Claude Opus 4.5 Anthropic partial | 76.80% SWE-bench Verified · medium Benchmark source: SWE-bench Leaderboards · checked 2026-06-02 Captured source names the row as Claude 4.5 Opus — high reasoning; verify exact API model alias before display. | Input: $5 Output: $25 | $0.5 read; $6.25 5m write not disclosed | Coding agentsRepo refactorCode review | Strong SWE-bench evidence in captured source, but expensive output pricing; do not present as universal best model. Pricing source: Anthropic Claude API pricing · checked 2026-06-02 · confidence high Benchmark source: SWE-bench Leaderboards · checked 2026-06-02 · confidence medium | Calculate This Model’s Task Cost |
Claude Sonnet 4.5 Anthropic partial | 71.40% SWE-bench Verified · medium Benchmark source: SWE-bench Leaderboards · checked 2026-06-02 Captured source names the row as Claude 4.5 Sonnet — high reasoning; verify exact API model alias before display. | Input: $3 Output: $15 | $0.3 read; $3.75 5m write not disclosed | Coding agentsFrontend generationRepo refactorTest generation | Good candidate for default coding workflow shortlist, but scenario label must cite benchmark and price fields rather than claim overall superiority. Pricing source: Anthropic Claude API pricing · checked 2026-06-02 · confidence high Benchmark source: SWE-bench Leaderboards · checked 2026-06-02 · confidence medium | Calculate This Model’s Task Cost |
Claude Haiku 4.5 Anthropic partial | 66.60% SWE-bench Verified · medium Benchmark source: SWE-bench Leaderboards · checked 2026-06-02 Captured source names the row as Claude 4.5 Haiku — high reasoning; verify exact API model alias before display. | Input: $1 Output: $5 | $0.1 read; $1.25 5m write not disclosed | Low-cost automationTest generationBug fixing | Lower token price does not guarantee lower task cost if retry rate rises; calculator must expose retry assumptions. Pricing source: Anthropic Claude API pricing · checked 2026-06-02 · confidence high Benchmark source: SWE-bench Leaderboards · checked 2026-06-02 · confidence medium | Calculate This Model’s Task Cost |
GPT-5.4 mini OpenAI partial evidence | not publicly benchmarked SWE-bench Verified · low Benchmark source: SWE-bench Leaderboards exact alias recheck · checked 2026-06-05 Exact SWE-bench row for GPT-5.4 mini was not verified. A public SWE-bench row exists for a different OpenAI Mini alias, so this row intentionally does not attach its numeric value. | Input: $0.75 Output: $4.5 | $0.075 read; write not disclosed not disclosed | Low-cost automationTest generation | OpenAI API pricing is source-backed for GPT-5.4 mini standard short-context pricing. Exact public coding benchmark for this alias is not verified, so benchmark evidence is marked not publicly benchmarked. Pricing source: OpenAI API pricing · checked 2026-06-02 · confidence high Benchmark source: SWE-bench Leaderboards exact alias recheck · checked 2026-06-05 · confidence low | Calculate This Model’s Task Cost |
Gemini 3 Flash — high reasoning Google partial | 75.80% SWE-bench Verified · medium Benchmark source: SWE-bench Leaderboards · checked 2026-06-02 Captured benchmark row is Gemini 3 Flash — high reasoning; exact Google API pricing row for this exact model alias was not captured. | Input: not disclosed Output: not disclosed | not disclosed not disclosed | Coding agentsLong contextFrontend generation | Strong captured SWE-bench result, but price/context must remain unknown until exact Gemini model docs are mapped. Pricing source: Gemini Developer API pricing · checked 2026-05-28 · confidence low Benchmark source: SWE-bench Leaderboards · checked 2026-06-02 · confidence medium | Calculate This Model’s Task Cost |
DeepSeek V4 Flash DeepSeek partial | not_publicly_benchmarked SWE-bench / Aider · low Benchmark source: DeepSeek API Docs + public benchmark sources checked · checked 2026-05-28 Captured benchmark sources mention DeepSeek V3.2 / R1 variants, not exact DeepSeek V4 Flash. | Input: $0.14 Output: $0.28 | $0.0028 read; write not disclosed 1,000,000 tokens | Low-cost automationChinese coding workflowLong context | Excellent token price and context signal, but exact public coding benchmark row for V4 Flash was not captured; mark coding evidence as incomplete. Pricing source: DeepSeek API Docs — Models & Pricing · checked 2026-05-28 · confidence high Benchmark source: DeepSeek API Docs + public benchmark sources checked · checked 2026-05-28 · confidence low | Calculate This Model’s Task Cost |
DeepSeek V4 Pro DeepSeek partial | not_publicly_benchmarked SWE-bench / Aider · low Benchmark source: DeepSeek API Docs + public benchmark sources checked · checked 2026-05-28 Captured benchmark sources mention DeepSeek V3.2 / R1 variants, not exact DeepSeek V4 Pro. | Input: $0.435 Output: $0.87 | $0.0036 read; write not disclosed 1,000,000 tokens | Chinese coding workflowLong contextRepo refactor | Pricing/context are source-backed; coding benchmark evidence for the exact V4 Pro model still needs source verification. Pricing source: DeepSeek API Docs — Models & Pricing · checked 2026-05-28 · confidence high Benchmark source: DeepSeek API Docs + public benchmark sources checked · checked 2026-05-28 · confidence low | Calculate This Model’s Task Cost |
Kimi K2.5 — high reasoning Moonshot AI / Kimi partial | 70.80% SWE-bench Verified · medium Benchmark source: SWE-bench Leaderboards · checked 2026-06-02 Captured source names Kimi K2.5 — high reasoning. | Input: not disclosed Output: not disclosed | not disclosed not disclosed | Chinese coding workflowCoding agentsRepo refactor | Useful Chinese coding workflow candidate, but price/context must remain unknown until exact Moonshot pricing/model docs are captured. Pricing source: Kimi API Platform pricing index · checked 2026-05-28 · confidence low Benchmark source: SWE-bench Leaderboards · checked 2026-06-02 · confidence medium | Calculate This Model’s Task Cost |
Qwen3 235B A22B Alibaba Cloud / Qwen partial | 59.6% Aider polyglot coding benchmark · medium Benchmark source: Aider LLM Leaderboards · checked 2026-05-28 Captured source row: Qwen3 235B A22B diff, no think, Alibaba API. | Input: not disclosed Output: not disclosed | not disclosed not disclosed | Chinese coding workflowLow-cost automation | Benchmark-backed partial row only. Do not show price until exact Alibaba Cloud model pricing is captured. Pricing source: Alibaba Cloud Model Studio pricing search result · checked 2026-05-28 · confidence low Benchmark source: Aider LLM Leaderboards · checked 2026-05-28 · confidence medium | Calculate This Model’s Task Cost |