Scenario-based coding model guide
Choose the Best LLM for Your Coding Workflow
Stop asking which model is best in general. Pick the model that fits the job: coding agents, repo refactors, frontend generation, code review, test generation, Chinese coding workflows, or budget-sensitive automation.
Best for workflow · Evidence visible · Caveats included · Cost estimate next
short answerThe best coding LLM depends on the job.
The right choice depends on workflow, public benchmark evidence, API pricing, context needs, retry risk, and output length. Use this page to shortlist models by scenario, then compare source-backed rows and estimate task cost before committing to a model.
evidence table introUse evidence before recommendation labels
Use the evidence table to see what supports each recommendation. The table should show benchmark/source, price availability, context status, data status, confidence, caveat, and calculator handoff. Recommendations can be reordered as data improves, but unsupported claims should not be shipped.
Anthropic
Claude Opus 4.5
partial76.80%SWE-bench Verified
SWE-bench Leaderboards · checked 2026-06-02$5input / 1M$25output / 1Mnot disclosedcontext
Coding agentsRepo refactorCode review
Strong SWE-bench evidence in captured source, but expensive output pricing; do not present as universal best model.
Pricing source: Anthropic Claude API pricing · checked 2026-06-02 · confidence high
Benchmark source: SWE-bench Leaderboards · checked 2026-06-02 · confidence medium
Calculate This Model’s Task CostAnthropic
Claude Sonnet 4.5
partial71.40%SWE-bench Verified
SWE-bench Leaderboards · checked 2026-06-02$3input / 1M$15output / 1Mnot disclosedcontext
Coding agentsFrontend generationRepo refactorTest generation
Good candidate for default coding workflow shortlist, but scenario label must cite benchmark and price fields rather than claim overall superiority.
Pricing source: Anthropic Claude API pricing · checked 2026-06-02 · confidence high
Benchmark source: SWE-bench Leaderboards · checked 2026-06-02 · confidence medium
Calculate This Model’s Task CostAnthropic
Claude Haiku 4.5
partial66.60%SWE-bench Verified
SWE-bench Leaderboards · checked 2026-06-02$1input / 1M$5output / 1Mnot disclosedcontext
Low-cost automationTest generationBug fixing
Lower token price does not guarantee lower task cost if retry rate rises; calculator must expose retry assumptions.
Pricing source: Anthropic Claude API pricing · checked 2026-06-02 · confidence high
Benchmark source: SWE-bench Leaderboards · checked 2026-06-02 · confidence medium
Calculate This Model’s Task CostOpenAI
GPT-5.4 mini
partial evidencenot publicly benchmarkedSWE-bench Verified
SWE-bench Leaderboards exact alias recheck · checked 2026-06-05$0.75input / 1M$4.5output / 1Mnot disclosedcontext
Low-cost automationTest generation
OpenAI API pricing is source-backed for GPT-5.4 mini standard short-context pricing. Exact public coding benchmark for this alias is not verified, so benchmark evidence is marked not publicly benchmarked.
Pricing source: OpenAI API pricing · checked 2026-06-02 · confidence high
Benchmark source: SWE-bench Leaderboards exact alias recheck · checked 2026-06-05 · confidence low
Calculate This Model’s Task CostWhat is the best LLM for coding?
There is no single best LLM for every coding task. The best choice depends on your workflow, benchmark evidence, price, context, retry risk, and output length. Use scenario labels such as best for coding agents, best for refactor, or cheapest good-enough.
Which LLM should I use for a coding agent?
Start with models that have public coding evidence, reliable context behavior, visible caveats, and source-backed pricing. Then estimate the cost of your agent loop. Agent tasks can become expensive through retries, tool calls, long outputs, and failed trajectories.
What is the cheapest LLM for coding?
The cheapest LLM for coding depends on task complexity and retry rate. A low price per million tokens can be attractive, but the real cost depends on how many prompts, outputs, retries, cache reads, and human corrections the workflow needs.
How should I compare Claude, GPT, Gemini, DeepSeek, Kimi, and Qwen for coding?
Compare them by workflow and source coverage. Separate API token pricing from subscription-tool pricing, keep benchmark sources separate, mark unknown fields, and avoid forcing all models into one absolute winner list.