AI coding model decision engine

Compare Coding Models by Workflow Cost

Raw token price is only the starting point. AICodingPricing connects coding benchmark evidence, API pricing, cache rules, context limits, model caveats, and task-cost assumptions so you can choose a model for the work you are actually running.

Public sources only · Official pricing preferred · Missing values marked · No fake universal score
short answer

There is no single coding model that wins every workflow.

A model can lead one benchmark, cost less per token, or offer a larger context window, but still be the wrong choice if it retries more often, produces longer trajectories, lacks exact source-backed pricing, or has weak evidence for your task type.

Coding agentsFrontend generationRepo refactorLow-cost automationChinese coding workflowLong context
workflow shortlist

Best AI coding models by workflow

Use the filters to compare models by workflow, not by hype. A filter is not an absolute ranking. It is a lens over source-backed fields, confidence labels, and caveats.

source-led table

Coding benchmark evidence, token price, context, and caveats

Each row shows the model, provider, benchmark evidence, input price, output price, cache price, context window, speed signal when available, best-for labels, caveat, source, last checked date, confidence, and data status. If a field is unknown, the reason stays visible.

ModelBenchmark evidenceInput / OutputCache / ContextBest forCaveat / sourceTask cost
Claude Opus 4.5
Anthropic
partial
76.80%
SWE-bench Verified · medium
Benchmark source: SWE-bench Leaderboards · checked 2026-06-02
Captured source names the row as Claude 4.5 Opus — high reasoning; verify exact API model alias before display.
Input: $5
Output: $25
$0.5 read; $6.25 5m write
not disclosed
Coding agentsRepo refactorCode review

Strong SWE-bench evidence in captured source, but expensive output pricing; do not present as universal best model.

Pricing source: Anthropic Claude API pricing · checked 2026-06-02 · confidence high
Benchmark source: SWE-bench Leaderboards · checked 2026-06-02 · confidence medium

Calculate This Model’s Task Cost
Claude Sonnet 4.5
Anthropic
partial
71.40%
SWE-bench Verified · medium
Benchmark source: SWE-bench Leaderboards · checked 2026-06-02
Captured source names the row as Claude 4.5 Sonnet — high reasoning; verify exact API model alias before display.
Input: $3
Output: $15
$0.3 read; $3.75 5m write
not disclosed
Coding agentsFrontend generationRepo refactorTest generation

Good candidate for default coding workflow shortlist, but scenario label must cite benchmark and price fields rather than claim overall superiority.

Pricing source: Anthropic Claude API pricing · checked 2026-06-02 · confidence high
Benchmark source: SWE-bench Leaderboards · checked 2026-06-02 · confidence medium

Calculate This Model’s Task Cost
Claude Haiku 4.5
Anthropic
partial
66.60%
SWE-bench Verified · medium
Benchmark source: SWE-bench Leaderboards · checked 2026-06-02
Captured source names the row as Claude 4.5 Haiku — high reasoning; verify exact API model alias before display.
Input: $1
Output: $5
$0.1 read; $1.25 5m write
not disclosed
Low-cost automationTest generationBug fixing

Lower token price does not guarantee lower task cost if retry rate rises; calculator must expose retry assumptions.

Pricing source: Anthropic Claude API pricing · checked 2026-06-02 · confidence high
Benchmark source: SWE-bench Leaderboards · checked 2026-06-02 · confidence medium

Calculate This Model’s Task Cost
GPT-5.4 mini
OpenAI
partial evidence
not publicly benchmarked
SWE-bench Verified · low
Benchmark source: SWE-bench Leaderboards exact alias recheck · checked 2026-06-05
Exact SWE-bench row for GPT-5.4 mini was not verified. A public SWE-bench row exists for a different OpenAI Mini alias, so this row intentionally does not attach its numeric value.
Input: $0.75
Output: $4.5
$0.075 read; write not disclosed
not disclosed
Low-cost automationTest generation

OpenAI API pricing is source-backed for GPT-5.4 mini standard short-context pricing. Exact public coding benchmark for this alias is not verified, so benchmark evidence is marked not publicly benchmarked.

Pricing source: OpenAI API pricing · checked 2026-06-02 · confidence high
Benchmark source: SWE-bench Leaderboards exact alias recheck · checked 2026-06-05 · confidence low

Calculate This Model’s Task Cost
Gemini 3 Flash — high reasoning
Google
partial
75.80%
SWE-bench Verified · medium
Benchmark source: SWE-bench Leaderboards · checked 2026-06-02
Captured benchmark row is Gemini 3 Flash — high reasoning; exact Google API pricing row for this exact model alias was not captured.
Input: not disclosed
Output: not disclosed
not disclosed
not disclosed
Coding agentsLong contextFrontend generation

Strong captured SWE-bench result, but price/context must remain unknown until exact Gemini model docs are mapped.

Pricing source: Gemini Developer API pricing · checked 2026-05-28 · confidence low
Benchmark source: SWE-bench Leaderboards · checked 2026-06-02 · confidence medium

Calculate This Model’s Task Cost
DeepSeek V4 Flash
DeepSeek
partial
not_publicly_benchmarked
SWE-bench / Aider · low
Benchmark source: DeepSeek API Docs + public benchmark sources checked · checked 2026-05-28
Captured benchmark sources mention DeepSeek V3.2 / R1 variants, not exact DeepSeek V4 Flash.
Input: $0.14
Output: $0.28
$0.0028 read; write not disclosed
1,000,000 tokens
Low-cost automationChinese coding workflowLong context

Excellent token price and context signal, but exact public coding benchmark row for V4 Flash was not captured; mark coding evidence as incomplete.

Pricing source: DeepSeek API Docs — Models & Pricing · checked 2026-05-28 · confidence high
Benchmark source: DeepSeek API Docs + public benchmark sources checked · checked 2026-05-28 · confidence low

Calculate This Model’s Task Cost
DeepSeek V4 Pro
DeepSeek
partial
not_publicly_benchmarked
SWE-bench / Aider · low
Benchmark source: DeepSeek API Docs + public benchmark sources checked · checked 2026-05-28
Captured benchmark sources mention DeepSeek V3.2 / R1 variants, not exact DeepSeek V4 Pro.
Input: $0.435
Output: $0.87
$0.0036 read; write not disclosed
1,000,000 tokens
Chinese coding workflowLong contextRepo refactor

Pricing/context are source-backed; coding benchmark evidence for the exact V4 Pro model still needs source verification.

Pricing source: DeepSeek API Docs — Models & Pricing · checked 2026-05-28 · confidence high
Benchmark source: DeepSeek API Docs + public benchmark sources checked · checked 2026-05-28 · confidence low

Calculate This Model’s Task Cost
Kimi K2.5 — high reasoning
Moonshot AI / Kimi
partial
70.80%
SWE-bench Verified · medium
Benchmark source: SWE-bench Leaderboards · checked 2026-06-02
Captured source names Kimi K2.5 — high reasoning.
Input: not disclosed
Output: not disclosed
not disclosed
not disclosed
Chinese coding workflowCoding agentsRepo refactor

Useful Chinese coding workflow candidate, but price/context must remain unknown until exact Moonshot pricing/model docs are captured.

Pricing source: Kimi API Platform pricing index · checked 2026-05-28 · confidence low
Benchmark source: SWE-bench Leaderboards · checked 2026-06-02 · confidence medium

Calculate This Model’s Task Cost
Qwen3 235B A22B
Alibaba Cloud / Qwen
partial
59.6%
Aider polyglot coding benchmark · medium
Benchmark source: Aider LLM Leaderboards · checked 2026-05-28
Captured source row: Qwen3 235B A22B diff, no think, Alibaba API.
Input: not disclosed
Output: not disclosed
not disclosed
not disclosed
Chinese coding workflowLow-cost automation

Benchmark-backed partial row only. Do not show price until exact Alibaba Cloud model pricing is captured.

Pricing source: Alibaba Cloud Model Studio pricing search result · checked 2026-05-28 · confidence low
Benchmark source: Aider LLM Leaderboards · checked 2026-05-28 · confidence medium

Calculate This Model’s Task Cost
Anthropic

Claude Opus 4.5

partial
76.80%SWE-bench Verified
SWE-bench Leaderboards · checked 2026-06-02
$5input / 1M$25output / 1Mnot disclosedcontext
Coding agentsRepo refactorCode review

Strong SWE-bench evidence in captured source, but expensive output pricing; do not present as universal best model.

Pricing source: Anthropic Claude API pricing · checked 2026-06-02 · confidence high
Benchmark source: SWE-bench Leaderboards · checked 2026-06-02 · confidence medium

Calculate This Model’s Task Cost
Anthropic

Claude Sonnet 4.5

partial
71.40%SWE-bench Verified
SWE-bench Leaderboards · checked 2026-06-02
$3input / 1M$15output / 1Mnot disclosedcontext
Coding agentsFrontend generationRepo refactorTest generation

Good candidate for default coding workflow shortlist, but scenario label must cite benchmark and price fields rather than claim overall superiority.

Pricing source: Anthropic Claude API pricing · checked 2026-06-02 · confidence high
Benchmark source: SWE-bench Leaderboards · checked 2026-06-02 · confidence medium

Calculate This Model’s Task Cost
Anthropic

Claude Haiku 4.5

partial
66.60%SWE-bench Verified
SWE-bench Leaderboards · checked 2026-06-02
$1input / 1M$5output / 1Mnot disclosedcontext
Low-cost automationTest generationBug fixing

Lower token price does not guarantee lower task cost if retry rate rises; calculator must expose retry assumptions.

Pricing source: Anthropic Claude API pricing · checked 2026-06-02 · confidence high
Benchmark source: SWE-bench Leaderboards · checked 2026-06-02 · confidence medium

Calculate This Model’s Task Cost
OpenAI

GPT-5.4 mini

partial evidence
not publicly benchmarkedSWE-bench Verified
SWE-bench Leaderboards exact alias recheck · checked 2026-06-05
$0.75input / 1M$4.5output / 1Mnot disclosedcontext
Low-cost automationTest generation

OpenAI API pricing is source-backed for GPT-5.4 mini standard short-context pricing. Exact public coding benchmark for this alias is not verified, so benchmark evidence is marked not publicly benchmarked.

Pricing source: OpenAI API pricing · checked 2026-06-02 · confidence high
Benchmark source: SWE-bench Leaderboards exact alias recheck · checked 2026-06-05 · confidence low

Calculate This Model’s Task Cost
Google

Gemini 3 Flash — high reasoning

partial
75.80%SWE-bench Verified
SWE-bench Leaderboards · checked 2026-06-02
not disclosedinput / 1Mnot disclosedoutput / 1Mnot disclosedcontext
Coding agentsLong contextFrontend generation

Strong captured SWE-bench result, but price/context must remain unknown until exact Gemini model docs are mapped.

Pricing source: Gemini Developer API pricing · checked 2026-05-28 · confidence low
Benchmark source: SWE-bench Leaderboards · checked 2026-06-02 · confidence medium

Calculate This Model’s Task Cost
DeepSeek

DeepSeek V4 Flash

partial
not_publicly_benchmarkedSWE-bench / Aider
DeepSeek API Docs + public benchmark sources checked · checked 2026-05-28
$0.14input / 1M$0.28output / 1M1,000,000 tokenscontext
Low-cost automationChinese coding workflowLong context

Excellent token price and context signal, but exact public coding benchmark row for V4 Flash was not captured; mark coding evidence as incomplete.

Pricing source: DeepSeek API Docs — Models & Pricing · checked 2026-05-28 · confidence high
Benchmark source: DeepSeek API Docs + public benchmark sources checked · checked 2026-05-28 · confidence low

Calculate This Model’s Task Cost
DeepSeek

DeepSeek V4 Pro

partial
not_publicly_benchmarkedSWE-bench / Aider
DeepSeek API Docs + public benchmark sources checked · checked 2026-05-28
$0.435input / 1M$0.87output / 1M1,000,000 tokenscontext
Chinese coding workflowLong contextRepo refactor

Pricing/context are source-backed; coding benchmark evidence for the exact V4 Pro model still needs source verification.

Pricing source: DeepSeek API Docs — Models & Pricing · checked 2026-05-28 · confidence high
Benchmark source: DeepSeek API Docs + public benchmark sources checked · checked 2026-05-28 · confidence low

Calculate This Model’s Task Cost
Moonshot AI / Kimi

Kimi K2.5 — high reasoning

partial
70.80%SWE-bench Verified
SWE-bench Leaderboards · checked 2026-06-02
not disclosedinput / 1Mnot disclosedoutput / 1Mnot disclosedcontext
Chinese coding workflowCoding agentsRepo refactor

Useful Chinese coding workflow candidate, but price/context must remain unknown until exact Moonshot pricing/model docs are captured.

Pricing source: Kimi API Platform pricing index · checked 2026-05-28 · confidence low
Benchmark source: SWE-bench Leaderboards · checked 2026-06-02 · confidence medium

Calculate This Model’s Task Cost
Alibaba Cloud / Qwen

Qwen3 235B A22B

partial
59.6%Aider polyglot coding benchmark
Aider LLM Leaderboards · checked 2026-05-28
not disclosedinput / 1Mnot disclosedoutput / 1Mnot disclosedcontext
Chinese coding workflowLow-cost automation

Benchmark-backed partial row only. Do not show price until exact Alibaba Cloud model pricing is captured.

Pricing source: Alibaba Cloud Model Studio pricing search result · checked 2026-05-28 · confidence low
Benchmark source: Aider LLM Leaderboards · checked 2026-05-28 · confidence medium

Calculate This Model’s Task Cost
unknown value legend

Missing data is rendered as data

not disclosed

The provider or verified source did not publish this value.

not publicly benchmarked

This exact model was not found in the selected public benchmark source.

source needs recheck

A source exists, but exact model alias, price mode, or context value was not verified.

partial evidence

The row has useful data, but not enough to support a strong recommendation.

methodology

No fake universal score

Benchmarks stay separate

This leaderboard does not combine SWE-bench, Aider, LiveCodeBench, arena scores, pricing tables, and usage signals into one fake universal score.

Official pricing preferred

Pricing data prefers official provider pages. Aggregator or route-specific pricing must be labeled before it is used for a calculator prefill.

Speed stays unknown without a source

TTFT and tokens/sec only ship when a public source exists for the exact model or route. Otherwise speed is not disclosed.

Task cost is computed from assumptions

A cheaper token price can become expensive if retries, output length, cache behavior, or failure cleanup change the workflow.

calculator bridge

Found a candidate model?

Open the calculator with this model prefilled and test your own workflow assumptions: input tokens, output tokens, retry rate, cache hit rate, and monthly task volume.

What is an AI coding model leaderboard?

An AI coding model leaderboard compares language models for coding workflows such as coding agents, frontend generation, repo refactors, code review, bug fixing, and test generation. This page uses public benchmark evidence, pricing, context, caveats, and confidence labels instead of a single generic intelligence score.

What is the best LLM for coding agents?

There is no universal best LLM for coding agents. Start with models that have public coding benchmark evidence, source-backed API pricing, reliable context handling, and acceptable retry behavior for your workflow. Then estimate task cost before choosing a default model.

Why does the page show not disclosed or partial evidence?

Those labels protect the user from fake precision. If an exact price, context window, speed signal, or benchmark result was not verified from a public source, the page says so instead of copying values from a similar model or making an assumption.

Is token price the same as real task cost?

No. Real task cost depends on input length, output length, retries, tool calls, cache behavior, batch mode, failure rate, and human review time. Use benchmarks for evidence, then use task-cost assumptions for budgeting.

Browse all guides

Compare pricing, limits, and workflow fit