Same task, same budget — how much do models differ? This brings together price-value, authoritative benchmarks, and per-scenario capability into one complete benchmark suite. For the gateway layer (compliance / price / safety / stability), see Gateway Benchmarks.
Pick a task or set custom input/output tokens to compute each model's cost and value in real time. Two tables below: the left asks "what does this task cost," the right asks "how many tokens does this budget buy."
① What this task costs
| Model | Input $/1M | Output $/1M | Task cost |
|---|
② How many tokens this budget buys
| Model | Input affordable | Output affordable |
|---|
Green highlight = cheapest / best value in that column. Prices are official list prices (asOf —) and change with vendors; local / self-hosted open source counts as 0 (you pay for the compute, output treated as ∞). To change prices or add models, edit data/models.json.
Authoritative scores come from each model's technical report and public leaderboards. The table below records traceable representative scores (different sources / collection dates, annotated individually); for a complete, real-time comparison, see the dedicated aggregator leaderboards (external links below the table). PRs with sources are welcome.
See more authoritative / complete leaderboards →
Combines the authoritative scores above with price to answer "for the same money, who has the highest knowledge density." Only lists models that have both an MMLU-Pro score and a price; local / free models are limited by self-hosted compute and listed separately.
| Model | MMLU-Pro | Output $/1M | Knowledge points per dollar |
|---|
A rough quality-per-dollar measure, for order-of-magnitude reference only — high quality at low price ≠ right for your use case; read it alongside per-scenario and gateway stability / compliance.
A single total score misleads: good at math doesn't mean good at coding. Here authoritative benchmarks are grouped by use case — each scenario flags its key metric, then lists the performance of models with traceable scores. Scores share the same sources as the table above.
Scenario ↔ metric mapping: Coding → SWE-bench Verified · Scientific reasoning → GPQA · Math → AIME · General knowledge → MMLU-Pro. Comparable scores for vision / long context / agent tool calling are being collected (for gateway-side tool-call forwarding measurements, see Behavior Check).