Model Benchmarks · Benchmark Suite

Same task, same budget — how much do models differ? This brings together price-value, authoritative benchmarks, and per-scenario capability into one complete benchmark suite. For the gateway layer (compliance / price / safety / stability), see Gateway Benchmarks.

Price-Value MeasurementsSame task, compare cost · same budget, compare output · prices are official list prices, maintained in data/models.json

Pick a task or set custom input/output tokens to compute each model's cost and value in real time. Two tables below: the left asks "what does this task cost," the right asks "how many tokens does this budget buy."

⚠ This page requires a local server: npm run serve → http://localhost:8080/evals.html. When opened directly via file://, the browser blocks the data module and the calculator won't load.

① What this task costs

ModelInput $/1MOutput $/1MTask cost

② How many tokens this budget buys

ModelInput affordableOutput affordable

Green highlight = cheapest / best value in that column. Prices are official list prices (asOf ) and change with vendors; local / self-hosted open source counts as 0 (you pay for the compute, output treated as ∞). To change prices or add models, edit data/models.json.

Authoritative BenchmarksHard scores from public evaluations · each with a source · missing recorded as — not invented

Authoritative scores come from each model's technical report and public leaderboards. The table below records traceable representative scores (different sources / collection dates, annotated individually); for a complete, real-time comparison, see the dedicated aggregator leaderboards (external links below the table). PRs with sources are welcome.

See more authoritative / complete leaderboards →

Quality per Dollarcomposite knowledge score (MMLU-Pro) ÷ output price = points bought per dollar, higher is better value

Combines the authoritative scores above with price to answer "for the same money, who has the highest knowledge density." Only lists models that have both an MMLU-Pro score and a price; local / free models are limited by self-hosted compute and listed separately.

ModelMMLU-ProOutput $/1MKnowledge points per dollar

A rough quality-per-dollar measure, for order-of-magnitude reference only — high quality at low price ≠ right for your use case; read it alongside per-scenario and gateway stability / compliance.

Per-Scenario BenchmarksPick a model by real use case — look at the column for the work you care about

A single total score misleads: good at math doesn't mean good at coding. Here authoritative benchmarks are grouped by use case — each scenario flags its key metric, then lists the performance of models with traceable scores. Scores share the same sources as the table above.

Scenario ↔ metric mapping: Coding → SWE-bench Verified · Scientific reasoning → GPQA · Math → AIME · General knowledge → MMLU-Pro. Comparable scores for vision / long context / agent tool calling are being collected (for gateway-side tool-call forwarding measurements, see Behavior Check).