LLM Gateway Bench · Benchmarking LLM Gateways

Benchmark Suite OverviewTwo layers of evaluation, from "which model" to "which gateway"

Layer 1 · Model Benchmarks

Which model gives the best value?

Same task, compare cost; same budget, compare output (price-value measurements) · public authoritative benchmarks (hard scores with sources) · per-scenario capability (coding / math / science / knowledge).

Layer 2 · Gateway Benchmarks

Which gateway is trustworthy?

For the same model, different relays/gateways differ in compliance and safety (behavior check) · whether price is inflated (price matrix) · whether stability holds up (30-day time series) — black-box measured, not claimed.

Method · Fully Open Source

Why trust it?

Probe scripts, decision thresholds, and raw data are committed run by run, reproducible with your own key; no black-box weighted total score. See the full analysis framework in the knowledge base.

Gateway LeaderboardPick what matters most and the board re-ranks live by that dimension — no black-box total score, every dimension uses a public definition · Understand this dimension →

I care most about

Filter + Test my own gateway

Trust rating, three tiers: Direct Verified passed cross-check verification; Claimed, Unverified gateway self-claims, no cross-check evidence yet; Unverified no claim or refused testing. See the next section for details.

Trust & ComplianceDon't trust claims, look at behavior fingerprints; policy conclusions carry evidence links, corrections via PR welcome · Understand this dimension →

A channel's source is hard to "prove" directly, so this benchmark takes the position of a combination of behavior fingerprints: whether tool calls are stripped, whether streaming is genuine (per-chunk timing), whether usage is inflated (recomputed and compared locally), whether latency characteristics match the official endpoint — multiple black-box-measurable signals form a composite picture, rather than taking the gateway's self-claims at face value. When conditions allow, a K2-style cross-check against the official API is added (a fixed request set comparing finish_reason F1 and schema validity rate). Data retention and operating entity are manually annotated from the original terms, with the annotation date attached.

Gateway	Channel Source	Cross-check	Prompt Retention	Used for Training	Operating Entity	Invoice	Evidence

Behavior CheckBlack-box probing of "is it giving you the real thing" — don't trust claims, measure it · Understand this dimension →

Every item is a black-box probe reproducible from a fixed script: Model Echo — whether the response's model field matches the request (catches substitution) · Tool Calling — whether tool calls are stripped · Real Streaming — whether output is buffered then dumped as fake streaming · CJK Integrity — whether Chinese output is corrupted (a quantization-degradation tell) · Context — whether markers in long text are silently truncated. Green = all probed models passed, red = some models failed, — = no decidable data yet.

Gateway	Model Echo	Tool Calling	Real Streaming	CJK Integrity	Context	Cache	Usage Fingerprint

These signals are not a presumption of guilt: a single failure may be network jitter or a limit of the model itself; what matters is consistent, repeated behavior. The Usage Fingerprint column shows "characters per token"; for the same model across gateways, an abnormally low value = suspected token inflation, to be checked against the official baseline (comparison lights up once the baseline is aligned).

PricingUSD / 1M tokens (input / output) · multiple = gateway price ÷ official price · official price from litellm · Understand this dimension →

Price Index = the geometric mean of the gateway's multiples across all comparable models: <1.00 cheaper than official, >1.00 more expensive than official. Suspiciously cheap (<0.5×) usually means a reverse-engineered channel — read it alongside the trust rating.

Analysis FrameworkHow to judge whether a gateway is trustworthy — the methodology for reading the probe data · Enter the knowledge base →

Behind every column on the board is a logic for making the call. These articles break down "is it the real model, is it overcharging, is it stable or about to disappear" into an actionable analysis framework, and explain how the platform measures and how you should read it. See all articles →

Loading articles…

Self-Test GuideProbe any gateway in three minutes with your own key · Understand this dimension →

Not seeing the gateway you use on the board? Or want to verify for yourself whether it's giving you the real model? Add it to data/gateways.json and run the same probe scripts locally with your key — identical to the board, and reproducible.

# Node ≥ 20, zero dependencies. After cloning:
git clone https://github.com/cuihuan/llm-gateway-bench && cd llm-gateway-bench

# Fastest: --url probes any OpenAI-compatible endpoint directly, no file edits, results to stdout:
PROBE_KEY=sk-... node probe/probe.mjs --url https://your-gateway.com --model gpt-4o-mini --samples 3

# To include in the board + local dashboard: after adding to data/gateways.json
node probe/probe.mjs --gateway <id> --out data/results && npm run aggregate && npm run serve

The "behavior fingerprints" it measures

Whether tool calls are stripped · streaming authenticity (fake streaming = TTFT ≈ total latency then dumped at once) · usage recompute (abnormally low characters per token = suspected inflation) · measured TTFT/throughput percentiles.

Why it's trustworthy

The probe prompts, sample counts, and decision thresholds are all written in the scripts, with a random string to defeat caching; raw results are committed run by run into data/results/, so anyone can reproduce the same conclusion with their own key. No black-box weighted total score.

Add your gateway to the board

Submit a PR editing data/gateways.json (fill in baseUrl / authEnv / probeModels); once the maintainer adds the key in Secrets, it automatically enters the 6-hourly probing and data accumulates over time.

Benchmarking LLM Gateways

Is it giving you the real model?

Is it overcharging you?

Will it crash or disappear?

Benchmark Suite OverviewTwo layers of evaluation, from "which model" to "which gateway"

Which model gives the best value?

Which gateway is trustworthy?

Why trust it?

Gateway LeaderboardPick what matters most and the board re-ranks live by that dimension — no black-box total score, every dimension uses a public definition · Understand this dimension →

Trust & ComplianceDon't trust claims, look at behavior fingerprints; policy conclusions carry evidence links, corrections via PR welcome · Understand this dimension →

Behavior CheckBlack-box probing of "is it giving you the real thing" — don't trust claims, measure it · Understand this dimension →

PricingUSD / 1M tokens (input / output) · multiple = gateway price ÷ official price · official price from litellm · Understand this dimension →

StabilityProbed every 6 hours · errors classified by nature (429 ≠ 5xx ≠ timeout) · hourly profile reveals whether peak hours slow down · Understand this dimension →

Gateway CatalogNew gateway: submit data/gateways.json via PR to enter probing · Understand this dimension →

Analysis FrameworkHow to judge whether a gateway is trustworthy — the methodology for reading the probe data · Enter the knowledge base →

Self-Test GuideProbe any gateway in three minutes with your own key · Understand this dimension →

The "behavior fingerprints" it measures

Why it's trustworthy

Add your gateway to the board

Benchmark Suite OverviewTwo layers of evaluation, from "which model" to "which gateway"

Which model gives the best value?

Which gateway is trustworthy?

Why trust it?

Help Me Choose a GatewayCheck what matters to you and get a recommendation with reasons — aggregated by rank across the chosen dimensions, with every rank laid out, no hidden weighting

Gateway LeaderboardPick what matters most and the board re-ranks live by that dimension — no black-box total score, every dimension uses a public definition · Understand this dimension →

Trust & ComplianceDon't trust claims, look at behavior fingerprints; policy conclusions carry evidence links, corrections via PR welcome · Understand this dimension →

Behavior CheckBlack-box probing of "is it giving you the real thing" — don't trust claims, measure it · Understand this dimension →

PricingUSD / 1M tokens (input / output) · multiple = gateway price ÷ official price · official price from litellm · Understand this dimension →

StabilityProbed every 6 hours · errors classified by nature (429 ≠ 5xx ≠ timeout) · hourly profile reveals whether peak hours slow down · Understand this dimension →

Gateway CatalogNew gateway: submit data/gateways.json via PR to enter probing · Understand this dimension →

Analysis FrameworkHow to judge whether a gateway is trustworthy — the methodology for reading the probe data · Enter the knowledge base →

Self-Test GuideProbe any gateway in three minutes with your own key · Understand this dimension →

The "behavior fingerprints" it measures

Why it's trustworthy

Add your gateway to the board