Skip to content
evaluationmethodologybenchmarking

Why your eval benchmark is lying to you: regional variation as methodology

Most public LLM eval benchmarks run from a single origin and report a single score per model. Running the same benchmark from 10 regions surfaces variance that single-origin testing hides — and the variance is larger than the reported confidence intervals.

· Hamza Rahim · 5 min read

There is a structural gap in most public LLM evaluations. HELM, MMLU, BBH, MT-Bench, Arena — all of them report model scores as if the model is a pure function from prompt to response. For a growing fraction of the eval surface, it is not. The model is a function of prompt and the inference-time environment it is served from, which includes region-dependent routing, cached prompt prefixes, provider-side policy layers, and occasionally model weight versions that differ by region.

When your eval runs from one origin, you are measuring one instantiation of the model's response distribution. Running the same eval from 10 origins reveals variance that is sometimes larger than the confidence intervals the benchmark itself reports.

HELM, for context, measures seven metrics (accuracy, calibration, robustness, fairness, bias, toxicity, efficiency) across dozens of scenarios — but the standard runs do not vary the request origin. That is a methodology choice, defensible for a research benchmark, but it is not what end-user experience looks like.

The variance we see

We reran a subset of a public eval benchmark (we're withholding the name pending a longer writeup that the benchmark maintainers have asked to coordinate on) against GPT-4o and Claude 3.5 Sonnet from 10 SquadProxy residential origins: US, UK, DE, FR, JP, NL, CA, SG, KR, AU.

Three findings worth sharing:

1. Accuracy variance across origins was ~2.3× the reported CI

For tasks with clean ground truth — factual QA, coding, arithmetic — per-origin accuracy varied by more than double the confidence interval reported by the benchmark. Not a huge absolute delta — 1–3 percentage points — but structurally meaningful for a benchmark that distinguishes model releases by 1-point deltas.

2. Toxicity and bias metrics showed larger, directional variance

The spread on the safety-metric side was larger and directional: origins in countries with stronger content regulation (EU, UK) saw lower toxicity scores on average, not because the model was refusing more, but because it was returning more hedged outputs whose classifier-scored toxicity was lower. This is a case where the metric is correlated with the regulation as much as with the model.

3. "Refusal" behaviour is the most origin-sensitive metric

The largest origin-to-origin deltas we saw were on refusal rates for policy-adjacent prompts. Single-origin testing of refusal rates is, based on what we measured, underpowered to the point of being uninformative for models where refusal varies materially by region.

What this implies for leaderboards

We don't think public leaderboards should be required to multi-origin test — the infrastructure is non-trivial and the maintainers have limited resources. But we do think:

  1. Published scores should carry a methodology note about the single-origin limitation.
  2. Model cards that claim regional behavioural consistency should show evidence, not assert it.
  3. Procurement evaluators should run origin-diverse follow-up tests on any score delta that matters to their decision. The infrastructure cost is modest — a SquadProxy residential subscription and a day of engineer time gets you to 10 origins.

A reference methodology

For teams wanting to run origin-diverse eval:

  • Pick 5–10 origins. Minimum: US, UK, DE, JP, and one of SG / KR / AU. The important thing is hemisphere and policy- regime diversity.
  • Use residential, not datacenter. Datacenter IPs route through the provider's "cloud-developer" stack and don't reflect end-user experience.
  • Per-request rotation at the proxy gateway. You want per-request IP-layer freshness, not per-session stickiness.
  • 10+ replicates per prompt per origin. Single-shot per origin doesn't beat temperature variance.
  • Record: exit IP, origin country, timestamp, full response, provider-returned model version string if exposed.

What this is not

We want to be clear about two things:

  • We are not claiming providers are deliberately shipping different model weights by region. The available evidence doesn't support that claim at the level of frontier public APIs. What does happen — and is publicly documented — is regional policy layers, regional routing to different inference stacks, and regional content-moderation fine-tunes.
  • We are not claiming single-origin eval is useless. It is correct for a lot of what benchmarks try to measure. It is underpowered for the slice of behaviour that is region-dependent, and that slice is larger than the leaderboard conversation currently assumes.

Going forward

Eval methodology is due for a refresh. Region-diversity is one axis; time-diversity (same model, same origin, sampled across days) is another; model-version-diversity (same model ID, provider cache clear) is a third. SquadProxy powers the region-diversity axis. The others are methodology work that the benchmark community will have to pick up.

References

  • Liang et al., "Holistic Evaluation of Language Models" (arXiv:2211.09110) — HELM baseline paper
  • HELM Lite current results (crfm.stanford.edu/helm/lite)
  • A longer writeup with full numbers will follow — subscribe via the footer if you want it.

Related reading on SquadProxy

Ship on a proxy network you can actually call your ops team about

Real ASNs, real edge capacity, and an engineer who answers your Slack the first time.