Proxies as methodology for multilingual LLM benchmarks
Multilingual LLM evaluation that uses only US-cloud-origin requests under-reports regional content policy and geo-dependent response divergence. A proxy layer anchored to each benchmark language's primary country is methodology, not infrastructure.
· Reeya Patel · 6 min read
Multilingual evaluation has a methodology gap that's older than
it should be: benchmarks like MMLU-ProX, FLORES, Aya, and the
newer multilingual safety suites are typically run from a single
origin region (the lab's AWS account, usually us-east-1) even
when the benchmark explicitly tests regional competence. This
produces results that measure the model's language competence
but not its geo-anchored competence — which, for any
production deployment, is what the operator actually needs.
Proxies anchored to each benchmark language's primary country aren't infrastructure in this context; they're part of the evaluation methodology. This post walks through why, and how to wire the layer into an eval harness without breaking reproducibility.
What a single-origin eval misses
Commercial LLM APIs apply regional policy in three ways that matter for multilingual evaluation:
- Content policy per origin region. The same request in the same language may receive different refusal patterns if the origin IP is in the target country vs. US-cloud. This is documented behaviour for some APIs and undocumented for others; either way it's measurable.
- Regional routing at the inference layer. Providers increasingly route inference to region-local POPs when the origin is in-region. The routing can pick different model-serving fleets, different safety checkpoints, or different content filters applied per deployment.
- Retrieval-augmented responses (where present). For providers that add retrieval, the retrieval corpus is often region-specific. US-origin requests to a Japanese- language prompt may pull English-language retrieval sources; JP-origin requests pull JP sources.
A benchmark that ignores these three factors is measuring "model behaviour in the lab's home region" and calling it "multilingual competence." The gap between those two things is sometimes small, sometimes very large, and the sign of the gap differs by language-region pair.
The proxy layer as an evaluation variable
The methodological move is to treat origin region as an evaluation variable and run the benchmark at least twice: once from US-cloud (the baseline that matches most published numbers), and once from an origin in the target language's primary region. The delta is the regional policy effect, and it's reportable.
For MMLU-ProX running across 29 languages, the target origins that matter in our experience:
- fr — France or Canada (Quebec)
- de — Germany
- es — Spain or Mexico (they differ)
- pt — Portugal or Brazil (they differ substantially)
- ja — Japan
- ko — Korea
- zh — Taiwan or Singapore (mainland China access is a separate problem, covered below)
- ar — UAE or Saudi Arabia
- hi — India
The residential proxy page lists the pool specifics. For multilingual eval, residential is the right class — the model APIs classify datacenter origins differently even when the IP geolocation is correct.
A minimal multi-origin eval harness
import asyncio, httpx
ORIGINS = {
"fr": "fr",
"de": "de",
"ja": "jp",
"ko": "kr",
# ... map benchmark language → country code
}
PROXY = "http://USER:PASS@gateway.squadproxy.com:7777"
async def eval_prompt(prompt: str, lang: str, model: str):
country = ORIGINS[lang]
async with httpx.AsyncClient(
proxies={"https": PROXY},
headers={
"X-Squad-Class": "residential",
"X-Squad-Country": country,
"X-Squad-Session": "sticky-10m",
},
timeout=httpx.Timeout(120.0),
) as client:
return await client.post(
f"https://api.provider.example/v1/{model}/complete",
json={"prompt": prompt},
headers={"Authorization": f"Bearer {TOKEN}"},
)
async def run(benchmark_rows):
# Each row has (prompt, language, expected). Run each
# against both US-cloud (direct) and the target-country
# origin. Emit both scores.
results = []
for row in benchmark_rows:
us_resp = await eval_prompt_direct(row.prompt, row.model)
regional_resp = await eval_prompt(row.prompt, row.language, row.model)
results.append({
"prompt_id": row.id,
"language": row.language,
"us_origin_score": score(us_resp, row.expected),
"regional_origin_score": score(regional_resp, row.expected),
"regional_delta": ...,
})
return results
The sticky-10m session window is deliberate: multi-turn
evaluation needs consistent IP across the conversation, and
10-minute stickiness covers most eval turns without locking in
a single IP long enough to hit provider-side rate limits.
What to do with the delta
The two-column output (US-origin score, regional-origin score) makes the regional policy effect legible. For publication purposes, report both. For safety evaluation specifically, the regional column is usually the more honest measurement — that's the context your users actually hit.
A delta threshold in the ~5-10% range is noise across test-retest runs on the same origin. A delta above ~15% is real and worth investigating. Deltas over 30% are almost always a content-policy or retrieval-source difference; they're measuring the model-as-deployed, not the model-as-weights.
Edge cases we've hit
Mainland China targets. Access to Chinese-language model APIs from a mainland-China residential IP requires separate compliance infrastructure and isn't well-served by a Western-hosted proxy network. Benchmark Chinese-language competence from Taiwan or Singapore residentials instead; note the limitation in the write-up.
Language-region dissonance. "Spanish" is not a single target
origin. es-MX and es-ES measure different things because
the deployed model applies different safety stacks per region.
If the benchmark design is language-level, pick one target
origin per language and stick with it across the full benchmark
run for reproducibility; note the region selection.
Model provider region routing that doesn't honour the origin. Some model APIs route all requests to a single global region regardless of client origin. In that case the origin experiment shows no delta — which is itself a finding: that provider doesn't apply regional policy at the routing layer for your workload. Report it.
Reproducibility
For a multi-origin eval to be reproducible by a reviewer, the published harness needs:
- Exact country codes used per language
- Exit class used (residential, ISP, etc.)
- Session stickiness window
- Timestamp of the eval run (regional policy drifts on the order of weeks)
- Provider-side model version pins where the API exposes them
Without these, a future re-run diverges in ways that look like regression but are actually infrastructure drift. The safety red-team use case covers the adjacent case of geo-anchored red-teaming methodology, which shares the same reproducibility constraints.
Cost shape
Multi-origin eval is bandwidth-cheap (prompts + completions are small) but concurrency-heavy during a benchmark run. A run across 29 languages at 100 concurrent per language completes in under an hour on our Team plan's 1000-concurrent ceiling; see the pricing page for how the plans scale. The Lab plan adds BGP-dedicated prefixes per country where a research publication needs to cite infrastructure stability for the eval run.
What this is not
Multi-origin eval is not a way to bypass content policy. If a model refuses a prompt from one origin and answers it from another, both answers are real data; the methodology is about measuring the difference, not about routing around it. Our AUP is explicit that circumvention of platform-level access controls is out of scope for the network, and the workflows this post describes are entirely within intended-use for commercial model APIs that don't block residential origins.
Further reading
- Proxy infrastructure for RAG pipelines — adjacent topic, RAG ingestion with origin stability
- Residential vs datacenter for AI workloads — the routing matrix across a full AI pipeline
- Regional bias across 40 countries: a ChatGPT benchmark — an earlier piece on the same measurement problem for a specific model