Evaluate LLMs from 10 countries to measure regional variation
GPT, Claude, Gemini, and open models respond differently depending on IP-layer geography. SquadProxy gives you evaluation origins across 10 countries on residential, ISP, and mobile exits so your eval methodology reflects real deployment conditions.
Why geography changes model output
Commercial LLM APIs apply region-dependent behaviour in at least three places:
- Content policy. Some topics — regional political figures, locally-sensitive events, jurisdictional legal questions — receive different guardrail responses depending on the request origin. Providers do not always document this; it surfaces in eval.
- Routing and model variant. Large providers sometimes serve a regional variant of the same model ID. The weights may be identical but the inference stack, safety filter, and cached prompt prefix can vary.
- Implicit region anchoring. The same ambiguous prompt ("recommend a good lawyer for immigration") returns differently shaped answers when the request carries a US vs. UK vs. SG origin. The model infers locality from provider-attached metadata, not just prompt content.
For any AI team doing safety evaluation, regional bias measurement, or compliance documentation, running eval from a single origin under-samples reality.
Methodological requirements
A credible regional evaluation needs:
- IP-layer authenticity. Datacenter IPs are classified instantly by major API providers. Evaluation from AWS us-east-1 tells you about the policy that applies to "generic US-cloud developer traffic," not about what a Berlin-resident end-user sees. Residential or ISP exits are required for credible regional framing.
- Per-region replicates. A single request per region is not a measurement. Run 30–100 prompts per region across your eval set, rotate exit IPs per request to avoid per-IP caching artifacts.
- Session hygiene. Don't carry session state across regions unless you are specifically testing stateful behaviour. Use per-request rotation at the SquadProxy gateway.
- Time-diversity. Region-dependent routing can depend on provider-side load balancing. Sample across time-of-day in each region.
A reference setup
import httpx, random
COUNTRIES = ["us","gb","de","fr","jp","nl","ca","sg","kr","au"]
proxy_base = "http://USER:PASS@gateway.squadproxy.com:7777"
def eval_prompt(prompt: str, model_endpoint: str, country: str) -> dict:
headers = {
"X-Squad-Class": "residential",
"X-Squad-Country": country,
"X-Squad-Session": "per-request",
}
return httpx.post(
model_endpoint,
json={"model": "...", "messages": [{"role":"user","content": prompt}]},
headers={"Authorization": f"Bearer {API_KEY}", **headers},
proxies=proxy_base,
timeout=60,
).json()
Record the exit IP, country, timestamp, and the full response. Regional-bias eval is a methodology problem; the proxy layer is prerequisite infrastructure.
Cellular-anchored variant
For workloads that target mobile-first deployments — voice AI,
in-app agents — the eval origin needs to be cellular. Swap
X-Squad-Class to 4g-mobile or 5g-mobile and the same
country targeting applies. Expect higher latency and per-request
cost; accept that it is non-optional if your downstream product
runs on a mobile network.
What a multi-origin eval run looks like in practice
A well-shaped multi-origin evaluation for a single model and a single benchmark of 500 prompts across 10 origin regions compiles to ~5,000 model calls. On SquadProxy's Team plan ceiling of 1,000 concurrent, the run finishes in 20-60 minutes depending on per-call latency.
The operational shape:
- Prompt set — the benchmark, held constant across runs (MMLU-ProX, FLORES, HELM, a bespoke internal safety suite, etc.)
- Origin matrix — one row per (benchmark language, country origin, exit class). For MMLU-ProX at 29 languages, that's 29 country-language pairs plus the US-datacenter baseline column.
- Rate shape — each origin handles a fraction of the load; per-origin concurrency stays under 30 to avoid burning residential pool depth on a single eval.
- Capture — for each call, store (prompt_id, origin, country, exit_class, exit_ip, timestamp, response, latency_ms, score). This metadata is the artifact that supports later reproducibility.
- Replay — re-run any slice on demand. The exit_ip capture is cosmetic; the origin pool turns over on residential. But the exit_class + country are stable.
For the methodology side of this workflow, our multilingual LLM benchmark post covers the reproducibility requirements and the delta-threshold conventions.
Delta interpretation
The useful output of a multi-origin eval is not a single scalar score per region. It is a delta — how much does score change between the baseline origin (usually US-cloud) and the target origin (a country residential)?
Our working heuristics from running these evals at scale:
- Delta within ±5%: test-retest noise. Re-run to confirm. If the delta persists across two reruns, it's real but small.
- Delta between 5-15%: real but modest. Usually explained by content policy applying differently per region, or by retrieval-augmented responses pulling from region-local corpora.
- Delta above 15%: material. Worth a follow-up — usually either (a) explicit content policy that blocks or reshapes the response in one region and not another, or (b) a different safety checkpoint deployed per region. Both are worth documenting.
- Delta above 30%: usually means the model-as-deployed is different per region at the weights or safety stack level. This is reportable; providers don't always document it.
Publish the full matrix in your eval report, not just the summary stats. Reviewers and downstream users who look at the numbers need to see the shape, not just the headline.
Reproducibility requirements
For a multi-origin eval to stand up to academic or audit review, the publication needs to include:
- Exact country code used per benchmark language
- Exit class (residential / ISP / datacenter / 4G) per row
- Session stickiness window used
- Run window (start and end timestamps — regional routing drift happens on the order of weeks)
- Provider-side model version pin where the API exposes it
- Proxy vendor and pool-provenance posture
- Any rate-limit backoffs or retries, with counts
Without these, a future re-run diverges in ways that look like regression but are actually infrastructure drift.
Handling providers that don't route by origin
Not every model API applies regional policy at the routing layer. When an eval shows negligible delta across all country origins, the eval hasn't failed — it's produced a finding: this provider routes all requests to a single global region. That's useful information for procurement and safety teams; document it and move on.
Conversely, when an eval shows outsized delta on one specific country but not its neighbours, investigate the anomaly before publishing. The most common cause is a transient content-policy difference (e.g., a provider deploying a new safety filter to one region first). Re-run after 48-72 hours; if the delta persists, it's structural; if it disappears, it was a deployment transient.
Common anti-patterns
Single-origin evaluation, published as multi-origin. The benchmark covers 29 languages but all requests come from us-east-1. This is a language-competence measurement, not a regional measurement. Some labs publish this as multilingual and the distinction gets lost downstream.
Origin rotation within a single benchmark run. If different benchmark items use different origins because of pool-rotation timing, you can't interpret the result — the signal you measured is an average over uncontrolled origins. Pin origin per item and document it.
Insufficient per-region sample size. A single prompt per region is not a measurement. 30-100 per region at minimum, with the variance reported alongside the mean.
Ignoring the exit class. Running ISP on some languages and residential on others produces a systematic bias — ISP classifies differently from residential at the target's content policy layer. Pin exit class for the full run; report it.
Pool and country coverage for eval
SquadProxy's 10-country focus covers the languages that matter for most production multilingual workloads: US (en), GB (en), DE (de), FR (fr), JP (ja), NL (nl), CA (en/fr), SG (en/zh), KR (ko), AU (en). For evaluation sets that need coverage outside these (hi, ar, pt-BR, es-MX, etc.), we recommend pairing SquadProxy with a broader-coverage vendor — our depth is in the 10 markets we've instrumented for ASN depth and eval-grade latency SLOs, not in thin coverage of 190 more.
For the specific benchmark-to-country mapping recommended for MMLU-ProX and similar multilingual suites, the multilingual benchmark post has the working defaults.
Cost shape
Evaluation is bandwidth-light and concurrency-heavy. A typical 29-language MMLU-ProX run uses under 2 GB of residential bandwidth total — the bills come from concurrency, not traffic. The Team plan's 1,000-concurrent ceiling covers most eval teams; the Lab plan's 3,000-concurrent ceiling covers continuous evaluation pipelines that run multiple benchmarks per day. See pricing.
Further reading
- Multilingual LLM benchmark methodology
- Residential vs datacenter routing matrix
- Safety and red-team testing — adjacent methodology, similar reproducibility needs
- Regional bias in ChatGPT across 40 countries — an earlier internal benchmark using this methodology
Pricing
Pricing for llm evaluation across regions
Every plan carries every exit class — pick the one whose bandwidth envelope fits your workload.
Solo
For individual researchers running evaluation scripts and prototype RAG pipelines.
$149/ month
or $1,430/year (save 20%)
50 GB residential · unlimited datacenter · 200 concurrent sessions
- ✓Access to all 5 exit classes · 10 focus countries
- ✓50 GB residential · unlimited datacenter
- ✓5 static ISP IPs · 5 GB 4G mobile
- ✓1 seat · 200 concurrent sessions
- ✓Python + Node SDK + REST API
- ✓Per-request metering (not time-based)
- ✓Email support (24h response, business days)
- ✓Overage: $3/GB residential · $6/GB mobile
Best for
- Solo researchers
- Evaluation scripts
- Prototype RAG
Team
Most popularFor AI startups and mid-size labs splitting capacity between training and evaluation.
$699/ month
or $6,710/year (save 20%)
500 GB residential · unlimited datacenter · 1,000 concurrent sessions
- ✓Access to all 5 exit classes · 10 focus countries
- ✓500 GB residential · unlimited datacenter
- ✓25 static ISP IPs · 25 GB 4G mobile
- ✓10 seats ($29/mo per extra seat) · 1,000 concurrent sessions
- ✓City-level geo-routing + ASN targeting
- ✓99.9% uptime SLA
- ✓Priority Slack support (4h response, business hours)
- ✓Python + Node SDK + REST API + webhooks
- ✓Overage: $3/GB residential · $6/GB mobile
Best for
- AI startups
- Mid-size labs
- Model eval teams
Lab
For academic labs, eval consortia, and frontier model companies running sustained workloads.
$2,999/ month
or $28,790/year (save 20%)
2 TB residential · unlimited DC · 50 GB 4G + 20 GB 5G · 3,000 concurrent sessions
- ✓Access to all 5 exit classes · 10 countries on 4 continents
- ✓2 TB residential · unlimited datacenter
- ✓100 static ISP IPs · 50 GB 4G + 20 GB 5G mobile
- ✓50 seats ($19/mo per extra seat) · 3,000 concurrent sessions
- ✓Dedicated gateway lane (bypasses shared-pool queues on us-east-1 + eu-west-1)
- ✓99.95% uptime SLA
- ✓Dedicated Slack channel (1h response, business hours)
- ✓Custom BGP prefix on request (additional fees apply)
- ✓Overage: $2.50/GB residential · $5/GB mobile
Best for
- Academic labs
- Large eval consortia
- Frontier model companies
Enterprise
Custom contracts with dedicated infrastructure, volume pricing, and research-grade SLAs.
Custom pricing
Custom (from 5 TB/mo residential) · unlimited concurrent sessions
- ✓Volume pricing from 5 TB/mo residential
- ✓Dedicated BGP prefix + ASN announcement
- ✓Unlimited concurrent sessions · unlimited seats
- ✓99.99% uptime SLA with financial credits
- ✓Named Technical Account Manager + 24/7 on-call paging
- ✓Custom AUP, DPA, on-site deployment option
- ✓Research / academic discount (30–50% off Team or Lab)
- ✓Annual contract · wire, ACH, USDC/USDT/BTC settlement
Best for
- Frontier labs
- Eval consortia
- Enterprise AI
All plans include 14-day refund, single endpoint with regional failover, HTTP(S) + SOCKS5 on every exit class, access to all 5 exit classes and all 10 focus countries, and Python + Node SDKs. Concurrent sessions = simultaneous TCP sessions through the gateway. Overage warnings fire at 80% and 100%; traffic continues only if overage billing is enabled on your account.
Ship on a proxy network you can actually call your ops team about
Real ASNs, real edge capacity, and an engineer who answers your Slack the first time.