Skip to content
infrastructureproxy-strategytraining-datarag

Residential vs datacenter proxy for AI workloads: a routing matrix

Most AI teams over-index on residential proxies and pay too much for coverage they don't need. The useful question isn't residential-vs-datacenter; it's which source class goes through which exit class. A practical routing matrix for training, RAG, and evaluation pipelines.

· Reeya Patel · 8 min read

The residential-vs-datacenter question is the wrong framing. Most AI teams who ask it end up either paying residential prices for workloads that datacenter would serve fine, or hitting silent-gap failures on the 10–15% of sources that geoblock cloud ASNs. Neither outcome is subtle; both are expensive.

The useful question is routing: for each source class in your pipeline, which exit class fits? The answer is almost always a mix, and the mix is more stable than most teams expect once you classify sources once.

The three numbers that decide everything

Every proxy decision reduces to three ratios:

  1. Success rate — what fraction of requests through this exit class return a useful response for this target class
  2. Cost per GB — what you pay for the bandwidth
  3. Latency — added round-trip over direct

For a training corpus pull from arXiv or Common Crawl through AWS us-east-1, datacenter is: ~99% success, sub-$0.10/GB on a committed tier, under 10ms added latency. For the same traffic through a residential pool, you're paying $2–8/GB for no success-rate improvement against a target that doesn't care about ASN, and you add 100–300ms of extra latency. That's paying more for a worse result. This happens constantly.

The flip case: trying to pull a regional news site through AWS us-east-1 gets a 403 or a degraded "you're a bot" page at ~40% success. The same request through a residential pool anchored to the target's country is 95%+ at a reasonable rate. Paying cloud pricing there isn't frugal; it's a silent content gap in your corpus that only surfaces months later when a downstream eval shows holes.

What each class actually is

Datacenter

IPs announced from cloud provider ASNs — AWS (AS16509), Google Cloud (AS15169), Azure (AS8075), or a handful of neutral IaaS providers. The ASN is the tell: the moment a target looks up your originating IP and sees a cloud ASN, it knows you're not a consumer. Most of the open web doesn't care: arXiv publishes bulk access endpoints precisely because bulk programmatic access from cloud is expected. GitHub's clone endpoints, Wikimedia dumps, and Common Crawl's index lookups all tolerate cloud-origin traffic at scale.

For the AI workloads that make up the vast majority of training corpus collection, this is the default and should stay the default. See the datacenter proxy page for the pool specifics.

Residential

IPs announced under real ISP ASNs — Comcast, Charter Spectrum, AT&T, Deutsche Telekom, Orange, KDDI. The IPs are assigned to real subscribers' home routers; the proxy pool routes your request through those devices (with the subscriber's consent, in a well-run pool). The ASN matches a normal consumer pattern, so targets that filter on "cloud vs consumer" don't bounce the request.

Residential is the right class for four narrower cases:

  1. The target geoblocks cloud ASNs (common on regional news, some government sites, some B2B SaaS with strict trust tiers)
  2. The workload is regional evaluation — testing how a commercial LLM responds when the request appears to come from a specific country or city, where datacenter IPs get differently-classified by the provider's policy layer
  3. You need training corpus diversity, specifically the long tail of regional content that cloud-origin crawlers already under-sampled in Common Crawl
  4. Competitive AI intelligence (scraping public model outputs, Gradio spaces, model cards) where platforms apply different rate limits to bulk cloud traffic

See the residential proxy page.

ISP

ISP proxies (sometimes "static residential") are a hybrid: datacenter-hosted IPs that announce under residential ASNs. You get datacenter speed and uptime with residential ASN classification. The tradeoff is they're static — no rotation — so they suit long-session workloads: multi-turn agent evaluation, authenticated RAG ingestion, cookie-persistent scraping.

Most AI teams don't need ISP as a separate class until they hit a specific case: an eval harness that runs 50 turns per session, a RAG source that requires login and maintains per-IP rate limits, or an agent benchmark that expects stable session state. See the ISP proxy page.

Mobile (4G / 5G)

Live carrier SIMs on Verizon, AT&T, T-Mobile, EE, Vodafone, DoCoMo. The IPs announce under CG-NAT cellular ranges — the same ranges that ~60% of worldwide mobile users share. For AI workloads the use case is narrow: cellular-anchored evaluation, where the hypothesis is that a model API or a target service returns a different response to cellular origin vs fixed-line origin. If you're not testing that hypothesis, mobile is over-kill. See 4G mobile and 5G mobile.

The routing matrix

This is what a working AI data pipeline's proxy routing actually looks like. The source classes don't change often; the exit class per source is typically stable across many months of collection.

Source class Example targets Primary exit Fallback Notes
Open-web archives arXiv, Common Crawl, Wikimedia Datacenter ~60% of corpus volume; cloud ASN expected
Public code / data GitHub clones, HF datasets, Kaggle Datacenter ISP if auth 20% of volume; HF has rate limits per-IP, ISP helps
Regional news Le Monde, Le Soir, SCMP, etc. Residential Geoblocks cloud; residential anchored to target country
Regional government .gov, .gob, .gouv Residential Datacenter fallback Mixed — depends on jurisdiction
Commercial LLM APIs (eval) OpenAI, Anthropic, Google Residential ISP for sessions Regional eval requires authentic IP; ISP for multi-turn
Model platforms HF Spaces, Gradio public, leaderboards Datacenter Residential on rate-limit Most tolerate cloud; switch on 429
Auth'd SaaS (RAG sources) Notion exports, Slack archives, Confluence ISP Static session essential
Mobile-anchored evals Model APIs from cellular 4G/5G Only when testing cellular-specific behavior

A training-heavy pipeline pulling public corpus at TB scale will sit at ~80% datacenter / ~15% residential / ~5% ISP+mobile by volume. An evaluation-heavy pipeline running regional benchmarks inverts to ~30% datacenter / ~60% residential / ~10% ISP+mobile. Both are normal.

A minimal per-source router

The pragmatic implementation is a source classifier at ingest time that tags each URL with an exit class. One endpoint, one gateway header, the routing decision stays in your pipeline.

import httpx

PROXY = "http://USER:PASS@gateway.squadproxy.com:7777"

SOURCE_CLASS = {
    "arxiv.org":            "datacenter",
    "github.com":           "datacenter",
    "data.commoncrawl.org": "datacenter",
    "huggingface.co":       "isp",         # rate-limited
    "lemonde.fr":           "residential", # geoblocks cloud
    "api.openai.com":       "residential", # regional eval
}

def fetch(url: str) -> httpx.Response:
    host = httpx.URL(url).host
    cls = SOURCE_CLASS.get(host, "datacenter")
    headers = {
        "X-Squad-Class": cls,
        "X-Squad-Session": "sticky-10m" if cls == "isp" else "per-request",
    }
    return httpx.get(url, proxies=PROXY, headers=headers, http2=True, timeout=30)

The SOURCE_CLASS map grows with your corpus and stabilises once you've classified the top 200 hosts. Re-scrape runs pick up the map unchanged; cost per source is predictable.

Common mistakes

Using residential for everything because "it's more reliable." Residential is more reliable against a hostile target, not universally. Paying residential rates to pull arXiv through real home broadband is waste and adds latency that matters at corpus scale.

Using datacenter for everything because "it's cheaper." See the silent-gap failure mode described earlier. A quiet 5% content loss on regional sources is invisible until it surfaces as a multilingual-eval regression three releases later.

Mixing exit classes per source over time. A RAG index that sees the same URL from three origins over 18 months accumulates three near-duplicate documents with different canonicals. Pick an exit class per source and keep it for the life of the source in your corpus. We wrote more about this in proxy infrastructure for RAG pipelines.

Rotating mid-session. For any workload that holds state (logged-in scrape, multi-turn eval, agent benchmark), rotation breaks session cookies and confuses the target's rate limiter. Stick on ISP for those; rotate on residential/datacenter for everything else.

Where to start

If you're standing up a collection pipeline from scratch, the minimum viable setup is datacenter default + residential on geoblock/eval + ISP on auth. You don't need mobile until you've verified that a specific target classifies cellular traffic differently.

For a cost estimate against your actual workload, the pricing page shows tiers sized around common splits — 50 GB residential + unlimited datacenter covers most single-researcher evaluation work, and 500 GB residential + unlimited datacenter covers most AI startup training + eval splits. The split is real: bandwidth on datacenter is cheap enough that we don't meter it; residential is where the volume metering lives because that's where the real unit cost sits.

The US country page has the ASN breakdown for the largest pool we operate. For any other country in our network, switch the URL — the shape is the same, the ASN list is what changes.

Ship on a proxy network you can actually call your ops team about

Real ASNs, real edge capacity, and an engineer who answers your Slack the first time.