Proxies for RAG pipelines that stay consistent across ingestion
Datacenter throughput for open sources, residential authenticity where the source geoblocks cloud ASNs, ISP persistence where the source needs a stable session. Chosen per-source by your pipeline, unified at one gateway.
The RAG ingestion problem most pipelines underestimate
A RAG index is only as useful as its corpus is consistent. The failure modes we see in production:
1. Source-class drift
Most RAG pipelines start with a handful of sources and a flat scraping config. As the index grows past a few thousand sources, the sources split into three classes whether you notice or not: open-cloud-friendly (GitHub, arXiv, Wikimedia), cloud-filtered (regional press, enterprise wikis, some gov), and authentication-required (partner APIs, Slack/Notion exports, internal KBs). Running one proxy class against all three produces silent gaps — the cloud-filtered sources return empty or degraded content and your index just quietly has holes.
2. Canonical and content-type inconsistency
The same URL serves different content by Accept-Language, Accept-Encoding, Mobile-UA, and sometimes referring ASN. A RAG corpus scraped from mixed origins has the same document stored in three canonical variants, which embeds-as-close-neighbours and surfaces as duplicates at retrieval time. Running collection from a consistent origin (one region, one ASN class) per source stabilises the canonical.
3. Version drift
A RAG index that doesn't pin a source version treats 2023 content
and 2026 content as equivalent neighbours in embedding space. For
indexes that need to answer "what was true at time T," the
collection pipeline has to capture and store the fetch timestamp
and revisit policy at the source level. Proxies don't solve this
for you, but consistent origin makes the If-Modified-Since
behaviour more predictable.
How to route RAG sources through SquadProxy
A typical pipeline configuration looks like:
- Datacenter (default) — US East edge for North-American sources, Frankfurt for European, Tokyo for APAC. 80% of RAG source volume by document count.
- Residential — for regional press, local government, and enterprise knowledge bases that geoblock cloud ASNs. ~15% of sources by count but disproportionate value for regional completeness.
- ISP — for sources that require login and maintain session-cookied rate limits. ~5% of sources.
Dedup at ingest time
Embed-time dedup is expensive and gets brittle at scale. Cheap wins at scrape time:
- Canonicalise URLs before you enqueue (strip tracking params,
normalise trailing slashes, follow
link rel="canonical"). - Hash document content (sha256 of the extracted text, not the HTML) and skip if seen. Store the hash-to-URL mapping for retrieval-time citation.
- MinHash-LSH across the top-k near-duplicates. For corpus at the scale of tens of millions of documents this adds minutes, not hours, if you batch it.
Integration
import httpx
proxy = "http://USER:PASS@gateway.squadproxy.com:7777"
def fetch(url: str, source_class: str) -> httpx.Response:
headers = {
"X-Squad-Class": source_class, # datacenter | residential | isp
"X-Squad-Country": "us",
"X-Squad-Session": "per-request" if source_class == "datacenter" else "sticky-10m",
}
return httpx.get(url, proxies=proxy, headers=headers, http2=True, timeout=30)
The X-Squad-Class header is the gateway hint. One endpoint, three
exit classes, your pipeline decides per source. That decision is the
entire value SquadProxy adds to the RAG stack.
Source-class routing in practice
A RAG ingestion pipeline that crosses the 10,000-sources mark sees the source catalogue stratify into a small number of classes, each with a stable right-exit choice. The routing table below is what a production setup we worked on in Q1 2026 settled on, and it has held for the twelve months since. The shape is more stable than most teams expect.
| Source class | Example targets | Exit class | Rotation | Why |
|---|---|---|---|---|
| Open archives | arXiv, GitHub, Wikimedia, Common Crawl | Datacenter | Per-request | Tolerates cloud ASN; rate limits on aggregate, not per-IP |
| Public datasets | Hugging Face, Kaggle, data.gov | ISP | Sticky-10m per dataset | HF rate-limits per-IP; sticky preserves LFS resume |
| Regional news | Le Monde, SCMP, Clarín, NYT | Residential | Per-request | Geoblocks cloud; anchor to target country |
| Regional government | .gouv, .gob, .go.jp | Residential | Sticky-10m | Some require cookie state for browse; geoblocks cloud |
| Enterprise wikis | Notion public, Confluence public | ISP | Sticky-60m | Session state across paginated browse |
| Auth'd SaaS sources | Notion exports, Slack archives | ISP | Sticky-60m | Login state must survive |
| Model hubs | HF Spaces, Gradio public | Datacenter | Per-request | Tolerates cloud; rate limits on aggregate |
| Partner APIs | Customer-specific, pre-contracted | Datacenter or ISP | Varies | Routing depends on the partner's setup |
The route assignments compile to a ~30-line mapping in the ingestion code. The rest of the pipeline doesn't care which class was used, as long as the same class was used consistently for a given source across re-scrapes.
For the underlying reasoning on per-class tradeoffs, see the residential vs datacenter routing guide and the class-specific pages: residential, ISP, datacenter.
SquadProxy versioning for retrieval-over-time
RAG indexes used in production eventually face questions about document recency — "what did this source say in March?" The ingestion pipeline is the only place those timestamps can be captured cleanly.
A versioning layer we've seen work:
- Fetch-time stamp stored per document, alongside the content hash. One row per (source_url, fetch_timestamp) in the metadata store.
- Revisit policy defined per source class. Open archives can be cached for a week; regional news for a day; partner APIs on their documented cadence.
- Change detection via hash comparison on re-fetch. Only write a new row when the content hash differs. The pipeline ends up retaining ~3-10% of fetches as new versions; the rest are noops that confirm continuity.
- Retrieval-time filter exposed as "latest" (default) or "as-of date". Most retrieval workflows use "latest"; a research subset uses "as-of" for reproducibility of past eval runs.
The proxy layer doesn't solve this for you. What it does solve is
the consistency of If-Modified-Since and ETag handling — with
a consistent origin, conditional GETs work as intended and
re-fetch cost stays low. With inconsistent origins, the target's
cache layer returns 200 for the new origin even when the content
hasn't changed, and your pipeline ingests spurious "new"
versions. That's a noise problem that rotating origins make
worse.
We wrote more on the infrastructure side of this in proxy infrastructure for RAG pipelines.
Vector DB choice interacts with the ingestion layer
Pinecone, Qdrant, Weaviate, and Milvus each impose different constraints on how the ingestion pipeline feeds them. The proxy layer doesn't change which one you pick, but it does change which failures surface where:
- Pinecone — hosted, scales serverless. Ingestion is upsert-heavy; the proxy layer's job is to deliver deduplicated, canonicalised content before embedding. Canonical collisions on Pinecone surface as near-neighbours in retrieval, which is confusing to debug.
- Qdrant — self-hosted options with bulk upsert. Proxy layer's job is the same, but at larger scale; Qdrant's bulk loader tolerates higher throughput, so the ingestion pipeline bottleneck moves upstream to the fetch layer.
- Weaviate — schema-first. Content-type inconsistency from mixed-origin scraping shows up as schema-property validation failures on ingest, which is useful — the failure is loud, not silent. But the pipeline has to handle the errors.
- Milvus — performance-first. Higher ingest throughput means the proxy layer has to sustain higher per-source throughput without degrading. ISP is the right class where HF or other rate-limited sources would otherwise bottleneck.
Header-based class routing on a single gateway (the SquadProxy shape) fits better with bulk-loaders in Qdrant and Milvus than per-class-endpoints would, because the ingestion worker doesn't have to reconfigure its proxy between dataset sources.
Common failure modes we've debugged
Silent content gaps on regional sources. The pipeline runs datacenter-default and hits 70% success on a regional news source. The team writes off the 30% as "source is flaky." Three months later an eval surfaces a regional competence gap traceable to missing training content. Residential routing on that source class would have closed the gap at ingest.
Duplicate documents from rotated origins. Same URL fetched through three origins over six months, three near-duplicate embeddings in the index. Retrieval returns three low-similarity-but-not-identical results for a query that should return one. The fix is sticky-session ISP for canonical-sensitive sources, per-request rotation for the rest.
HF rate-limit on bulk dataset pulls. The ingestion pipeline pulls 10 datasets in parallel, each through per-request rotating residentials, and half timeout. The issue is residential being wrong for HF (see HF dataset guide) — sticky ISP per dataset is the right shape.
Vector DB schema drift from mixed origins. Documents from
different origins return slightly different metadata (different
Accept-Language default, different User-Agent-specific content)
and the vector DB's schema validation rejects 2-3% of ingests.
Fix: pin per-source origin and freeze the request shape.
Where to start
A minimal RAG pipeline on SquadProxy for a team onboarding in a week:
- Map your sources into the eight classes in the table above. Most pipelines fit without needing a new class.
- Configure one gateway endpoint with three active exit classes: datacenter, residential, ISP. Add mobile only if your workload specifically requires cellular.
- Ingest for one week, record per-source success rate.
- Re-classify any source with <90% success; move it to a different exit class and re-ingest.
- Pin the class map; treat it as a source-of-truth configuration for the life of the corpus.
The Team plan covers this shape for sub-500 GB residential per month of mixed ingestion. See pricing.
Further reading
- Residential vs datacenter routing matrix
- Proxies for Common Crawl
- Proxies for Hugging Face datasets
- Proxy infrastructure for RAG pipelines
- US country page for the North-American RAG source base
- Ethical residential proxies for AI research for provenance requirements on the upstream ingestion side
Pricing
Pricing for rag data collection and indexing
Every plan carries every exit class — pick the one whose bandwidth envelope fits your workload.
Solo
For individual researchers running evaluation scripts and prototype RAG pipelines.
$149/ month
or $1,430/year (save 20%)
50 GB residential · unlimited datacenter · 200 concurrent sessions
- ✓Access to all 5 exit classes · 10 focus countries
- ✓50 GB residential · unlimited datacenter
- ✓5 static ISP IPs · 5 GB 4G mobile
- ✓1 seat · 200 concurrent sessions
- ✓Python + Node SDK + REST API
- ✓Per-request metering (not time-based)
- ✓Email support (24h response, business days)
- ✓Overage: $3/GB residential · $6/GB mobile
Best for
- Solo researchers
- Evaluation scripts
- Prototype RAG
Team
Most popularFor AI startups and mid-size labs splitting capacity between training and evaluation.
$699/ month
or $6,710/year (save 20%)
500 GB residential · unlimited datacenter · 1,000 concurrent sessions
- ✓Access to all 5 exit classes · 10 focus countries
- ✓500 GB residential · unlimited datacenter
- ✓25 static ISP IPs · 25 GB 4G mobile
- ✓10 seats ($29/mo per extra seat) · 1,000 concurrent sessions
- ✓City-level geo-routing + ASN targeting
- ✓99.9% uptime SLA
- ✓Priority Slack support (4h response, business hours)
- ✓Python + Node SDK + REST API + webhooks
- ✓Overage: $3/GB residential · $6/GB mobile
Best for
- AI startups
- Mid-size labs
- Model eval teams
Lab
For academic labs, eval consortia, and frontier model companies running sustained workloads.
$2,999/ month
or $28,790/year (save 20%)
2 TB residential · unlimited DC · 50 GB 4G + 20 GB 5G · 3,000 concurrent sessions
- ✓Access to all 5 exit classes · 10 countries on 4 continents
- ✓2 TB residential · unlimited datacenter
- ✓100 static ISP IPs · 50 GB 4G + 20 GB 5G mobile
- ✓50 seats ($19/mo per extra seat) · 3,000 concurrent sessions
- ✓Dedicated gateway lane (bypasses shared-pool queues on us-east-1 + eu-west-1)
- ✓99.95% uptime SLA
- ✓Dedicated Slack channel (1h response, business hours)
- ✓Custom BGP prefix on request (additional fees apply)
- ✓Overage: $2.50/GB residential · $5/GB mobile
Best for
- Academic labs
- Large eval consortia
- Frontier model companies
Enterprise
Custom contracts with dedicated infrastructure, volume pricing, and research-grade SLAs.
Custom pricing
Custom (from 5 TB/mo residential) · unlimited concurrent sessions
- ✓Volume pricing from 5 TB/mo residential
- ✓Dedicated BGP prefix + ASN announcement
- ✓Unlimited concurrent sessions · unlimited seats
- ✓99.99% uptime SLA with financial credits
- ✓Named Technical Account Manager + 24/7 on-call paging
- ✓Custom AUP, DPA, on-site deployment option
- ✓Research / academic discount (30–50% off Team or Lab)
- ✓Annual contract · wire, ACH, USDC/USDT/BTC settlement
Best for
- Frontier labs
- Eval consortia
- Enterprise AI
All plans include 14-day refund, single endpoint with regional failover, HTTP(S) + SOCKS5 on every exit class, access to all 5 exit classes and all 10 focus countries, and Python + Node SDKs. Concurrent sessions = simultaneous TCP sessions through the gateway. Overage warnings fire at 80% and 100%; traffic continues only if overage billing is enabled on your account.
Ship on a proxy network you can actually call your ops team about
Real ASNs, real edge capacity, and an engineer who answers your Slack the first time.