AI & Machine Learning

Proxies for RAG pipelines that stay consistent across ingestion

Datacenter throughput for open sources, residential authenticity where the source geoblocks cloud ASNs, ISP persistence where the source needs a stable session. Chosen per-source by your pipeline, unified at one gateway.

The RAG ingestion problem most pipelines underestimate

A RAG index is only as useful as its corpus is consistent. The failure modes we see in production:

1. Source-class drift

Most RAG pipelines start with a handful of sources and a flat scraping config. As the index grows past a few thousand sources, the sources split into three classes whether you notice or not: open-cloud-friendly (GitHub, arXiv, Wikimedia), cloud-filtered (regional press, enterprise wikis, some gov), and authentication-required (partner APIs, Slack/Notion exports, internal KBs). Running one proxy class against all three produces silent gaps — the cloud-filtered sources return empty or degraded content and your index just quietly has holes.

2. Canonical and content-type inconsistency

The same URL serves different content by Accept-Language, Accept-Encoding, Mobile-UA, and sometimes referring ASN. A RAG corpus scraped from mixed origins has the same document stored in three canonical variants, which embeds-as-close-neighbours and surfaces as duplicates at retrieval time. Running collection from a consistent origin (one region, one ASN class) per source stabilises the canonical.

3. Version drift

A RAG index that doesn't pin a source version treats 2023 content and 2026 content as equivalent neighbours in embedding space. For indexes that need to answer "what was true at time T," the collection pipeline has to capture and store the fetch timestamp and revisit policy at the source level. Proxies don't solve this for you, but consistent origin makes the If-Modified-Since behaviour more predictable.

How to route RAG sources through SquadProxy

A typical pipeline configuration looks like:

Datacenter (default) — US East edge for North-American sources, Frankfurt for European, Tokyo for APAC. 80% of RAG source volume by document count.
Residential — for regional press, local government, and enterprise knowledge bases that geoblock cloud ASNs. ~15% of sources by count but disproportionate value for regional completeness.
ISP — for sources that require login and maintain session-cookied rate limits. ~5% of sources.

Dedup at ingest time

Embed-time dedup is expensive and gets brittle at scale. Cheap wins at scrape time:

Canonicalise URLs before you enqueue (strip tracking params, normalise trailing slashes, follow link rel="canonical").
Hash document content (sha256 of the extracted text, not the HTML) and skip if seen. Store the hash-to-URL mapping for retrieval-time citation.
MinHash-LSH across the top-k near-duplicates. For corpus at the scale of tens of millions of documents this adds minutes, not hours, if you batch it.

Integration

import httpx
proxy = "http://USER:PASS@gateway.squadproxy.com:7777"

def fetch(url: str, source_class: str) -> httpx.Response:
    headers = {
        "X-Squad-Class": source_class,   # datacenter | residential | isp
        "X-Squad-Country": "us",
        "X-Squad-Session": "per-request" if source_class == "datacenter" else "sticky-10m",
    }
    return httpx.get(url, proxies=proxy, headers=headers, http2=True, timeout=30)

The X-Squad-Class header is the gateway hint. One endpoint, three exit classes, your pipeline decides per source. That decision is the entire value SquadProxy adds to the RAG stack.

Source-class routing in practice

A RAG ingestion pipeline that crosses the 10,000-sources mark sees the source catalogue stratify into a small number of classes, each with a stable right-exit choice. The routing table below is what a production setup we worked on in Q1 2026 settled on, and it has held for the twelve months since. The shape is more stable than most teams expect.

Source class	Example targets	Exit class	Rotation	Why
Open archives	arXiv, GitHub, Wikimedia, Common Crawl	Datacenter	Per-request	Tolerates cloud ASN; rate limits on aggregate, not per-IP
Public datasets	Hugging Face, Kaggle, data.gov	ISP	Sticky-10m per dataset	HF rate-limits per-IP; sticky preserves LFS resume
Regional news	Le Monde, SCMP, Clarín, NYT	Residential	Per-request	Geoblocks cloud; anchor to target country
Regional government	.gouv, .gob, .go.jp	Residential	Sticky-10m	Some require cookie state for browse; geoblocks cloud
Enterprise wikis	Notion public, Confluence public	ISP	Sticky-60m	Session state across paginated browse
Auth'd SaaS sources	Notion exports, Slack archives	ISP	Sticky-60m	Login state must survive
Model hubs	HF Spaces, Gradio public	Datacenter	Per-request	Tolerates cloud; rate limits on aggregate
Partner APIs	Customer-specific, pre-contracted	Datacenter or ISP	Varies	Routing depends on the partner's setup

The route assignments compile to a ~30-line mapping in the ingestion code. The rest of the pipeline doesn't care which class was used, as long as the same class was used consistently for a given source across re-scrapes.

For the underlying reasoning on per-class tradeoffs, see the residential vs datacenter routing guide and the class-specific pages: residential, ISP, datacenter.

SquadProxy versioning for retrieval-over-time

RAG indexes used in production eventually face questions about document recency — "what did this source say in March?" The ingestion pipeline is the only place those timestamps can be captured cleanly.

A versioning layer we've seen work:

Fetch-time stamp stored per document, alongside the content hash. One row per (source_url, fetch_timestamp) in the metadata store.
Revisit policy defined per source class. Open archives can be cached for a week; regional news for a day; partner APIs on their documented cadence.
Change detection via hash comparison on re-fetch. Only write a new row when the content hash differs. The pipeline ends up retaining ~3-10% of fetches as new versions; the rest are noops that confirm continuity.
Retrieval-time filter exposed as "latest" (default) or "as-of date". Most retrieval workflows use "latest"; a research subset uses "as-of" for reproducibility of past eval runs.

The proxy layer doesn't solve this for you. What it does solve is the consistency of If-Modified-Since and ETag handling — with a consistent origin, conditional GETs work as intended and re-fetch cost stays low. With inconsistent origins, the target's cache layer returns 200 for the new origin even when the content hasn't changed, and your pipeline ingests spurious "new" versions. That's a noise problem that rotating origins make worse.

We wrote more on the infrastructure side of this in proxy infrastructure for RAG pipelines.

Vector DB choice interacts with the ingestion layer

Pinecone, Qdrant, Weaviate, and Milvus each impose different constraints on how the ingestion pipeline feeds them. The proxy layer doesn't change which one you pick, but it does change which failures surface where:

Pinecone — hosted, scales serverless. Ingestion is upsert-heavy; the proxy layer's job is to deliver deduplicated, canonicalised content before embedding. Canonical collisions on Pinecone surface as near-neighbours in retrieval, which is confusing to debug.
Qdrant — self-hosted options with bulk upsert. Proxy layer's job is the same, but at larger scale; Qdrant's bulk loader tolerates higher throughput, so the ingestion pipeline bottleneck moves upstream to the fetch layer.
Weaviate — schema-first. Content-type inconsistency from mixed-origin scraping shows up as schema-property validation failures on ingest, which is useful — the failure is loud, not silent. But the pipeline has to handle the errors.
Milvus — performance-first. Higher ingest throughput means the proxy layer has to sustain higher per-source throughput without degrading. ISP is the right class where HF or other rate-limited sources would otherwise bottleneck.

Header-based class routing on a single gateway (the SquadProxy shape) fits better with bulk-loaders in Qdrant and Milvus than per-class-endpoints would, because the ingestion worker doesn't have to reconfigure its proxy between dataset sources.

Common failure modes we've debugged

Silent content gaps on regional sources. The pipeline runs datacenter-default and hits 70% success on a regional news source. The team writes off the 30% as "source is flaky." Three months later an eval surfaces a regional competence gap traceable to missing training content. Residential routing on that source class would have closed the gap at ingest.

Duplicate documents from rotated origins. Same URL fetched through three origins over six months, three near-duplicate embeddings in the index. Retrieval returns three low-similarity-but-not-identical results for a query that should return one. The fix is sticky-session ISP for canonical-sensitive sources, per-request rotation for the rest.

HF rate-limit on bulk dataset pulls. The ingestion pipeline pulls 10 datasets in parallel, each through per-request rotating residentials, and half timeout. The issue is residential being wrong for HF (see HF dataset guide) — sticky ISP per dataset is the right shape.

Vector DB schema drift from mixed origins. Documents from different origins return slightly different metadata (different Accept-Language default, different User-Agent-specific content) and the vector DB's schema validation rejects 2-3% of ingests. Fix: pin per-source origin and freeze the request shape.

Where to start

A minimal RAG pipeline on SquadProxy for a team onboarding in a week:

Map your sources into the eight classes in the table above. Most pipelines fit without needing a new class.
Configure one gateway endpoint with three active exit classes: datacenter, residential, ISP. Add mobile only if your workload specifically requires cellular.
Ingest for one week, record per-source success rate.
Re-classify any source with <90% success; move it to a different exit class and re-ingest.
Pin the class map; treat it as a source-of-truth configuration for the life of the corpus.

The Team plan covers this shape for sub-500 GB residential per month of mixed ingestion. See pricing.

Pricing for rag data collection and indexing

Every plan carries every exit class — pick the one whose bandwidth envelope fits your workload.

Solo

For individual researchers running evaluation scripts and prototype RAG pipelines.

$149/ month

or $1,430/year (save 20%)

50 GB residential · unlimited datacenter · 200 concurrent sessions

✓Access to all 5 exit classes · 10 focus countries
✓50 GB residential · unlimited datacenter
✓5 static ISP IPs · 5 GB 4G mobile
✓1 seat · 200 concurrent sessions
✓Python + Node SDK + REST API
✓Per-request metering (not time-based)
✓Email support (24h response, business days)
✓Overage: $3/GB residential · $6/GB mobile

Start with Solo

Best for

Solo researchers
Evaluation scripts
Prototype RAG

Team

Lab

For academic labs, eval consortia, and frontier model companies running sustained workloads.

$2,999/ month

or $28,790/year (save 20%)

2 TB residential · unlimited DC · 50 GB 4G + 20 GB 5G · 3,000 concurrent sessions

✓Access to all 5 exit classes · 10 countries on 4 continents
✓2 TB residential · unlimited datacenter
✓100 static ISP IPs · 50 GB 4G + 20 GB 5G mobile
✓50 seats ($19/mo per extra seat) · 3,000 concurrent sessions
✓Dedicated gateway lane (bypasses shared-pool queues on us-east-1 + eu-west-1)
✓99.95% uptime SLA
✓Dedicated Slack channel (1h response, business hours)
✓Custom BGP prefix on request (additional fees apply)
✓Overage: $2.50/GB residential · $5/GB mobile

Start with Lab

Best for

Academic labs
Large eval consortia
Frontier model companies

Enterprise

Custom contracts with dedicated infrastructure, volume pricing, and research-grade SLAs.

Custom pricing

Custom (from 5 TB/mo residential) · unlimited concurrent sessions

✓Volume pricing from 5 TB/mo residential
✓Dedicated BGP prefix + ASN announcement
✓Unlimited concurrent sessions · unlimited seats
✓99.99% uptime SLA with financial credits
✓Named Technical Account Manager + 24/7 on-call paging
✓Custom AUP, DPA, on-site deployment option
✓Research / academic discount (30–50% off Team or Lab)
✓Annual contract · wire, ACH, USDC/USDT/BTC settlement

Contact research team

Best for

Frontier labs
Eval consortia
Enterprise AI

All plans include 14-day refund, single endpoint with regional failover, HTTP(S) + SOCKS5 on every exit class, access to all 5 exit classes and all 10 focus countries, and Python + Node SDKs. Concurrent sessions = simultaneous TCP sessions through the gateway. Overage warnings fire at 80% and 100%; traffic continues only if overage billing is enabled on your account.

Ship on a proxy network you can actually call your ops team about

Real ASNs, real edge capacity, and an engineer who answers your Slack the first time.

See pricing Contact sales

Proxies for RAG pipelines that stay consistent across ingestion

The RAG ingestion problem most pipelines underestimate

1. Source-class drift

2. Canonical and content-type inconsistency

3. Version drift

How to route RAG sources through SquadProxy

Dedup at ingest time

Integration

Source-class routing in practice

SquadProxy versioning for retrieval-over-time

Vector DB choice interacts with the ingestion layer

Common failure modes we've debugged

Where to start

Further reading

Solo

Team

Lab

Enterprise

Ship on a proxy network you can actually call your ops team about