Choosing a proxy for LLM training data collection: criteria that actually matter
Listicles ranking "best proxies for AI" miss the criteria that AI engineers weigh in practice. A honest breakdown of the tradeoffs — pool provenance, ASN diversity, bandwidth economics, concurrency ceilings, and legal footprint — for teams collecting LLM training data at scale.
· Nathan Brecher · 6 min read
Most published comparisons of "the best proxies for AI" are affiliate listicles that rank providers on an unweighted mix of "success rate" and "country coverage" — metrics that optimise for the listicle author's conversion rate, not for the AI engineer's actual workload. This post is the criteria we'd use if we were evaluating proxy vendors for a training-data pipeline. SquadProxy operates in this space, so the framing is not neutral; the criteria are.
1. Pool provenance (the one that actually matters in 2026)
For AI training data specifically, provenance is the lead criterion. A model trained on corpus collected through a proxy pool whose IPs were obtained via bundleware, SDK dark patterns, or children's-device compromises inherits a provenance problem that surfaces as:
- Publication-blocking at review time (ML venues increasingly ask for provenance documentation of training corpus)
- Legal exposure under consumer-privacy regimes (CCPA/CPRA, GDPR) when the data-subject layer in the chain didn't meaningfully consent
- Reputational exposure when a disclosure about the proxy vendor's sourcing practices pulls the customer roster into the story
Relevant questions to ask a vendor:
- How are peers onboarded? (Opt-in SDK integration with informed consent, or pre-installed on low-trust apps?)
- What do peers get in exchange? (Value — features, rewards, ad-free tiers — or nothing?)
- Can peers leave? How quickly?
- Has the pool appeared in security research on residential proxy misuse?
The honest answer in 2026 is that only a handful of vendors have clean provenance here. Our residential pool documentation describes our approach (opt-in SDK, value-for-bandwidth, no stealth). Other vendors have published their own versions; the quality of the documentation is a useful signal in itself.
2. ASN diversity (not just pool size)
Headline pool-size numbers ("150M IPs!") are effectively marketing. The number that matters is ASN diversity within your target countries. A pool with 10M residential IPs concentrated across three US carriers is often more useful for AI workloads than a pool with 40M IPs spread across 200 smaller providers, because the three-carrier pool covers the ASNs that matter for regional content access while the long-tail pool has coverage gaps on the carriers that anchor regional residential traffic.
For a US workload, the useful ASN list starts with Comcast (AS7922), Charter Spectrum (AS20115), AT&T (AS7018), Verizon FiOS (AS701), Cox Communications (AS22773). A pool that covers those five at sustained depth is more useful than a pool with triple the aggregate IPs across a different distribution. The US country page lists the ASNs we actively scale against.
Ask vendors for per-country ASN coverage, not pool size.
3. Bandwidth economics (where the bill comes from)
Residential pricing ranges from $0.49/GB (budget) to $8-10/GB (premium). For AI training data, the realistic mid-tier is $2-4/GB. At 1 TB of residential pulls per month you're paying $2,000-4,000; at 10 TB you're paying $20,000-40,000.
The right-shape plan for an AI workload is:
- Residential: metered, priced per GB (because you're measuring what you actually pull)
- Datacenter: unlimited (because a metered datacenter plan at AI scale makes no sense; datacenter bandwidth is near-zero cost)
- ISP: per-IP allocation (the economics are IP-quality, not bandwidth)
- Mobile: metered, higher per-GB (carrier SIM bandwidth is expensive)
A vendor who charges residential-style pricing on their datacenter pool is mispricing against your workload. A vendor who charges unlimited-style pricing on their residential pool and then caps concurrency aggressively is pricing-around the honest unit cost. Both are signals. Our pricing page maps plans against these four shapes explicitly.
4. Concurrency ceiling
For a training-data pipeline, concurrent connection ceiling is often the binding constraint, not bandwidth. A 200-concurrent plan is fine for evaluation work but stalls a training-corpus collector that needs to pull from a few thousand hosts in parallel.
Realistic ceilings for AI workloads:
- Solo research / evaluation scripts: 200 concurrent is comfortable
- Small-team training pipelines: 1,000 concurrent, with bursting headroom
- Large-team / continuous ingestion: 3,000+ concurrent
Higher concurrency makes more sense paired with per-source throttling in your pipeline, because unrestricted concurrency will melt targets you respect and get your traffic rate-limited downstream anyway.
5. Protocol and session controls
The non-negotiables for AI workloads:
- HTTP, HTTPS, SOCKS5 on all exit classes
- Sticky sessions from 1 to 60 minutes, configurable per request
- City-level geo-targeting on the residential pool (AI evals that test regional bias don't care about country-level routing; they care about metro-level)
- ASN targeting on residential (for ASN-diversity requirements in training corpora)
- Header-based class switching (one endpoint, routing decision
in your pipeline via a header like
X-Squad-Class, not multiple gateway URLs)
Vendors that require different endpoints for different exit classes make the per-source routing architecture described in residential vs datacenter for AI workloads materially harder to implement. Header-based class switching is a pipeline-level cost reducer.
6. Legal footprint and AUP
For AI training data collection specifically, the useful AUP clauses are:
- Explicit scoping around "publicly available" content (post-hiQ v. LinkedIn, what actually counts)
- Explicit prohibitions on credential-stuffing, scalping, and inventory-hoarding workloads (these raise the liability for all vendor customers, including yours)
- Explicit provisions around children-adjacent data (COPPA, state laws)
- Documented response path for DMCA / takedown notices that sometimes route through the proxy provider
A vendor whose AUP is three paragraphs of marketing language isn't offering real liability protection. Our AUP is the working version of what a serious AUP looks like.
7. Support path for research engineers
The practical test: can you email the vendor with "we're building a training-data pipeline, here's our architecture, what would you suggest?" and get back a technical answer from someone who has run pipelines?
Not: "thanks for reaching out, here's our pricing page."
This matters less than the first six but separates vendors who understand the use case from those who are selling a generic proxy product and calling it "AI-ready."
Putting it together
A working procurement shortlist for an AI team collecting training data at TB scale in 2026:
- Provenance that survives due diligence (written, not just marketing)
- ASN depth on the carriers that matter for your target countries
- Plan shape that matches workload shape (metered residential, unlimited datacenter)
- Concurrency ceiling 3-5x your peak pipeline requirement
- Header-based class switching on a single gateway
- AUP specific to AI data collection cases
- Engineer-level support path
If you're evaluating SquadProxy specifically against the above: see our use cases for the workload framing, pricing for the plan shape, and contact for the engineer conversation. If we're the wrong fit for your workload, we'll say so — being on a shortlist with the wrong vendor is more expensive than being off the shortlist.
Further reading
- Residential vs datacenter for AI workloads — the routing matrix behind criterion #5
- RAG data collection use case — procurement framing for RAG-focused teams
- LLM evaluation use case — procurement framing for evaluation-focused teams