Skip to content
ethicsprovenancetraining-data

Ethical residential proxies for AI research: why provenance is a methodology concern

If your training-data corpus passes through a residential proxy pool whose peers didn't meaningfully consent, the provenance problem is now your provenance problem. A practical framing of why proxy-pool provenance matters for AI research specifically, and what to ask a vendor.

· Reeya Patel · 6 min read

Research engineering teams publishing LLM work in 2026 face a question that didn't exist two years ago: can you document the provenance of the traffic path that collected your training corpus? The answer is increasingly asked at review, in internal compliance processes, and by legal teams before a public release. "We used a commercial proxy provider" is no longer a sufficient answer for research that wants to stand up.

This post is about why, and what honest provenance looks like for a residential proxy pool used in AI data collection.

The chain-of-consent chain

Every residential proxy request routes through the following chain of consent:

  1. The proxy operator (the vendor) contracts with an application developer or consumer-app publisher who integrates the proxy SDK into their app.
  2. The app publisher exposes the SDK to end users — hopefully with informed consent, a clear opt-in, and a disclosed value exchange (SDK features, rewards, premium tier access).
  3. The end user agrees to share idle-device bandwidth in exchange for the disclosed value.
  4. The researcher (you) routes a request through that consenting end user's device via the proxy operator.

Each link in this chain has historically been a place where consent can degrade. The pattern that's produced every recent incident in residential-proxy security research is the same: bundleware installers, stealth SDKs embedded in "free VPN" apps, pre-installed device-farm compromises, or SDK integrations that bury the disclosure in a EULA paragraph 12 screens in.

For AI research publications, the problem is that your corpus provenance is only as clean as the weakest link in the chain. If your proxy vendor's SDK shipped in an app where users didn't meaningfully agree to resell their bandwidth, the data-subject consent isn't there — and that's a provenance problem for your corpus even though your own processing was clean.

Why this matters for AI research specifically

Three reasons the AI-research case is more acute than general web-scraping:

The publication bar has moved

Recent ML venues (NeurIPS, ICLR, ACL) have added or strengthened data documentation requirements. Datasheets for datasets, data-statements, and model cards expect documented source chains. "Scraped via a commercial proxy" is increasingly insufficient — reviewers ask what the proxy pool's consent posture is.

Public disclosures compound

When a proxy vendor's sourcing practices become public — via security research, journalism, or a regulatory action — the disclosure pulls in every customer who used that vendor. A research team that published three papers using the vendor now has three papers with a provenance footnote that didn't match what they thought.

The regulatory surface is widening

GDPR (in Europe), CCPA/CPRA (California), the EU AI Act (for AI system operators), and the growing list of state privacy regimes in the US all have consent provisions that reach through intermediaries. The compliance team's position increasingly treats the proxy pool as a data processor under contract terms that require consent-of-data-subject documentation.

What clean provenance looks like

The shape of a residential proxy pool whose provenance survives scrutiny:

  1. Opt-in integration. SDK onboarding happens in an app where users explicitly agree to the bandwidth-sharing arrangement. The arrangement is readable in plain language; it's not buried in paragraph 47 of the EULA.
  2. Value disclosure. Users know what they get back — premium features, reward credits, ad removal, whatever. "Nothing" is not a consent regime; it's a theft regime with fewer consequences for the operator.
  3. Reversibility. Users can leave the program at any time; the SDK stops routing when they do.
  4. Exclusion zones. Pool does not include children's devices (COPPA-relevant), devices in categories where consent is meaningfully impaired (prison-tablet app deployments, for example), or devices in jurisdictions where the legal regime doesn't permit the arrangement.
  5. Auditable usage logs. When a request routes through a peer, the metadata (timestamp, destination category) is retained in a form that supports after-the-fact inquiry.
  6. Published policy, not just marketing. The vendor has a document that describes the above that you can cite in your own research's data statement.

Our residential pool documentation describes our specific approach; the shape above is the general version.

Questions to ask a vendor

If you're procuring a proxy network for AI research specifically, the questions that separate clean vendors from unclean ones:

  • Where do your residential peers come from? (Answer should name at least one SDK integration you can verify exists as an opt-in experience.)
  • Can I see the peer-side opt-in flow? (Vendor should show you the UI a peer sees.)
  • How do peers leave? (Answer should describe a one-click opt-out mechanism.)
  • Has the pool appeared in published security research? (If yes, what changed in response?)
  • Do you exclude specific device categories? (Expected answer: yes — children's devices, specific low-consent deployments.)
  • What do you retain in the peer-activity logs? (Answer should be operationally useful, not "nothing" — that makes after-the-fact inquiry impossible.)

A vendor who can't answer these, or whose answers don't survive five minutes of follow-up, is not suitable for publication- grade research.

Documenting provenance in your own work

For an AI research publication that used a commercial proxy network:

  • Name the vendor
  • Name the exit class (residential, ISP, datacenter, etc.)
  • Cite the vendor's published sourcing policy by URL and access date
  • Describe the routing pattern (per-source class mapping, etc.)
  • Note the countries touched
  • Note whether the pool policy excludes categories relevant to your research

This level of documentation takes a few paragraphs and materially reduces the chance that a future vendor disclosure contaminates your publication's provenance story.

The economics

Clean-provenance residential pools cost more per GB than unclean ones. The price ratio is typically 1.5-2x: clean pools price at $2-4/GB, unclean ones at $1-2/GB. For AI research work this is usually the right trade; a corpus you can stand behind is worth paying a premium for, especially when the corpus represents multiple person-months of engineering work upstream and downstream.

Budget for the price premium explicitly at procurement time; don't let a low-bid residential pool sneak past compliance by looking only at per-GB numbers. Our pricing page sizes plans that assume clean-pool economics.

When datacenter is the honest answer

Not every AI workload actually needs residential. If your training corpus is primarily open-web content where datacenter origin works fine, running the entire collection through datacenter is the ethically simplest answer — the consent question doesn't arise because the IP doesn't route through an unrelated third party's device. The routing matrix in residential vs datacenter for AI workloads is relevant here: for ~80% of training-corpus volume, datacenter is correct and residential is unnecessary.

Use residential where you actually need the residential property. Don't use it where datacenter would work. The net result is a cleaner provenance posture and a smaller bill.

Further reading

Ship on a proxy network you can actually call your ops team about

Real ASNs, real edge capacity, and an engineer who answers your Slack the first time.