Writing

Notes from the network

Long-form pieces on proxy selection, bot detection, scraping tradecraft, and the parts of the internet we spend our days routing traffic across.

23 April 2026
6 min read · Nathan Brecher
Proxies for AI browser agents: the 2026 workload shape
AI browser agents — Claude Computer Use, ChatGPT Operator, Gemini-based web agents, and the long tail of open-source browser agents — hit the web from a new angle. The proxy layer below them has specific requirements that general- purpose proxy products don't meet. Here's what matters.
- ai-agents
- browser-automation
- proxy-strategy
23 April 2026
4 min read · Nathan Brecher
Proxies for ChatGPT Operator: browser-agent configuration that works
ChatGPT Operator runs browser-based task execution for end users. Operator-style agents (Operator itself, open-source clones, custom GPTs with browsing) all share a proxy configuration shape. A working reference.
- ai-agents
- chatgpt-operator
- openai
23 April 2026
4 min read · Nathan Brecher
Proxies for Claude Computer Use: session patterns and exit-class choice
Claude Computer Use operates browsers at screen-pixel level. The proxy layer below it needs specific configuration to keep agent sessions coherent and avoid anti-bot challenges. A working guide for production Computer Use deployments.
- ai-agents
- claude-computer-use
- anthropic
22 April 2026
6 min read · Nathan Brecher
Choosing a proxy for LLM training data collection: criteria that actually matter
Listicles ranking "best proxies for AI" miss the criteria that AI engineers weigh in practice. A honest breakdown of the tradeoffs — pool provenance, ASN diversity, bandwidth economics, concurrency ceilings, and legal footprint — for teams collecting LLM training data at scale.
- proxy-strategy
- training-data
- buyer-guide
22 April 2026
6 min read · Reeya Patel
Ethical residential proxies for AI research: why provenance is a methodology concern
If your training-data corpus passes through a residential proxy pool whose peers didn't meaningfully consent, the provenance problem is now your provenance problem. A practical framing of why proxy-pool provenance matters for AI research specifically, and what to ask a vendor.
- ethics
- provenance
- training-data
22 April 2026
6 min read · Nathan Brecher
Proxies for arXiv bulk download: OAI-PMH, S3, and the API — which needs which
arXiv publishes three access paths with different rate-limit behaviour, and only one of them benefits from proxies. A practical breakdown of when to use OAI-PMH metadata harvests, the S3 PDF mirror, and the arXiv API — and where proxies fit.
- arxiv
- training-data
- research
22 April 2026
6 min read · Nathan Brecher
Proxies for Common Crawl: when you need one, when you don't, and how to route
Common Crawl publishes ~250 TB of web content per monthly snapshot and makes most of it freely accessible from S3. Proxies still have a role — but a narrower one than most scraping guides suggest. A working engineer's breakdown.
- common-crawl
- training-data
- infrastructure
22 April 2026
6 min read · Nathan Brecher
Proxies for Hugging Face dataset downloads: when HF_HUB_DOWNLOAD_TIMEOUT won't save you
Hugging Face rate-limits aggressively per-IP. Raising the timeout doesn't help; the server has already decided. A practical guide to routing HF bulk pulls through a proxy layer without breaking LFS resumption.
- huggingface
- training-data
- infrastructure
22 April 2026
6 min read · Reeya Patel
Proxies as methodology for multilingual LLM benchmarks
Multilingual LLM evaluation that uses only US-cloud-origin requests under-reports regional content policy and geo-dependent response divergence. A proxy layer anchored to each benchmark language's primary country is methodology, not infrastructure.
- llm-evaluation
- benchmarking
- multilingual
22 April 2026
8 min read · Reeya Patel
Residential vs datacenter proxy for AI workloads: a routing matrix
Most AI teams over-index on residential proxies and pay too much for coverage they don't need. The useful question isn't residential-vs-datacenter; it's which source class goes through which exit class. A practical routing matrix for training, RAG, and evaluation pipelines.
- infrastructure
- proxy-strategy
- training-data
- rag
25 March 2026
5 min read · Hamza Rahim
Why your eval benchmark is lying to you: regional variation as methodology
Most public LLM eval benchmarks run from a single origin and report a single score per model. Running the same benchmark from 10 regions surfaces variance that single-origin testing hides — and the variance is larger than the reported confidence intervals.
- evaluation
- methodology
- benchmarking
18 February 2026
5 min read · Reeya Patel
Proxy infrastructure for RAG pipelines: latency, consistency, versioning
A RAG index is only as useful as its corpus is consistent and current. The proxy layer is where consistency and currency live or die. A practical guide to picking exit classes per source, handling latency under load, and versioning re-scrapes.
- rag
- infrastructure
- proxy-strategy
20 January 2026
4 min read · Elena Novak
The hidden bias in Common Crawl sampling — and how to fix it from your side
Common Crawl is the default corpus backbone for open LLM training. Its sampling is not uniform, and the biases it introduces show up downstream in very specific ways. Here is what to look for and how to correct it in your own pipeline.
- training-data
- common-crawl
- corpus-bias
11 December 2025
4 min read · Hamza Rahim
How much does geography actually change ChatGPT's answer? A 10-country test
We ran 800 prompts against GPT-4o from 10 country origins to measure how much the same question changes answer when the request IP geography changes. The delta is smaller than we expected, larger than zero, and concentrated in a specific class of prompt.
- evaluation
- regional-bias
- methodology
4 November 2025
4 min read · Imogen Reyes
Tokenization-aware dedup at scrape time, not after
Most training-corpus pipelines run MinHash dedup after collection finishes. Running it at scrape time with tokenizer-aware signatures saves terabytes and produces a cleaner corpus. Here is the approach that worked for us and why it matters.
- training-data
- deduplication
- corpus-engineering

Notes from the network

Proxies for AI browser agents: the 2026 workload shape

Proxies for ChatGPT Operator: browser-agent configuration that works

Proxies for Claude Computer Use: session patterns and exit-class choice

Choosing a proxy for LLM training data collection: criteria that actually matter

Ethical residential proxies for AI research: why provenance is a methodology concern

Proxies for arXiv bulk download: OAI-PMH, S3, and the API — which needs which

Proxies for Common Crawl: when you need one, when you don't, and how to route

Proxies for Hugging Face dataset downloads: when HF_HUB_DOWNLOAD_TIMEOUT won't save you

Proxies as methodology for multilingual LLM benchmarks

Residential vs datacenter proxy for AI workloads: a routing matrix

Why your eval benchmark is lying to you: regional variation as methodology

Proxy infrastructure for RAG pipelines: latency, consistency, versioning

The hidden bias in Common Crawl sampling — and how to fix it from your side

How much does geography actually change ChatGPT's answer? A 10-country test

Tokenization-aware dedup at scrape time, not after