Writing
Notes from the network
Long-form pieces on proxy selection, bot detection, scraping tradecraft, and the parts of the internet we spend our days routing traffic across.
6 min read · Nathan Brecher
Proxies for AI browser agents: the 2026 workload shape
AI browser agents — Claude Computer Use, ChatGPT Operator, Gemini-based web agents, and the long tail of open-source browser agents — hit the web from a new angle. The proxy layer below them has specific requirements that general- purpose proxy products don't meet. Here's what matters.
- ai-agents
- browser-automation
- proxy-strategy
4 min read · Nathan Brecher
Proxies for ChatGPT Operator: browser-agent configuration that works
ChatGPT Operator runs browser-based task execution for end users. Operator-style agents (Operator itself, open-source clones, custom GPTs with browsing) all share a proxy configuration shape. A working reference.
- ai-agents
- chatgpt-operator
- openai
4 min read · Nathan Brecher
Proxies for Claude Computer Use: session patterns and exit-class choice
Claude Computer Use operates browsers at screen-pixel level. The proxy layer below it needs specific configuration to keep agent sessions coherent and avoid anti-bot challenges. A working guide for production Computer Use deployments.
- ai-agents
- claude-computer-use
- anthropic
6 min read · Nathan Brecher
Choosing a proxy for LLM training data collection: criteria that actually matter
Listicles ranking "best proxies for AI" miss the criteria that AI engineers weigh in practice. A honest breakdown of the tradeoffs — pool provenance, ASN diversity, bandwidth economics, concurrency ceilings, and legal footprint — for teams collecting LLM training data at scale.
- proxy-strategy
- training-data
- buyer-guide
6 min read · Reeya Patel
Ethical residential proxies for AI research: why provenance is a methodology concern
If your training-data corpus passes through a residential proxy pool whose peers didn't meaningfully consent, the provenance problem is now your provenance problem. A practical framing of why proxy-pool provenance matters for AI research specifically, and what to ask a vendor.
- ethics
- provenance
- training-data
6 min read · Nathan Brecher
Proxies for arXiv bulk download: OAI-PMH, S3, and the API — which needs which
arXiv publishes three access paths with different rate-limit behaviour, and only one of them benefits from proxies. A practical breakdown of when to use OAI-PMH metadata harvests, the S3 PDF mirror, and the arXiv API — and where proxies fit.
- arxiv
- training-data
- research
6 min read · Nathan Brecher
Proxies for Common Crawl: when you need one, when you don't, and how to route
Common Crawl publishes ~250 TB of web content per monthly snapshot and makes most of it freely accessible from S3. Proxies still have a role — but a narrower one than most scraping guides suggest. A working engineer's breakdown.
- common-crawl
- training-data
- infrastructure
6 min read · Nathan Brecher
Proxies for Hugging Face dataset downloads: when HF_HUB_DOWNLOAD_TIMEOUT won't save you
Hugging Face rate-limits aggressively per-IP. Raising the timeout doesn't help; the server has already decided. A practical guide to routing HF bulk pulls through a proxy layer without breaking LFS resumption.
- huggingface
- training-data
- infrastructure
6 min read · Reeya Patel
Proxies as methodology for multilingual LLM benchmarks
Multilingual LLM evaluation that uses only US-cloud-origin requests under-reports regional content policy and geo-dependent response divergence. A proxy layer anchored to each benchmark language's primary country is methodology, not infrastructure.
- llm-evaluation
- benchmarking
- multilingual
8 min read · Reeya Patel
Residential vs datacenter proxy for AI workloads: a routing matrix
Most AI teams over-index on residential proxies and pay too much for coverage they don't need. The useful question isn't residential-vs-datacenter; it's which source class goes through which exit class. A practical routing matrix for training, RAG, and evaluation pipelines.
- infrastructure
- proxy-strategy
- training-data
- rag
5 min read · Hamza Rahim
Why your eval benchmark is lying to you: regional variation as methodology
Most public LLM eval benchmarks run from a single origin and report a single score per model. Running the same benchmark from 10 regions surfaces variance that single-origin testing hides — and the variance is larger than the reported confidence intervals.
- evaluation
- methodology
- benchmarking
5 min read · Reeya Patel
Proxy infrastructure for RAG pipelines: latency, consistency, versioning
A RAG index is only as useful as its corpus is consistent and current. The proxy layer is where consistency and currency live or die. A practical guide to picking exit classes per source, handling latency under load, and versioning re-scrapes.
- rag
- infrastructure
- proxy-strategy
4 min read · Elena Novak
The hidden bias in Common Crawl sampling — and how to fix it from your side
Common Crawl is the default corpus backbone for open LLM training. Its sampling is not uniform, and the biases it introduces show up downstream in very specific ways. Here is what to look for and how to correct it in your own pipeline.
- training-data
- common-crawl
- corpus-bias
4 min read · Hamza Rahim
How much does geography actually change ChatGPT's answer? A 10-country test
We ran 800 prompts against GPT-4o from 10 country origins to measure how much the same question changes answer when the request IP geography changes. The delta is smaller than we expected, larger than zero, and concentrated in a specific class of prompt.
- evaluation
- regional-bias
- methodology
4 min read · Imogen Reyes
Tokenization-aware dedup at scrape time, not after
Most training-corpus pipelines run MinHash dedup after collection finishes. Running it at scrape time with tokenizer-aware signatures saves terabytes and produces a cleaner corpus. Here is the approach that worked for us and why it matters.
- training-data
- deduplication
- corpus-engineering