Proxies for scraping the AI research surface at scale
arXiv publishes thousands of AI-relevant papers per month. HuggingFace hosts millions of models and datasets. Papers With Code, OpenReview, and leaderboard platforms change daily. SquadProxy gives you the infrastructure to keep that surface current.
The scraping surface for AI research
If you are building a research-intelligence product, a meta-eval harness, or a dataset provenance tool, you are scraping a reasonably stable set of sources:
- arXiv. The primary preprint server for ML, AI, NLP, CV, and several adjacent fields. Paper counts grow substantially year-over-year; AI-related sub-categories (cs.AI, cs.LG, cs.CL, cs.CV) alone see thousands of submissions per month.
- OpenReview. Conference review venues — NeurIPS, ICLR, ICML, ACL, EMNLP — run through OpenReview. The review traces and decision histories are public and valuable.
- Papers With Code. Paper-to-code-to-benchmark graph. Rate limits are real.
- HuggingFace. Model cards, dataset cards, Spaces, daily papers, leaderboards. API available for most of this but rate-limited; scraping fills the gaps.
- Leaderboards. LMSYS Arena, HELM, MMLU leaderboards, MTEB, Open LLM Leaderboard, and specialised vertical boards.
What SquadProxy specifically helps with
- Scale for arXiv. arXiv permits bulk downloading via their AWS S3 mirror under an agreement; otherwise HTTP rate limits apply. A SquadProxy datacenter exit in us-east-1 peers directly with the S3 mirror and bulk collection runs at line rate.
- Rate-limit distribution for HuggingFace. HuggingFace's rate limiter is per-IP. A datacenter pool with rotating exits distributes a large crawl across hundreds of IPs, keeping each well inside the per-IP ceiling.
- Geoblocks on OpenReview and some leaderboard mirrors. Some academic infrastructure geoblocks outside .edu/.ac.uk peers. Residential exits in the UK or Canada resolve this.
- Consistency. Running the whole benchmark-scraping pipeline through one gateway (with per-source class selection) means one auth system, one usage metric, one invoice.
Practical config
SOURCES = {
"arxiv": {"class": "datacenter", "country": "us", "rotation": "per-request"},
"huggingface": {"class": "datacenter", "country": "us", "rotation": "per-request"},
"openreview": {"class": "residential", "country": "us", "rotation": "sticky-10m"},
"paperswithcode": {"class": "datacenter", "country": "us", "rotation": "per-request"},
"lmsys-arena": {"class": "residential", "country": "us", "rotation": "per-request"},
}
def fetch(source: str, url: str) -> httpx.Response:
cfg = SOURCES[source]
headers = {
"X-Squad-Class": cfg["class"],
"X-Squad-Country": cfg["country"],
"X-Squad-Session": cfg["rotation"],
"User-Agent": "research-intel/1.0 (contact: you@yourlab.edu)",
}
return httpx.get(url, proxies=proxy, headers=headers, timeout=45)
Set a descriptive User-Agent with contact info. Academic infrastructure responds well to identifiable research traffic and badly to generic browser-mimicking UAs that look like scraping for a commercial downstream product.
Respect the sources
arXiv, HuggingFace, OpenReview, and the leaderboard platforms are
public goods. Scrape them sustainably: respect robots.txt, follow
published rate-limit headers, contribute back where possible
(citations, bug reports, mirror donations). SquadProxy rate-caps these
source hostnames at the gateway to keep customers from
accidentally overwhelming them.
Benchmark-specific workflows
Research-intelligence workloads tend to cluster around a small set of benchmarks whose update cadence and access model differ enough that the proxy configuration has to change per target.
MMLU, MMLU-Pro, MMLU-ProX
Published on GitHub + HuggingFace under permissive licences. Bulk
access via datasets.load_dataset("cais/mmlu") works through the
HuggingFace Hub rate limiter described in our
HF datasets post. For the
multilingual variant (MMLU-ProX, 29 languages) the raw files are
split across a single HF repo — sticky-ISP on the full pull keeps
LFS resumption consistent.
For reproducibility: pin the dataset commit hash at pull time and log it alongside your eval results. A year-later re-run needs to target the same commit, not "latest".
HELM (Holistic Evaluation of Language Models)
Published by Stanford CRFM. The leaderboard (crfm.stanford.edu/helm) updates on an irregular cadence. Scraping the leaderboard pages directly is straightforward; datacenter with polite spacing works. The actual scenario files are on GitHub and pull cleanly.
For teams running HELM-adjacent evals: the scenarios are licence-pinned; your pipeline should capture the licence string per scenario and respect the redistribution terms.
GPQA, GPQA-Diamond
Gated dataset on HuggingFace — requires acknowledgement of the licence via the HF Hub UI before download. Once the account-level acknowledgement is in place, pulls work through standard HF flow. No special proxy configuration needed beyond the HF dataset pattern.
For redistribution: GPQA's licence permits use for eval but restricts derivative-dataset publication. Document the chain of custody in your eval metadata.
LMSYS Chatbot Arena
Public leaderboard at chat.lmsys.org. Rate limits are soft but real — too many queries from one IP slow-rolls responses. Datacenter with a small rotating pool (10-20 IPs) plus 1-request-per-2-seconds spacing per IP is the polite pattern.
The underlying Arena battles dataset is published on HuggingFace with provenance controls; pull it through the HF flow.
Open LLM Leaderboard, MTEB, other HF-hosted boards
All go through the HF Hub. One configuration serves all of them. Rate-limit pressure here is per-IP on the HF CDN, not per-leaderboard, so the ISP pattern from the HF post applies directly.
Rate-limit map per target
Practical rates that hold without triggering throttling, from ~18 months of running benchmark-scraping customers:
| Target | Per-IP rate budget | Concurrency ceiling | Exit class | Session |
|---|---|---|---|---|
| arXiv API | 1 req / 3s | 20 IPs (60 rps aggregate) | Datacenter | Per-request |
| arXiv S3 mirror | No practical limit | Bound by AWS quotas | Direct (no proxy) | — |
| HuggingFace Hub | 100-200 req/min per IP | 8 parallel datasets | ISP | Sticky-dataset |
| OpenReview | 1 req / 2s | 10 IPs (5 rps aggregate) | Residential or ISP | Sticky |
| Papers With Code | 1 req / 2s | 10 IPs | Datacenter | Per-request |
| LMSYS leaderboard | 1 req / 2s | 10-15 IPs | Datacenter | Per-request |
| Various HELM mirrors | 1 req / 5s | 5 IPs | Datacenter | Per-request |
| HuggingFace Spaces (public output scrape) | 60 req/min per IP | 5-8 IPs | Datacenter | Per-request |
The numbers are floor values — sustained rates we run without visible degradation. Above these, error rates climb and the polite operator backs off. Under-running the limits is fine; over-running them gets ASNs block-listed eventually.
Dataset licensing and redistribution
Research-intelligence pipelines frequently conflate three distinct compliance surfaces: (1) permission to scrape the source, (2) permission to redistribute the scraped data, and (3) permission to use the data for model training. For published benchmarks, the licences govern redistribution and use explicitly; the scraping permission is usually implicit (the platform publishes the data, scraping the canonical form doesn't differ from downloading it).
Common licences in this space:
- MIT / Apache 2 — unrestricted use and redistribution with attribution. Applies to most GitHub-hosted eval datasets.
- Creative Commons BY — same, formalised for dataset publication.
- Creative Commons BY-NC — redistribution OK, commercial use restricted. Some academic benchmarks fall here.
- Custom research-only — GPQA, some newer safety benchmarks. Explicit TOS acceptance required; redistribution restricted.
- Dataset-card "request access" — HF's gating. Acknowledgement of licence via account is part of the compliance chain.
Your benchmark-scraping pipeline should capture the source-side licence at ingest time and propagate it forward as a column in your eval metadata store. This matters when the output of your pipeline (leaderboard, analysis report, derivative dataset) ships to a third party who will ask "where did this come from and can I use it?"
Reproducibility for published benchmark reports
A benchmark report that survives academic review needs to document its ingestion pipeline, not just its evaluation:
- Commit hash of each benchmark dataset at the time of eval
- Scrape timestamp for leaderboard-sourced data
- Proxy configuration (exit class, country, rotation) per target
- Rate discipline observed (requests per minute per IP)
- Licence acknowledgements for gated datasets
- Any corrections applied for known benchmark bugs (MMLU has several documented errata; your pipeline should note which patch state was used)
Without these, a year-later re-run diverges from your original in ways that look like regression but are actually pipeline drift. See the LLM evaluation use case and the multilingual benchmark post for the eval-side reproducibility requirements that this ingestion side supports.
Where to start
For a research-intelligence team onboarding in a week:
- Classify your benchmark targets into three buckets: API (arXiv, HF Hub), static scrape (OpenReview, leaderboards), gated download (GPQA, some HELM scenarios).
- One gateway, three exit classes active: datacenter, residential, ISP. Mobile not required unless your eval specifically tests mobile-origin bias.
- Polite User-Agent with institutional contact email. This matters for arXiv and OpenReview particularly; sites that see identifiable research traffic respond differently to rate-limit pressure than they do to anonymous bulk scraping.
- Capture licence and commit-hash metadata at ingest. Don't retrofit this; retrofitted provenance is always incomplete.
- Pin source-to-exit-class map; treat as source-of-truth config for the life of the research programme.
The Team plan covers this shape for most research-intel pipelines; see pricing.
Further reading
- Proxies for arxiv bulk download
- Proxies for Hugging Face datasets
- LLM evaluation use case — downstream of benchmark scraping
- Competitive AI intelligence — adjacent workflow
- Residential vs datacenter routing matrix
Pricing
Pricing for benchmark and paper scraping
Every plan carries every exit class — pick the one whose bandwidth envelope fits your workload.
Solo
For individual researchers running evaluation scripts and prototype RAG pipelines.
$149/ month
or $1,430/year (save 20%)
50 GB residential · unlimited datacenter · 200 concurrent sessions
- ✓Access to all 5 exit classes · 10 focus countries
- ✓50 GB residential · unlimited datacenter
- ✓5 static ISP IPs · 5 GB 4G mobile
- ✓1 seat · 200 concurrent sessions
- ✓Python + Node SDK + REST API
- ✓Per-request metering (not time-based)
- ✓Email support (24h response, business days)
- ✓Overage: $3/GB residential · $6/GB mobile
Best for
- Solo researchers
- Evaluation scripts
- Prototype RAG
Team
Most popularFor AI startups and mid-size labs splitting capacity between training and evaluation.
$699/ month
or $6,710/year (save 20%)
500 GB residential · unlimited datacenter · 1,000 concurrent sessions
- ✓Access to all 5 exit classes · 10 focus countries
- ✓500 GB residential · unlimited datacenter
- ✓25 static ISP IPs · 25 GB 4G mobile
- ✓10 seats ($29/mo per extra seat) · 1,000 concurrent sessions
- ✓City-level geo-routing + ASN targeting
- ✓99.9% uptime SLA
- ✓Priority Slack support (4h response, business hours)
- ✓Python + Node SDK + REST API + webhooks
- ✓Overage: $3/GB residential · $6/GB mobile
Best for
- AI startups
- Mid-size labs
- Model eval teams
Lab
For academic labs, eval consortia, and frontier model companies running sustained workloads.
$2,999/ month
or $28,790/year (save 20%)
2 TB residential · unlimited DC · 50 GB 4G + 20 GB 5G · 3,000 concurrent sessions
- ✓Access to all 5 exit classes · 10 countries on 4 continents
- ✓2 TB residential · unlimited datacenter
- ✓100 static ISP IPs · 50 GB 4G + 20 GB 5G mobile
- ✓50 seats ($19/mo per extra seat) · 3,000 concurrent sessions
- ✓Dedicated gateway lane (bypasses shared-pool queues on us-east-1 + eu-west-1)
- ✓99.95% uptime SLA
- ✓Dedicated Slack channel (1h response, business hours)
- ✓Custom BGP prefix on request (additional fees apply)
- ✓Overage: $2.50/GB residential · $5/GB mobile
Best for
- Academic labs
- Large eval consortia
- Frontier model companies
Enterprise
Custom contracts with dedicated infrastructure, volume pricing, and research-grade SLAs.
Custom pricing
Custom (from 5 TB/mo residential) · unlimited concurrent sessions
- ✓Volume pricing from 5 TB/mo residential
- ✓Dedicated BGP prefix + ASN announcement
- ✓Unlimited concurrent sessions · unlimited seats
- ✓99.99% uptime SLA with financial credits
- ✓Named Technical Account Manager + 24/7 on-call paging
- ✓Custom AUP, DPA, on-site deployment option
- ✓Research / academic discount (30–50% off Team or Lab)
- ✓Annual contract · wire, ACH, USDC/USDT/BTC settlement
Best for
- Frontier labs
- Eval consortia
- Enterprise AI
All plans include 14-day refund, single endpoint with regional failover, HTTP(S) + SOCKS5 on every exit class, access to all 5 exit classes and all 10 focus countries, and Python + Node SDKs. Concurrent sessions = simultaneous TCP sessions through the gateway. Overage warnings fire at 80% and 100%; traffic continues only if overage billing is enabled on your account.
Ship on a proxy network you can actually call your ops team about
Real ASNs, real edge capacity, and an engineer who answers your Slack the first time.