AI & Machine Learning

Proxies for scraping the AI research surface at scale

arXiv publishes thousands of AI-relevant papers per month. HuggingFace hosts millions of models and datasets. Papers With Code, OpenReview, and leaderboard platforms change daily. SquadProxy gives you the infrastructure to keep that surface current.

The scraping surface for AI research

If you are building a research-intelligence product, a meta-eval harness, or a dataset provenance tool, you are scraping a reasonably stable set of sources:

arXiv. The primary preprint server for ML, AI, NLP, CV, and several adjacent fields. Paper counts grow substantially year-over-year; AI-related sub-categories (cs.AI, cs.LG, cs.CL, cs.CV) alone see thousands of submissions per month.
OpenReview. Conference review venues — NeurIPS, ICLR, ICML, ACL, EMNLP — run through OpenReview. The review traces and decision histories are public and valuable.
Papers With Code. Paper-to-code-to-benchmark graph. Rate limits are real.
HuggingFace. Model cards, dataset cards, Spaces, daily papers, leaderboards. API available for most of this but rate-limited; scraping fills the gaps.
Leaderboards. LMSYS Arena, HELM, MMLU leaderboards, MTEB, Open LLM Leaderboard, and specialised vertical boards.

What SquadProxy specifically helps with

Scale for arXiv. arXiv permits bulk downloading via their AWS S3 mirror under an agreement; otherwise HTTP rate limits apply. A SquadProxy datacenter exit in us-east-1 peers directly with the S3 mirror and bulk collection runs at line rate.
Rate-limit distribution for HuggingFace. HuggingFace's rate limiter is per-IP. A datacenter pool with rotating exits distributes a large crawl across hundreds of IPs, keeping each well inside the per-IP ceiling.
Geoblocks on OpenReview and some leaderboard mirrors. Some academic infrastructure geoblocks outside .edu/.ac.uk peers. Residential exits in the UK or Canada resolve this.
Consistency. Running the whole benchmark-scraping pipeline through one gateway (with per-source class selection) means one auth system, one usage metric, one invoice.

Practical config

SOURCES = {
    "arxiv": {"class": "datacenter", "country": "us", "rotation": "per-request"},
    "huggingface": {"class": "datacenter", "country": "us", "rotation": "per-request"},
    "openreview": {"class": "residential", "country": "us", "rotation": "sticky-10m"},
    "paperswithcode": {"class": "datacenter", "country": "us", "rotation": "per-request"},
    "lmsys-arena": {"class": "residential", "country": "us", "rotation": "per-request"},
}

def fetch(source: str, url: str) -> httpx.Response:
    cfg = SOURCES[source]
    headers = {
        "X-Squad-Class": cfg["class"],
        "X-Squad-Country": cfg["country"],
        "X-Squad-Session": cfg["rotation"],
        "User-Agent": "research-intel/1.0 (contact: you@yourlab.edu)",
    }
    return httpx.get(url, proxies=proxy, headers=headers, timeout=45)

Set a descriptive User-Agent with contact info. Academic infrastructure responds well to identifiable research traffic and badly to generic browser-mimicking UAs that look like scraping for a commercial downstream product.

Respect the sources

arXiv, HuggingFace, OpenReview, and the leaderboard platforms are public goods. Scrape them sustainably: respect robots.txt, follow published rate-limit headers, contribute back where possible (citations, bug reports, mirror donations). SquadProxy rate-caps these source hostnames at the gateway to keep customers from accidentally overwhelming them.

Benchmark-specific workflows

Research-intelligence workloads tend to cluster around a small set of benchmarks whose update cadence and access model differ enough that the proxy configuration has to change per target.

MMLU, MMLU-Pro, MMLU-ProX

Published on GitHub + HuggingFace under permissive licences. Bulk access via datasets.load_dataset("cais/mmlu") works through the HuggingFace Hub rate limiter described in our HF datasets post. For the multilingual variant (MMLU-ProX, 29 languages) the raw files are split across a single HF repo — sticky-ISP on the full pull keeps LFS resumption consistent.

For reproducibility: pin the dataset commit hash at pull time and log it alongside your eval results. A year-later re-run needs to target the same commit, not "latest".

HELM (Holistic Evaluation of Language Models)

Published by Stanford CRFM. The leaderboard (crfm.stanford.edu/helm) updates on an irregular cadence. Scraping the leaderboard pages directly is straightforward; datacenter with polite spacing works. The actual scenario files are on GitHub and pull cleanly.

For teams running HELM-adjacent evals: the scenarios are licence-pinned; your pipeline should capture the licence string per scenario and respect the redistribution terms.

GPQA, GPQA-Diamond

Gated dataset on HuggingFace — requires acknowledgement of the licence via the HF Hub UI before download. Once the account-level acknowledgement is in place, pulls work through standard HF flow. No special proxy configuration needed beyond the HF dataset pattern.

For redistribution: GPQA's licence permits use for eval but restricts derivative-dataset publication. Document the chain of custody in your eval metadata.

LMSYS Chatbot Arena

Public leaderboard at chat.lmsys.org. Rate limits are soft but real — too many queries from one IP slow-rolls responses. Datacenter with a small rotating pool (10-20 IPs) plus 1-request-per-2-seconds spacing per IP is the polite pattern.

The underlying Arena battles dataset is published on HuggingFace with provenance controls; pull it through the HF flow.

Open LLM Leaderboard, MTEB, other HF-hosted boards

All go through the HF Hub. One configuration serves all of them. Rate-limit pressure here is per-IP on the HF CDN, not per-leaderboard, so the ISP pattern from the HF post applies directly.

Rate-limit map per target

Practical rates that hold without triggering throttling, from ~18 months of running benchmark-scraping customers:

Target	Per-IP rate budget	Concurrency ceiling	Exit class	Session
arXiv API	1 req / 3s	20 IPs (60 rps aggregate)	Datacenter	Per-request
arXiv S3 mirror	No practical limit	Bound by AWS quotas	Direct (no proxy)	—
HuggingFace Hub	100-200 req/min per IP	8 parallel datasets	ISP	Sticky-dataset
OpenReview	1 req / 2s	10 IPs (5 rps aggregate)	Residential or ISP	Sticky
Papers With Code	1 req / 2s	10 IPs	Datacenter	Per-request
LMSYS leaderboard	1 req / 2s	10-15 IPs	Datacenter	Per-request
Various HELM mirrors	1 req / 5s	5 IPs	Datacenter	Per-request
HuggingFace Spaces (public output scrape)	60 req/min per IP	5-8 IPs	Datacenter	Per-request

The numbers are floor values — sustained rates we run without visible degradation. Above these, error rates climb and the polite operator backs off. Under-running the limits is fine; over-running them gets ASNs block-listed eventually.

Dataset licensing and redistribution

Research-intelligence pipelines frequently conflate three distinct compliance surfaces: (1) permission to scrape the source, (2) permission to redistribute the scraped data, and (3) permission to use the data for model training. For published benchmarks, the licences govern redistribution and use explicitly; the scraping permission is usually implicit (the platform publishes the data, scraping the canonical form doesn't differ from downloading it).

Common licences in this space:

MIT / Apache 2 — unrestricted use and redistribution with attribution. Applies to most GitHub-hosted eval datasets.
Creative Commons BY — same, formalised for dataset publication.
Creative Commons BY-NC — redistribution OK, commercial use restricted. Some academic benchmarks fall here.
Custom research-only — GPQA, some newer safety benchmarks. Explicit TOS acceptance required; redistribution restricted.
Dataset-card "request access" — HF's gating. Acknowledgement of licence via account is part of the compliance chain.

Your benchmark-scraping pipeline should capture the source-side licence at ingest time and propagate it forward as a column in your eval metadata store. This matters when the output of your pipeline (leaderboard, analysis report, derivative dataset) ships to a third party who will ask "where did this come from and can I use it?"

Reproducibility for published benchmark reports

A benchmark report that survives academic review needs to document its ingestion pipeline, not just its evaluation:

Commit hash of each benchmark dataset at the time of eval
Scrape timestamp for leaderboard-sourced data
Proxy configuration (exit class, country, rotation) per target
Rate discipline observed (requests per minute per IP)
Licence acknowledgements for gated datasets
Any corrections applied for known benchmark bugs (MMLU has several documented errata; your pipeline should note which patch state was used)

Without these, a year-later re-run diverges from your original in ways that look like regression but are actually pipeline drift. See the LLM evaluation use case and the multilingual benchmark post for the eval-side reproducibility requirements that this ingestion side supports.

Where to start

For a research-intelligence team onboarding in a week:

Classify your benchmark targets into three buckets: API (arXiv, HF Hub), static scrape (OpenReview, leaderboards), gated download (GPQA, some HELM scenarios).
One gateway, three exit classes active: datacenter, residential, ISP. Mobile not required unless your eval specifically tests mobile-origin bias.
Polite User-Agent with institutional contact email. This matters for arXiv and OpenReview particularly; sites that see identifiable research traffic respond differently to rate-limit pressure than they do to anonymous bulk scraping.
Capture licence and commit-hash metadata at ingest. Don't retrofit this; retrofitted provenance is always incomplete.
Pin source-to-exit-class map; treat as source-of-truth config for the life of the research programme.

The Team plan covers this shape for most research-intel pipelines; see pricing.

Pricing for benchmark and paper scraping

Every plan carries every exit class — pick the one whose bandwidth envelope fits your workload.

Solo

For individual researchers running evaluation scripts and prototype RAG pipelines.

$149/ month

or $1,430/year (save 20%)

50 GB residential · unlimited datacenter · 200 concurrent sessions

✓Access to all 5 exit classes · 10 focus countries
✓50 GB residential · unlimited datacenter
✓5 static ISP IPs · 5 GB 4G mobile
✓1 seat · 200 concurrent sessions
✓Python + Node SDK + REST API
✓Per-request metering (not time-based)
✓Email support (24h response, business days)
✓Overage: $3/GB residential · $6/GB mobile

Start with Solo

Best for

Solo researchers
Evaluation scripts
Prototype RAG

Team

Lab

For academic labs, eval consortia, and frontier model companies running sustained workloads.

$2,999/ month

or $28,790/year (save 20%)

2 TB residential · unlimited DC · 50 GB 4G + 20 GB 5G · 3,000 concurrent sessions

✓Access to all 5 exit classes · 10 countries on 4 continents
✓2 TB residential · unlimited datacenter
✓100 static ISP IPs · 50 GB 4G + 20 GB 5G mobile
✓50 seats ($19/mo per extra seat) · 3,000 concurrent sessions
✓Dedicated gateway lane (bypasses shared-pool queues on us-east-1 + eu-west-1)
✓99.95% uptime SLA
✓Dedicated Slack channel (1h response, business hours)
✓Custom BGP prefix on request (additional fees apply)
✓Overage: $2.50/GB residential · $5/GB mobile

Start with Lab

Best for

Academic labs
Large eval consortia
Frontier model companies

Enterprise

Custom contracts with dedicated infrastructure, volume pricing, and research-grade SLAs.

Custom pricing

Custom (from 5 TB/mo residential) · unlimited concurrent sessions

✓Volume pricing from 5 TB/mo residential
✓Dedicated BGP prefix + ASN announcement
✓Unlimited concurrent sessions · unlimited seats
✓99.99% uptime SLA with financial credits
✓Named Technical Account Manager + 24/7 on-call paging
✓Custom AUP, DPA, on-site deployment option
✓Research / academic discount (30–50% off Team or Lab)
✓Annual contract · wire, ACH, USDC/USDT/BTC settlement

Contact research team

Best for

Frontier labs
Eval consortia
Enterprise AI

All plans include 14-day refund, single endpoint with regional failover, HTTP(S) + SOCKS5 on every exit class, access to all 5 exit classes and all 10 focus countries, and Python + Node SDKs. Concurrent sessions = simultaneous TCP sessions through the gateway. Overage warnings fire at 80% and 100%; traffic continues only if overage billing is enabled on your account.

Ship on a proxy network you can actually call your ops team about

Real ASNs, real edge capacity, and an engineer who answers your Slack the first time.

See pricing Contact sales

Proxies for scraping the AI research surface at scale

The scraping surface for AI research

What SquadProxy specifically helps with

Practical config

Respect the sources

Benchmark-specific workflows

MMLU, MMLU-Pro, MMLU-ProX

HELM (Holistic Evaluation of Language Models)

GPQA, GPQA-Diamond

LMSYS Chatbot Arena

Open LLM Leaderboard, MTEB, other HF-hosted boards

Rate-limit map per target

Dataset licensing and redistribution

Reproducibility for published benchmark reports

Where to start

Further reading

Solo

Team

Lab

Enterprise

Ship on a proxy network you can actually call your ops team about