Skip to content
arxivtraining-dataresearch

Proxies for arXiv bulk download: OAI-PMH, S3, and the API — which needs which

arXiv publishes three access paths with different rate-limit behaviour, and only one of them benefits from proxies. A practical breakdown of when to use OAI-PMH metadata harvests, the S3 PDF mirror, and the arXiv API — and where proxies fit.

· Nathan Brecher · 6 min read

arXiv has three published access paths, and picking the right one saves most teams from needing a proxy at all. The operational request is clear: don't attempt to download the complete corpus programmatically, and if you do bulk work, use the documented paths. Most teams who hit rate-limit problems on arXiv are on the wrong path.

This post maps each access path to its rate-limit behaviour, and calls out the narrow cases where a proxy layer is actually the right tool.

The three paths

OAI-PMH (metadata harvest)

export.arxiv.org/oai2 — the protocol for harvesting metadata at scale. Returns structured XML with title, abstract, authors, categories, dates, and identifier. No full text.

Rate behaviour: a 5-second delay between requests is enforced by the server. Respectful clients use resumptionToken for pagination and crawl the full metadata set in a few hours. Polite and predictable.

Proxy role here: none. The 5-second delay is global to the endpoint, not per-IP. A proxy pool doesn't parallelise around it; attempting to hammer with multiple origins wastes quota and gets your ASN block-listed. Single-threaded OAI-PMH is correct.

S3 bulk (full-text)

arXiv mirrors all accepted papers to an AWS S3 bucket (arxiv), requester-pays, available as both PDF and LaTeX source. See the bulk data docs for the current bucket structure.

Rate behaviour: no per-IP limit. S3 requester-pays scales with your AWS account's ability to pay egress. Budget approximately $0.09/GB egress if pulling outside us-east-1; free if pulling into us-east-1.

Proxy role here: none. Same logic as Common Crawl — see proxies for Common Crawl for the broader S3 access patterns. If you need a proxy in front of S3, your architecture is wrong; move the collector to AWS.

arXiv API

export.arxiv.org/api/query — the REST(-ish) search and fetch endpoint. Returns metadata plus links to PDFs; rate-limited per IP.

Documented guidance is 1 request per 3 seconds per IP, with burst tolerance above that. In practice the limit is softer: sustained rates up to ~1 request per second hold, with occasional throttling. Going higher triggers the per-IP block.

Proxy role here: specifically for API-driven selection workloads. If you're running a targeted enumeration — "all papers in cs.LG from 2024 mentioning a specific term" — and need higher throughput than the single-origin rate, a small datacenter proxy pool is the right tool. See the datacenter proxy page.

The realistic AI-team use cases

Most AI teams touching arXiv fall into four patterns. Only one of them needs proxies.

Pattern A: building a training corpus slice

You want cs.LG/cs.CL/stat.ML papers from the last 10 years, as plaintext. The correct path is OAI-PMH for metadata + S3 for full-text, direct from us-east-1. No proxy. Expect ~1 TB of PDFs, ~200 GB of extracted plaintext.

Pattern B: RAG source for a research assistant

You want a continuously-updated index of recent papers. OAI-PMH for the daily harvest (cron nightly) + S3 for the corresponding full-text pulls. Still no proxy. Storage pattern is append-only; re-indexing happens at ingest.

Pattern C: targeted scraping via the API

You need papers matching a specific query where OAI-PMH's category axes don't reach — full-text search, author networks, date-narrow slices that need iterative refinement. The API is the right tool; the rate limit is the constraint.

A rotating datacenter pool of 20–50 IPs lets an enumeration job finish in hours instead of weeks. Polite operation: per-IP spacing of 1 request per 3 seconds (so the aggregate stays within the documented intent), rotation to distribute the load, and an adaptive backoff on 429s. We run this configuration for a few research customers and haven't had ASN blocks in 18 months of operation.

import httpx, time, random

PROXY = "http://USER:PASS@gateway.squadproxy.com:7777"

def arxiv_search(query: str, max_results: int = 1000):
    start = 0
    page_size = 100
    while start < max_results:
        resp = httpx.get(
            "https://export.arxiv.org/api/query",
            params={
                "search_query": query,
                "start": start,
                "max_results": page_size,
            },
            proxies=PROXY,
            headers={
                "X-Squad-Class": "datacenter",
                "User-Agent": "research-harvester/0.1 (team@example.org)",
            },
            timeout=30,
        )
        if resp.status_code == 429:
            time.sleep(60 + random.random() * 30)
            continue
        yield resp.text
        start += page_size
        time.sleep(3)

The 3-second sleep per request is deliberate — we're aggregating across many origins but keeping per-origin behaviour polite. That's the ethical line for a public research resource, not a technical limit.

Pattern D: re-fetching papers that dropped from S3

Occasionally papers are withdrawn and disappear from the S3 mirror. For a training-data re-fetch that needs the withdrawn versions (archived in Wayback Machine or other preprint mirrors), proxies matter in the same way proxies for Common Crawl matter for Pattern 3 there — the secondary mirrors have their own rate-limit and bot-management rules.

What you lose by being impolite

arXiv's operators have blocked ASNs that hammered the API beyond the documented limits. The block is at the ASN layer, not the IP layer — once your cloud ASN is listed, no amount of proxy rotation within that ASN helps. A proxy pool is a horizontal scale tool within the documented rate budget, not a way to evade the budget entirely.

If you see repeated 429s that don't clear after a backoff, stop. The correct move is to email help@arxiv.org and describe the workload. They are reasonable about research use cases and will often whitelist an ASN for a specific project in exchange for a throttled, documented access pattern.

Summary

Path Use case Rate limit Proxy needed
OAI-PMH Metadata harvest 5s global No
S3 bulk Full-text at scale None (pay egress) No
API Targeted search/enum 1 req / 3s / IP Yes, for parallelism
Secondary mirrors Withdrawn papers Per-mirror Case-by-case

The single sentence to remember: proxies are a parallelism tool for the API path, nothing else. Everything else arXiv publishes is meant to be pulled directly.

For the routing picture across a full AI data pipeline — arXiv plus Common Crawl plus HuggingFace plus live re-fetch — residential vs datacenter for AI workloads has the complete matrix. Our pricing page sizes plans for the realistic concurrency a mixed pipeline needs.

Ship on a proxy network you can actually call your ops team about

Real ASNs, real edge capacity, and an engineer who answers your Slack the first time.