Proxies for arXiv bulk download: OAI-PMH, S3, and the API — which needs which
arXiv publishes three access paths with different rate-limit behaviour, and only one of them benefits from proxies. A practical breakdown of when to use OAI-PMH metadata harvests, the S3 PDF mirror, and the arXiv API — and where proxies fit.
· Nathan Brecher · 6 min read
arXiv has three published access paths, and picking the right one saves most teams from needing a proxy at all. The operational request is clear: don't attempt to download the complete corpus programmatically, and if you do bulk work, use the documented paths. Most teams who hit rate-limit problems on arXiv are on the wrong path.
This post maps each access path to its rate-limit behaviour, and calls out the narrow cases where a proxy layer is actually the right tool.
The three paths
OAI-PMH (metadata harvest)
export.arxiv.org/oai2 — the protocol for harvesting metadata
at scale. Returns structured XML with title, abstract, authors,
categories, dates, and identifier. No full text.
Rate behaviour: a 5-second delay between requests is enforced
by the server. Respectful clients use resumptionToken for
pagination and crawl the full metadata set in a few hours. Polite
and predictable.
Proxy role here: none. The 5-second delay is global to the endpoint, not per-IP. A proxy pool doesn't parallelise around it; attempting to hammer with multiple origins wastes quota and gets your ASN block-listed. Single-threaded OAI-PMH is correct.
S3 bulk (full-text)
arXiv mirrors all accepted papers to an AWS S3 bucket
(arxiv), requester-pays, available as both PDF and LaTeX
source. See the bulk data docs
for the current bucket structure.
Rate behaviour: no per-IP limit. S3 requester-pays scales with
your AWS account's ability to pay egress. Budget approximately
$0.09/GB egress if pulling outside us-east-1; free if pulling
into us-east-1.
Proxy role here: none. Same logic as Common Crawl — see proxies for Common Crawl for the broader S3 access patterns. If you need a proxy in front of S3, your architecture is wrong; move the collector to AWS.
arXiv API
export.arxiv.org/api/query — the REST(-ish) search and fetch
endpoint. Returns metadata plus links to PDFs; rate-limited per
IP.
Documented guidance is 1 request per 3 seconds per IP, with burst tolerance above that. In practice the limit is softer: sustained rates up to ~1 request per second hold, with occasional throttling. Going higher triggers the per-IP block.
Proxy role here: specifically for API-driven selection workloads. If you're running a targeted enumeration — "all papers in cs.LG from 2024 mentioning a specific term" — and need higher throughput than the single-origin rate, a small datacenter proxy pool is the right tool. See the datacenter proxy page.
The realistic AI-team use cases
Most AI teams touching arXiv fall into four patterns. Only one of them needs proxies.
Pattern A: building a training corpus slice
You want cs.LG/cs.CL/stat.ML papers from the last 10 years, as
plaintext. The correct path is OAI-PMH for metadata + S3 for
full-text, direct from us-east-1. No proxy. Expect ~1 TB of
PDFs, ~200 GB of extracted plaintext.
Pattern B: RAG source for a research assistant
You want a continuously-updated index of recent papers. OAI-PMH for the daily harvest (cron nightly) + S3 for the corresponding full-text pulls. Still no proxy. Storage pattern is append-only; re-indexing happens at ingest.
Pattern C: targeted scraping via the API
You need papers matching a specific query where OAI-PMH's category axes don't reach — full-text search, author networks, date-narrow slices that need iterative refinement. The API is the right tool; the rate limit is the constraint.
A rotating datacenter pool of 20–50 IPs lets an enumeration job finish in hours instead of weeks. Polite operation: per-IP spacing of 1 request per 3 seconds (so the aggregate stays within the documented intent), rotation to distribute the load, and an adaptive backoff on 429s. We run this configuration for a few research customers and haven't had ASN blocks in 18 months of operation.
import httpx, time, random
PROXY = "http://USER:PASS@gateway.squadproxy.com:7777"
def arxiv_search(query: str, max_results: int = 1000):
start = 0
page_size = 100
while start < max_results:
resp = httpx.get(
"https://export.arxiv.org/api/query",
params={
"search_query": query,
"start": start,
"max_results": page_size,
},
proxies=PROXY,
headers={
"X-Squad-Class": "datacenter",
"User-Agent": "research-harvester/0.1 (team@example.org)",
},
timeout=30,
)
if resp.status_code == 429:
time.sleep(60 + random.random() * 30)
continue
yield resp.text
start += page_size
time.sleep(3)
The 3-second sleep per request is deliberate — we're aggregating across many origins but keeping per-origin behaviour polite. That's the ethical line for a public research resource, not a technical limit.
Pattern D: re-fetching papers that dropped from S3
Occasionally papers are withdrawn and disappear from the S3 mirror. For a training-data re-fetch that needs the withdrawn versions (archived in Wayback Machine or other preprint mirrors), proxies matter in the same way proxies for Common Crawl matter for Pattern 3 there — the secondary mirrors have their own rate-limit and bot-management rules.
What you lose by being impolite
arXiv's operators have blocked ASNs that hammered the API beyond the documented limits. The block is at the ASN layer, not the IP layer — once your cloud ASN is listed, no amount of proxy rotation within that ASN helps. A proxy pool is a horizontal scale tool within the documented rate budget, not a way to evade the budget entirely.
If you see repeated 429s that don't clear after a backoff, stop.
The correct move is to email help@arxiv.org and describe the
workload. They are reasonable about research use cases and will
often whitelist an ASN for a specific project in exchange for a
throttled, documented access pattern.
Summary
| Path | Use case | Rate limit | Proxy needed |
|---|---|---|---|
| OAI-PMH | Metadata harvest | 5s global | No |
| S3 bulk | Full-text at scale | None (pay egress) | No |
| API | Targeted search/enum | 1 req / 3s / IP | Yes, for parallelism |
| Secondary mirrors | Withdrawn papers | Per-mirror | Case-by-case |
The single sentence to remember: proxies are a parallelism tool for the API path, nothing else. Everything else arXiv publishes is meant to be pulled directly.
For the routing picture across a full AI data pipeline — arXiv plus Common Crawl plus HuggingFace plus live re-fetch — residential vs datacenter for AI workloads has the complete matrix. Our pricing page sizes plans for the realistic concurrency a mixed pipeline needs.