Proxies for Hugging Face dataset downloads: when HF_HUB_DOWNLOAD_TIMEOUT won't save you
Hugging Face rate-limits aggressively per-IP. Raising the timeout doesn't help; the server has already decided. A practical guide to routing HF bulk pulls through a proxy layer without breaking LFS resumption.
· Nathan Brecher · 6 min read
The first time a training-data pipeline tries to pull a 400 GB
dataset from Hugging Face is usually the first time the team
discovers that HF rate-limits per-IP. Raising
HF_HUB_DOWNLOAD_TIMEOUT looks like the fix because the client
error is a timeout, but the timeout is downstream: the HF edge
is slow-rolling or 429'ing your connection long before your
client gives up.
This post is the working configuration we use to pull bulk datasets and model weights at scale through a proxy layer without tripping the rate limiter, breaking resumable LFS downloads, or silently losing LFS pointer integrity.
What actually rate-limits
The HF CDN (fastly-fronted, Cloudflare on the LFS backends) applies limits at several layers:
- Per-IP request rate to
huggingface.co— soft cap around a few hundred requests per minute before responses degrade. - Per-IP bandwidth to the LFS endpoints — soft cap around 200–400 MB/s per origin IP. Enough for a single dataset pull from one worker; not enough for sustained parallel pulls of 10+ datasets.
- Auth'd account rate limit — documented in the HF account dashboard, raised for paying accounts. Worth knowing, but the IP layer bites first for unauth'd or lightly-auth'd use.
Relevant symptoms when you've hit these:
requests.exceptions.ConnectionErrormid-download- LFS
git-lfs smudgehangs, no output snapshot_downloadcompletes but leaves zero-byte shardsHTTP 429inside the CDN response body (not the header)
Raising HF_HUB_DOWNLOAD_TIMEOUT does not help because the
server is responding — slowly, or with a limit — not failing to
respond.
The proxy shape that works
Hugging Face bulk pulls route cleanly through ISP proxies. Residential is overkill (HF doesn't geoblock cloud, and residential adds 100ms+ latency that hurts bandwidth). Datacenter works but hits the per-IP limit faster because HF knows the ASNs. ISP is the sweet spot: datacenter-class uptime and latency, with IPs announced under residential ASNs that HF's rate-limiter treats as independent origins.
For the pool itself, our ISP proxy page details the specifics.
Minimal configuration:
import os
from huggingface_hub import snapshot_download
# Proxy configured per-request, not globally, so unrelated code
# paths (like pushing to the Hub) still go direct.
proxies = {
"https": "http://USER:PASS@gateway.squadproxy.com:7777",
}
snapshot_download(
repo_id="allenai/dolma",
repo_type="dataset",
local_dir="/data/dolma",
proxies=proxies,
etag_timeout=60, # CDN can be slow to return ETag headers
max_workers=8, # keep per-IP concurrency under HF's soft cap
)
Two details that matter:
max_workersbetween 4 and 8. Higher and you will hit the per-IP rate limit faster than the proxy can rotate around it, especially on large datasets where each worker pulls for tens of minutes at a time.etag_timeout=60(default is 10). HF's CDN occasionally takes longer than 10s to returnHEADresponses for large files; the default timeout produces a stream of "HEAD failed, retrying" log noise and spurious cache invalidations.
Multiple datasets in parallel
The other case where proxies matter: pulling 10+ datasets concurrently for a training mix. Each dataset hits its own LFS shards; aggregate bandwidth across all pulls bumps against the per-origin-IP cap.
import asyncio
import httpx
from huggingface_hub import HfFileSystem
REPOS = [
"allenai/dolma",
"cerebras/SlimPajama-627B",
"togethercomputer/RedPajama-Data-V2",
# ... more
]
# Each dataset pulls through a fresh ISP session, giving each
# its own apparent origin IP. The session is sticky within the
# pull so LFS resumption stays consistent.
async def pull_dataset(repo_id: str, session_id: str):
fs = HfFileSystem()
async with httpx.AsyncClient(
proxies={
"https": f"http://USER:PASS@gateway.squadproxy.com:7777"
},
headers={
"X-Squad-Class": "isp",
"X-Squad-Session": session_id, # sticky per-dataset
},
timeout=httpx.Timeout(60.0),
) as client:
# ... streaming download via fs.open + client
pass
async def main():
tasks = [
pull_dataset(r, f"hf-{i}")
for i, r in enumerate(REPOS)
]
await asyncio.gather(*tasks)
The X-Squad-Session header pins each dataset to a single
exit IP for its lifetime. This preserves LFS resume behaviour —
HF's LFS backend expects the same origin to resume a partial
download — while giving each parallel dataset a separate
origin.
LFS resume: the invisible failure mode
If a dataset pull rotates exit IPs mid-download, LFS resume
silently misbehaves: the server answers the resume request from
the new origin, but the partial file hash no longer matches the
server's expected state, and the client stitches together a
corrupted shard. git-lfs smudge won't warn you; the dataset
loads at training time and your loss curves start doing
strange things 48 hours into a run.
The sticky-session configuration above prevents this. For teams running the naive config (per-request rotation), we recommend either (a) switching to sticky ISP sessions per dataset or (b) disabling rotation entirely for HF pulls and accepting the per-IP bandwidth cap. The first option scales better.
Common mistakes we see
Raising HF_HUB_DOWNLOAD_TIMEOUT to huge values. This
masks the problem. The download doesn't fail, but it degrades
to a sustained low-throughput state because every shard is
being slow-rolled by the rate-limiter. A 400 GB dataset that
should pull in an hour takes three days.
Running HF pulls through residential proxies. Residential IPs rotate through real home networks that have narrower bandwidth than datacenter — pulling 400 GB through a residential IP will hit the subscriber's bandwidth cap and stall. HF isn't the target that needs residential.
Ignoring etag_timeout. The symptom is a pipeline that
"finishes" with missing shards because HEAD requests timed out
silently and the client moved on.
Mixing huggingface-cli download and snapshot_download
with different proxy configs. The CLI honors HTTPS_PROXY;
snapshot_download honors the proxies kwarg. If only one is
set, half your pulls go through the proxy and half go direct,
producing inconsistent rate-limit behaviour.
When to just pay for a bigger HF tier
The proxy approach works for teams pulling a few hundred GB per day. At higher volumes — a 5 TB-per-day scan of HF datasets — the economics favour a paid HF account with a raised per-account limit more than a proxy layer. The paid tier removes the per-IP ceiling entirely. Proxies are the right answer below that threshold; above it, they add complexity without saving money.
Where this fits
Hugging Face pulls are one slice of a typical AI training-data pipeline. Common Crawl and open-web fetches route differently — see residential vs datacenter for AI workloads for the full routing matrix and proxies for Common Crawl for the archive-specific patterns.
For teams integrating this pattern into a larger RAG ingestion or training pipeline, the RAG data collection use case covers the upstream architecture, and our pricing page sizes plans against the concurrency needed for sustained ISP workloads.