Skip to content
huggingfacetraining-datainfrastructure

Proxies for Hugging Face dataset downloads: when HF_HUB_DOWNLOAD_TIMEOUT won't save you

Hugging Face rate-limits aggressively per-IP. Raising the timeout doesn't help; the server has already decided. A practical guide to routing HF bulk pulls through a proxy layer without breaking LFS resumption.

· Nathan Brecher · 6 min read

The first time a training-data pipeline tries to pull a 400 GB dataset from Hugging Face is usually the first time the team discovers that HF rate-limits per-IP. Raising HF_HUB_DOWNLOAD_TIMEOUT looks like the fix because the client error is a timeout, but the timeout is downstream: the HF edge is slow-rolling or 429'ing your connection long before your client gives up.

This post is the working configuration we use to pull bulk datasets and model weights at scale through a proxy layer without tripping the rate limiter, breaking resumable LFS downloads, or silently losing LFS pointer integrity.

What actually rate-limits

The HF CDN (fastly-fronted, Cloudflare on the LFS backends) applies limits at several layers:

  • Per-IP request rate to huggingface.co — soft cap around a few hundred requests per minute before responses degrade.
  • Per-IP bandwidth to the LFS endpoints — soft cap around 200–400 MB/s per origin IP. Enough for a single dataset pull from one worker; not enough for sustained parallel pulls of 10+ datasets.
  • Auth'd account rate limit — documented in the HF account dashboard, raised for paying accounts. Worth knowing, but the IP layer bites first for unauth'd or lightly-auth'd use.

Relevant symptoms when you've hit these:

  • requests.exceptions.ConnectionError mid-download
  • LFS git-lfs smudge hangs, no output
  • snapshot_download completes but leaves zero-byte shards
  • HTTP 429 inside the CDN response body (not the header)

Raising HF_HUB_DOWNLOAD_TIMEOUT does not help because the server is responding — slowly, or with a limit — not failing to respond.

The proxy shape that works

Hugging Face bulk pulls route cleanly through ISP proxies. Residential is overkill (HF doesn't geoblock cloud, and residential adds 100ms+ latency that hurts bandwidth). Datacenter works but hits the per-IP limit faster because HF knows the ASNs. ISP is the sweet spot: datacenter-class uptime and latency, with IPs announced under residential ASNs that HF's rate-limiter treats as independent origins.

For the pool itself, our ISP proxy page details the specifics.

Minimal configuration:

import os
from huggingface_hub import snapshot_download

# Proxy configured per-request, not globally, so unrelated code
# paths (like pushing to the Hub) still go direct.
proxies = {
    "https": "http://USER:PASS@gateway.squadproxy.com:7777",
}

snapshot_download(
    repo_id="allenai/dolma",
    repo_type="dataset",
    local_dir="/data/dolma",
    proxies=proxies,
    etag_timeout=60,     # CDN can be slow to return ETag headers
    max_workers=8,       # keep per-IP concurrency under HF's soft cap
)

Two details that matter:

  1. max_workers between 4 and 8. Higher and you will hit the per-IP rate limit faster than the proxy can rotate around it, especially on large datasets where each worker pulls for tens of minutes at a time.
  2. etag_timeout=60 (default is 10). HF's CDN occasionally takes longer than 10s to return HEAD responses for large files; the default timeout produces a stream of "HEAD failed, retrying" log noise and spurious cache invalidations.

Multiple datasets in parallel

The other case where proxies matter: pulling 10+ datasets concurrently for a training mix. Each dataset hits its own LFS shards; aggregate bandwidth across all pulls bumps against the per-origin-IP cap.

import asyncio
import httpx
from huggingface_hub import HfFileSystem

REPOS = [
    "allenai/dolma",
    "cerebras/SlimPajama-627B",
    "togethercomputer/RedPajama-Data-V2",
    # ... more
]

# Each dataset pulls through a fresh ISP session, giving each
# its own apparent origin IP. The session is sticky within the
# pull so LFS resumption stays consistent.
async def pull_dataset(repo_id: str, session_id: str):
    fs = HfFileSystem()
    async with httpx.AsyncClient(
        proxies={
            "https": f"http://USER:PASS@gateway.squadproxy.com:7777"
        },
        headers={
            "X-Squad-Class": "isp",
            "X-Squad-Session": session_id,   # sticky per-dataset
        },
        timeout=httpx.Timeout(60.0),
    ) as client:
        # ... streaming download via fs.open + client
        pass

async def main():
    tasks = [
        pull_dataset(r, f"hf-{i}")
        for i, r in enumerate(REPOS)
    ]
    await asyncio.gather(*tasks)

The X-Squad-Session header pins each dataset to a single exit IP for its lifetime. This preserves LFS resume behaviour — HF's LFS backend expects the same origin to resume a partial download — while giving each parallel dataset a separate origin.

LFS resume: the invisible failure mode

If a dataset pull rotates exit IPs mid-download, LFS resume silently misbehaves: the server answers the resume request from the new origin, but the partial file hash no longer matches the server's expected state, and the client stitches together a corrupted shard. git-lfs smudge won't warn you; the dataset loads at training time and your loss curves start doing strange things 48 hours into a run.

The sticky-session configuration above prevents this. For teams running the naive config (per-request rotation), we recommend either (a) switching to sticky ISP sessions per dataset or (b) disabling rotation entirely for HF pulls and accepting the per-IP bandwidth cap. The first option scales better.

Common mistakes we see

Raising HF_HUB_DOWNLOAD_TIMEOUT to huge values. This masks the problem. The download doesn't fail, but it degrades to a sustained low-throughput state because every shard is being slow-rolled by the rate-limiter. A 400 GB dataset that should pull in an hour takes three days.

Running HF pulls through residential proxies. Residential IPs rotate through real home networks that have narrower bandwidth than datacenter — pulling 400 GB through a residential IP will hit the subscriber's bandwidth cap and stall. HF isn't the target that needs residential.

Ignoring etag_timeout. The symptom is a pipeline that "finishes" with missing shards because HEAD requests timed out silently and the client moved on.

Mixing huggingface-cli download and snapshot_download with different proxy configs. The CLI honors HTTPS_PROXY; snapshot_download honors the proxies kwarg. If only one is set, half your pulls go through the proxy and half go direct, producing inconsistent rate-limit behaviour.

When to just pay for a bigger HF tier

The proxy approach works for teams pulling a few hundred GB per day. At higher volumes — a 5 TB-per-day scan of HF datasets — the economics favour a paid HF account with a raised per-account limit more than a proxy layer. The paid tier removes the per-IP ceiling entirely. Proxies are the right answer below that threshold; above it, they add complexity without saving money.

Where this fits

Hugging Face pulls are one slice of a typical AI training-data pipeline. Common Crawl and open-web fetches route differently — see residential vs datacenter for AI workloads for the full routing matrix and proxies for Common Crawl for the archive-specific patterns.

For teams integrating this pattern into a larger RAG ingestion or training pipeline, the RAG data collection use case covers the upstream architecture, and our pricing page sizes plans against the concurrency needed for sustained ISP workloads.

Ship on a proxy network you can actually call your ops team about

Real ASNs, real edge capacity, and an engineer who answers your Slack the first time.