Skip to content
For HuggingFace Inference

Proxies for HuggingFace Inference

HF hosts inference for thousands of open-source models. Routing eval workloads through the HF inference surface with sensible rate distribution and regional anchoring keeps the eval consistent and within HF's rate budget.

Updated 23 April 2026

Recommended exit classes

Recommended country anchors

Why proxy HuggingFace Inference

Four workload shapes regularly want proxy routing against the HF inference surface:

  1. Bulk evaluation across many open-source models. Testing how 50 different open-source models (Llama-derivatives, Mistral-derivatives, Qwen, DeepSeek, etc.) respond to the same prompt set. The HF Serverless Inference API makes this tractable but rate-limits per-account and per-IP; a proxy pool distributes the load.

  2. Open-source model regional behaviour. Even open-source models deployed on HF Inference Endpoints apply the deployment region's network path to requests. Eval from US, EU, and APAC origins measures the deployed-fleet behaviour.

  3. HF Inference Endpoints (dedicated) warm-up / eval. For teams running dedicated HF Endpoints, the proxy layer provides a consistent test origin for evaluation before deployment.

  4. Serverless cold-start measurement. Measuring HF's serverless cold-start latency is itself a useful benchmark for understanding the open-source inference surface.

Important context: HF is rate-sensitive

HuggingFace's Inference API applies rate limits per-IP on the free tier and per-account on paid tiers. The pattern we cover in depth at Proxies for Hugging Face datasets applies equally to Inference — ISP with sticky sessions gives LFS-equivalent stability for inference calls.

Recommended configuration

import httpx
from huggingface_hub import InferenceClient

PROXY = "http://USER:PASS@gateway.squadproxy.com:7777"

# Bulk eval across open models
models = ["meta-llama/Llama-3.1-70B-Instruct", "mistralai/Mistral-7B-v0.3", ...]

def eval_hf(prompt: str, model_id: str):
    # HF's inference endpoints
    url = f"https://api-inference.huggingface.co/models/{model_id}"
    return httpx.post(
        url,
        json={"inputs": prompt},
        headers={
            "Authorization": f"Bearer {HF_TOKEN}",
            "X-Squad-Class": "isp",
            "X-Squad-Session": "sticky-10m",
        },
        proxies=PROXY,
        timeout=120,
    ).json()

HF-specific eval notes

  • Cold-start effects — serverless endpoints cold-start on first request after idle. Eval that includes cold-start timing should document the warm-up state explicitly.
  • Open model eval corpus — HF hosts many derivative models; eval across 10-20 derivatives of a base model tells you about the derivation, not the base.
  • Dedicated Endpoints — for production workloads, the dedicated endpoint gives stable deployment; the proxy layer keeps eval-side traffic consistent.

Plans that fit

See pricing. HF inference evaluation is typically concurrency-heavy but bandwidth-light. The Team plan's 1000 concurrent is usually the right shape.

Related

Pricing

Pricing — plans sized for HuggingFace Inference workloads

Every plan includes access to all 5 exit classes across our 10 focus countries — quotas vary by plan. The size you need scales with your eval cadence and concurrency.

Solo

For individual researchers running evaluation scripts and prototype RAG pipelines.

$149/ month

or $1,430/year (save 20%)

50 GB residential · unlimited datacenter · 200 concurrent sessions

  • Access to all 5 exit classes · 10 focus countries
  • 50 GB residential · unlimited datacenter
  • 5 static ISP IPs · 5 GB 4G mobile
  • 1 seat · 200 concurrent sessions
  • Python + Node SDK + REST API
  • Per-request metering (not time-based)
  • Email support (24h response, business days)
  • Overage: $3/GB residential · $6/GB mobile

Best for

  • Solo researchers
  • Evaluation scripts
  • Prototype RAG

Team

Most popular

For AI startups and mid-size labs splitting capacity between training and evaluation.

$699/ month

or $6,710/year (save 20%)

500 GB residential · unlimited datacenter · 1,000 concurrent sessions

  • Access to all 5 exit classes · 10 focus countries
  • 500 GB residential · unlimited datacenter
  • 25 static ISP IPs · 25 GB 4G mobile
  • 10 seats ($29/mo per extra seat) · 1,000 concurrent sessions
  • City-level geo-routing + ASN targeting
  • 99.9% uptime SLA
  • Priority Slack support (4h response, business hours)
  • Python + Node SDK + REST API + webhooks
  • Overage: $3/GB residential · $6/GB mobile

Best for

  • AI startups
  • Mid-size labs
  • Model eval teams

Lab

For academic labs, eval consortia, and frontier model companies running sustained workloads.

$2,999/ month

or $28,790/year (save 20%)

2 TB residential · unlimited DC · 50 GB 4G + 20 GB 5G · 3,000 concurrent sessions

  • Access to all 5 exit classes · 10 countries on 4 continents
  • 2 TB residential · unlimited datacenter
  • 100 static ISP IPs · 50 GB 4G + 20 GB 5G mobile
  • 50 seats ($19/mo per extra seat) · 3,000 concurrent sessions
  • Dedicated gateway lane (bypasses shared-pool queues on us-east-1 + eu-west-1)
  • 99.95% uptime SLA
  • Dedicated Slack channel (1h response, business hours)
  • Custom BGP prefix on request (additional fees apply)
  • Overage: $2.50/GB residential · $5/GB mobile

Best for

  • Academic labs
  • Large eval consortia
  • Frontier model companies

Enterprise

Custom contracts with dedicated infrastructure, volume pricing, and research-grade SLAs.

Custom pricing

Custom (from 5 TB/mo residential) · unlimited concurrent sessions

  • Volume pricing from 5 TB/mo residential
  • Dedicated BGP prefix + ASN announcement
  • Unlimited concurrent sessions · unlimited seats
  • 99.99% uptime SLA with financial credits
  • Named Technical Account Manager + 24/7 on-call paging
  • Custom AUP, DPA, on-site deployment option
  • Research / academic discount (30–50% off Team or Lab)
  • Annual contract · wire, ACH, USDC/USDT/BTC settlement

Best for

  • Frontier labs
  • Eval consortia
  • Enterprise AI

All plans include 14-day refund, single endpoint with regional failover, HTTP(S) + SOCKS5 on every exit class, access to all 5 exit classes and all 10 focus countries, and Python + Node SDKs. Concurrent sessions = simultaneous TCP sessions through the gateway. Overage warnings fire at 80% and 100%; traffic continues only if overage billing is enabled on your account.

Start routing HuggingFace Inference traffic through SquadProxy

Real ASNs, real edge capacity, and an engineer who answers your Slack the first time.