Common Crawl
Common Crawl is a non-profit that maintains an open repository of crawled web content. Each monthly snapshot captures ~2-3 billion pages, totalling around 250 TB, published in WARC format on AWS S3. It's the primary corpus backbone for most open-weights LLMs.
Definition
Common Crawl is a non-profit that operates one of the largest open web crawls in the world. Each monthly snapshot captures approximately 2-3 billion web pages, totalling around 250 TB of compressed content, published in WARC format on AWS S3 as a requester-pays bucket.
The organisation has been operating monthly crawls since 2008; the cumulative corpus exceeds hundreds of billions of unique pages.
Why Common Crawl matters for AI
Essentially every open-weights frontier-scale model's training corpus includes Common Crawl as a primary source: Llama, Mistral, Falcon, BLOOM, and most derivatives list CC directly or via CC-filtered derivatives (C4, Dolma, RefinedWeb, SlimPajama, FineWeb-Edu).
CC's scale and permissiveness make it the default for open training corpus. The tradeoff is that CC has sampling biases (documented in our CC sampling bias post) that every team using it inherits.
Accessing Common Crawl
Three published access paths:
- S3 bulk pull: direct from the
commoncrawlS3 bucket, requester-pays (you pay AWS egress if pulling outside us-east-1). - CDX index server: for URL-level lookups without full WARC scan.
- Derivative mirrors: academic and community mirrors that republish subsets with varying post-processing.
See our proxies-for-Common-Crawl post for when proxy routing fits into a CC-using pipeline.
CC and proxies
Most CC access doesn't need a proxy. The S3 bulk path is intended for cloud-hosted consumers and rate-tolerates bulk access. Proxies fit into CC workflows specifically for:
- Parallelising CDX index queries beyond the ~3 rps per-IP soft limit
- Accessing secondary mirrors behind Cloudflare geo-filtering
- Live re-fetching URLs discovered via CC index (where the routing matrix post framework applies)
Related
Ship on a proxy network you can actually call your ops team about
Real ASNs, real edge capacity, and an engineer who answers your Slack the first time.