The hidden bias in Common Crawl sampling — and how to fix it from your side
Common Crawl is the default corpus backbone for open LLM training. Its sampling is not uniform, and the biases it introduces show up downstream in very specific ways. Here is what to look for and how to correct it in your own pipeline.
· Elena Novak · 4 min read
Common Crawl is the spine of most open LLM training corpora. Per Common Crawl's own statistics, each monthly crawl captures around 2–3 billion web pages (the March 2025 crawl was 2.74B pages / ~455 TiB uncompressed; the November 2025 crawl 2.29B / 378 TiB). The cumulative corpus sits at hundreds of billions of unique pages and continues to grow.
It is the single most common data source mentioned in open-source frontier-ish models (Llama, Mistral, Falcon, BLOOM, most RedPajama-lineage models, most Pile-derived corpora).
It also has sampling biases that every team using it inherits, and those biases propagate into models trained on it. We want to talk about three of them that AI teams sometimes underestimate.
Bias 1: Host-level distribution is heavy-tailed
The Common Crawl crawl frontier prioritises hosts that are well-linked in the existing web graph. That sounds reasonable and is — for the hosts that exist in the graph. It systematically under-samples newer hosts, geographically-isolated hosts, and hosts that are linked primarily within local-language communities.
If your training corpus is pure Common Crawl, you inherit the graph bias. The practical implication is that a model trained on pure Common Crawl English content is heavy on US- and UK-hosted long-form content and under-weighted on, say, Australian academic content or Singaporean technical blogs.
Correction on your side
Supplement Common Crawl with targeted crawls on under-sampled host groups. Identify them by pulling the Common Crawl URL distribution by country-code TLD and hosting ASN, comparing that to a reasonable "expected" distribution (e.g., TLD share weighted by population or by published web-presence statistics), and commissioning directed crawls for the deltas you care about.
Bias 2: Language detection is imperfect
Common Crawl publishes content-level language tags produced by a language identifier. For major languages (English, Chinese, Spanish), accuracy is very high. For lower-resource languages — Swahili, Tagalog, Welsh, Uzbek — the identifier confuses close siblings and mis-tags code-switched content. Some of the mis-tagged content ends up in the wrong language bucket during filtering.
The result: lower-resource language buckets in CC-derived corpora often contain a non-trivial fraction of off-language content, and are often undersized relative to their real web presence.
Correction on your side
Run a higher-quality language identifier over the raw text in
CC-provided WARC files for the languages you care about. The
fasttext-langdetect (lightweight) and newer transformer-based
langID models will do better than the default on short and
code-switched content. Publish the delta in your data card so
downstream evaluators know the shape of your corpus.
Bias 3: robots.txt compliance means coverage skews toward permissive sites
Common Crawl respects robots.txt. That is correct ethically
and legally, and it is the right default. It also means that
Common Crawl systematically under-samples sites that block
crawlers — which, unhelpfully for AI research, tends to include:
- Major press publishers who block in response to AI training concerns (NYT, Reuters, many European outlets post-2023)
- A lot of commercial technical documentation
- Some academic journal landing pages
Correction on your side
Two paths:
- Accept the under-sample. Train on what's there. Note the exclusion in your data card.
- Directly license or partner. For teams whose eval depends on recent press coverage (news QA, fact-checking, time-localised reasoning), Common Crawl is not the right primary source for that content. Licensed feeds or partnerships are. This is increasingly how serious labs handle the gap.
We don't recommend the third path — scraping press content that has explicitly opted out. Our AUP prohibits using SquadProxy for this, and the legal exposure is real in several jurisdictions.
Putting it together
Common Crawl is still the right backbone for most training corpora. But treating it as a uniform sample of the web is a measurement mistake that compounds through your model. The corrections above cost something — directed crawls, better langID, licensed press — but they show up in downstream eval as consistent wins on lower-resource tasks.
References
- Common Crawl statistics (commoncrawl.github.io/cc-crawl-statistics)
- Common Crawl blog — March 2025 crawl archive announcement
- "The Pile: An 800GB Dataset of Diverse Text for Language Modeling" (Gao et al., arXiv:2101.00027) — still worth reading for the CC-filtering lineage
Related reading on SquadProxy
- Proxies for Common Crawl — the infrastructure side of pulling CC at scale
- Residential vs datacenter for AI workloads — choosing an exit class when re-fetching from CC-indexed URLs
- Tokenization-aware dedup at scrape time — the dedup counterpart to CC sampling correction