training-datacommon-crawlcorpus-bias

The hidden bias in Common Crawl sampling — and how to fix it from your side

Common Crawl is the default corpus backbone for open LLM training. Its sampling is not uniform, and the biases it introduces show up downstream in very specific ways. Here is what to look for and how to correct it in your own pipeline.

20 January 2026 · Elena Novak · 4 min read

Common Crawl is the spine of most open LLM training corpora. Per Common Crawl's own statistics, each monthly crawl captures around 2–3 billion web pages (the March 2025 crawl was 2.74B pages / ~455 TiB uncompressed; the November 2025 crawl 2.29B / 378 TiB). The cumulative corpus sits at hundreds of billions of unique pages and continues to grow.

It is the single most common data source mentioned in open-source frontier-ish models (Llama, Mistral, Falcon, BLOOM, most RedPajama-lineage models, most Pile-derived corpora).

It also has sampling biases that every team using it inherits, and those biases propagate into models trained on it. We want to talk about three of them that AI teams sometimes underestimate.

Bias 1: Host-level distribution is heavy-tailed

The Common Crawl crawl frontier prioritises hosts that are well-linked in the existing web graph. That sounds reasonable and is — for the hosts that exist in the graph. It systematically under-samples newer hosts, geographically-isolated hosts, and hosts that are linked primarily within local-language communities.

If your training corpus is pure Common Crawl, you inherit the graph bias. The practical implication is that a model trained on pure Common Crawl English content is heavy on US- and UK-hosted long-form content and under-weighted on, say, Australian academic content or Singaporean technical blogs.

Correction on your side

Supplement Common Crawl with targeted crawls on under-sampled host groups. Identify them by pulling the Common Crawl URL distribution by country-code TLD and hosting ASN, comparing that to a reasonable "expected" distribution (e.g., TLD share weighted by population or by published web-presence statistics), and commissioning directed crawls for the deltas you care about.

Bias 2: Language detection is imperfect

Common Crawl publishes content-level language tags produced by a language identifier. For major languages (English, Chinese, Spanish), accuracy is very high. For lower-resource languages — Swahili, Tagalog, Welsh, Uzbek — the identifier confuses close siblings and mis-tags code-switched content. Some of the mis-tagged content ends up in the wrong language bucket during filtering.

The result: lower-resource language buckets in CC-derived corpora often contain a non-trivial fraction of off-language content, and are often undersized relative to their real web presence.

Correction on your side

Run a higher-quality language identifier over the raw text in CC-provided WARC files for the languages you care about. The fasttext-langdetect (lightweight) and newer transformer-based langID models will do better than the default on short and code-switched content. Publish the delta in your data card so downstream evaluators know the shape of your corpus.

Bias 3: `robots.txt` compliance means coverage skews toward permissive sites

Common Crawl respects robots.txt. That is correct ethically and legally, and it is the right default. It also means that Common Crawl systematically under-samples sites that block crawlers — which, unhelpfully for AI research, tends to include:

Major press publishers who block in response to AI training concerns (NYT, Reuters, many European outlets post-2023)
A lot of commercial technical documentation
Some academic journal landing pages

Correction on your side

Two paths:

Accept the under-sample. Train on what's there. Note the exclusion in your data card.
Directly license or partner. For teams whose eval depends on recent press coverage (news QA, fact-checking, time-localised reasoning), Common Crawl is not the right primary source for that content. Licensed feeds or partnerships are. This is increasingly how serious labs handle the gap.

We don't recommend the third path — scraping press content that has explicitly opted out. Our AUP prohibits using SquadProxy for this, and the legal exposure is real in several jurisdictions.

Putting it together

Common Crawl is still the right backbone for most training corpora. But treating it as a uniform sample of the web is a measurement mistake that compounds through your model. The corrections above cost something — directed crawls, better langID, licensed press — but they show up in downstream eval as consistent wins on lower-resource tasks.

References

Common Crawl statistics (commoncrawl.github.io/cc-crawl-statistics)
Common Crawl blog — March 2025 crawl archive announcement
"The Pile: An 800GB Dataset of Diverse Text for Language Modeling" (Gao et al., arXiv:2101.00027) — still worth reading for the CC-filtering lineage

The hidden bias in Common Crawl sampling — and how to fix it from your side

Bias 1: Host-level distribution is heavy-tailed

Correction on your side

Bias 2: Language detection is imperfect

Correction on your side

Bias 3: `robots.txt` compliance means coverage skews toward permissive sites

Correction on your side

Putting it together

References

Related reading on SquadProxy

Keep reading

Proxies for AI browser agents: the 2026 workload shape

Proxies for ChatGPT Operator: browser-agent configuration that works

Ship on a proxy network you can actually call your ops team about

Bias 1: Host-level distribution is heavy-tailed

Correction on your side

Bias 2: Language detection is imperfect

Correction on your side

Bias 3: robots.txt compliance means coverage skews toward permissive sites

Correction on your side

Putting it together

References

Related reading on SquadProxy

Keep reading

Proxies for AI browser agents: the 2026 workload shape

Proxies for ChatGPT Operator: browser-agent configuration that works

Ship on a proxy network you can actually call your ops team about

Bias 3: `robots.txt` compliance means coverage skews toward permissive sites