Imogen Reyes
Research engineer at SquadProxy focused on training-corpus engineering — deduplication, tokenisation-aware sampling, corpus-level bias measurement.
Five years in data engineering for large-scale ML training, with a focus on the deduplication and quality-filtering stages that determine how much of a corpus actually matters at training time.
Imogen works on the content side of the proxy/training-data interface — the parts of the pipeline that happen after the proxy returns bytes but before the tokeniser turns them into training examples. Most of the posts on SquadProxy about deduplication and corpus engineering are her framing.
Background
Imogen worked on the data-engineering side of pretraining for two models that shipped as open-weights between 2022 and 2024. The scar tissue is specifically around CommonCrawl-derived corpora and the kind of sampling biases that only surface once a trained model evaluates badly on a specific benchmark and the team has to trace it back to the corpus.
Writing on SquadProxy
- Tokenization-aware dedup at scrape time, not after
- The hidden bias in Common Crawl sampling — and how to fix it
What she's working on
A longer-form writeup comparing the corpus-quality impact of per-source exit-class routing versus single-class collection — specifically whether the canonical-stability benefits described in the RAG infrastructure post produce measurable gains in training corpus quality. Preliminary results suggest yes; the full writeup is Q3 2026.
Contact
Questions about dedup strategies, Common Crawl processing, or tokeniser-aware sampling are good fits. hello@squadproxy.com with "corpus" in the subject.
Ship on a proxy network you can actually call your ops team about
Real ASNs, real edge capacity, and an engineer who answers your Slack the first time.