MMLU-ProX
MMLU-ProX is a multilingual extension of MMLU-Pro that evaluates large language models across 29 languages using identical question content translated and reviewed by experts. Released 2025.
Definition
MMLU-ProX is a multilingual extension of MMLU-Pro that evaluates large language models across 29 languages using 11,829 identical questions per language, translated via a semi-automatic process with expert review. A "lite" variant offers 658 questions per language for efficient evaluation.
Published in 2025 by a research collaboration (the canonical reference is arXiv:2503.10497). The benchmark is hosted on HuggingFace and is freely available for research use.
Why MMLU-ProX matters
Pre-MMLU-ProX, multilingual LLM evaluation was fragmented across benchmarks that tested different capabilities in different languages, making cross-linguistic comparison difficult. MMLU-ProX's identical-content-across-languages design makes per-language capability directly comparable.
Key findings from the original paper: LLM performance declines markedly on low-resource languages, with gaps of up to 24.3% between high-resource and low-resource performance. The benchmark is used to track this gap over time as models improve.
Running MMLU-ProX with geo-anchored origin
The benchmark specifies identical questions per language but does not specify request origin. Running MMLU-ProX with geo- anchored origin (each language evaluated from its primary country's residential pool) measures regional policy variation in addition to language capability. See our multilingual LLM benchmark post for the methodology framing.
Pulling MMLU-ProX
The dataset is on HuggingFace under cais/MMLU-Pro (for the
English base) and the ProX extension under the associated
organisation. Standard datasets.load_dataset() works; for
large pulls through a proxy, see
our HuggingFace dataset post.
Related
Ship on a proxy network you can actually call your ops team about
Real ASNs, real edge capacity, and an engineer who answers your Slack the first time.