Skip to content
AI evaluation

MMLU-ProX

MMLU-ProX is a multilingual extension of MMLU-Pro that evaluates large language models across 29 languages using identical question content translated and reviewed by experts. Released 2025.

Definition

MMLU-ProX is a multilingual extension of MMLU-Pro that evaluates large language models across 29 languages using 11,829 identical questions per language, translated via a semi-automatic process with expert review. A "lite" variant offers 658 questions per language for efficient evaluation.

Published in 2025 by a research collaboration (the canonical reference is arXiv:2503.10497). The benchmark is hosted on HuggingFace and is freely available for research use.

Why MMLU-ProX matters

Pre-MMLU-ProX, multilingual LLM evaluation was fragmented across benchmarks that tested different capabilities in different languages, making cross-linguistic comparison difficult. MMLU-ProX's identical-content-across-languages design makes per-language capability directly comparable.

Key findings from the original paper: LLM performance declines markedly on low-resource languages, with gaps of up to 24.3% between high-resource and low-resource performance. The benchmark is used to track this gap over time as models improve.

Running MMLU-ProX with geo-anchored origin

The benchmark specifies identical questions per language but does not specify request origin. Running MMLU-ProX with geo- anchored origin (each language evaluated from its primary country's residential pool) measures regional policy variation in addition to language capability. See our multilingual LLM benchmark post for the methodology framing.

Pulling MMLU-ProX

The dataset is on HuggingFace under cais/MMLU-Pro (for the English base) and the ProX extension under the associated organisation. Standard datasets.load_dataset() works; for large pulls through a proxy, see our HuggingFace dataset post.

Related

Ship on a proxy network you can actually call your ops team about

Real ASNs, real edge capacity, and an engineer who answers your Slack the first time.