Hamza Rahim
Research engineer at SquadProxy focused on LLM evaluation, regional bias measurement, and specifically the comparative benchmarks that test whether model APIs behave consistently across origin regions.
Four years as an eval engineer at a frontier-adjacent lab focused on safety and reliability benchmarks, before joining SquadProxy in mid-2025.
Hamza runs the evaluation benchmarks that SquadProxy occasionally publishes. The 40-country ChatGPT benchmark post is his; a follow-up on Claude and Gemini is in progress for Q2 2026.
Background
Hamza worked on safety and reliability evaluation at an AI safety-adjacent organisation from 2021-2025, with particular focus on multi-turn and geography-varying evaluation methodologies. Joined SquadProxy to do the cross-provider benchmark work that the lab couldn't do internally (because it would have been viewed as competitive intelligence rather than safety).
Writing on SquadProxy
- How much does geography actually change ChatGPT's answer? A 10-country test
- Why your eval benchmark is lying to you: regional variation as methodology
What he's working on
The expanded benchmark — same 10-country setup, run against GPT-4o, Claude Opus 4.7, Gemini 2.0 Pro, and 2-3 open-source SOTA models. Publication target Q3 2026 with full prompt set and response-level transcripts (subject to review for sensitive outputs).
Contact
Hamza handles questions about evaluation methodology, benchmark design, and the kind of "here's what we measured" queries that show up on social media. hello@squadproxy.com with "eval" in the subject.
Writing by Hamza Rahim
25 Mar 2026
Why your eval benchmark is lying to you: regional variation as methodology
Most public LLM eval benchmarks run from a single origin and report a single score per model. Running the same benchmark from 10 regions surfaces variance that single-origin testing hides — and the variance is larger than the reported confidence intervals.
11 Dec 2025
How much does geography actually change ChatGPT's answer? A 10-country test
We ran 800 prompts against GPT-4o from 10 country origins to measure how much the same question changes answer when the request IP geography changes. The delta is smaller than we expected, larger than zero, and concentrated in a specific class of prompt.
Ship on a proxy network you can actually call your ops team about
Real ASNs, real edge capacity, and an engineer who answers your Slack the first time.