Research engineer — evaluation

Hamza Rahim

Research engineer at SquadProxy focused on LLM evaluation, regional bias measurement, and specifically the comparative benchmarks that test whether model APIs behave consistently across origin regions.

Four years as an eval engineer at a frontier-adjacent lab focused on safety and reliability benchmarks, before joining SquadProxy in mid-2025.

Hamza runs the evaluation benchmarks that SquadProxy occasionally publishes. The 40-country ChatGPT benchmark post is his; a follow-up on Claude and Gemini is in progress for Q2 2026.

Background

Hamza worked on safety and reliability evaluation at an AI safety-adjacent organisation from 2021-2025, with particular focus on multi-turn and geography-varying evaluation methodologies. Joined SquadProxy to do the cross-provider benchmark work that the lab couldn't do internally (because it would have been viewed as competitive intelligence rather than safety).

Writing on SquadProxy

What he's working on

The expanded benchmark — same 10-country setup, run against GPT-4o, Claude Opus 4.7, Gemini 2.0 Pro, and 2-3 open-source SOTA models. Publication target Q3 2026 with full prompt set and response-level transcripts (subject to review for sensitive outputs).

Contact

Hamza handles questions about evaluation methodology, benchmark design, and the kind of "here's what we measured" queries that show up on social media. hello@squadproxy.com with "eval" in the subject.

Writing by Hamza Rahim

Ship on a proxy network you can actually call your ops team about

Real ASNs, real edge capacity, and an engineer who answers your Slack the first time.

See pricing Contact sales