How To Game AI benchmarks
Ever marveled at the latest LLM benchmark scores?
But what if I told you those impressive numbers might be smoke and mirrors?
Gaming these tests is easier than you might think (and seems to be happening more and more).
Here are some tricks companies use to inflate their AI’s performance:
Paraphrasing the test set: Some companies have managed to beat GPT-4 with much smaller models by rewriting test questions in different formats or languages. This can lead to gains of over 10 points on benchmarks like MMLU, GSK-8K, and HumanEval.
Overfitting to test distributions: Clever firms generate new superficially different questions but share similar solution templates. This tricks detectors while still giving their AI an unfair advantage.
Prompt engineering to fool detectors: Since data generation methods are private, companies can optimize their prompts to sidestep public detection tools.
Turbocharging at test time: Some companies use tricks like having the AI double-check its work (self-reflection), asking it the same question multiple times and picking the most common answer (majority voting), or making it explore several solutions before deciding (Tree of Thought). These methods are like giving the AI multiple attempts at each question during the test. While this can boost scores, it doesn't actually make the AI smarter - it just lets it use more processing power to find the right answer.
So, how can you separate the AI wheat from the chaff?
Look for evaluations that are harder to game, such as:
ELO ratings from platforms like LMSys Chatbot Arena, where real users interact with models.
Private evaluations from trusted third parties using secret, well-curated test sets.
However, even these methods aren’t foolproof.
Leading LLM companies run constant A/B tests on public arenas, and conflicts of interest can arise with evaluation providers who also sell datasets (!).
The takeaway?
Take AI benchmark scores with a hefty grain of salt. Real-world performance and user feedback are often more reliable indicators of an AI’s true capabilities.
How do you think we can create more trustworthy AI evaluations?
Cheers, Craig