As AI models excel in benchmarks, the focus shifts to human evaluation
Follow us
Artificial intelligence has come a long way, and it's been largely driven by automatic accuracy tests designed to mimic human knowledge. These tests, like the General Language Understanding Evaluation (GLUE) and the Massive Multitask Language Understanding (MMLU), have been the gold standard for measuring how well AI models understand various topics. But as these models become more advanced, these benchmarks are starting to feel a bit outdated. It seems we might need a more human touch when evaluating AI.
The Shift Towards Human Evaluation
The idea of involving humans more directly in AI evaluation isn't new. Industry experts have been discussing it for a while. Michael Gerstenhaber from Anthropic, a company behind the Claude family of language models, mentioned at a Bloomberg AI conference that we've "saturated the benchmarks." Essentially, AI models are acing these tests, but that doesn't necessarily mean they're ready for real-world applications.
This sentiment is echoed in a recent paper published in The New England Journal of Medicine. The authors argue that human involvement is crucial when it comes to benchmarks, especially in fields like medical AI. Traditional tests, like MedQA from MIT, are being easily passed by AI, but they don't necessarily reflect practical clinical needs. The authors suggest using methods like role-playing, similar to how doctors are trained, to better assess AI capabilities.
Human Feedback in AI Development
Human oversight has always been a key part of AI development. When OpenAI developed ChatGPT in 2022, they used a method called "reinforcement learning by human feedback," where humans repeatedly graded AI outputs to improve them. Now, companies like OpenAI and Google are emphasizing human evaluations even more.
For example, Google recently introduced its open-source model, Gemma 3, highlighting not just automated scores but also human ratings to showcase its effectiveness. They even compared it to top athletes using ELO scores. Similarly, when OpenAI launched GPT-4.5, they focused on how human reviewers perceived the model's output, rather than just automated benchmarks.
New Benchmarks with Human Involvement
As we develop new benchmarks to replace the old ones, human participation is becoming a central feature. OpenAI's GPT-o3 "mini" was the first language model to surpass human scores on a test of abstract reasoning. To ensure these new benchmarks are challenging, François Chollet from Google's AI unit conducted a live study involving over 400 people to test the difficulty of new tasks.
This blending of automated benchmarking with human input suggests a future where AI model development involves more human interaction. While it's uncertain if this will lead to true artificial general intelligence, it certainly opens up new possibilities for AI training and evaluation.
Incorporating human evaluations into AI development isn't just about making models smarter; it's about making them more aligned with human needs and values. As AI continues to evolve, keeping humans in the loop will likely become even more essential.