As AI models excel in benchmarks, the focus shifts to human evaluation

Follow us

Artificial intelligence has come a long way, and it's been largely driven by automatic accuracy tests designed to mimic human knowledge. These tests, like the General Language Understanding Evaluation (GLUE) and the Massive Multitask Language Understanding (MMLU), have been the gold standard for measuring how well AI models understand various topics. But as these models become more advanced, these benchmarks are starting to feel a bit outdated. It seems we might need a more human touch when evaluating AI.

The Shift Towards Human Evaluation

The idea of involving humans more directly in AI evaluation isn't new. Industry experts have been discussing it for a while. Michael Gerstenhaber from Anthropic, a company behind the Claude family of language models, mentioned at a Bloomberg AI conference that we've "saturated the benchmarks." Essentially, AI models are acing these tests, but that doesn't necessarily mean they're ready for real-world applications.

This sentiment is echoed in a recent paper published in The New England Journal of Medicine. The authors argue that human involvement is crucial when it comes to benchmarks, especially in fields like medical AI. Traditional tests, like MedQA from MIT, are being easily passed by AI, but they don't necessarily reflect practical clinical needs. The authors suggest using methods like role-playing, similar to how doctors are trained, to better assess AI capabilities.

Human Feedback in AI Development

Human oversight has always been a key part of AI development. When OpenAI developed ChatGPT in 2022, they used a method called "reinforcement learning by human feedback," where humans repeatedly graded AI outputs to improve them. Now, companies like OpenAI and Google are emphasizing human evaluations even more.

For example, Google recently introduced its open-source model, Gemma 3, highlighting not just automated scores but also human ratings to showcase its effectiveness. They even compared it to top athletes using ELO scores. Similarly, when OpenAI launched GPT-4.5, they focused on how human reviewers perceived the model's output, rather than just automated benchmarks.

New Benchmarks with Human Involvement

As we develop new benchmarks to replace the old ones, human participation is becoming a central feature. OpenAI's GPT-o3 "mini" was the first language model to surpass human scores on a test of abstract reasoning. To ensure these new benchmarks are challenging, François Chollet from Google's AI unit conducted a live study involving over 400 people to test the difficulty of new tasks.

This blending of automated benchmarking with human input suggests a future where AI model development involves more human interaction. While it's uncertain if this will lead to true artificial general intelligence, it certainly opens up new possibilities for AI training and evaluation.

Incorporating human evaluations into AI development isn't just about making models smarter; it's about making them more aligned with human needs and values. As AI continues to evolve, keeping humans in the loop will likely become even more essential.

As AI models excel in benchmarks, the focus shifts to human evaluation

As AI models excel in benchmarks, the focus shifts to human evaluation

As AI models excel in benchmarks, the focus shifts to human evaluation

Book free 15 min call

Want to use AI potential in Your business but don't know how? Book free consultation and let's find out together.

Book free 15 min call

Want to use AI potential in Your business but don't know how? Book free consultation and let's find out together.

Book free 15 min call

Want to use AI potential in Your business but don't know how? Book free consultation and let's find out together.

Discover how AI can help Your business

Discover how AI can help Your business

Discover how AI can help Your business

2025 copyright. All rights reserved

Website made by Imdev.ai

2025 copyright. All rights reserved

Website made by Imdev.ai

2025 copyright. All rights reserved

Website made by Imdev.ai