Large Language Models Benchmarks

AI Benchmarks Are Broken : The Leaderboard Illusion

What if the tools we trust to measure progress are actually holding us back? In the rapidly evolving world of large language models (LLMs), AI benchmarks and leaderboards have become the gold standard ...

Medical Xpress

Multilingual benchmark evaluates how well AI interprets clinical text and health records in nine languages

Researchers at Mass General Brigham recently developed BRIDGE, a multilingual benchmark that evaluates how well large language models (LLMs) understand clinical patient care text, including language ...

ascopubs.org

RadOncRAG: A Novel Retrieval-Augmented Generation Framework Improves Large Language Model Benchmark Performance in Radiation Oncology

Large language models (LLMs) show promise in assisting knowledge-intensive fields such as oncology, where up-to-date information and multidisciplinary expertise are critical. Traditional LLMs risk ...

13don MSN

Hide inaccessible results

AI Benchmarks Are Broken : The Leaderboard Illusion

Multilingual benchmark evaluates how well AI interprets clinical text and health records in nine languages

RadOncRAG: A Novel Retrieval-Augmented Generation Framework Improves Large Language Model Benchmark Performance in Radiation Oncology

China's Z.ai GLM-5.2 tops OpenAI’s GPT 5.5 model on key benchmarks

Z.ai’s open-weights GLM-5.2 beats GPT-5.5 on multiple long-horizon coding benchmarks for 1/6th the cost

Artificial intelligence models show massive gaps on traditional human intelligence tests

How to Build Custom LLM Benchmarks for Your AI Applications