Large Language Models Benchmarks

Towards domain-adapted large language models for water and wastewater management: methods, datasets and benchmarking

Large language models (LLMs) have shown significant promise for water and wastewater management. However, current foundation models are not yet reliable. This Perspective outlines a pathway for ...

7don MSN

China's Z.ai GLM-5.2 tops OpenAI’s GPT 5.5 model on key benchmarks

Chinese startup Z.ai has launched GLM-5.2, a powerful AI model for complex coding projects. This new large language model ...

7don MSN

Multilingual benchmark evaluates how well AI interprets clinical text and health records in nine languages

Researchers at Mass General Brigham recently developed BRIDGE, a multilingual benchmark that evaluates how well large ...

Nature

Benchmarking large language model-based agent systems for clinical decision tasks

Clinical decision-making entails complex, data-intensive, and often uncertain judgments, resulting in excessive workload and exceeding the cognitive limits of many clinicians. For more than two ...

Japanese AI startup Sakana launches Fugu, claims it beats banned Anthropic's Claude Fable 5 in coding benchmarks

Japanese AI startup Sakana has launched Fugu, a new AI model family that the company says outperforms Anthropic's Claude ...

Geeky Gadgets

AI Benchmarks Are Broken : The Leaderboard Illusion

What if the tools we trust to measure progress are actually holding us back? In the rapidly evolving world of large language models (LLMs), AI benchmarks and leaderboards have become the gold standard ...

Z.ai’s open-weights GLM-5.2 beats GPT-5.5 on multiple long-horizon coding benchmarks for 1/6th the cost

It allows engineering teams to host frontier-level AI on their own sovereign infrastructure, entirely eliminating vendor lock ...

1mon

Frontier AI models don't just delete document content — they rewrite it, and the errors are nearly impossible to catch

Frontier AI models corrupt 25% of document content in multi-step workflows — rewriting rather than deleting, which makes the ...

SiliconANGLE

Elon Musk’s xAI sets AI benchmark records with new reasoning-optimized Grok 4 model

Elon Musk’s xAI Holdings Corp. has debuted a new large language model, Grok 4, that’s optimized for reasoning tasks such as generating code. The LLM’s late Wednesday launch followed a turbulent week ...

STAT

OpenAI leaps into health care with AI benchmark to evaluate models

OpenAI on Monday released a large dataset for evaluating how well large language models answer questions related to health care. Experts lauded the open-source data and detailed evaluation rubrics, ...

News-Medical.Net

Leading AI models ace many vaccine questions but falter on clinical rules

A multilingual benchmark of 1,886 vaccine-related questions found that large language models answered most items accurately ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results