Creating a Leaderboard Python

Cut your coding agent’s cost with Sonar Vortex

New benchmarks show semantic code graphs helping coding agents find change locations faster and complete updates more ...

16d

Why Weibo’s tiny VibeThinker-3B has the AI world arguing over benchmarks again

B, a 3-billion-parameter AI model, is challenging OpenAI, Google and DeepSeek on math and coding benchmarks while reigniting the debate over AI scaling, benchmark gaming and small-model reasoning.

2don MSN

Florida Python Challenge returns after last year’s record removal

Video from previous story: FWC announces winners of the 2025 Florida Python Challenge TAMPA, Fla. (WFLA )— In just about a ...

GitHub

Autoresearch for weather dycores.

Autoresearch for weather dycores. Contribute to khzhao/dynamaxx development by creating an account on GitHub.

GitHub

Speedrun.com API V2 wrapper

Speedrun.com's official API (aka APIv1) is not actively maintained, and both misses a large number of modern features (including various social connections on user profiles) and several unaddressed ...

USENIX

Package Hallucinations: How LLMs Can Invent Vulnerabilities

We used the HumanEval leaderboard to filter the best performing models at the time our research started, which you can see in Figure 3. Note that this project began in February of 2024 and was first ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results