MIT OpenCourseWare Math

Benchmarking cloud-based and locally deployed LLMs on university-level mathematical reasoning

CLIMB-80 is a benchmark dataset and evaluation framework for comparing cloud-based language models (ChatGPT, Claude) against locally deployed open-source models (Llama 3.1 8B, Mistral 7B, Qwen 2.5 7B) ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results

Benchmarking cloud-based and locally deployed LLMs on university-level mathematical reasoning

Trending now