CLIMB-80 is a benchmark dataset and evaluation framework for comparing cloud-based language models (ChatGPT, Claude) against locally deployed open-source models (Llama 3.1 8B, Mistral 7B, Qwen 2.5 7B) ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results