The performance of Large Language Models (LLMs) on multiple-choice question (MCQ) benchmarks is frequently cited as proof of their medical capabilities. We hypothesized that LLM performance on medical ...
Results that may be inaccessible to you are currently showing.
Hide inaccessible results