Pperformance of a large language model on the reasoning tasks of a physician

2 weeks ago 6

[Submitted on 14 Dec 2024 (v1), last revised 19 May 2025 (this version, v2)]

View PDF

Abstract:A seminal paper published by Ledley and Lusted in 1959 introduced complex clinical diagnostic reasoning cases as the gold standard for the evaluation of expert medical computing systems, a standard that has held ever since. Here, we report the results of a physician evaluation of a large language model (LLM) on challenging clinical cases against a baseline of hundreds of physicians. We conduct five experiments to measure clinical reasoning across differential diagnosis generation, display of diagnostic reasoning, triage differential diagnosis, probabilistic reasoning, and management reasoning, all adjudicated by physician experts with validated psychometrics. We then report a real-world study comparing human expert and AI second opinions in randomly-selected patients in the emergency room of a major tertiary academic medical center in Boston, MA. We compared LLMs and board-certified physicians at three predefined diagnostic touchpoints: triage in the emergency room, initial evaluation by a physician, and admission to the hospital or intensive care unit. In all experiments--both vignettes and emergency room second opinions--the LLM displayed superhuman diagnostic and reasoning abilities, as well as continued improvement from prior generations of AI clinical decision support. Our study suggests that LLMs have achieved superhuman performance on general medical diagnostic and management reasoning, fulfilling the vision put forth by Ledley and Lusted, and motivating the urgent need for prospective trials.

Submission history

From: Arjun Manrai [view email]
[v1] Sat, 14 Dec 2024 14:46:18 UTC (1,601 KB)
[v2] Mon, 19 May 2025 14:26:07 UTC (1,996 KB)

Read Entire Article