fbpx

Science

Study Title
Performance of a large language model on the reasoning tasks of a physician
Publication
Science.org
Author(s)

Peter G. Brodeur, Thomas A. Buckley, Zahir Kanjee, Ethan Goh, Evelyn Bin Ling, Priyank Jain, Stephanie Cabral, Raja-­ Elie Abdulnour, Adrian D. Haimovich9, Jason A. Freed, Andrew Olson, Daniel J. Morgan, Jason Hom, Robert Gallo, Liam G. McCoy, Haadi Mombini, Christopher Lucas1, Misha Fotoohi, Matthew Gwiazdon, Daniele Restifo, Daniel Restrepo, Eric Horvitz, Jonathan Chen, Arjun K. Manrai, Adam Rodman

Abstract

More than 65 years ago, complex clinical diagnostic reasoning cases were introduced as the gold standard for the evaluation of expert medical computing systems, a standard that has held ever since. In this study, we report the results of a physician evaluation of a large language model (llM) on challenging clinical cases across five experiments with a baseline of hundreds of physicians. We then report a real-­ world study comparing human expert and artificial intelligence (AI) second opinions in randomly selected patients in the emergency room of a major tertiary academic medical center. In all experiments, the llM outperformed physician baselines and displayed continued improvement from prior generations of AI clinical decision support. Our study suggests that llMs have eclipsed most benchmarks of clinical reasoning, motivating the urgent need for prospective trials.

Date
May 4, 2026
View study

Share This

Dr. Perlmutter is one of the leading lights in medicine today, illuminating the path for solving chronic illness

Mark Hyman, MD