Part I: What the Harvard ER Study Says About o1 Beating Doctors at Diagnosis, Why It Means Differential Diagnosis Just Stopped Being a Scarce Cognitive Asset, and Where the Money Goes Next

Thoughts on Healthcare Markets & Technology Podcast

0:00

-8:37

Part I: What the Harvard ER Study Says About o1 Beating Doctors at Diagnosis, Why It Means Differential Diagnosis Just Stopped Being a Scarce Cognitive Asset, and Where the Money Goes Next

May 05, 2026

A Harvard team just published a study in Science showing o1 outperformed ER physicians at diagnosis. The popular take is “AI beat doctors.” The popular take misses the most important finding in the paper. Thread.

Setup: 76 real ED cases from a Boston academic medical center. o1 and physicians both given triage info first, then progressively more workup data. Each stage: produce a differential, rank the candidates. Physicians blinded to model outputs in the comparison arm.

The headline number: o1 hit around 67% accuracy at triage. Physicians were at 50-55%. By full workup, both converged above 80%. So yes, the gap is real. But the gap is biggest exactly where information is sparsest, and that detail matters a lot.

Why it matters: at triage, the cognitive task is generating a wide net of plausible diagnoses. Humans are systematically narrow-net generators. The clinical literature calls this premature closure. Researchers estimate roughly 12 million Americans experience diagnostic error per year. This study is a controlled demo of why.

Important caveat that got buried: this is text in, text out. No imaging interpretation. No real-time lab handling. No conversational diagnosis. The inputs are chart vignettes, closer to chart review than clinical care. Anyone calling physicians obsolete based on this has not read the methods.

Now the finding almost nobody covered. The human-plus-AI condition did not outperform AI alone. Every health AI pitch deck assumes physician plus machine equals better than either alone. That is the entire copilot premise. This paper puts that assumption on the back foot empirically.

The mechanism: automation bias plus anchoring. Physicians used the model’s ranked list as a starting point. They accepted incorrect rankings too often. They discounted correct AI suggestions that conflicted with their own prior. Net result: roughly a wash. Radiology has seen this same pattern for a decade.

Subscribe to www.onhealthcare.tech for free and paid articles, podcasts, and more. For a further deep dive on the topic, see article.

Thoughts on Healthcare Markets & Technology

Part I: What the Harvard ER Study Says About o1 Beating Doctors at Diagnosis, Why It Means Differential Diagnosis Just Stopped Being a Scarce Cognitive Asset, and Where the Money Goes Next

Discussion about this episode

Ready for more?