Harvard/Science Study: OpenAI o1 Outperforms Emergency Room Physicians in Diagnosis — 67% vs 56% Accuracy
A landmark peer-reviewed study published in Science by Harvard Medical School, Beth Israel Deaconess Medical Center, and Stanford collaborators found that OpenAI's o1 reasoning model correctly identified an exact or very close diagnosis in 67% of real emergency room triage cases — more than 10 percentage points higher than two internal medicine attending physicians given identical text-based patient data. In one clinical reasoning task, o1 achieved a perfect score on 98% of cases versus 35% for physicians. The study used 76 real ER patients and is among the most rigorous head-to-head comparisons of LLM diagnostic reasoning and physician performance to date. Authors explicitly called for urgent prospective randomized trials before clinical deployment, noting that the controlled text-only format of the study does not replicate the full complexity of clinical encounters — including physical examination, emotional context, and physician-patient communication. The publication in Science (the world's second most-cited journal) and its Harvard provenance give the findings exceptional credibility and are expected to accelerate regulatory debate about AI clinical deployment standards.
Media
Sources
- T1 Science / AAAS Official western
- T2 STAT News Major western
- T2 Science News Major western