Harvard Study Finds AI Outperforms Human Doctors in Emergency Room Triage
A landmark study published in Science reveals that an advanced AI model diagnosed emergency room patients more accurately than attending physicians, particularly in high-pressure triage situations with limited information.
By Factlen Editorial Team
- Clinical AI Researchers
- Argue that AI has eclipsed traditional benchmarks and must now be evaluated through rigorous, real-world clinical trials like any new medical intervention.
- Practicing Physicians
- View the technology not as a replacement, but as a critical 'second opinion' partner in a triadic care model to catch errors in high-pressure environments.
- Healthcare Technologists
- Emphasize the rapid pace of LLM improvement and its potential to democratize access to expert-level diagnostic reasoning globally.
What's not represented
- · Patients subjected to AI triage
- · Medical malpractice insurers
Why this matters
Diagnostic errors in fast-paced emergency rooms cost lives and billions of dollars annually. By demonstrating that AI can reliably parse messy, real-time patient data to catch diagnoses that doctors miss, this research paves the way for a 'second opinion' system that could dramatically reduce medical errors and improve patient outcomes.
Key points
- A Harvard study found OpenAI's o1 model outperformed human doctors in emergency room triage.
- The AI correctly diagnosed 67% of cases at initial triage, compared to 50-55% for attending physicians.
- The models were tested on raw, unstructured electronic health records from 76 real patients.
- Researchers are calling for rigorous clinical trials to evaluate AI as a 'second opinion' tool in hospitals.
In the chaotic, high-stakes environment of a hospital emergency room, the earliest moments of triage dictate the trajectory of a patient's survival. For decades, the gold standard of care has relied entirely on the rapid cognitive processing of human physicians working with fragmented information. Now, a landmark study published in the journal Science reveals that artificial intelligence has crossed a critical threshold in clinical reasoning. Researchers from Harvard Medical School and Beth Israel Deaconess Medical Center demonstrated that OpenAI's o1 reasoning model significantly outperformed human attending physicians in diagnosing patients during the most critical, information-poor stages of emergency room triage. The findings mark a profound shift in medical technology, suggesting that AI is no longer just an administrative tool, but a highly capable diagnostic partner.[1][2][4]
To test the true capabilities of the AI, the research team designed an experiment that mirrored the messy reality of modern medicine. They selected 76 real-world patient cases from the emergency department at Beth Israel Deaconess Medical Center in Boston. Rather than feeding the AI neatly organized clinical summaries, the researchers provided the models with the exact same raw, unstructured electronic health records that the human doctors faced. This included vital signs, demographic data, and brief, often hastily written notes from triage nurses. The goal was to see if the AI could sift through the noise and identify the signal without the benefit of a physical examination.[2][5][7]
The results at the initial triage stage—when urgency is highest and available data is lowest—were striking. The AI model identified the exact or a very close diagnosis in 67 percent of the cases. In contrast, the two human attending physicians, operating under the same blinded conditions, achieved accuracy rates of only 50 and 55 percent. The AI's advantage stemmed from its ability to instantly process vast amounts of unstructured text and weigh multiple diagnostic probabilities simultaneously, effectively bypassing the cognitive biases and fatigue that can hinder human decision-making in a crowded emergency ward.[1][5][8]

As more clinical information became available later in the patient encounter, the performance gap between the machine and the humans narrowed, though the AI maintained a slight edge. With richer detail, the AI's diagnostic accuracy rose to 82 percent, while the human doctors improved to a range of 70 to 79 percent. While this secondary difference was not deemed statistically significant, it underscored the AI's unique value proposition: the system is most advantageous exactly when doctors are most vulnerable to error—during the initial, chaotic intake process where rapid decisions must be made with minimal context.[1][5][8]
With richer detail, the AI's diagnostic accuracy rose to 82 percent, while the human doctors improved to a range of 70 to 79 percent.
Beyond simply naming the disease, the study tested the AI on "management reasoning," a highly complex clinical task that involves developing long-term treatment plans, recommending antibiotic regimens, and navigating sensitive goals-of-care conversations. In a separate evaluation involving five complex clinical case studies, the AI was pitted against a larger cohort of 46 human doctors who were allowed to use conventional resources like search engines. The AI achieved a median score of 89 percent for its treatment plans, crushing the human experts, who earned a median score of just 34 percent.[1][3][7]

Despite the sweeping victory for the algorithmic models, the researchers were quick to dispel the notion of a looming robotic takeover of hospital wards. Dr. Adam Rodman, a lead author of the study and a physician at Beth Israel, emphasized that the technology is not designed to replace doctors. Instead, he envisions a "triadic care model" involving the doctor, the patient, and the AI system working in concert. In this framework, the AI acts as an ever-vigilant second opinion, passively monitoring electronic health records to flag missed diagnostic opportunities or suggest alternative testing pathways before a human error can result in patient harm.[4][7]
The unprecedented performance of the o1 model has prompted the study's authors to call for a fundamental shift in how medical AI is evaluated. Historically, AI models have been tested using multiple-choice medical licensing exams, a metric that fails to capture the nuance of real-world clinical practice. Arjun Manrai, an assistant professor of biomedical informatics at Harvard Medical School, argued that medical AI is now mature enough to be subjected to the same rigorous, prospective clinical trials required for new pharmaceutical drugs. Only through controlled deployment in active care settings can the medical community fully understand the safety profile and operational impact of these tools.[2][6]

The study, while groundbreaking, acknowledged several key limitations that must be addressed before widespread adoption. The AI models were evaluated solely on text-based inputs, meaning they did not interpret non-text data such as X-rays, MRI scans, or the subtle physical cues a doctor observes during an in-person examination. Furthermore, researchers warned that while an AI might correctly identify the top diagnosis, it could simultaneously recommend unnecessary or overly aggressive testing that exposes patients to financial or physical harm. Consequently, human oversight remains the ultimate baseline for ensuring patient safety as the healthcare industry navigates this profound technological transition.[2][6][8]
How we got here
1950s
The first standards are created to train and evaluate doctors, which later become the benchmark for early medical software.
2023–2024
Large language models begin passing the US Medical Licensing Examination, though mostly on clean, multiple-choice questions.
Late 2025
OpenAI releases the o1 reasoning model, introducing step-by-step logical processing capabilities.
April 30, 2026
Harvard Medical School and Beth Israel Deaconess Medical Center publish their landmark study in Science, proving AI outpaces doctors on real-world triage data.
Viewpoints in depth
Clinical AI Researchers
Advocating for a shift from multiple-choice benchmarks to real-world clinical trials.
For years, the gold standard for testing medical AI has been the US Medical Licensing Examination—a multiple-choice test that rewards rote memorization over clinical intuition. Researchers argue that models have now eclipsed these rudimentary benchmarks. Because AI can now navigate the messy, unstructured reality of actual patient charts, the academic community insists that these systems must be evaluated exactly like new pharmaceutical drugs: through rigorous, prospective clinical trials that measure actual patient outcomes in active hospital wards.
Practicing Physicians
Embracing AI as a safety net rather than a replacement for human expertise.
Frontline doctors are largely rejecting the narrative that AI will render them obsolete. Instead, they view the technology as a critical tool to combat the cognitive fatigue that plagues overcrowded emergency rooms. By adopting a 'triadic care model,' physicians hope to use AI as a passive, always-on second opinion that scans health records in the background, flagging missed symptoms or suggesting alternative diagnoses before a human error can result in a catastrophic medical failure.
Patient Safety Advocates
Focusing on the potential to drastically reduce diagnostic errors in high-pressure environments.
Diagnostic errors are a leading cause of preventable death and injury in modern healthcare, particularly during the chaotic intake process of emergency triage. Advocates point to the AI's 67 percent accuracy rate in information-poor scenarios as a massive leap forward for patient safety. However, they also caution that AI systems must be carefully monitored to ensure they do not recommend overly aggressive, unnecessary, or financially ruinous testing regimens simply because a rare disease is statistically possible.
What we don't know
- How the AI models would perform when forced to interpret non-text data, such as X-rays, MRI scans, or physical patient cues.
- Whether the AI's diagnostic accuracy will hold up in rural or under-resourced hospitals with different demographic profiles and less comprehensive electronic health records.
- How medical malpractice liability will be handled if a doctor overrides a correct AI diagnosis, or if an AI system recommends a harmful treatment plan.
Key terms
- Large Language Model (LLM)
- An artificial intelligence system trained on vast amounts of text, capable of understanding and generating human-like language and reasoning.
- Triage
- The process of quickly examining patients who are taken to a hospital to decide which ones are the most seriously ill and must be treated first.
- Management Reasoning
- The complex clinical process of deciding the next steps in patient care, including treatment plans, medication regimens, and end-of-life discussions.
- Electronic Health Record (EHR)
- A digital version of a patient's paper chart, containing medical history, diagnoses, medications, and treatment plans.
- Triadic Care Model
- A proposed healthcare framework where medical decisions are made collaboratively by the doctor, the patient, and an artificial intelligence system.
Frequently asked
Will AI replace emergency room doctors?
No. Researchers emphasize that AI is designed to act as a "second opinion" to catch errors and assist with data processing, not to practice medicine autonomously or replace human physical examinations.
How much better was the AI at diagnosing patients?
During the initial triage stage with limited information, the AI identified the correct diagnosis 67% of the time, compared to 50-55% for human doctors.
Did the AI have an unfair advantage in the study?
No. The AI and the human doctors were given the exact same raw, unstructured electronic health records, and neither was allowed to perform a physical examination of the patient.
What is management reasoning and how did the AI perform?
Management reasoning involves creating long-term treatment plans and medication regimens. The AI scored 89% on these tasks, significantly outperforming human doctors who scored 34%.
Sources
[1]The GuardianPracticing Physicians
AI outperforms doctors in emergency room tasks, new Harvard study shows
Read on The Guardian →[2]Harvard UniversityClinical AI Researchers
Study Suggests AI Is Good Enough at Diagnosing Complex Medical Cases To Warrant Clinical Testing
Read on Harvard University →[3]Inc. MagazineHealthcare Technologists
A new peer-reviewed study found AI diagnosed emergency patients more accurately than human doctors
Read on Inc. Magazine →[4]Harvard MagazinePracticing Physicians
AI Outperforms Doctors in Emergency Room Tasks, New Harvard Study Shows
Read on Harvard Magazine →[5]Algorithm TimesClinical AI Researchers
Harvard study: OpenAI's o1 model beats doctors in ER triage
Read on Algorithm Times →[6]Tech EchelonHealthcare Technologists
OpenAI's o1 model outperforms physicians in emergency room diagnoses
Read on Tech Echelon →[7]Indian ExpressPracticing Physicians
AI models outperform human doctors in emergency room diagnosis: Harvard study
Read on Indian Express →[8]India TimesHealthcare Technologists
How AI outperformed doctors in a landmark Harvard study
Read on India Times →
More in ai
See all 5 stories →Every angle. Every day.
Get ai stories with full source coverage and perspective breakdowns delivered to your inbox.












