AI Models Match or Beat Doctors in Complex Medical Reasoning, Study Shows

A new study from Harvard Medical School and Beth Israel Deaconess Medical Center has found that large language models (LLMs) can outperform physicians across a range of clinical reasoning tasks, including emergency department decision-making, diagnosis, and management planning. The research, published in a leading medical journal, tested OpenAI's o1-preview model against human doctors on multiple benchmarks and found the AI consistently matched or exceeded human performance.

How the AI Was Tested

The researchers evaluated o1-preview, a reasoning-focused model released in 2024, alongside GPT-4o, using a variety of clinical cases. These included published case conferences and real-world emergency department records. The AI was asked to generate likely diagnoses and recommend next steps at different stages of patient care, from triage to admission.

In one experiment, the model was given only the information available at each stage of a standard emergency department visit. The largest performance gap between AI and human physicians occurred during triage, when patient data is most limited. As more information became available, both AI and human diagnostic accuracy improved, but the AI maintained a lead.

“We tested the AI model against virtually every benchmark, and it eclipsed both prior models and our physician baselines,” said Arjun Manrai, co-senior author and professor at Harvard Medical School. “However, this does not mean AI will necessarily improve care – how and where it should be deployed remain understudied, and we desperately need rigorous prospective trials to evaluate the impact of AI on clinical practice.”

Implications for European Healthcare

The findings arrive as European health systems grapple with rising demand, workforce shortages, and budget constraints. Countries like Germany, France, and the Netherlands are already piloting AI tools for radiology and pathology, but this study suggests broader applications in emergency medicine and primary care could be on the horizon. The European Union's AI Act, which classifies medical AI as high-risk, will require such systems to undergo conformity assessments before deployment.

“Models are increasingly capable. We used to evaluate models with multiple-choice tests; now they are consistently scoring close to 100%, and we can't track progress anymore because we're already at the ceiling,” said co-first author Peter Brodeur, a clinical fellow at Beth Israel Deaconess. He cautioned that high benchmark scores do not guarantee safe real-world performance.

The researchers noted that AI might reduce diagnostic errors, which the World Health Organization estimates affect one in ten patients globally. In Europe, diagnostic delays contribute significantly to adverse outcomes, particularly in under-resourced regions like the Balkans and parts of Eastern Europe. However, the study also highlighted risks: “A model might get the top diagnosis right but also suggest unnecessary testing that could expose a patient to harm,” Brodeur said.

Limitations and Next Steps

The study has several limitations. It focused on the o1-preview model, which has since been superseded by OpenAI's o3. The research measured model performance, not clinical outcomes, and did not account for the human factors—such as patient communication, ethical judgment, and contextual awareness—that are critical in real-world care.

The authors called for prospective trials to evaluate AI in clinical settings and for health systems to invest in computing infrastructure and regulatory frameworks. “Humans should be the ultimate baseline when it comes to evaluating performance and safety,” Brodeur added.

As European policymakers consider integrating AI into healthcare, this study provides both promise and caution. The technology's ability to handle complex reasoning tasks could complement human expertise, but only if deployed with rigorous oversight. For now, the message is clear: AI can think like a doctor, but it is not yet ready to replace one.

AI Models Match or Beat Doctors in Complex Medical Reasoning, Study Shows

How the AI Was Tested

Implications for European Healthcare

Limitations and Next Steps

More from this story

Urban vs Rural Upbringing Shapes Distinct Mental Health Profiles in Children, Study Finds

Lancet Report: Europe's Climate Inaction Fuels Heat Deaths, Disease, and Food Insecurity

UK Enacts Generational Tobacco Ban, Setting Precedent for European Health Policy

European Leaders Seek New Security Order as US Disengagement Looms