795,000 Americans die or are permanently disabled by diagnostic error every year. That figure comes from the Agency for Healthcare Research and Quality and Johns Hopkins Medicine. It represents doctors — trained, certified, experienced human professionals — getting the diagnosis wrong at a rate of roughly 11.1% across all conditions.

We call this acceptable. We’ve built entire systems around it — second opinions, morbidity reviews, checklists, escalation protocols. We don’t ban doctors from practising. We design for their error rate.

Then an AI makes a mistake, and the headline writes itself.


For the rest of us: what is an error rate?

An error rate is simply the proportion of times a system gets the wrong answer. A 5% error rate means one in twenty outputs is incorrect. An 11.1% diagnostic error rate means roughly one in nine diagnoses is wrong.

No system — human or machine — has a zero error rate. The question is never whether errors occur. It’s what kinds of errors, at what frequency, with what consequences, and whether the system around the actor is designed to catch them.

We’ve spent a century designing systems around human fallibility in high-stakes environments. We’ve barely started doing the same for AI.


The human baseline

In medicine, the average diagnostic error rate across conditions is 11.1% — but the range is extreme. Heart attack: 1.5%. Spinal abscess: 62%. In outpatient settings specifically, studies estimate a 5% error rate, corresponding to approximately 12 million misdiagnosed adults in the US every year.

In hospitals, a study of 2,428 patients across 29 institutions found that 23% of those who died or were transferred to an ICU had experienced a diagnostic error in the care leading up to that event.

In aviation, human factors account for up to 80% of all accidents. Pilot error alone contributes to around 53% of commercial crashes. This figure has been known for decades. The response was not to ground all aircraft. The response was CRM training, crew redundancy, automated flight envelopes, black boxes, and mandatory incident reporting. The industry designed around the error rate.

These are not fringe numbers. They are the published baseline of human performance in high-stakes domains. We accept them because the alternative — no doctors, no pilots — is worse.


What AI’s error rates actually look like

In emergency medicine, a study published in JMIR (2024) found that GPT-4 outperformed resident physicians on diagnostic accuracy for internal medicine emergencies. Not matched — outperformed.

In cardiac monitoring, one AI system achieved a false-negative rate of 0.3%, compared to 4.4% for trained technicians — a 15x improvement on one of the most consequential error types.

In operative reporting, AI-generated reports showed 87.3% accuracy against a surgeon baseline of 72.8%. A 14.5 percentage-point reduction in error rate, on documentation that directly affects patient care.

In radiology, a 2025 Frontiers meta-analysis found GPT-4 accuracy improved from 57.5% in 2023 to 70.9% in 2025 — and improving every year.

None of this means AI is ready to practise medicine unsupervised. The cardiology comparison showed AI at 52.6% vs physicians at 47.4% — real but narrow. The Swedish primary care study found GPT-4 fell short on complex cases. Error distributions differ by domain and case complexity in ways that matter.

But the data does not support the assumption that AI is uniquely error-prone compared to humans. In several domains, it’s already more accurate. In others, it’s comparable. In some, it still lags.


The Bainbridge problem

Lisanne Bainbridge published “Ironies of Automation” in 1983. It became one of the most cited papers in human factors. Her core observation: automating a process doesn’t remove human error from the system. It relocates it.

When systems are automated, humans lose the practice that keeps their skills sharp. Attention degrades rapidly — research suggests humans cannot maintain effective vigilance on low-event processes for more than about 30 minutes. And when the automated system fails, the human who steps in is less capable of handling it than they would have been without automation, precisely because the automation trained them out of the loop.

The same dynamic applies to AI. Deploying AI without designing for its error distribution doesn’t eliminate errors — it changes their shape. If an AI handles 95% of cases correctly and humans don’t review the outputs, the 5% that fail may go undetected longer than they would have under a fully human process.

The solution isn’t to avoid AI. It’s to build the AI-equivalent of aviation’s error-mitigation stack: logging, sampling, escalation, override protocols, and systematic measurement of where and how the AI fails.


Why the 100% expectation is the wrong frame

Demanding that AI be error-free before deployment is a frame that, applied consistently, would require us to shut down hospitals.

The more productive question is: compared to what? If an AI system can reduce diagnostic error rates from 11% to 6%, that is a net benefit even if the AI makes different types of errors than the human would. If an AI-assisted radiology workflow catches 15x fewer missed arrhythmias than the manual baseline, the question isn’t whether the AI is perfect — it’s whether the system as a whole produces better outcomes.

The sociotechnical argument — developed by Bainbridge and extended in decades of human factors research — is that we shouldn’t be comparing AI against a perfect standard. We should be comparing AI-in-a-system against human-in-a-system, measuring outcomes at the level of the whole process, not the individual actor.

For enterprise AI deployment, this reframe is practical, not philosophical. It changes what you measure (process outcomes, not model accuracy in isolation), what you build (feedback loops, not just APIs), and what governance looks like (systematic error tracking, not binary go/no-go decisions).


What realistic expectations look like

Expecting zero errors from AI agents is neither scientifically grounded nor strategically useful. The honest expectation is:

AI will make errors. Those errors will have a specific distribution — certain domains, certain case types, certain conditions — that is measurable if you instrument your system. That distribution will likely improve over time. And in many domains, the error rate is already lower than the human baseline it’s being compared against.

The organisations that handle this well will be the ones that treat AI error rates the way aviation treats human error rates: as a known engineering parameter to design around, not a failure of the technology.


References

  • AHRQ / Johns Hopkins Medicine, diagnostic error impact report (2023) — 795,000 Americans harmed annually
  • NIH PMC, “The incidence of diagnostic error in medicine” — 11.1% average rate
  • Famularo et al., outpatient diagnostic error rate: ~5%, ~12 million US adults/year
  • FAA human factors data — 80% aviation accidents attributed to human factors; pilot error ~53%
  • JMIR (2024), GPT-4 vs ED physicians — diagnostic accuracy comparison
  • Scientific Reports (2024), GPT-4o vs ophthalmologists — glaucoma diagnosis
  • Frontiers in Radiology (2025), meta-analysis — GPT-4 accuracy 57.5% (2023) to 70.9% (2025)
  • AI operative reports vs surgeon reports — 87.3% vs 72.8% accuracy
  • Bainbridge, L., “Ironies of Automation” (1983) — Automatica, Vol. 19, pp. 775–779