跳至正文

AI在诊断病人方面比医生更出色(哈佛医学院)
AI Better At Diagnosing Patients Than Doctors

(参考信源:顶刊Science 2026-04-30 哈佛医学院 )A large language model (LLM) outperformed physicians across many of these tasks, including making emergency-room decisions based on the available information, identifying likely diagnoses, and choosing the next steps in management, a team led by physicians and computer scientists at Harvard Medical School and Beth Israel Deaconess Medical Center reported April 30 in Science.

一个大型语言模型(LLM)在多项任务中表现优于医生,包括根据可用信息做出急诊室决策、识别可能的诊断以及选择后续管理步骤,由哈佛医学院和贝斯以色列女执事医疗中心的医生与计算机科学家组成的团队于 4 月 30 日在《科学》杂志上报告称。

“We tested the AI model against virtually every benchmark, and it eclipsed both prior models and our physician baselines,” said co-senior author Arjun (Raj) Manrai, assistant professor of biomedical informatics in the Blavatnik institute at HMS and founding deputy editor of NEJM AI.

“我们几乎在每一项基准测试中都对 AI 模型进行了检验,其表现均超越了先前模型及我们设定的医生基准线,” 共同资深作者、哈佛医学院布拉瓦特尼克研究所生物医学信息学助理教授、《新英格兰医学杂志 AI 版》创刊副主编阿尔琼(拉吉)·马奈表示。

The results make the case that medical AI is ready to be studied the same way as all new medical interventions: through carefully controlled, rigorous, prospective clinical trials in real care settings.

研究结果有力地证明,医疗人工智能已具备条件,可以像所有新型医疗干预手段一样接受同等标准的研究:通过在实际医疗环境中进行严格受控、严谨周密的前瞻性临床试验。

Manrai noted that these trials are necessary to evaluate whether, how, and where such tools should be deployed in clinical care as aids to human practitioners.

Manrai 指出,这些试验对于评估此类工具是否、如何以及在何处应作为人类从业者的辅助工具部署到临床护理中是必要的。

The model’s performance also suggests that longstanding ways of testing medical AI may no longer capture the abilities of current systems — pointing to a possible turning point for the field.

该模型的性能还表明,长期用于测试医疗人工智能的方法可能已无法准确评估现有系统的能力——这标志着该领域可能正迎来一个转折点。

“Models are increasingly capable. We used to evaluate models with multiple-choice tests; now they are consistently scoring close to 100 percent, and we can’t track progress anymore because we’re already at the ceiling,” said co-first author Peter Brodeur, HMS clinical fellow in medicine at Beth Israel Deaconess.

“模型的能力正日益增强。过去我们通过多项选择题来评估模型;现在它们几乎总是能拿到接近 100% 的分数,我们已经达到了天花板,因此无法再追踪进展了,” 共同第一作者、贝斯以色列女执事医疗中心的哈佛医学院临床研究员彼得·布罗德表示。

Testing medical AI in the real world
在现实世界中测试医疗 AI

Incorporating standards first created in the 1950s to train and evaluate doctors, the researchers compared how an AI system performed against hundreds of clinicians. The comparisons included case study diagnostic challenges, reasoning exercises, and real emergency department cases.

研究人员借鉴了上世纪 50 年代为培训和评估医生所制定的标准,将人工智能系统的表现与数百名临床医生进行了对比。对比内容包括案例研究诊断挑战、推理练习以及真实的急诊科病例。

In one experiment, the team tasked the LLM with evaluating patients at various points in a standard emergency department setting, ranging from early triage to later admission decisions. At each stage, the model was given only the information available at that point — drawn directly from actual electronic health records — and asked to generate likely diagnoses and recommend what should happen next.

在一次实验中,研究团队让大语言模型在一个标准急诊科环境中,从早期分诊到后续的入院决策等不同时间点对患者进行评估。在每个阶段,模型仅获得该时间点可用的信息——这些信息直接来自真实的电子健康记录——并被要求生成可能的诊断结果,并建议下一步应采取的措施。

“To better understand real-world performance, we needed to test performance early in the patient course, when clinical data is sparse,” said co-first author Thomas Buckley, a Harvard Kenneth C. Griffin Graduate School of Arts and Sciences doctoral student and Dunleavy Fellow in HMS’ AI in Medicine PhD track and member of the Manrai Lab.

“为了更准确地评估模型在实际临床环境中的表现,我们需要在患者病程早期、临床数据尚不充分时进行性能测试,”该论文的共同第一作者托马斯·巴克利表示。他是哈佛大学肯尼斯·C·格里芬文理研究生院的教授的博士生,同时担任哈佛医学院人工智能医学博士项目的邓利维学者,也是曼莱实验室的成员。

Unlike in prior studies, the team did not smooth out the messiness of real‑world care before testing the model; the emergency department cases were presented exactly as they appeared in the electronic health record.

与先前的研究不同,该团队在测试模型前并未对现实世界医疗护理的复杂性进行简化处理;急诊科病例完全按照电子健康记录中的原始样貌呈现。

“We didn’t pre‑process the data at all,” said co-senior author Adam Rodman, HMS assistant professor of medicine at Beth Israel Deaconess, director of AI programs for the Carl J. Shapiro Center for Education and Research, and associate editor of NEJM AI.

“我们完全没有对数据进行预处理,” 共同资深作者、哈佛医学院贝斯以色列女执事医疗中心医学助理教授、卡尔·J·夏皮罗教育与研究中心人工智能项目主任、《NEJM AI》副主编亚当·罗德曼表示。

At the early decision points in the real-world emergency department cases, the model matched or exceeded attending physicians in diagnostic accuracy — a result that surprised even the researchers.

在现实世界急诊科的早期决策点,该模型在诊断准确性上达到或超过了主治医师的水平——这一结果甚至让研究人员都感到惊讶。

“I thought it was going to be a fun experiment but that it wouldn’t work that well. That was not at all what happened,” Rodman said.

“我原以为这会是个有趣的实验,但不会那么有效。结果完全不是那么回事,” 罗德曼说。

The researchers emphasized that their results do not suggest that AI systems are ready to practice medicine autonomously or that physicians can be removed from the diagnostic process.

研究人员强调,他们的研究结果并不表明人工智能系统已准备好自主行医,也不意味着医生可以被排除在诊断过程之外。

“A model might get the top diagnosis right but also suggest unnecessary testing that could expose a patient to harm,” Brodeur said. “Humans should be the ultimate baseline when it comes to evaluating performance and safety.”

“一个模型可能能正确诊断出首要病症,但也可能建议不必要的检查,从而让患者面临风险,” 布罗德尔说。“在评估性能和安全性时,人类应当成为最终的基准。”

Researchers have just completed one of the largest-yet studies comparing artificial intelligence and physicians across a wide range of clinical reasoning tasks, evaluating whether an AI system could do what physicians do every day: review a messy patient chart and decide what to do next.

A large language model (LLM) outperformed physicians across many of these tasks, including making emergency-room decisions based on the available information, identifying likely diagnoses, and choosing the next steps in management, a team led by physicians and computer scientists at Harvard Medical School and Beth Israel Deaconess Medical Center reported April 30 in Science.

“We tested the AI model against virtually every benchmark, and it eclipsed both prior models and our physician baselines,” said co-senior author Arjun (Raj) Manrai, assistant professor of biomedical informatics in the Blavatnik institute at HMS and founding deputy editor of NEJM AI.

The results make the case that medical AI is ready to be studied the same way as all new medical interventions: through carefully controlled, rigorous, prospective clinical trials in real care settings.

Manrai noted that these trials are necessary to evaluate whether, how, and where such tools should be deployed in clinical care as aids to human practitioners.

The model’s performance also suggests that longstanding ways of testing medical AI may no longer capture the abilities of current systems — pointing to a possible turning point for the field.

“Models are increasingly capable. We used to evaluate models with multiple-choice tests; now they are consistently scoring close to 100 percent, and we can’t track progress anymore because we’re already at the ceiling,” said co-first author Peter Brodeur, HMS clinical fellow in medicine at Beth Israel Deaconess.

Testing medical AI in the real world

Incorporating standards first created in the 1950s to train and evaluate doctors, the researchers compared how an AI system performed against hundreds of clinicians. The comparisons included case study diagnostic challenges, reasoning exercises, and real emergency department cases.

In one experiment, the team tasked the LLM with evaluating patients at various points in a standard emergency department setting, ranging from early triage to later admission decisions. At each stage, the model was given only the information available at that point — drawn directly from actual electronic health records — and asked to generate likely diagnoses and recommend what should happen next.

Promotion for an HMS Continuing Education course: AI in Clinical Medicine June 11-15, 2026 register today
“To better understand real-world performance, we needed to test performance early in the patient course, when clinical data is sparse,” said co-first author Thomas Buckley, a Harvard Kenneth C. Griffin Graduate School of Arts and Sciences doctoral student and Dunleavy Fellow in HMS’ AI in Medicine PhD track and member of the Manrai Lab.

Unlike in prior studies, the team did not smooth out the messiness of real‑world care before testing the model; the emergency department cases were presented exactly as they appeared in the electronic health record.

“We didn’t pre‑process the data at all,” said co-senior author Adam Rodman, HMS assistant professor of medicine at Beth Israel Deaconess, director of AI programs for the Carl J. Shapiro Center for Education and Research, and associate editor of NEJM AI.

At the early decision points in the real-world emergency department cases, the model matched or exceeded attending physicians in diagnostic accuracy — a result that surprised even the researchers.

“I thought it was going to be a fun experiment but that it wouldn’t work that well. That was not at all what happened,” Rodman said.

The researchers emphasized that their results do not suggest that AI systems are ready to practice medicine autonomously or that physicians can be removed from the diagnostic process.

“A model might get the top diagnosis right but also suggest unnecessary testing that could expose a patient to harm,” Brodeur said. “Humans should be the ultimate baseline when it comes to evaluating performance and safety.”

diagnose [ˈdaɪəɡnəʊz] v. 诊断
make a diagnosis 作出诊断
Doctors must diagnose patients accurately.
医生必须准确诊断病人。

clinical [ˈklɪnɪkl] adj. 临床的
clinical trial 临床试验
The drug is undergoing clinical trials.
该药正在进行临床试验。

evaluate [ɪˈvæljueɪt] v. 评估
evaluate performance 评估表现
We need to evaluate the new system.
我们需要评估新系统。

outperform [ˌaʊtpəˈfɔːm] v. 表现优于
outperform competitors 表现优于竞争者
The AI system outperformed physicians.
该人工智能系统表现优于医生。

decision [dɪˈsɪʒn] n. 决定
make a decision 作出决定
She made a quick decision.
她迅速作出了决定。

management [ˈmænɪdʒmənt] n. 管理
management strategy 管理策略
Good management improves efficiency.
良好的管理提高效率。

rigorous [ˈrɪɡərəs] adj. 严格的
rigorous testing 严格测试
The study requires rigorous testing.
该研究需要严格测试。

prospective [prəˈspektɪv] adj. 预期的;未来的
prospective study 前瞻性研究
They conducted a prospective study.
他们进行了前瞻性研究。

deploy [dɪˈplɔɪ] v. 部署;应用
deploy technology 部署技术
Hospitals deploy AI tools.
医院部署人工智能工具。

capable [ˈkeɪpəbl] adj. 有能力的
capable of doing 能够做某事
The model is capable of complex reasoning.
该模型能够进行复杂推理。

accuracy [ˈækjərəsi] n. 准确性
diagnostic accuracy 诊断准确性
The test showed high accuracy.
测试显示出高准确性。

autonomous [ɔːˈtɒnəməs] adj. 自主的
autonomous system 自主系统
The car is not fully autonomous.
这辆车并非完全自主。

expose [ɪkˈspəʊz] v. 暴露;使遭受
expose to risk 使暴露于风险
Unnecessary tests may expose patients to harm.
不必要的检查可能使病人遭受伤害。

baseline [ˈbeɪslaɪn] n. 基线;基准
baseline performance 基准表现
Physicians serve as the baseline for evaluation.
医生作为评估的基准。

funding [ˈfʌndɪŋ] n. 资金;拨款
research funding 研究资金
The project received federal funding.
该项目获得了联邦资金。

disclosure [dɪsˈkləʊʒə] n. 披露
financial disclosure 财务披露
The company made full disclosure.
公司进行了全面披露。

1. A study suggests that…
说明:用于引出研究结论或观点。
例句:A study suggests that AI is good enough to diagnose complex medical cases.
研究表明人工智能足以诊断复杂的医疗案例。

2. …is good enough to…
说明:用于表达某事物的能力或水平已达到要求。
例句:The system is good enough to warrant clinical testing.
该系统足以进行临床测试。

3. …warrant(s) further testing/attention
说明:用于强调某事物值得进一步研究或关注。
例句:The findings warrant further clinical testing.
研究结果值得进一步临床测试。

4. …outperformed… in terms of…
说明:用于比较优势,常见于学术写作。
例句:AI outperformed physicians in diagnostic accuracy.
人工智能在诊断准确性方面优于医生。

5. …raises questions about…
说明:用于引出讨论或批判性思考。
例句:The study raises questions about the role of AI in healthcare.
该研究引发了关于人工智能在医疗中的作用的疑问。

6. …requires rigorous evaluation before…
说明:用于强调某事在实施前需要严格评估。
例句:The technology requires rigorous evaluation before deployment.
该技术在应用前需要严格评估。

7. …is capable of…
说明:用于说明能力或潜力。
例句:The model is capable of handling complex reasoning.
该模型能够处理复杂推理。

8. …may expose patients to…
说明:用于表达潜在风险。
例句:Unnecessary tests may expose patients to harm.
不必要的检查可能使病人遭受伤害。

9. …serves as a baseline for…
说明:用于说明比较或评估的基准。
例句:Physicians serve as a baseline for evaluation.
医生作为评估的基准。

10. …received funding from…
说明:用于说明研究或项目的资金来源。
例句:The project received funding from federal agencies.
该项目获得了联邦机构的资助。

以下为雅思写作模版句型:

11. Research has demonstrated that…
说明:用于引出研究结果。
例句:Research has demonstrated that AI can improve diagnostic accuracy.
研究表明人工智能可以提高诊断准确性。

12. It is widely acknowledged that…
说明:用于引出普遍观点。
例句:It is widely acknowledged that technology plays a crucial role in modern healthcare.
人们普遍认为科技在现代医疗中发挥着关键作用。

13. There is growing concern that…
说明:用于引出问题或担忧。
例句:There is growing concern that reliance on AI may reduce human judgment.
人们越来越担心依赖人工智能可能削弱人的判断力。

14. …has the potential to…
说明:用于说明某事物的潜力。
例句:AI has the potential to revolutionize medical diagnosis.
人工智能有可能彻底改变医学诊断。

15. …should be subject to rigorous evaluation before…
说明:用于强调实施前的严格审查。
例句:New technologies should be subject to rigorous evaluation before being widely adopted.
新技术在广泛应用前应接受严格评估。

16. …offers both opportunities and challenges…
说明:用于平衡论证。
例句:AI offers both opportunities and challenges in healthcare.
人工智能在医疗领域既带来机遇也带来挑战。

17. …raises ethical questions regarding…
说明:用于引出伦理问题。
例句:The use of AI raises ethical questions regarding patient privacy.
人工智能的使用引发了关于患者隐私的伦理问题。

18. …is increasingly being deployed in…
说明:用于说明趋势。
例句:AI is increasingly being deployed in hospitals worldwide.
人工智能正越来越多地应用于全球医院。

19. …plays a vital role in…
说明:用于强调重要性。
例句:Technology plays a vital role in improving healthcare efficiency.
科技在提高医疗效率方面发挥着重要作用。

20. …should not replace but complement…
说明:用于提出平衡观点。
例句:AI should not replace but complement human doctors.
人工智能不应取代而应当补充人类医生。

本文结构

引言段 (开篇引入话题)
A study suggests that artificial intelligence is good enough to diagnose complex medical cases.
研究表明人工智能足以诊断复杂的医疗案例。

It is widely acknowledged that technology plays a crucial role in modern healthcare.
人们普遍认为科技在现代医疗中发挥着关键作用。

主体段一 (提出优势与潜力)
AI outperformed physicians in terms of diagnostic accuracy, which raises questions about the role of human doctors.
人工智能在诊断准确性方面优于医生,这引发了关于人类医生角色的疑问。

AI has the potential to revolutionize medical diagnosis, offering both opportunities and challenges.
人工智能有可能彻底改变医学诊断,既带来机遇也带来挑战。

主体段二 (提出风险与限制)
Unnecessary tests may expose patients to harm, so the technology requires rigorous evaluation before deployment.
不必要的检查可能使病人遭受伤害,因此该技术在应用前需要严格评估。

The use of AI raises ethical questions regarding patient privacy, which should not be ignored.
人工智能的使用引发了关于患者隐私的伦理问题,这不应被忽视。

结论段 (总结与平衡观点)
AI is increasingly being deployed in hospitals worldwide, but it should not replace but complement human doctors.
人工智能正越来越多地应用于全球医院,但它不应取代而应当补充人类医生。

This balance ensures that technology serves as a baseline for improvement while maintaining human judgment.
这种平衡保证了科技作为改进的基准,同时保有人类的判断力。

Question 1 (Main idea)
What is the main point of the article?
A. AI has already replaced doctors in clinical practice.
B. An AI model performed well enough in clinical tasks to justify clinical testing.
C. Physicians performed much better than AI in emergency care.
D. The study shows AI has no limitations in medicine.

Question 2 (Detail)
According to the study, in which clinical situation did the AI show a particularly strong advantage?
A. Initial triage when information was sparse.
B. Long‑term outpatient follow‑up.
C. Surgical procedures requiring hands‑on skill.
D. Visual assessment of a patient’s appearance.

Question 3 (Detail)
Which of the following did the researchers compare the AI against?
A. Only multiple‑choice benchmark tests.
B. Only simulated textbook cases.
C. Hundreds of clinicians across case studies, reasoning exercises, and real ER cases.
D. Non‑medical crowdworkers.

Question 4 (Inference)
Which inference is best supported by the article?
A. The study proves AI should immediately replace human clinicians.
B. The findings indicate AI merits rigorous, prospective clinical trials before wide deployment.
C. The study shows no need for human oversight in AI‑assisted care.
D. The AI was tested on visual and bedside examinations.

Question 5 (Vocabulary in context)
In the phrase “prospective clinical trials,” the word prospective most nearly means:
A. retrospective; looking back.
B. accidental; unplanned.
C. casual; informal.
D. forward‑looking; planned in advance.

答案:1-5 BACBD

解题思路与技巧(每题)
Q1 (主旨题技巧)
Why B? The article’s repeated emphasis is that the LLM outperformed physicians on many tasks and therefore warrants clinical testing—this is the central claim. Tip: Scan title, opening and concluding paragraphs for the author’s main claim.

Q2 (细节题技巧)
Why A? The article notes the AI’s advantage was most pronounced at initial triage with minimal information. Tip: Look for phrases like “particularly pronounced” or “most pronounced” to locate key details.

Q3 (细节题技巧)
Why C? Methods described include comparisons with hundreds of clinicians across case studies, reasoning exercises, and real ER cases. Tip: Match multiple elements in the option to the passage description.

Q4 (推断题技巧)
Why B? The authors call for carefully controlled, rigorous, prospective clinical trials, implying further testing is needed before deployment—not immediate replacement. Tip: Distinguish between what is explicitly recommended and extreme conclusions.

Q5 (词义猜测技巧)
Why D? In clinical research, prospective trials are planned forward‑looking studies (opposite of retrospective). Tip: Use the scientific context to infer technical meanings.