- Research Briefing
- Published: 18 July 2025
Nature Medicine (2025)Cite this article
-
427 Accesses
-
1 Altmetric
By systematic analysis of patient cases, we evaluated the clinical utility of open-source large language models (LLMs), such as the DeepSeek models, for implementation in medical applications. Their performance on clinical decision-making tasks was comparable to and partly better than proprietary models GPT-4o and Gemini-2.0 Flash Thinking Experimental, respectively.
Enjoying our latest content?
Login or create an account to continue
- Access the most recent journalism from Nature's award-winning team
- Explore the latest features & opinion covering groundbreaking research
or
References
Quer, G. & Topol, E. J. The potential for large language models to transform cardiovascular medicine. Lancet Digit. Health 6, e767–e771 (2024). A review article that presents opportunities and limitations of artificial intelligence models in the field of cardiovascular medicine.
de Hond, A. et al. From text to treatment: the crucial role of validation for generative large language models in health care. Lancet Digit. Health 6, e441–e443 (2024). A comment on the challenge of validating LLMs in healthcare, suggesting general, task-specific and clinical validation.
Ong, J. C. L. et al. Medical ethics of large language models in medicine. NEJM AI https://doi.org/10.1056/Aira2400038 (2024). A review article that presents bioethical principles to promote the responsible use of LLMs, enabling their use ethically, equitably and effectively in medicine.
Sandmann, S. et al. Systematic analysis of ChatGPT, google search and Llama 2 for clinical decision support tasks. Nat Commun. 15, 2050 (2024). A benchmarking article showing the potential and shortcomings of commercial LLMs for clinical decisions.
Hou, G. & Lian, Q. Benchmarking of commercial large language models: ChatGPT, Mistral, and Llama. Preprint at Research Square https://doi.org/10.21203/rs.3.rs-4376810/v1 (2024). A benchmarking article presenting a critical look at LLMs, showing the need for ongoing evaluations and the potential of hybrid models (that is, combining LLMs and existing systems).
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This is a summary of: Sandmann, S. et al. Benchmark evaluation of DeepSeek large language models in clinical decision-making. Nat. Med. https://doi.org/10.1038/s41591-025-03727-2 (2025).
About this article
Cite this article
Open-source LLM DeepSeek on a par with proprietary models in clinical decision making. Nat Med (2025). https://doi.org/10.1038/s41591-025-03850-0
Published: 18 July 2025
DOI: https://doi.org/10.1038/s41591-025-03850-0
.png)

