LLM DeepSeek on a par with proprietary models in clinical decision making

3 months ago 3

Research Briefing
Published: 18 July 2025

Nature Medicine (2025)Cite this article

427 Accesses
1 Altmetric
Metrics details

By systematic analysis of patient cases, we evaluated the clinical utility of open-source large language models (LLMs), such as the DeepSeek models, for implementation in medical applications. Their performance on clinical decision-making tasks was comparable to and partly better than proprietary models GPT-4o and Gemini-2.0 Flash Thinking Experimental, respectively.

References

Quer, G. & Topol, E. J. The potential for large language models to transform cardiovascular medicine. Lancet Digit. Health 6, e767–e771 (2024). A review article that presents opportunities and limitations of artificial intelligence models in the field of cardiovascular medicine.
CAS PubMed Google Scholar
de Hond, A. et al. From text to treatment: the crucial role of validation for generative large language models in health care. Lancet Digit. Health 6, e441–e443 (2024). A comment on the challenge of validating LLMs in healthcare, suggesting general, task-specific and clinical validation.
PubMed Google Scholar
Ong, J. C. L. et al. Medical ethics of large language models in medicine. NEJM AI https://doi.org/10.1056/Aira2400038 (2024). A review article that presents bioethical principles to promote the responsible use of LLMs, enabling their use ethically, equitably and effectively in medicine.
Article Google Scholar
Sandmann, S. et al. Systematic analysis of ChatGPT, google search and Llama 2 for clinical decision support tasks. Nat Commun. 15, 2050 (2024). A benchmarking article showing the potential and shortcomings of commercial LLMs for clinical decisions.
CAS PubMed PubMed Central Google Scholar
Hou, G. & Lian, Q. Benchmarking of commercial large language models: ChatGPT, Mistral, and Llama. Preprint at Research Square https://doi.org/10.21203/rs.3.rs-4376810/v1 (2024). A benchmarking article presenting a critical look at LLMs, showing the need for ongoing evaluations and the potential of hybrid models (that is, combining LLMs and existing systems).

Download references

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This is a summary of: Sandmann, S. et al. Benchmark evaluation of DeepSeek large language models in clinical decision-making. Nat. Med. https://doi.org/10.1038/s41591-025-03727-2 (2025).