LLM DeepSeek on a par with proprietary models in clinical decision making

3 months ago 3
  • Research Briefing
  • Published: 18 July 2025

Nature Medicine (2025)Cite this article

By systematic analysis of patient cases, we evaluated the clinical utility of open-source large language models (LLMs), such as the DeepSeek models, for implementation in medical applications. Their performance on clinical decision-making tasks was comparable to and partly better than proprietary models GPT-4o and Gemini-2.0 Flash Thinking Experimental, respectively.

Enjoying our latest content?
Login or create an account to continue

  • Access the most recent journalism from Nature's award-winning team
  • Explore the latest features & opinion covering groundbreaking research

or

References

  1. Quer, G. & Topol, E. J. The potential for large language models to transform cardiovascular medicine. Lancet Digit. Health 6, e767–e771 (2024). A review article that presents opportunities and limitations of artificial intelligence models in the field of cardiovascular medicine.

    CAS  PubMed  Google Scholar 

  2. de Hond, A. et al. From text to treatment: the crucial role of validation for generative large language models in health care. Lancet Digit. Health 6, e441–e443 (2024). A comment on the challenge of validating LLMs in healthcare, suggesting general, task-specific and clinical validation.

    PubMed  Google Scholar 

  3. Ong, J. C. L. et al. Medical ethics of large language models in medicine. NEJM AI https://doi.org/10.1056/Aira2400038 (2024). A review article that presents bioethical principles to promote the responsible use of LLMs, enabling their use ethically, equitably and effectively in medicine.

    Article  Google Scholar 

  4. Sandmann, S. et al. Systematic analysis of ChatGPT, google search and Llama 2 for clinical decision support tasks. Nat Commun. 15, 2050 (2024). A benchmarking article showing the potential and shortcomings of commercial LLMs for clinical decisions.

    CAS  PubMed  PubMed Central  Google Scholar 

  5. Hou, G. & Lian, Q. Benchmarking of commercial large language models: ChatGPT, Mistral, and Llama. Preprint at Research Square https://doi.org/10.21203/rs.3.rs-4376810/v1 (2024). A benchmarking article presenting a critical look at LLMs, showing the need for ongoing evaluations and the potential of hybrid models (that is, combining LLMs and existing systems).

Download references

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This is a summary of: Sandmann, S. et al. Benchmark evaluation of DeepSeek large language models in clinical decision-making. Nat. Med. https://doi.org/10.1038/s41591-025-03727-2 (2025).

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Open-source LLM DeepSeek on a par with proprietary models in clinical decision making. Nat Med (2025). https://doi.org/10.1038/s41591-025-03850-0

Download citation

  • Published: 18 July 2025

  • DOI: https://doi.org/10.1038/s41591-025-03850-0

Read Entire Article