A large language model from Google Research, designed for the medical domain.
A large language model from Google Research, designed for the medical domain.
Med-PaLM is a large language model (LLM) designed to provide high quality answers to medical questions.
Med-PaLM harnesses the power of Google’s large language models, which we have aligned to the medical domain with a set of carefully-curated medical expert demonstrations. Our first version of Med-PaLM, preprinted in late 2022, was the first AI system to surpass the pass mark on US Medical License Exam (USMLE) style questions. Med-PaLM also generates accurate, helpful long-form answers to consumer health questions, as judged by panels of physicians and users.
We introduced our latest model, Med-PaLM 2, at our annual health event The Check Up. Med-PaLM 2 achieves an accuracy of 85.4% on USMLE questions. This performance is on par with “expert” test takers, and is an 18% leap over our own state of the art results from Med-PaLM. We will soon release a preprint for Med-PaLM 2. In the coming months, Med-PaLM 2 will also be made available to a select group of Google Cloud customers for limited testing, to explore use cases and share feedback, as we investigate safe, responsible, and meaningful ways to use this technology.
Med-PaLM 2 reached 85.4% accuracy on the medical exam benchmark in research
Progress in AI over the last decade has enabled it to play an increasingly important role in healthcare and medicine. Breakthroughs such as the Transformer have enabled LLMs and other large models to scale to billions of parameters – such as PaLM – letting generative AI move beyond the limited pattern-spotting of earlier AIs and into the creation of novel expressions of content, from speech to scientific modeling.
Developing AI that can answer medical questions accurately has been a long-standing challenge with several research advances over the past few decades. While the topic is broad, answering USMLE style questions has recently emerged as a popular benchmark for evaluating medical question answering performance.
Above is an example USMLE style question. You are presented with a vignette containing a description of the patient, symptoms, and medications. The goal is to select the right multiple choice option.
Answering the question accurately requires the reader to understand symptoms, examine findings from a patient’s tests, perform complex reasoning about the likely diagnosis, and ultimately, pick the right answer for what disease, test, or treatment is most appropriate. In short, a combination of medical comprehension, knowledge retrieval, and reasoning is necessary to do well. It takes years of training for clinicians to be able to accurately and consistently answer these questions.
The generation capabilities of large language models also enable them to produce long-form answers to consumer medical questions. However, ensuring model responses are accurate, safe, and helpful has been a crucial research challenge, especially in this safety-critical domain.
We assessed Med-PaLM and Med-PaLM 2 against a benchmark we call ‘MultiMedQA’, which combines seven question answering datasets spanning professional medical exams, medical research, and consumer queries. Med-PaLM was the first AI system to obtain a passing score on USMLE questions from the MedQA dataset, with an accuracy of 67.4%. Med-PaLM 2 improves on this further with state of the art performance of 85.4%, matching expert test-takers.
Importantly, in this work we go beyond multiple-choice accuracy to measure and improve model capabilities in medical question answering. Our model’s long-form answers were tested against 14 criteria — including scientific factuality, precision, medical consensus, reasoning, bias, and likelihood of possible harm — which were evaluated by clinicians and non-clinicians from a range of backgrounds and countries. Both Med-PaLM and Med-PaLM 2 perform encouragingly across three datasets of consumer medical questions.
*Examples only. Med-PaLM 2 is currently being evaluated to ensure safe and responsible use.
The practice of medicine is inherently multi-modal and incorporates information from images, electronic health records, sensors, wearables, genomics and more. We believe AI systems that leverage these data at scale using self-supervised learning with careful consideration of privacy, safety, fairness and ethics will be the foundation of the next generation of learning health systems that scale world-class healthcare to everyone.
Building on a vision-language model developed by our robotics team called “PaLM-E”, we designed a multimodal version of Med-PaLM 2. This system can synthesize and communicate information from images like chest-x rays, mammograms, and more areas to help doctors provide better patient care. Within scope are several modalities alongside language: dermatology, retina, radiology (3D and 2D), pathology, health records and genomics. We’re excited to explore how this technology can benefit clinicians in the future.
While Med-PaLM 2 reached state-of-the-art performance on several multiple-choice medical question answering benchmarks, and our human evaluation shows answers compare favorably to physician answers across several clinically important axes, we know that more work needs to be done to ensure it is safely and effectively deployed.
Careful consideration will need to be given to the ethical deployment of this technology including rigorous quality assessment when used in different clinical settings with guardrails to mitigate against the risks in such settings. For example, the potential harms of using a LLM for diagnosing or treating an illness are much greater than using a LLM for information about a disease or medication. Additional research will be needed to assess LLMs used in healthcare for homogenization and amplification of biases and security vulnerabilities inherited from base models.
We dive into many important areas for further research in our Med-PaLM preprint, and are excited to develop some of these areas in our forthcoming preprint for Med-PaLM 2.
In the immediate future, we will continue to advance our research on Med-PaLM 2, improving the model while evaluating Med-PaLM 2 across many axes such as safety, bias, and helpfulness.
Med-PaLM 2 will be made available in coming months to a select group of Google Cloud customers for limited testing, to explore use cases and share feedback, as we investigate safe, responsible, and meaningful ways to use this technology.
While this is exciting progress, there’s still a lot of work to be done to make sure this technology can work in real-world settings. Through our evaluation of Med-PaLM 2, we know that it isn’t ready for widespread adoption and does not yet meet our product excellence standards. We look forward to working with researchers and the global medical community to close these gaps and understand how this technology can help improve health delivery.
Scientific American: AI Chatbots Can Diagnose Medical Conditions at Home. How Good Are They?
CNBC: Google’s working on an updated version of its medical A.I. that can answer health questions
Med Page Today: Google AI Performs at 'Expert' Level on U.S. Medical Licensing Exam
New Scientist: Google's AI is best yet at answering medical and health questions
The Economist: A bioethicist and a professor of medicine on regulating AI in healthcare
Advisory Board: Are AI doctors on the horizon?
STAT: Google will let health care customers test its generative AI model, ramping up rivalry with GPT-4
MobiHealthNews: Google to offer limited access to medical LLM
Forbes: How Tech Leaders Compete In The Battle Of Healthcare AI
Med-PaLM research:
Karan Singhal*, Shekoofeh Azizi*, Tao Tu*, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, Perry Payne, Martin Seneviratne, Paul Gamble, Chris Kelly, Abubakr Babiker, Yu-Han Liu, Nathanael Schärli, Aakanksha Chowdhery, Philip Mansfield, Dina Demner-Fushman, Blaise Agüera y Arcas, Dale Webster, Greg S. Corrado, Yossi Matias, Katherine Chou, Juraj Gottweis, Nenad Tomasev, Yun Liu, Alvin Rajkomar, Joelle Barral, Christopher Semturs, Alan Karthikesalingam**, and Vivek Natarajan**
* - equal contributions
** - equal leadership
Additional contributors:
Renee Wong, Kavita Kulkarni, Rory Sayres, Amy Wang, Mike Schaekermann, Sami Lachgar, Lauren Winer, Anna Iurchenko, Will Vaughan, Julie Wang, Ellery Wulczyn, Le Hou, Kevin Clark, Jonas Kemp, Jimmy Hu, Yuan Liu, Jonathan Krause, John Guilyard.
We thank Michael Howell, Cameron Chen, Basil Mustafa, David Fleet, Fayruz Kibria, Gordon Turner, Lisa Lehmann, Ivor Horn, Maggie Shiels, Shravya Shetty, Jukka Zitting, Evan Rapoport, Lucy Marples, Viknesh Sounderajah, Ali Connell, Jan Freyberg, Cian Hughes, Brett Hatfield, Gary Parakkal, Sudhanshu Sharma, Megan Jones-Bell, Susan Thomas, Martin Ho, Sushant Prakash, Bradley Green, Ewa Dominowska, Frederick Liu, Laura Culp, and Xuezhi Wang for their assistance, insights, and feedback during our research.
We are also grateful to Yossi Matias, Karen DeSalvo, Zoubin Ghahramani, James Manyika, and Jeff Dean for their support throughout this project.
If you are interested in exploring Med-PaLM via the Trusted Tester Program, please reach out to your Google Cloud sales representative. If you are a research-focused organization (e.g., academic medical institution) interested in a novel research partnership with the Med-PaLM team, please fill out this form.