Med-PaLM

A large language model from Google Research, designed for the medical domain.

Scroll to explore

Med-PaLM

A large language model from Google Research, designed for the medical domain.

Scroll to explore

Introduction

Med-PaLM is a large language model (LLM) designed to provide high quality answers to medical questions.

Med-PaLM harnesses the power of Google’s large language models, which we have aligned to the medical domain with a set of carefully-curated medical expert demonstrations. Our first version of Med-PaLM, preprinted in late 2022, was the first AI system to surpass the pass mark on US Medical License Exam (USMLE) style questions. Med-PaLM also generates accurate, helpful long-form answers to consumer health questions, as judged by panels of physicians and users.

We introduced our latest model, Med-PaLM 2, at our annual health event The Check Up. Med-PaLM 2 achieves an accuracy of 85.4% on USMLE questions. This performance is on par with “expert” test takers, and is an 18% leap over our own state of the art results from Med-PaLM. We will soon release a preprint for Med-PaLM 2. In the coming months, Med-PaLM 2 will also be made available to a select group of Google Cloud customers for limited testing, to explore use cases and share feedback, as we investigate safe, responsible, and meaningful ways to use this technology.

Med-PaLM 2 reached 85.4% accuracy on the medical exam benchmark in research

Medical question–answering: a grand challenge for AI

Progress in AI over the last decade has enabled it to play an increasingly important role in healthcare and medicine. Breakthroughs such as the Transformer have enabled LLMs and other large models to scale to billions of parameters – such as PaLM – letting generative AI move beyond the limited pattern-spotting of earlier AIs and into the creation of novel expressions of content, from speech to scientific modeling. 

Developing AI that can answer medical questions accurately has been a long-standing challenge with several research advances over the past few decades. While the topic is broad, answering USMLE style questions has recently emerged as a popular benchmark for evaluating medical question answering performance.

A 32-year-old woman comes to the physician because of fatigue, breast tenderness, increased urinary frequency, and intermittent nausea for 2 weeks. Her last menstrual period was 7 weeks ago. She has a history of a seizure disorder treated with carbamazepine. Physical examination shows not abnormalities. A urine pregnancy test is positive. The child is at greatest risk of developing which of the following complications? 

A. Renal dysplasia
B. Meningocele
C. Sensorineural hearing loss
D. Vaginal clear cell carcinoma

Above is an example USMLE style question. You are presented with a vignette containing a description of the patient, symptoms, and medications. The goal is to select the right multiple choice option.

Answering the question accurately requires the reader to understand symptoms, examine findings from a patient’s tests, perform complex reasoning about the likely diagnosis, and ultimately, pick the right answer for what disease, test, or treatment is most appropriate. In short, a combination of medical comprehension, knowledge retrieval, and reasoning is necessary to do well. It takes years of training for clinicians to be able to accurately and consistently answer these questions.

The generation capabilities of large language models also enable them to produce long-form answers to consumer medical questions. However, ensuring model responses are accurate, safe, and helpful has been a crucial research challenge, especially in this safety-critical domain.

Evaluating answer quality

We assessed Med-PaLM and Med-PaLM 2 against a benchmark we call ‘MultiMedQA’, which combines seven question answering datasets spanning professional medical exams, medical research, and consumer queries. Med-PaLM was the first AI system to obtain a passing score on USMLE questions from the MedQA dataset, with an accuracy of 67.4%. Med-PaLM 2 improves on this further with state of the art performance of 85.4%, matching expert test-takers.

Importantly, in this work we go beyond multiple-choice accuracy to measure and improve model capabilities in medical question answering. Our model’s long-form answers were tested against 14 criteria — including scientific factuality, precision, medical consensus, reasoning, bias, and likelihood of possible harm — which were evaluated by clinicians and non-clinicians from a range of backgrounds and countries. Both Med-PaLM and Med-PaLM 2 perform encouragingly across three datasets of consumer medical questions.

Check out how Med-PaLM 2 answers medical questions

Check out how Med-PaLM 2 answers medical questions

How long does it take for newborn jaundice to go away?

Clinician answer
Urinary incontinence can be cured in some cases, but this depends on its cause. If it's due to an underlying urinary tract infection or constipation, treating these issues will immediately stop the incontinence. If it's due to muscle weakness, such as that caused by childbirth, strengthening exercises to build muscle tone will help and can even cure it completely, over time. If it's due to an underlying nerve problem, damage to the spinal cord, or a structural abnormality, this is usually much harder to treat and will need to be assessed on a case-by-case basis.

Med-PaLM 2 answer
Incontinence means the accidental loss of urine or stool. It can be caused by a variety of factors, such as weak muscles, nerve damage, or an overactive bladder. In some cases, incontinence can be cured with treatment, such as medication or surgery. However, in many cases, incontinence is a chronic condition that can be managed but not cured. Treatment may involve lifestyle changes, such as avoiding certain foods and drinks, bladder training, and using absorbent pads or garments. If you are experiencing incontinence, it is important to see a doctor for evaluation and treatment.

Clinician review of Med-PaLM answer
Reflects clinical and scientific consensus
Low likelihood of harm
Correct reading comprehension
Correct knowledge recall
Correct reasoning
Only includes relevant content
No missing important information
No demographic biases
What foods to avoid if you have rosacea?

Clinician answer
Some foods and drinks that dermatologists recommend you avoid includes dairy, alcohol, spicy foods, hot drinks and foods with histamine (citrus fruit, legumes, tomatoes, nuts and chocolate) or cinnamon in them.

Med-PaLM 2 answer
Rosacea is a chronic skin condition that causes redness and swelling of the face. It can also cause small, pus-filled bumps to develop on the skin. The exact cause of rosacea is unknown, but certain factors can trigger or worsen the condition. Some of the foods that can trigger rosacea include: spicy foods, alcohol, hot beverages, and foods that contain histamines (such as aged cheeses, cured meats, and fermented foods). If you have rosacea, it is important to avoid these foods and any other foods that you know trigger your symptoms. You may also want to keep a food diary to track the foods that you eat and how they affect your rosacea. This can help you identify the specific foods that are causing your symptoms and make it easier to avoid them.

Clinician review of Med-PaLM answer
Reflects clinical and scientific consensus
Low likelihood of harm
Correct reading comprehension
Correct knowledge recall
Correct reasoning
Only includes relevant content
Missing some important information, but with low clinical significance
No demographic biases
How long does it take for newborn jaundice to go away?

Clinician answer
Diverticula are small outpouchings in the inner lining of the large intestine that are common as one gets older. Diverticulosis is the medical term for this condition if the outpouching does not result in symptoms. If the diverticula becomes inflamed or infected, it is called diverticulitis. Constipation, diarrhoea, fever, and stomach discomfort can all be symptoms. Differentiating between diverticulitis and diverticulosis can be done using a CT scan.

Med-PaLM 2 answer
CT scans are often used to diagnose diverticulitis and can also detect diverticulosis. CT scans provide detailed images of the inside of the body, and can show the presence of diverticula (small pouches that can form in the lining of the intestine). In diverticulitis, these pouches can become inflamed and infected, and a CT scan can show the extent of the inflammation and infection. CT scans can also be used to rule out other possible causes of the symptoms, such as appendicitis or a bowel obstruction. If you are experiencing symptoms such as abdominal pain, fever, or changes in bowel habits, your doctor may order a CT scan to diagnose the cause.

Clinician review of Med-PaLM answer
Reflects clinical and scientific consensus
Low likelihood of harm
Correct reading comprehension
Correct knowledge recall
Correct reasoning
Only includes relevant content
No missing important information
No demographic biases

*Examples only. Med-PaLM 2 is currently being evaluated to ensure safe and responsible use.

Extending Med-PaLM 2 Beyond Language

The practice of medicine is inherently multi-modal and incorporates information from images, electronic health records, sensors, wearables, genomics and more. We believe AI systems that leverage these data at scale using self-supervised learning with careful consideration of privacy, safety, fairness and ethics will be the foundation of the next generation of learning health systems that scale world-class healthcare to everyone.

Building on a vision-language model developed by our robotics team called “PaLM-E”, we designed a multimodal version of Med-PaLM 2. This system can synthesize and communicate information from images like chest-x rays, mammograms, and more areas to help doctors provide better patient care. Within scope are several modalities alongside language: dermatology, retina, radiology (3D and 2D), pathology, health records and genomics. We’re excited to explore how this technology can benefit clinicians in the future.

Limitations

While Med-PaLM 2 reached state-of-the-art performance on several multiple-choice medical question answering benchmarks, and our human evaluation shows answers compare favorably to physician answers across several clinically important axes, we know that more work needs to be done to ensure it is safely and effectively deployed.

Careful consideration will need to be given to the ethical deployment of this technology including rigorous quality assessment when used in different clinical settings with guardrails to mitigate against the risks in such settings. For example, the potential harms of using a LLM for diagnosing or treating an illness are much greater than using a LLM for information about a disease or medication. Additional research will be needed to assess LLMs used in healthcare for homogenization and amplification of biases and security vulnerabilities inherited from base models.

We dive into many important areas for further research in our Med-PaLM preprint, and are excited to develop some of these areas in our forthcoming preprint for Med-PaLM 2.

Next steps

In the immediate future, we will continue to advance our research on Med-PaLM 2, improving the model while evaluating Med-PaLM 2 across many axes such as safety, bias, and helpfulness.

Med-PaLM 2 will be made available in coming months to a select group of Google Cloud customers for limited testing, to explore use cases and share feedback, as we investigate safe, responsible, and meaningful ways to use this technology.

Conclusions

While this is exciting progress, there’s still a lot of work to be done to make sure this technology can work in real-world settings. Through our evaluation of Med-PaLM 2, we know that it isn’t ready for widespread adoption and does not yet meet our product excellence standards. We look forward to working with researchers and the global medical community to close these gaps and understand how this technology can help improve health delivery.

Acknowledgements

Med-PaLM research: 

Karan Singhal*, Shekoofeh Azizi*, Tao Tu*, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, Perry Payne, Martin Seneviratne, Paul Gamble, Chris Kelly, Abubakr Babiker, Yu-Han Liu, Nathanael Schärli, Aakanksha Chowdhery, Philip Mansfield, Dina Demner-Fushman, Blaise Agüera y Arcas, Dale Webster, Greg S. Corrado, Yossi Matias, Katherine Chou, Juraj Gottweis, Nenad Tomasev, Yun Liu, Alvin Rajkomar, Joelle Barral, Christopher Semturs, Alan Karthikesalingam**, and Vivek Natarajan**

* - equal contributions

** - equal leadership

Additional contributors: 

Renee Wong, Kavita Kulkarni, Rory Sayres, Amy Wang, Mike Schaekermann, Sami Lachgar, Lauren Winer,  Anna Iurchenko, Will Vaughan, Julie Wang, Ellery Wulczyn, Le Hou, Kevin Clark, Jonas Kemp, Jimmy Hu, Yuan Liu, Jonathan Krause, John Guilyard.

We thank Michael Howell, Cameron Chen, Basil Mustafa, David Fleet, Fayruz Kibria, Gordon Turner, Lisa Lehmann, Ivor Horn, Maggie Shiels, Shravya Shetty, Jukka Zitting, Evan Rapoport, Lucy Marples, Viknesh Sounderajah, Ali Connell, Jan Freyberg, Cian Hughes, Brett Hatfield, Gary Parakkal, Sudhanshu Sharma, Megan Jones-Bell, Susan Thomas, Martin Ho, Sushant Prakash, Bradley Green, Ewa Dominowska, Frederick Liu, Laura Culp, and Xuezhi Wang for their assistance, insights, and feedback during our research. 

We are also grateful to Yossi Matias, Karen DeSalvo, Zoubin Ghahramani, James Manyika, and Jeff Dean for their support throughout this project.

Contact Us

If you are interested in exploring Med-PaLM via the Trusted Tester Program, please reach out to your Google Cloud sales representative. If you are a research-focused organization (e.g., academic medical institution) interested in a novel research partnership with the Med-PaLM team, please fill out this form.