2023 Year in Review
Rajpurkar Lab

A quest for Generalist Medical AI, 24 studies advancing foundations and applications, openings to join efforts.

Pranav Rajpurkar
Dec 18, 2023

As 2023 wraps up, I’m eager to share some research highlights from our team's latest medical AI advances. We’ve been on a mission to develop artificial intelligence that can match exceptional doctors in both specialty tasks and flexible clinical thinking. Major progress made this year brings us closer to that goal.

Our primary 2023 focus has been developing Generalist Medical AI (GMAI) systems. As introduced in our Nature perspective with Eric Topol, Michael Moor, and colleagues, GMAI systems are designed to closely resemble doctors in their ability to reason through a wide range of medical tasks, incorporate multiple data modalities, and communicate in natural language. GMAI represents a big contrast to the current generation of medical AI models. Our New England Journal of Medicine review with Matt Lungren recognizes that despite advances in AI for medical image interpretation, the current generation of AI models can handle only a limited set of interpretation tasks, and they rely heavily on curated data that have been specifically labeled and categorized. Rapid progress in developing multimodal large language models in AI represents a major opportunity for the development of GMAI models that can tackle the full spectrum of image-interpretation tasks and more.

I will highlight three concrete sets of advances we have made this year in Advancing Foundations of Medical AI, as well as in Powering Clinical Applications.

Advancing Foundations of Medical AI

Tackling Data Challenges – Progress in medical AI requires addressing key data challenges around limited availability, inconsistencies across modalities, and lack of patient diversity. Medical imaging datasets often lack sufficient labeled examples to enable robust model development. This year, we developed learning strategies to reduce reliance on extensive data annotation (Ke et al., Mishra et al.). We introduced an Image-Graph Contrastive Learning framework for label-efficient learning (Khanna et al.) where we paired chest X-rays with structured report graphs from radiology notes, demonstrating improved few-shot disease detection. Beyond data availability, one of the challenges facing medical AI is the inconsistency in methodologies applied to different medical data modalities, slowing progress. To address this, co-led with Alex Tamkin, we developed BenchMD (Wantlin et al.), a benchmark to measure the progress of unified methods including model architectures and self-supervised pre-training techniques across medical imaging and sensor data modalities. Another barrier to medical AI progress is that public datasets are limited and unrepresentative. We pioneered the MAIDA initiative for coordinated global data sharing, assembling heterogeneous datasets to evaluate model fairness and safety (Saenz et al., Lancet Digital Health) in the context of medical images.

Safeguarding Translation Into Clinical Settings – In our review (Han et al.), we identified 84 randomized trials evaluating the impact of medical AI in clinical use, demonstrating global progress in clinical translation across medical specialties. This year, we made progress on frameworks to safely integrate AI systems into clinical practice while addressing real-world limitations. An important question to test is whether expert-level AI can guarantee improvement in the performance of its clinical users. In collaboration with MIT economists Nikhil Agarwal, Tobias Salz and Alex Moehring, we performed a large-scale study with 100+ radiologists evaluating the effects of medical AI on decision making, where we formalized the factors driving lack of improvement in the performance of radiologists despite the use of an AI that was more accurate than two-thirds of them (Agarwal et al.). In another study, we used real-world usage data from a deployed radiology AI diagnostic model to develop a decision pipeline (Sanchez et al. Cell Reports Medicine) that supports the diagnostic model with an ecosystem of models integrating disagreement prediction, clinical significance categorization, and prediction quality modeling to reduce clinician burden. Finally, in a perspective (Saenz et al., NPJ Digital Med), we explored the avenue of deploying medical AI models autonomously without clinician oversight, summarizing considerations for liability, regulations and costs.

Auditing Generative AI for Medicine – This has been a year of rapid advances in generative AI modeling with large language models (LLMs) able to generate coherent text outputs, and latent diffusion models able to create realistic synthetic images. We see strong potential for these technologies in medicine when developed responsibly. With Roxana Daneshjou at Stanford, we assessed the ability of LLMs including GPT-4 to conduct medical diagnoses through conversational interactions (Johri et al.). Our analysis exposed limitations in LLMs’ integration of conversational details, providing insights for further development. We were also part of collaborative efforts led by (1) Arjun Manrai’s lab at Harvard DBMI on analyzing latent diffusion model capabilities for generating synthetic images of skin disease (Sagers et al.), (2) Javier Alvarez-Valle’s group at Microsoft Research on assessing GPT-4 for radiology report analysis tasks (Liu et al., EMNLP), and (3) Michael Moor in Jure Leskovec’s group at Stanford on the development of Med-Flamingo (Moor et al. ML4H), the first few-shot learner for multimodal medical tasks. Through these diverse evaluations of generative AI capabilities, we are gathering key insights into translating these powerful technologies to clinical medicine - both the promises and current limitations.

Powering Clinical Applications

Radiology Report Generation – This year, we have made significant advances in developing systems that can accurately generate full radiology reports from medical images and in proposing metrics for clinical evaluation of such systems. We introduced X-REM (Jeong et al.), a novel retrieval-based radiology report generation module that outperforms previous approaches on both natural language and clinical evaluation metrics by effectively capturing the fine-grained interaction between an image and corresponding report text. Because evaluating the clinical accuracy of generated reports is challenging, we proposed new automated metrics RadGraph F1 and RadCliQ that demonstrate stronger correlation with radiologists’ judgments (Yu et al., Patterns). We also created datasets to enable robust clinical evaluation - the Radiology Report Expert Evaluation (ReXVal) dataset containing radiologist annotations on error types, and the ReFiSco dataset where radiologists assess, score and edit AI generated reports.

We addressed an important shortcoming of existing report generation methods, which is that they conflate report content (e.g. findings) with style choices (e.g. formatting), which can reduce clinical accuracy. We introduced a style-aware approach that first extracts content from an image using RadGraph, our graph representation for capturing clinical concepts and relationships from reports, and then styles the content to match radiologist preferences (Yan et al.). Finally, this year, we extended RadGraph to RadGraph2 (Khanna et al.) to enable better modeling of disease progression across longitudinal exams for the same patient over time. Our advances enable generating radiology reports that come closer to matching expert quality across metrics of accuracy, evaluation, and adaptability to radiologist needs and preferences.

Precision Diagnostics and Treatment – We have worked on the development of interpretable AI methods for diagnostics and treatment selection in cancer. For example, in pancreatic ductal adenocarcinoma (PDAC), we collaborated with Valar Labs to report an AI approach that extracts a signature from histologic features that is predictive of disease-specific survival in patients receiving adjuvant gemcitabine chemotherapy (Namgaonkar et al., Cell Rep Med). We showed the signature is associated with outcomes following adjuvant gemcitabine but not in untreated patients, suggesting that the signature specifically predicts treatment-related outcomes rather than just being generally prognostic. The imaging analysis pipeline shows promise for developing clinically actionable biomarkers where few currently exist to guide PDAC therapy.

Additionally, in collaboration with pathology colleagues at Stanford and co-led with Sebastian Fernandez-Pol, we introduced LymphoML (Shankar et al., ML4H), an interpretable machine learning approach that identifies morphological features prognostic of lymphoma subtypes in H&E stained tissues. The method rivals the accuracy of pathologists in lymphoma subclassification while providing explanations through learned visual features.

Emergency Department Monitoring – In close collaboration with David Kim in the Stanford Emergency Department (ED), our team aims to develop flexible multimodal AI systems for patient monitoring. By combining modalities like waveforms, vital signs, medications, and imaging reports, there is an opportunity to greatly improve monitoring of patients during their stay in the emergency department and at home. We developed multimodal machine learning approaches that combine continuous physiologic monitoring data with standard triage information to predict adverse events like tachycardia, hypotension, and hypoxia in emergency department patients (Sundrani et al., NPJ Digital Med).

To spur further progress in robust multimodal ED AI, we introduced the Multimodal Clinical Benchmark for Emergency Care (MC-BEC) (Chen et al., NeurIPS D&B).  MC-BEC is designed to encourage the development and evaluation of robust and clinically useful foundation models for multimodal data analysis in Emergency Medicine. MC-BEC focuses evaluation on key facets like performance across modalities, tolerance to missing modalities, and bias/fairness to deeply assess how models utilize heterogeneous data sources. The benchmark also emphasizes multitask learning across predictions like decompensation, disposition, and revisits over different timescales. Additionally, with Vijay Reddi and colleagues, we proposed a concept called ClinAIOps (Chen et al. Nature BME) to integrate continuous monitoring with AI-guided treatment adjustment in critical care settings. ClinAIOps uses feedback loops enabling providers, patients, and algorithms to collaboratively adjust care plans hour-to-hour as conditions evolve.

The Road Ahead

Since the release of our Nature publication introducing GMAI in April 2023, there have been rapid demonstrations of promising GMAI capabilities by talented research groups across medical domains. Teams have showcased systems integrating multimodal data and flexible reasoning to address needs from radiology to pathology and beyond. Notable examples include RadFM from Weidi Xie’s lab at Shanghai Jiao Tong University; Med-PaLM M from the team of Vivek Natarajan & Alan Karthikesalingam at Google; PathChat for computational pathology from Faisal Mahmood’s Lab at Harvard; RO-LLaMA from Jong Chul Ye’s lab at KAIST; MAIRA-1 from Javier Alvarez-Valle’s team at Microsoft Research; VisionFM from the teams of Wu Yuan & Ningli Wang at The Chinese University of Hong Kong & Beijing Tongren Hospital. These rapid advancements spotlight the enthusiasm around generalist AI to transform medicine.

Even more meaningful opportunities await in 2024 as we continue partnering with the community to further cutting-edge medical AI for patient benefit.

Acknowledgements

The work covered in this post included contributions of current and former lab + Medical AI Bootcamp members, including Oishi Banerjee, Shreya Johri, Emma Chen, Agustina Saenz, Jaehwan Jeong, Kathy Yu, Ryan Han, Wendy Erselius, Heather Viana, Benjamin Yan, Alex Ke, Kathryn Wantlin, Julian Acosta, Vivek Shankar, Xiaoli Yang, Sameer Sundrani, Julie Chen, Aman Kansal, Kostas Tingos, Morgan Sanchez, Sameer Khanna, Adam Dejl, Kyle Alford, Chenwei Wu, Mark Endo, Rayan Krishnan, Vrishab Krishna, Daniel Michael, Ruochen Liu, David Kuo, Subathra Adithan, Stephen Kwak, Katherine Tian, Andrew Li, Sina Hartung, Fang Cao, Raja Narayan, Farah Dadabhoy, Veeral Mehta, Benjamin Tran, Henrik Marklund.

I have had the privilege of working with many collaborators and friends on the work shared above, including Eric Topol, David Kim, Matt Lungren, Nikhil Agarwal, Tobias Salz, Alex Moehring, Roxana Daneshjou, Vijay Reddi, Sebastian Fernandez-Pol, Andrew Ng, Curt Langlotz, Michael Moor, Yasodha Natkunam, Eduardo Reis, Vasantha Venugopal, Chloe O'Connell, Steven Truong, Kibo Yoon, Quoc Truong, Hanh Duong, Cyril Zakka, Ian Pan, Andy Tsai, Eduardo Fonseca, Fardad Behzadi, Juan Calle, Daniel Schlessinger, Shannon Wongvibulsin, Zhuo Cai, Vivek Nimgaonkar, Viswesh Krishna, Ekin Tiu, Anirudh Joshi, Damir Vrabac, David Osayande, Michael Pohlen, Rajat Mittal, Christy Jestin, Tom Jin, Luke Sagers, James Diao, Raj Manrai, Marinka Zitnik, Jennifer Sullivan, Cassandra Perry, Susanne Churchill, Zak Kohane, Shvetank Prakash, Zach Harned, Mars Huang, Harlan Krumholz, Michael Abràmoff, Ozan Oktay, Javier Alvarez-Valle, Errol Colak, Adewole Adamson, John Ioannidis, Laura Heacock, Geoffrey Tison, and Zahra Shakeri.

Join Our Efforts

As we continue working towards realizing the potential of Generalist Medical AI (GMAI), we have openings across multiple levels to join our efforts:

Postdoctoral Fellows: We have openings from postdocs with expertise in vision-language models, LLMs, multimodal learning, self-supervised learning, and related areas to advance GMAI capabilities. Apply here.

Clinician Scientists: Clinicians can collaborate with us through the Medical AI Bootcamp to guide grounded AI development.

Harvard/Stanford Students: Students can get involved in medical AI research through the Medical AI Bootcamp.

MAIDA Manager: We have an opening for a Relationship Development Manager to lead coordination efforts for the global MAIDA medical imaging data sharing initiative.