MAST: Medical AI Superintelligence Test
Introducing MAST, our vision for a suite of realistic clinical benchmarks to evaluate real-world performance of medical AI systems.
First, Do NOHARM is the foundational benchmark of the MAST suite, and establishes a new framework to assess clinical safety and accuracy in AI-generated medical recommendations.
Models
Compare model performance on a variety of metrics
Overall performance across Safety, Completeness, and Restraint (harmonic mean)
Best
Selected
Worst
MULTI-AGENT CONFIGURATION
Metric Explorer
Analyze multiple metrics
Hover for model details
The NOHARM Benchmark
Benchmark overview
- About
- NOHARM is a physician-validated medical benchmark to evaluate the accuracy and safety of AI-generated medical recommendations, grounded in real medical cases. The current version covers 10 specialties across 100 cases, and includes 12,747 specialist annotations on beneficial and harmful medical actions that can be taken in the 100 cases. This project is led and supported by the ARISE AI Research Network, based at Stanford and Harvard.
- Motivation
- As physicians, one of our core principles is to do no harm. With the rapid integration of AI technologies into medicine, how can we evaluate the harm of technologies? How do we evaluate how these models perform, compared to each other, and importantly, to ourselves?
- Study
- For details, see our study.
- Submissions
- Please see the MAST GitHub Repository for information and instructions on participating.
- Contact
- Reach out to our team.
Study Authors
David Wu (dwu@mgh.harvard.edu), Fateme Nateghi Haredasht, Saloni Kumar Maharaj, Priyank Jain, Jessica Tran, Matthew Gwiazdon, Arjun Rustagi, Jenelle Jindal, Jacob M. Koshy, Vinay Kadiyala, Anup Agarwal, Bassman Tappuni, Brianna French, Sirus Jesudasen, Christopher V. Cosgriff, Rebanta Chakraborty, Jillian Caldwell, Susan Ziolkowski, David J. Iberri, Robert Diep, Rahul S. Dalal, Kira L. Newman, Kristin Galetta, J. Carl Pallais, Nancy Wei, Kathleen M. Buchheit, David I. Hong, Ernest Y. Lee, Allen Shih, Vartan Pahalyants, Tamara B. Kaplan, Vishnu Ravi, Sarita Khemani, April S. Liang, Daniel Shirvani, Advait Patil, Nicholas Marshall, Kanav Chopra, Joel Koh, Adi Badhwar, Liam G. McCoy, David J. H. Wu, Yingjie Weng, Sumant Ranji, Kevin Schulman, Nigam H. Shah, Jason Hom, Arnold Milstein, Adam Rodman, Jonathan H. Chen, Ethan Goh