NOHARM

MAST: Medical AI Superintelligence Test

Introducing MAST, our vision for a suite of realistic clinical benchmarks to evaluate real-world performance of medical AI systems.

First, Do NOHARM is the foundational benchmark of the MAST suite, and establishes a new framework to assess clinical safety and accuracy in AI-generated medical recommendations.

Models

Compare model performance on a variety of metrics

Higher is better

Best

Selected

Worst

TEAM CONFIGURATION

Metric Explorer

Analyze multiple metrics

Hover for model details

The NOHARM Benchmark

Benchmark overview

About
NOHARM is a physician-validated medical benchmark to evaluate the accuracy and safety of AI-generated medical recommendations, grounded in real medical cases. The current version covers 10 specialties across 100 cases, and includes 12,747 specialist annotations on beneficial and harmful medical actions that can be taken in the 100 cases. This project is led and supported by the ARISE AI Research Network, based at Stanford and Harvard.
Motivation
As physicians, one of our core principles is to do no harm. With the rapid integration of AI technologies into medicine, how can we evaluate the harm of technologies? How do we evaluate how these models perform, compared to each other, and importantly, to ourselves?
Study
For details, see our study.
Submissions
An automated submission portal is in the works. In the meanwhile, please contact us if you are interested benchmarking your model and inclusion in the leaderboard.

Study Authors

David Wu (dwu@mgh.harvard.edu), Fateme Nateghi Haredasht, Saloni Kumar Maharaj, Priyank Jain, Jessica Tran, Matthew Gwiazdon, Arjun Rustagi, Jenelle Jindal, Jacob M. Koshy, Vinay Kadiyala, Anup Agarwal, Bassman Tappuni, Brianna French, Sirus Jesudasen, Christopher V. Cosgriff, Rebanta Chakraborty, Jillian Caldwell, Susan Ziolkowski, David J. Iberri, Robert Diep, Rahul S. Dalal, Kira L. Newman, Kristin Galetta, J. Carl Pallais, Nancy Wei, Kathleen M. Buchheit, David I. Hong, Ernest Y. Lee, Allen Shih, Vartan Pahalyants, Tamara B. Kaplan, Vishnu Ravi, Sarita Khemani, April S. Liang, Daniel Shirvani, Advait Patil, Nicholas Marshall, Kanav Chopra, Joel Koh, Adi Badhwar, Liam G. McCoy, David J. H. Wu, Yingjie Weng, Sumant Ranji, Kevin Schulman, Nigam H. Shah, Jason Hom, Arnold Milstein, Adam Rodman, Jonathan H. Chen, Ethan Goh