NOHARM

MAST: Medical AI Superintelligence Test

Introducing MAST, our vision for a suite of realistic clinical benchmarks to evaluate real-world performance of medical AI systems.

First, Do NOHARM is the foundational benchmark of the MAST suite, and establishes a new framework to assess clinical safety and accuracy in AI-generated medical recommendations.

Models

Compare model performance on a variety of metrics

Overall performance across Safety, Completeness, and Restraint (harmonic mean)

Higher is better

Best

Selected

Worst

MULTI-AGENT CONFIGURATION

Metric Explorer

Analyze multiple metrics

Hover for model details

The NOHARM Benchmark

Benchmark overview

About
NOHARM is a physician-validated medical benchmark to evaluate the accuracy and safety of AI-generated medical recommendations, grounded in real medical cases. The current version covers 10 specialties across 100 cases, and includes 12,747 specialist annotations on beneficial and harmful medical actions that can be taken in the 100 cases. This project is led and supported by the ARISE AI Research Network, based at Stanford and Harvard.
Motivation
As physicians, one of our core principles is to do no harm. With the rapid integration of AI technologies into medicine, how can we evaluate the harm of technologies? How do we evaluate how these models perform, compared to each other, and importantly, to ourselves?
Study
For details, see our study.
Submissions
Please see the MAST GitHub Repository for information and instructions on participating.

Study Authors

David Wu (dwu@mgh.harvard.edu), Fateme Nateghi Haredasht, Saloni Kumar Maharaj, Priyank Jain, Jessica Tran, Matthew Gwiazdon, Arjun Rustagi, Jenelle Jindal, Jacob M. Koshy, Vinay Kadiyala, Anup Agarwal, Bassman Tappuni, Brianna French, Sirus Jesudasen, Christopher V. Cosgriff, Rebanta Chakraborty, Jillian Caldwell, Susan Ziolkowski, David J. Iberri, Robert Diep, Rahul S. Dalal, Kira L. Newman, Kristin Galetta, J. Carl Pallais, Nancy Wei, Kathleen M. Buchheit, David I. Hong, Ernest Y. Lee, Allen Shih, Vartan Pahalyants, Tamara B. Kaplan, Vishnu Ravi, Sarita Khemani, April S. Liang, Daniel Shirvani, Advait Patil, Nicholas Marshall, Kanav Chopra, Joel Koh, Adi Badhwar, Liam G. McCoy, David J. H. Wu, Yingjie Weng, Sumant Ranji, Kevin Schulman, Nigam H. Shah, Jason Hom, Arnold Milstein, Adam Rodman, Jonathan H. Chen, Ethan Goh