This study evaluates the performance of multimodal AI models in medical diagnostics using the NEJM Image Challenge dataset, comparing their accuracy to human collective intelligence. 1️⃣ Anthropic's Claude 3 models showed the highest accuracy, surpassing average human performance by about 10%. 2️⃣ Human collective intelligence achieved a 90.8% accuracy rate, outperforming all AI models. 3️⃣ GPT-4 Vision Preview was selective, often responding to easier questions with smaller images and longer texts. 4️⃣ OpenAI’s GPT-4 Vision Preview answered only 76% of the cases, while other models responded to all queries. 5️⃣ The study highlights the potential and current limitations of multimodal AI in clinical diagnostics. 6️⃣ Ethical and reliability concerns arise from the integration of multimodal AI in medical diagnostics. 7️⃣ The EU AI Act emphasizes the need for transparency, robustness, and human oversight in high-risk AI systems, including medical AI. ✍🏻 Robert Kaczmarczyk, Theresa Isabelle Wilhelm, Dr. med. Ron Martin, B.Sc., Dr. med. Jonas Roos. Evaluating multimodal AI in medical diagnostics. npj Digital Medicine. 2024. DOI: 10.1038/s41746-024-01208-3
#Accuracy is the primary reported variable when reporting such systems, but accuracy can be #inflated when training and testing the same or similar data sets, or even from creating similar images from which the results are taken from. There are ways to inflate accuracy whether intentionally or not. What are the metrics that we need to look at and the underlying #assumptions that we need to understand before accepting the results of #performance of these #AI models?
Human collective intelligence will be a vital benchmark as we pursue the automation of key tasks, what is to be seen is how policymakers view risk in clinical settings, how does malpractice insurance evolve in this pursuit?
Thanks for sharing :)
Thank you for sharing!
Interesting AI calibration!
Very informative
Impressive study! It’s fascinating to see multimodal AI models pushing boundaries in medical diagnostics.
Generative AI Engineer and Consultant | Machine Learning Engineer | Ph.D. Biomedical Engineering
4moThis is an exciting coincidence, I was designing this study in my head yesterday as I was working with multimodal models for a different application! The sobering observation I had while reading this is that this is already out of date. Claude 3.5 is out, Gemini 1.5 pro and Flash are multimodal by design, and were not evaluated here, and Open AI has already released GPT4omni (GPT4o) that is natively multimodal. And that's just the major players, smaller labs have released many other open and closed source models like LLaVA and others. This is not to disparage the work by the authors, but just to remind everyone that the field moves very quickly, take every metric you read with a large handful of salt. By the time you read it, it's probably already incorrect.