Happy to share our latest study which explores LLMs accuracy in answering medical knowledge questions.
The study compared the medical accuracy of 24,500 QA responses from OpenAI’s GPT-4, Anthropic’s Claude3-Opus, and human medical experts.
The comparison was done through questions based on objective medical knowledge drawn from Kahun Medical’s Knowledge Graph.
To share some of the key findings:
📌 Claude3 edged out above GPT-4 on accuracy, but both paled in comparison to both human medical experts and objective medical knowledge. Both LLMs answered about a third of the questions wrong, with GPT-4 answering almost half of the questions with numerical-based answers incorrectly.
📌 Each LLM would generate different outputs on a prompt-by-prompt basis, emphasizing the significance of how the same QA prompt could generate vastly opposing results between each model.
📌The study included an “I do not know” option to reflect situations where a physician has to admit uncertainty. It found different answer rates for each LLM (Numeric: Claude3-63.66%, GPT-4-96.4%; Semantic: Claude3-94.62%, GPT-4-98.31%). However, there was an insignificant correlation between accuracy and answer rate for both LLMs, suggesting their ability to admit lack of knowledge is questionable. This indicates that without prior knowledge of the medical field and the model, the trustworthiness of LLMs is doubtful.
Read the study here: https://lnkd.in/d7cAKPfg
See the MedCity News' story: https://lnkd.in/dxcCjmzH
Eden Avnat Michal Levy Daniel Hershtain , Elia Y. , Daniel Ben Joya, MD, Dafna Eshel, Sahar Laros, Yael Dagan, Shahar Barami , Joseph Mermelstein, Shahar Ovadia , Noam Shomron, Varda Shalev and Raja-Elie Abdulnour