AI’s Chat Abilities Fall Short in Medical Diagnosis
Recent research has shown that advanced artificial intelligence models, while excelling in professional medical exams, struggle with a critical aspect of healthcare: effectively communicating with patients to gather essential medical information. Despite impressive performance on standardized tests, these models falter in real-time conversations where dynamic interactions are required.
“Large language models demonstrate strong results on multiple-choice tests, but their effectiveness significantly diminishes during open-ended discussions,” remarked an expert from Harvard University. This limitation became apparent when researchers designed a method to assess a clinical AI model’s reasoning based on simulated doctor-patient conversations derived from 2,000 medical cases, primarily sourced from professional US medical board exams.
“Simulating patient interactions is a valuable approach to evaluate medical history-taking skills, which are vital in clinical practice and cannot be adequately assessed through traditional case vignettes,” stated another expert at Harvard University. The newly established evaluation benchmark, known as CRAFT-MD, reflects real-world scenarios in which patients may struggle to pinpoint crucial information and may need specific prompts to reveal significant details.
The CRAFT-MD benchmark utilizes AI technology itself for evaluation. One AI model acted as a “patient,” engaging in conversations with the clinical AI under assessment. Additionally, this model helped grade the outcomes by aligning the clinical AI’s diagnoses with the correct responses for each medical case, all while being verified by human medical professionals to ensure accuracy.
Multiple laboratory tests revealed that four leading AI models—including two of the most advanced—exhibited significantly lower performance on conversational assessments compared to their ability to diagnose from written case summaries. Diagnostic accuracy for one top model stood impressively at 82 percent with structured data input; however, this precision plummeted to just under 49 percent without structured options and fell even further to 26 percent when exposed to simulated patient dialogues.
Despite being the highest-performing model evaluated, the findings indicated that these AI tools often failed to collect comprehensive medical histories, with one model successfully gathering pertinent information in only 71 percent of simulated patient conversations. Furthermore, even when the AI managed to extract relevant medical history, it did not consistently provide accurate diagnoses.
Such simulations offer a significantly more effective measure of AI clinical reasoning than traditional medical exams, according to experts from a prominent research institute in California.
Should an AI model eventually navigate these benchmarks and achieve accurate diagnoses based on simulated interactions, it does not imply it would outperform human physicians. Medical practice, as highlighted by experts, is often “messy,” involving complex patient interactions, coordination with healthcare teams, and understanding various social and systemic factors impacting local healthcare.
“Strong performance on this benchmark would indicate that AI could serve as a valuable tool in clinical settings, but it is not likely to replace the nuanced judgement of seasoned physicians,” they concluded.