With advancements in artificial intelligence, the healthcare industry is rapidly exploring new technologies to support clinical decision-making. One such technology is GPT-4 Vision (GPT-4V), a multimodal large language model (LLM) designed to process both text and images, including radiologic images. While it holds immense promise, concerns about its reliability have emerged. A recent study published by European Radiology revealed both potential benefits and serious limitations that could impact the adoption of GPT-4V in clinical settings.
Potential Benefits of GPT-4V in Radiology
GPT-4V’s capacity to integrate computer vision models with language-based reasoning offers new opportunities for assisting radiologists in interpreting complex imaging studies. As a model trained on vast amounts of data, GPT-4V can theoretically rationalise workflows by summarising radiologic findings, aiding in report generation, and offering clinical decision support. The study revealed that GPT-4V’s performance improved significantly when provided with clinical context, moving from a diagnostic accuracy of 8.3% without context to 63.6% when context was provided. This indicates that while GPT-4V is far from perfect, it has potential in supportive roles, particularly when clinical information is available to guide its interpretation.
The tool’s success in diagnosing certain conditions highlights its potential to assist in everyday radiology tasks. For instance, it performed best in radiographic and angiographic studies, where clear, unequivocal findings are easier to interpret. This suggests that GPT-4V could be used to pre-screen images and flag potential concerns for radiologists, allowing them to focus on more complex cases. By leveraging this capability, radiologists may be able to reduce their workload and increase efficiency in clinical settings, provided that appropriate oversight mechanisms are in place to diminish errors.
Challenges and Limitations in Diagnostic Accuracy
While GPT-4V shows promise, its diagnostic accuracy without clinical context was alarmingly low, at just 8.3%. This low performance rate reflects a significant challenge for using GPT-4V autonomously in real-world medical settings. Even in the clinical context, the system still misinterpreted many images, fabricated findings and misidentified imaging modalities. For example, in some instances, the model incorrectly diagnosed the anatomical region or mistook the imaging modality, such as confusing computed tomography (CT) images with radiographs. These mistakes underscore the inherent limitations of GPT-4V’s current design, particularly its propensity to rely more heavily on textual information than the image itself.
Another major concern is the model’s consistency over time. When asked to re-read images after 30 and 90 days, the accuracy dropped by up to 30%, raising questions about the reliability of GPT-4V over prolonged periods of use. This decline suggests that any real-world application of GPT-4V in radiology would require constant monitoring and periodic reassessment to ensure it continues to perform as expected. This inconsistency, combined with its tendency to fabricate findings in nearly two-thirds of its responses, demonstrates that GPT-4V is far from ready to be deployed in critical medical applications without significant improvements.
Ethical and Safety Concerns
The deployment of GPT-4V in radiologic settings raises significant ethical and safety concerns. In its current form, the model’s performance can put patient safety at risk, particularly if healthcare providers without radiological expertise rely on it for clinical decisions. The study’s finding that GPT-4V fabricated imaging findings in nearly 63% of its contextualised readings suggests a troubling lack of reliability. These fabricated findings could mislead healthcare professionals, potentially resulting in incorrect diagnoses, inappropriate treatments, or delayed patient care.
Furthermore, the “automation bias” concept emerges in discussions about AI in healthcare. As GPT-4V and other AI tools become more integrated into clinical workflows, there is a risk that clinicians may over-rely on these systems, trusting their outputs without sufficient scrutiny. This over-reliance could erode clinical expertise, as radiologists may begin to defer too much to AI-generated conclusions rather than relying on their judgement and experience. Additionally, privacy concerns arise when using AI models like GPT-4V, which process sensitive patient data. Robust measures must be taken to ensure patient confidentiality and prevent data breaches.
Given these challenges, it is clear that GPT-4V should not be used in isolation. Instead, it may serve as a supplemental tool, offering potential suggestions and insights while leaving final decision-making to trained radiologists. While accurate in many cases, the model's diagnostic reasoning still requires human oversight to ensure that conclusions are based on sound medical principles and not fabricated data.
GPT-4V offers both potential and risks in the field of radiology. Its ability to improve diagnostic accuracy in clinical contexts, particularly in radiographic and angiographic studies, demonstrates its promise as a supplementary tool for radiologists. However, significant limitations—such as low diagnostic accuracy without context, frequent fabrication of findings, and inconsistent performance over time—must be addressed before GPT-4V can be widely adopted in clinical practice. Additionally, the ethical and safety concerns surrounding its use highlight the need for careful oversight and regulation to prevent automation bias and ensure patient safety. Collaboration between developers, healthcare professionals and regulatory bodies will be critical to ensuring these tools contribute meaningfully to patient care without compromising safety or quality.
Source: European Radiology
Image Credit: iStock