Large language models are being assessed for radiology workflow tasks that depend on short clinical requests, including imaging protocol selection. An original investigation published in European Radiology Experimental evaluated whether forcing model outputs into predefined structured fields improves radiology request form processing. The work compared five models, including GPT-5-Thinking, Gemini 2.5 Pro and three open-weight models, across 100 anonymised request forms for CT and MRI. Each model processed the same cases with structured and unconstrained prompting. Outputs were assessed against a reference standard set by two board-certified radiologists in consensus. Performance was also compared with a first-year radiology resident, a third-year resident and the first-year resident working with GPT-5-Thinking support.
Read More: LLMs Improve Readability of Radiology Reports for Patients
Prompting Effects Varied by Model
The request forms preserved routine clinical complexity. They included brief wording, abbreviations, typographical errors, underspecified information and ambiguity. They covered CT and MRI cases across several clinical areas, including cardiovascular, oncological, musculoskeletal and emergency indications. No patient-identifying information was included or provided to the models.
Structured prompting required the models to return predefined answers for modality, anatomical region, contrast strategy and urgency. The same output also included a clearer version of the clinical indication. Unconstrained prompting did not impose predefined categories or a fixed output format, so model responses required manual assessment against the same clinical reference.
The effect of structure was not consistent across models. GPT-5-Thinking performed best when it was not constrained, with all protocol elements correct in 76% of cases. Its performance fell when structured output rules were imposed. Gemini 2.5 Pro followed the opposite pattern, improving from 53% with unconstrained prompting to 66% with structured prompting. The three open-weight models changed little when structured prompting was used, with their all-correct results remaining close to 40%.
Structured prompts therefore cannot be treated as a universal upgrade. They may improve one model, restrict another and make little difference to others.
Accuracy Was Strongest for Basic Protocol Choices
The models generally performed better on modality and anatomical region than on more complex protocol decisions. These categories showed smaller differences between proprietary and open-weight models, especially when outputs were constrained to a predefined format. Contrast phase selection also showed limited separation between the model groups. GPT-5-Thinking and Gemini 2.5 Pro reached the strongest contrast results under their better-performing prompting conditions.
Urgency was the most difficult category across the evaluation. The best proprietary result for urgency came from GPT-5-Thinking with structured prompting. The best open-weight result came from medgemma-27b-it with structured prompting. Even so, urgency remained less reliable than modality, anatomical region and contrast phase. Any mismatch in urgency was treated as incorrect, even when other protocol elements matched the reference.
The most consistent strength across all models was the rewriting of clinical indications. More than 90% of reformulated indications were judged clearer and more clinically precise than the original request wording. The rewritten indications introduced no factual inaccuracies or hallucinations.
Decision Support Helped a Junior Resident
The resident comparison suggested that LLMs may be most useful as support for less experienced radiology staff. The first-year resident had lower all-correct accuracy than all tested LLMs, apart from one open-weight model. The third-year resident performed at a level comparable to the strongest LLM result and slightly exceeded it in all-correct accuracy.
When the first-year resident repeated the task with GPT-5-Thinking support, performance improved across all evaluated categories. The largest change appeared in the all-correct category, where accuracy rose from 19% to 65%. This assisted result approached the third-year resident’s performance, although it did not exceed it.
Deployment context remains important. Proprietary models produced the highest overall accuracy, but their current licensing and deployment arrangements typically require execution on external company servers. This creates data protection and governance concerns for clinical use. Open-weight models performed less strongly in the combined all-correct endpoint, but their licensing and deployment characteristics allow local operation on institution-managed hardware within secure clinical networks. That makes them relevant where data locality and system control are priorities.
Structured output constraints produced model-specific effects in radiology protocol selection. Gemini 2.5 Pro improved with structured prompting, GPT-5-Thinking performed better without it and open-weight models changed little. Proprietary models achieved the highest combined accuracy, while open-weight models offered clearer advantages for local deployment and governance. The clearest practical benefit involved junior resident support, where GPT-5-Thinking assistance markedly improved first-year resident performance and brought it closer to the level of a more experienced resident.
Source: European Radiology Experimental
Image Credit: iStock
References:
Bahaaeldin M, Nowak S, Seidel O et al. (2026) Influence of structured output constraints on GPT-5-Thinking, Gemini 2.5 Pro, and open-weight LLMs for radiology protocol selection. Eur Radiol Exp; 10, 42.