Emergency departments managing traumatic brain injury (TBI) must rapidly identify patients at highest risk of in-hospital death using limited early information. Widely used clinical measures can support triage but often struggle to capture the full complexity of presentation, particularly when decisions depend on subtle combinations of physiological findings and neurological signs. Machine learning (ML) models offer an alternative route to risk stratification yet can be operationally demanding to build, validate and maintain. Large language models (LLMs) such as GPT are increasingly discussed for clinical support, but their behaviour on structured risk prediction depends heavily on how they are instructed. A head-to-head evaluation using one TBI dataset compared a support vector machine (SVM) baseline with multiple GPT prompting approaches, focusing on discrimination performance and the practical trade-offs between sensitivity and specificity.
One Cohort, One Feature Set, Multiple Modelling Paths
The data came from an emergency department TBI cohort drawn from three hospitals over a ten-year period, including 18,249 adult patients. Records were split into a training set and a testing set using a 7:3 ratio. Each patient entry contained 12 structured clinical variables extracted from electronic medical records, covering demographics, vital signs and neurological assessment. Variables included age, sex and body mass index (BMI), a triage category and measures such as heart rate, respiratory rate and body temperature. Neurological status was captured using the Glasgow Coma Scale (GCS) alongside bilateral pupil size and light reflex responses. The dataset was anonymised, and the work proceeded under institutional review board oversight with informed consent waived given the retrospective design.
The SVM model acted as the ML benchmark, selected because it had previously produced the strongest area under the receiver operating characteristic curve (ROC-AUC) among multiple ML approaches on the same cohort. The SVM was evaluated on the testing set using the established setup. GPT was used differently: it did not undergo training or fine-tuning on the dataset and instead generated probability outputs during inference. Several prompting strategies were applied to the same testing records, allowing direct comparison of how instruction style influenced predictions from identical inputs.
Prompting Choices Drive Divergent Error Profiles
Four GPT configurations were evaluated: a zero-shot approach, a few-shot approach, a few-shot approach combined with chain-of-thought (CoT) prompting and a CoT-only approach. The prompts were designed to frame GPT as a senior clinician and to elicit a mortality probability for each patient case. Few-shot variants provided exemplars drawn from confirmed death cases, while CoT prompts aimed to structure the reasoning process before producing a final probability. These outputs were then converted into high-risk versus low-risk classifications using a probability threshold.
Must Read: Interpretable ML Improves Post-Transplant Risk Prediction
At a fixed threshold of 0.5, the SVM achieved strong discrimination, with an ROC-AUC of 0.920. GPT outputs showed that headline accuracy figures could hide clinically important imbalances. The few-shot GPT configuration delivered high overall accuracy but comparatively low sensitivity, indicating many fatal cases were classified as survival at that threshold. The few-shot plus CoT strategy moved strongly in the opposite direction, producing very high sensitivity but substantially reduced specificity, consistent with a more aggressive approach that flagged many patients as high risk. The CoT-only configuration skewed toward survival classification, reaching very high specificity while sensitivity was the lowest among the GPT strategies. Zero-shot GPT sat between these extremes, with discrimination close to the best GPT results but with a more moderate sensitivity profile than the specificity-leaning CoT-only approach.
These patterns underline that prompt design did not merely shift performance marginally; it altered the operating characteristics of the model. In practical terms, some prompt styles behaved like a cautious screen that rarely labelled mortality risk unless signals were very strong, while others behaved like an alarm that captured most fatal cases but at the cost of many false positives.
Threshold Calibration Narrows the Gap with ML Baselines
A second evaluation adjusted the probability thresholds to balance sensitivity and specificity by minimising the absolute difference between them. Under this calibration, all GPT strategies improved their balance, reducing the stark asymmetries seen at the 0.5 cut-off. The SVM remained the most robust overall, maintaining an ROC-AUC of 0.920 alongside a balanced sensitivity and specificity profile at its tuned threshold.
Among the GPT approaches, few-shot prompting produced the closest alignment with the SVM baseline after threshold adjustment, reaching an ROC-AUC of 0.919 at a tuned threshold around 0.26. Zero-shot GPT also achieved a balanced trade-off at a slightly higher threshold, though with a lower ROC-AUC than the SVM and few-shot GPT. The few-shot plus CoT strategy could also be balanced with threshold tuning, but its ROC-AUC remained below the stronger approaches. The CoT-only configuration required a notably low threshold, around 0.064, to counteract its tendency to output probabilities biased toward survival, and it delivered the lowest discrimination among the compared strategies.
To assess stability, the same prompts were executed five times across independent sessions. Variation across runs was small, and the reporting highlighted low standard deviations across metrics, suggesting that the prompting setups used here produced consistent outputs rather than highly volatile behaviour.
A single emergency TBI dataset produced markedly different GPT performance profiles depending on prompt design and threshold setting. The SVM baseline delivered strong discrimination and a balanced sensitivity-specificity trade-off, while GPT approaches ranged from sensitivity-heavy to specificity-heavy behaviour when a fixed threshold was used. Threshold calibration reduced these extremes and brought few-shot GPT closest to the ML benchmark on ROC-AUC, while CoT-only prompting showed the weakest discrimination and required substantial threshold adjustment to achieve balance. For clinical leaders and decision-makers, the results point to a clear operational message: when LLMs are applied to structured risk prediction, prompt engineering and decision thresholds are not cosmetic choices but core governance parameters that shape clinical usefulness and error risk.
Source: International Journal of Medical Informatics
Image Credit: iStock