Pancreatic cancer care depends on treatment planning that integrates imaging, pathology and clinical history within multidisciplinary tumour boards. These boards bring together surgery, medical oncology, radiology, pathology and gastroenterology to reach consensus-based recommendations for newly diagnosed cases. In the evaluated setting, tumour board decisions were treated as the gold standard and were grouped into three pathways: surgical resection, neoadjuvant therapy and palliative therapy. LLaMA 3.3 with 70 billion parameters was assessed against these real-world decisions using fully anonymised clinical texts arranged in a standard order of clinical history, imaging reports and pathology findings. The analysis used 42 first-diagnosis pancreatic cancer cases with complete documentation and compared four prompting strategies: zero-shot, advanced zero-shot, chain-of-thought and few-shot prompting. Overall results showed that the model could align with tumour board decisions in many surgical and palliative cases, but performance changed sharply when the task involved identifying neoadjuvant candidates.

 

How the Evaluation Was Structured

The evaluation focused only on first-diagnosis cases discussed in a real-world tumour board. Each patient was assigned to one of three predefined treatment options. Surgical resection covered patients considered resectable under NCCN criteria, including the absence of arterial tumour contact and no unreconstructable venous involvement when patients were medically fit for surgery. The neoadjuvant category included anatomical borderline resectability as well as high-risk biological features and specific patient conditions. It also included locally advanced pancreatic cancer cases treated with induction chemotherapy when the therapeutic pathway still aimed at potential downstaging and secondary resection. Palliative therapy applied to patients with proven distant metastases or permanently unresectable locally advanced disease, as well as poor performance status, when treatment intent was life-prolonging or symptom-controlling rather than curative.

 

The model was deployed within institutional infrastructure and received fully anonymised clinical documentation. Final tumour board decisions and wording that referred to already initiated or planned downstream treatment were removed to avoid label leakage. No fine-tuning was performed. Instead, the comparison tested different prompt designs. Zero-shot and advanced zero-shot prompting used temperature 0.0 and top_p 0.9. Chain-of-thought prompting sampled seven independent reasoning traces with temperature 0.7 and top_p 0.95, then used the majority label as the prediction. Few-shot prompting used four examples per class and evaluated the remaining cases. The prompts forced the model to choose exactly one label: surgical resection, neoadjuvant therapy or palliative therapy.

 

Performance Patterns Across Prompting Strategies

The strongest global results came from advanced zero-shot and chain-of-thought prompting. Both reached an overall accuracy of 0.786 and a micro-averaged F1 score of 0.786. Both outperformed the basic zero-shot setting, which reached an accuracy of 0.667. Few-shot prompting produced the lowest overall accuracy at 0.567. In pairwise comparisons on the shared 30-case subset, advanced zero-shot and chain-of-thought both reached 86.7% accuracy, while few-shot reached 56.7%. The differences between advanced zero-shot and few-shot, and between chain-of-thought and few-shot, were statistically significant.

 

Class-wise analysis showed that these stronger overall scores did not reflect balanced performance across treatment categories. In the zero-shot condition, surgical cases were recovered without omission, while palliative cases were classified robustly. Neoadjuvant cases were rarely selected. Under advanced zero-shot prompting, surgical performance improved further and palliative classification remained strong, but neoadjuvant performance collapsed completely, with both precision and recall at 0.000. Chain-of-thought reproduced the same pattern. Few-shot prompting reversed that behaviour. Neoadjuvant recall rose to 1.000, but precision fell to 0.200 and surgical recall dropped sharply. Palliative predictions retained perfect precision under few-shot prompting, yet overall accuracy declined because many cases from other categories were redirected into the neoadjuvant class.

 

Must Read: LLMs Show Partial Alignment with Colorectal Cancer MDTs

 

Where the Errors Occurred

The confusion matrices showed that errors clustered around the neoadjuvant label. In the basic zero-shot setting, six of seven neoadjuvant cases were reassigned, most often to surgery and sometimes to palliation. Palliative cases were also often confused with surgery, while surgical cases were recovered without omission. Advanced zero-shot reduced noise in the majority classes but suppressed neoadjuvant decisions altogether. None of the seven neoadjuvant cases were correctly identified. Most were reassigned to palliative therapy, with the remainder shifted to surgery. This pattern improved global accuracy because surgical and palliative cases were classified with high fidelity, but it left an important blind spot.

 

Few-shot prompting changed the direction of the error. The model became willing to predict neoadjuvant therapy and captured all neoadjuvant cases in the reduced evaluation set, yet many surgical and palliative cases were incorrectly moved into that category. This overcorrection reduced surgical recall and overall performance. Chain-of-thought again mirrored advanced zero-shot by stabilising majority-class decisions without recovering any neoadjuvant cases.

 

Detailed review of discordant advanced zero-shot cases identified recurring reasoning failures. The model often defaulted to palliative therapy when vascular involvement was extensive but distant metastases were absent or equivocal, missing the board’s use of neoadjuvant treatment as a bridge to potential resection. When the model preferred immediate surgery, it relied too heavily on limited vessel contact and did not match the board’s assessment of negative margin likelihood or vascular reconstruction feasibility. Indeterminate extra-pancreatic lesions were sometimes treated as definitive metastases, leading to premature palliative recommendations. In one case, the model did not commit to any predefined treatment path and instead produced a diagnostic answer.

 

The evaluation showed that LLaMA 3.3 could approximate tumour board decisions in pancreatic oncology when cases were clearly surgical or palliative, especially under advanced zero-shot and chain-of-thought prompting. That apparent success masked a clinically important limitation. The same strategies failed to identify any neoadjuvant cases, while few-shot prompting improved detection of that group only by introducing substantial misclassification elsewhere. The main weaknesses emerged around vascular encasement, equivocal metastatic findings and the recognition of neoadjuvant therapy as a curative-intent pathway. These findings support careful oversight and targeted adaptation before any clinical consideration of large language models in pancreatic tumour board decision-making.

 

Source: BMC Medical Informatics and Decision Making

Image Credit: iStock


References:

Mergen M, Busch F, Schwarberg B et al. (2026) AI-assisted tumor board decision-making in pancreatic oncology. BMC Med Inform Decis Mak: In Press.



Latest Articles

Pancreatic cancer care depends on treatment planning that integrates imaging, pathology and clinical history within multidisciplinary tumour boards. T...