Artificial intelligence–generated content (AIGC) is advancing rapidly across nuclear medicine imaging (NMI), promising software-based gains in denoising, motion correction, attenuation correction and cross-modality translation. These approaches may streamline workflows, reduce radiation exposure and improve quantitative accuracy. Alongside these benefits, a critical risk has emerged: AIGC can fabricate realistic yet false image content that misrepresents anatomy or function, undermining diagnostic confidence and clinical safety. A domain-specific, shared framework is needed to name, detect and mitigate these failures. The DREAM report addresses this need by proposing a focused definition, illustrating representative failure modes, outlining multi-level evaluation approaches and mapping root causes to practical safeguards so that AIGC in NMI can be deployed more safely.
What Hallucinations Mean in Nuclear Medicine
Definitions of hallucination vary widely across literature. For NMI, the DREAM report recommends a narrow, operational meaning: AI-fabricated abnormalities or artifacts that look plausible and realistic yet are factually false and deviate from anatomic or functional truth or are unsupported by measurement when ground truth images are unavailable. This scope excludes general artifacts from traditional workflows and distinguishes hallucinations from other AI errors such as lesion omission or uniform intensity shifts, which are treated as illusions rather than fabrications.
Must Read: AI Hallucinations and the Risks for Healthcare
Representative risks span common AIGC tasks in NMI. During image enhancement, visually impressive SPECT or PET denoising can introduce false perfusion or lesion-like signals. In AI-based attenuation correction, synthetic maps derived from emission data may embed subtle but consequential false structures despite good visual agreement with references. Cross-modality translation is particularly vulnerable when attempting to infer functional abnormalities from structural data or vice versa, because pathophysiology may precede or bypass visible morphologic change. While such synthesis holds value for PET/MRI attenuation correction or dataset augmentation, its direct diagnostic substitution remains prone to misleading fabrications. By anchoring the definition in fabricated, realistic-looking abnormalities, the framework targets the errors most likely to deceive readers and propagate clinical harm.
How to Measure and Monitor Hallucinations
Robust deployment requires dedicated detection and evaluation beyond conventional image quality metrics. Image-based approaches include the hallucination index, which compares AIGC against a zero-hallucination reference constructed to match signal-to-noise and radiomics analyses that probe whether clinically relevant features in regions of interest remain statistically consistent with references. Both can reveal subtle divergences that plain visual scoring may miss, though they may also capture non-hallucinatory discrepancies and need tailoring to isolate fabrications.
When paired references are unavailable, dataset-level strategies become useful. The neural hallucination phallucinatingtifies deviations in feature space relative to a calibration bank, while no-gold-standard evaluation adapts quantitative imaging methodology to compare precision across models without assuming ground truth, acknowledging that it may reflect general error rather than hallucination alone. Clinically focused assessment remains essential: downstream task performance, expert Likert scoring augmented with bounding boxes and concise descriptors and sampled case review to balance feasibility with granularity. Automation can ease the burden, yet NMI lacks benchmark datasets annotated specifically for hallucinations, limiting the training of reliable detectors. Building multi-institutional, expandable repositories with standardised criteria would enable scalable, clinically aligned monitoring.
Regulatory and postmarket perspectives also matter. Cleared commercial tools exist, and draft device guidance recognises that erroneous outputs erode reliability and trust, advocating lifecycle approaches and rigorous validation. Routine clinical monitoring is not universally mandated, creating a gap that professional initiatives seek to address. Within such frameworks, hallucinations warrant explicit tracking, including thresholds that balance dose-reduction claims against fabrication risk, so visual gains do not mask inaccuracies.
Where Hallucinations Come from and How to Reduce Them
Hallucinations arise when the learned mapping from source to target images diverges from the true relationship. Data, learning and model factors each contribute, and mitigations should match the cause. Domain shift is a major driver: if training distributions overrepresent specific patterns or underrepresent rare pathologies, models may hallucinate familiar features or misbehave on out-of-distribution cases. Mitigations include clearly defining intended use and limits, improving data quality, quantity and diversity across scanners, protocols and populations, leveraging federated learning and applying domain adaptation when broad datasets are not feasible. Transfer learning and continuous updates can strike a balance between generalisation and specialisation, while retrieval-augmented workflows face constraints in NMI due to limited structured visual knowledge sources.
Data nondeterminism introduces aleatoric uncertainty from acquisition noise and ill-posed inverse problems, yielding one-to-many plausible outputs. Better acquisition, systematic cleaning and rigorous preprocessing can reduce variability, though practical constraints remain. Even with strong data, input perturbations or suboptimal prompts can trigger failures, structured prompts that encode organs, noise levels or anatomic expectations have improved fidelity in denoising and translation tasks.
From a learning perspective, underspecification means multiple models may meet validation targets yet differ in faithfulness. Ensemble or feature averaging across runs can suppress spurious signals at computational cost. Human-in-the-loop alignment allows experts to steer models toward plausible solutions, complemented by automated fast-checking layers that flag suspect content based on rules, heuristics or learned detectors. At the model level, limited perceptual understanding can be addressed by integrating auxiliary priors and constraints. Multimodal conditioning with demographic and disease-specific biomarkers has preserved pathological features in PET synthesis. Anatomically and metabolically informed diffusion models, or task-specific loss functions aligned with clinical endpoints such as perfusion defect detection, have reduced fabrications by guiding feature extraction toward medically relevant structure and function.
AIGC is reshaping NMI workflows yet introduces a distinctive safety risk when models fabricate realistic but false abnormalities. The DREAM report frames a practical, NMI-specific definition, demonstrates where fabrications emerge and recommends layered evaluation spanning image statistics, dataset-level analysis and clinically grounded assessment, with an emphasis on building annotated benchmarks. Mitigation should target data diversity and stability, address underspecification through averaging and expert alignment and embed anatomic, functional and task-based priors into model design. With explicit monitoring and thoughtful safeguards, healthcare organisations can realise AIGC efficiency gains while constraining hallucination risk in routine practice.
Source: Journal of Nuclear Medicine
Image Credit: iStock