Large volumes of clinical knowledge remain locked in free-text records, limiting reuse for care and research. Unstructured materials such as procedure notes, progress reports and radiology records hold substantial information yet are costly to mine manually. A generative artificial intelligence approach using open-source large language models was developed to convert right heart catheterisation (RHC) procedure notes into computable data with safeguards to minimise errors and hallucinations. Development drew 220 RHC notes and validation on 200 notes from a health system registry. Through preprocessing, structured prompts, layered validation and a targeted retry mechanism, the pipeline emphasised precision and reliability while running locally to preserve privacy. 

 

Modular Pipeline with Preprocessing and Structured Outputs 

The pipeline followed two serial steps that mirror clinician workflow. Step 1 performed multiclass categorisation, assigning text segments to testing conditions recorded in RHC documentation: room air, oxygen supplementation and nitric oxide supplementation. Step 2 extracted numerical values into a predefined structure. Each step combined an Engineered Preload Framework containing schemas, preprocessed notes and prompts with an LLM module and a post-processing validation layer. 

 

Automated preprocessing removed irrelevant sections, standardised common shorthand and harmonised phrasing without altering underlying content. For haemodynamic strings, shorthand was expanded so downstream inference encountered consistent field names and units. A detailed JSON scheme defined target outputs. Two identifiers, Medical Record Number and Report Number, anchored each record. For every testing condition, 13 conditional fields were specified, allowing extraction of up to 39 numerical values per note when present. 

 

Prompt templates were refined with pulmonary vascular disease expertise and prompting techniques including meta prompting, personification and self-reflection. A dedicated Working Space structured internal reasoning into think, plan and execute phases, aiding detection of dispersed cues that signal testing conditions. After comparisons of instruction following and medical knowledge, Llama 3.1 Nemotron 70B was used for categorisation and Llama 3 70B for extraction. Inference parameters prioritised determinism and factual grounding with temperature at 0.1 and top p at 0.85. An 8,196-token context window provided input buffer, a fixed seed supported repeatability and batch size was one note with parallel batches in flight. The system ran via vLLM on a local GPU server. Development used Python 3.10 with statistical analysis in RStudio. The modular design supports adaptation to other note types by updating schemes, prompts and validation rules. 

 

Validation, Targeted Retry and Reproducibility 

Layered guardrails detected and corrected errors. JSON validation checked structural conformity, then Pydantic enforcement verified variable presence and type and in Step 2 flagged physiologically invalid values that often reflected typographical issues. A source-text validation stage cross-referenced every extracted value against the original note to ensure direct textual grounding. 

 

A targeted retry mechanism simulated human quality assurance. When validation identified issues, feedback guided the LLM to reprocess the note with awareness of prior mistakes. Observations indicated that a maximum of two retries was sufficient for most recoverable errors, improving robustness without few-shot examples or fine-tuning. 

 

Must Read: Ambient Scribing Cuts Documentation Time and Lifts Satisfaction 

 

In the validation cohort, 200 notes were processed and 191 were included in the final analysis after automated checks, with nine excluded due to unresolved output errors. Across both steps, 15 notes initially triggered validation errors, six were corrected by the retry loop. In Step 1, four misplacements of testing condition occurred and two were fixed on retry. In Step 2, 11 errors were detected and four were corrected on retry. Hallucination was rare and resistant to self-correction, with one of four hallucination cases resolved within two attempts. For out-of-range values identified by pipeline constraints, half were adjusted into acceptable ranges after feedback. 

 

Reproducibility was assessed by running the pipeline ten times on the validation dataset and calculating Krippendorff’s alpha for fields with sufficient non-null variability. Alpha values ranged from 0.80 to 0.98 for all but one field, Room Air TPG, which showed tentative reliability at 0.76. Raw consistency rates complemented these results by capturing stability regardless of data sparsity. 

 

Performance, Error Patterns and Constraints 

At the note level, 172 of 191 analysed notes were completely accurate, yielding 90% accuracy. Precision reached 99% for correctly identified values and recall was 85% for collecting available values, resulting in an F1 score of 91.5%. Missed values were the most common residual error, affecting 10 notes, while hallucinations were least frequent with one affected note. Misplacements persisted in seven notes, each with a single misplaced field. 

 

Performance stratified by data availability followed expected patterns. Notes with minimal quantitative content posed a risk of fabrication, yet the pipeline avoided generating values where none existed and achieved 100% accuracy in the first quartile. In the fourth quartile with the greatest number of extractable fields, missed values rose to 17% of notes, reflecting a bias toward avoiding false positives. Pulmonary Vascular Resistance was the most frequently missed field. Right Ventricular Diastolic Pressure was the most commonly misplaced field. These patterns aligned with an emphasis on clinician trust by prioritising true positives and suppressing spurious outputs. 

 

System design balanced accuracy, efficiency and governance. Running open-source LLMs locally mitigated drift risk and supported consistent behaviour while maintaining data privacy. Zero-shot prompting avoided the cost and complexity of fine-tuning. Limitations centred on scope and annotation. The dataset originated from a multi-hospital system with diverse documentation styles, yet generalisability beyond that environment remains to be tested. Ground truth extraction by a single pulmonary vascular disease expert introduces potential single-rater bias and precludes assessment of inter-rater reliability. Some notes were excluded when errors could not be resolved despite retries, indicating room for improvement in error recovery. Preprocessing depends on institution-specific conventions that may require tuning for broader deployment. 

 

A structured GenAI pipeline using open-source LLMs converted unstructured RHC notes into a computable dataset with layered validation, source-text grounding and targeted retries. In a held-out cohort, results showed 90% note-level accuracy, 99% precision, 85% recall and an F1 score of 91.5% with minimal hallucination. The approach handled heterogeneous formats, emphasised reproducibility and operated in a privacy-preserving local environment. With modular components and a deliberate focus on precision, the pipeline offers a path to reduce manual data abstraction, support research efficiency and enable timely use of clinical information in decision-making, while highlighting the value of broader validation and enhanced annotation strategies. 

 

Source: JAMIA Open 

Image Credit: Freepik

 


References:

Dao N, Quesada L, Hassan SM et al. (2025) Generative artificial intelligence for automated data extraction from unstructured medical text. JAMIA Open; 8 (5): ooaf097 



Latest Articles

Generative AI transforms unstructured RHC notes into accurate, validated clinical data while ensuring privacy and precision. Generative AI transforms unstructured RHC notes into accurate, validated clinical data while ensuring privacy and precision.