Clinical notes remain central to patient care because they support communication across clinicians and create a record of progress from inpatient to outpatient settings. Yet note sectioning remains difficult when formats vary between clinicians and when important information sits inside semi-structured or unstructured free text. Manual extraction of current clinical information is labour-intensive, prone to error and unsuitable for large-scale analysis. MedSlice addresses that problem by automating extraction of three clinically relevant parts of oncology progress notes: History of Present Illness, Interval History, and Assessment and Plan. Because documentation varied, History of Present Illness and Interval History were combined into a single label, recent clinical history. The method was designed as a scalable pipeline that works with local and cloud hardware, while also addressing privacy, cost and accessibility constraints that can limit the use of proprietary language models in clinical environments.
Building a Targeted Sectioning Pipeline
The development process used clinical notes from three oncology groups: breast, gastrointestinal and neurological. Two nurse practitioners annotated the notes, first coding an initial gastrointestinal subset to support familiarisation and codebook development, then independently coding 653 notes for spans linked to recent clinical history and assessment and plan. Inter-rater reliability was measured with the Jaccard Index. When agreement exceeded 80%, the combined annotations became the final label. Notes below that threshold were re-coded through group discussion with a third-party adjudicator. A further 494 notes were then single coded using the finalised codebook, producing a dataset of 1147 clinical notes.
Must Read: Smarter Clinical Notes Through Automation
The dataset included 487 breast notes, 465 gastrointestinal notes and 195 neurological notes from 433 unique patients. Physicians authored most notes, followed by nurse practitioners and physician assistants. Average note length exceeded 1500 tokens across all groups, with the gastrointestinal notes showing the highest average length. Recent clinical history appeared in 86.6% of all notes and assessment and plan in 87.2%.
The pipeline compared several approaches. Rule-based baselines included SecTag and MedSpaCy. A Clinical Longformer with a 4096-token context window was also trained to predict the start and end positions of target sequences. Five large language models were evaluated for section identification: GPT-4o, GPT-4o mini, Llama 3.2 instruct 1B, Llama 3.2 instruct 3B and Llama 3.1 instruct 8B. OpenAI models ran on a HIPAA-compliant endpoint, while Meta models ran on a virtual machine with an 8192-token context window. Fine-tuning and inference took place on a HIPAA-secure virtual machine equipped with an A100 40 GB GPU.
Fine-Tuning, Matching and Performance
Supervised fine-tuning used the Unsloth library and rank-stabilised LoRA. Training parameters were set at rsLoRA rank and alpha of 16, five epochs, batch size of 2 and a learning rate of 2e-4. The fine-tuning dataset consisted of 487 breast cancer centre notes, with no patient overlap with the test set. The process took one hour for the largest model, Llama 3.1 8B, and twenty minutes for the smallest, Llama 3.2 1B.
The evaluation pipeline then processed model outputs for each target section. Using vLLM for inference, the system generated the first five words and last five words of each predicted span. These 5-grams were matched back to the source note. Because exact matches were uncommon, fuzzy matching was added using a sliding window of 5-grams and Levenshtein distance. Matches above 80% similarity were treated as valid, allowing predicted spans to be aligned with the original note text.
Performance was assessed through precision, recall and F1 score. Each model ran three times, followed by bootstrapping with 1000 iterations per run to generate 3000 metric sets. Statistical testing used the Friedman test, with post-hoc pairwise comparisons via the Wilcoxon signed-rank test and a Bonferroni-adjusted alpha of 0.01.
Results showed clear gains from supervised fine-tuning. SecTag reached an F1 score of 0.30 for assessment and plan but did not generate a valid output for recent clinical history. MedSpaCy achieved an average F1 score of 0.19 across both labels, while Clinical Longformer reached 0.62. Among the language models, the strongest performance came from fine-tuned Llama 3.1 8B. It achieved F1 scores of 0.89 for recent clinical history and 0.94 for assessment and plan. Fine-tuned Llama 3.2 3B also performed strongly, with F1 scores of 0.88 and 0.92 respectively. GPT-4o scored 0.78 for recent clinical history and 0.79 for assessment and plan, while GPT-4o mini scored 0.68 and 0.72. Llama 3.1 8B therefore scored 9 to 16 points higher than GPT-4o.
Robustness, Limits and Practical Value
Internal validity testing used gastrointestinal and neurological notes from the same institution but different patient populations from the breast centre used for training. External validity testing applied the best-performing model to 50 breast cancer progress notes from UCSF, all annotated with the validated codebook. On this external set, Llama 3.1 8B achieved F1 scores of 0.82 for recent clinical history and 0.87 for assessment and plan. These results indicate that the pipeline retained strong performance beyond the training environment.
The practical implications extend beyond note sectioning itself. Extracting only the relevant sections before downstream analysis reduces the input size passed to larger and more resource-intensive language models. That lowers computational demand and energy use while preserving task performance. The approach also supports deployment on local or cloud-based systems, giving institutions a privacy-conscious and cost-effective alternative to proprietary models. Smaller fine-tuned models trained on fewer than 500 notes still delivered high-quality outputs, which broadens access for teams working under tighter resource constraints.
Limitations remain. Error analysis of the top-performing model showed overprediction and underprediction in sections with ambiguous or inconsistent boundaries. The study focused on notes written by physicians, nurse practitioners and physician assistants, and did not evaluate notes from other clinical staff such as physical therapists, occupational therapists or nutritionists. All analysed notes came from academic medical centres, so note-style variability across other hospital settings was not assessed.
MedSlice delivers a robust method for segmenting clinically relevant sections from oncology progress notes using fine-tuned open-source language models. Across internal and external testing, Llama 3.1 8B outperformed proprietary alternatives and maintained strong results on notes from another institution. The pipeline combines curated annotation, supervised fine-tuning and fuzzy matching to extract recent clinical history and assessment and plan with high accuracy. By supporting local or cloud deployment, lowering computational demand and reducing reliance on proprietary systems, the method offers a scalable route for clinical documentation analysis across diverse healthcare settings.
Source: JAMIA Open
Image Credit: iStock
References:
Davis J, Sounack T, Sciacca K et al. (2026) MedSlice: fine-tuned large language models for secure clinical note sectioning. JAMIA Open, 9(1):ooaf179.