Systemic mastocytosis (SM) presents with heterogeneous, multisystem symptoms that can complicate recognition and delay diagnosis. Routine clinical documentation captures these signals unevenly, particularly in free-text notes that sit outside structured fields. A retrospective analysis within an integrated health system developed and validated a rule-based natural language processing (NLP) approach to extract 23 potentially SM-related symptoms from unstructured electronic health records (EHR). The method was trained and tested on manually annotated notes, then deployed at scale across SM, chronic spontaneous urticaria (CSU) and matched controls. High precision and recall for most symptoms, along with distinct note-level symptom patterns, indicate that NLP can surface clinically relevant information that is often missed by structured data alone.
Algorithm Development and Cohort Design
The analysis drew on EHR data from Kaiser Permanente Southern California (KPSC) covering 2008–2023. Patients with a manually confirmed SM diagnosis (n=135) were matched 1:2:3 by age, sex and index date to CSU (n=270) and to individuals without SM or CSU (n=405). Notes were extracted within one year before and after each index date and included provider narratives and patient–provider communications. Certain encounter types and note categories unlikely to contain relevant symptoms were excluded. Pre-processing standardised text by lowercasing, sentence splitting and tokenisation, with abbreviation normalisation and misspelling correction.
Must Read:LLM Benchmark Flags Limits in Personalised Longevity Advice
Symptom coverage spanned 23 terms across cutaneous, gastrointestinal, neuropsychiatric, musculoskeletal, systemic and severe allergic reaction domains. Candidate terms and phrases were assembled from prior work and the Unified Medical Language System, then refined through clinician review and corpus-trained Word2Vec models to add synonyms, morphological variants and common misspellings. The training set comprised 339 patients and 57 495 notes. Validation used 818 double-annotated notes from 16 additional SM patients containing at least one symptom keyword. Interrater agreement on the validation set was high, with percent agreement ranging from 97.8% to 100% and kappa coefficients from 0.764 to 1.0.
Rule-based classification operated at sentence level, combining symptom phrase detection with contextual cues for negation, uncertainty, temporal references, non-patient mentions and precautionary language. Sentences were classified as present or absent for each symptom, and a note was marked positive if any sentence indicated presence. Iterative development across ten training batches tuned the rules until predefined thresholds for sensitivity and positive predictive value (PPV) exceeded 90% overall.
Performance and Validation Outcomes
Validation against adjudicated annotations showed strong performance for most symptoms. PPV exceeded 90% broadly, with notable exceptions where ambiguity was common. Epigastric or abdominal bloating had PPV of 76.47% and swelling 80.43%. Spots, lesions or hives reached 84.69% and flushing or redness 88.89%. Sensitivity exceeded 92% across assessed symptoms. F1 scores were greater than 0.9 for nearly all targets, with epigastric or abdominal bloating at 0.87 and swelling at 0.89. Burning, brain fog or difficulty concentrating, and syncope could not be evaluated in validation due to absent confirmed positive cases.
Discrepancy review highlighted recurrent sources of error. False positives often occurred when the rules did not fully exclude internal organ findings, historical mentions or precautionary text such as medication side-effects and generic warnings. False negatives were most frequently linked to missed keyword variants, misclassified negations or over-exclusion of instructional language. These patterns underscore the need for ongoing curation of lexicons and context rules when deploying symptom extraction in real-world documentation.
The implementation dataset encompassed 118 252 notes from 810 patients. Across the full set, the average sentences per note were 9.9, aligning with 10.0 in training. The validation set’s focus on symptom-containing notes produced a higher average of 24.9 sentences per note, reflecting richer context where symptom language appears.
Real-World Symptom Documentation Patterns
Applying the final algorithm to the full cohort revealed distinct documentation profiles. At least one target symptom appeared in 15.9% of all notes. By group, proportions were highest in CSU (19.6%), followed by SM (15.5%) and non-SM/non-CSU controls (11.0%). Category-level frequencies showed cutaneous symptoms as most common overall, particularly in CSU notes, where cutaneous mentions reached 13.29%. In contrast, SM notes more frequently contained gastrointestinal symptoms at 6.77% and systemic symptoms at 3.90%.
At the level of specific symptoms, spots, lesions or hives were documented in 9.69% of CSU notes and 3.39% of SM notes. Swelling and diarrhoea also showed higher presence in SM and CSU compared with controls. Neuropsychiatric symptoms, including depression, anxiety and headache, were noted across all groups, while musculoskeletal symptoms were least frequent overall. Documentation density differed as well: SM notes were more likely to record multiple distinct symptoms, with 0.67% of SM notes containing five or more, compared with 0.30% in CSU and 0.14% in controls. These note-level patterns indicate greater symptom complexity in SM within the observation window.
A rule-based NLP approach reliably identified potentially SM-related symptoms from unstructured EHR narratives and characterised meaningful documentation patterns across SM, CSU and matched controls. High note-level precision and sensitivity for most symptoms, combined with rigorous manual validation, suggest practical utility for surfacing real-world symptom signals that may support earlier recognition and inform care planning. Important caveats remain, including dependence on what is documented, incomplete coverage of rare linguistic variants and limited generalisability without site-specific adaptation. Nonetheless, structured extraction from free text offers a scalable path to monitor symptom burden, refine diagnostic awareness and enable data-driven strategies for rare conditions in routine practice.
Source: JAMIA Open
Image Credit: iStock