Electronic health records (EHRs) are increasingly used to support disease risk prediction, screening strategies and prevention planning. However, models trained on structured EHR data often perform poorly when applied outside the health system in which they were developed. Differences in coding practices, clinical workflows and reporting standards limit portability, even when data are harmonised using a common data model (CDM) such as the OMOP Common Data Model. Mapping to a CDM is resource-intensive and does not fully eliminate semantic variation. A deep learning framework known as Generalizable Risk Assessment with Semantic Projection (GRASP) addresses these challenges by embedding medical concepts into a shared semantic space using large language model (LLM) representations and applying a transformer network to longitudinal patient histories to predict multiple disease risks across healthcare systems.
Limits of Conventional Cross-System EHR Modelling
Risk prediction based on structured EHR data depends heavily on how diagnoses, procedures and medications are coded. Even within a shared CDM, similar clinical concepts may be represented using different codes, while some concepts appear rarely or only in specific settings. Models based on fixed ontologies or co-occurrence statistics can capture partial similarity, but they are constrained by the vocabularies available during training and struggle to adapt as coding systems evolve.
These limitations are particularly relevant for smaller organisations that lack the resources to develop and maintain local prediction models. Full data harmonisation remains costly, and retraining models for each new setting is often impractical. GRASP is designed to address this gap by focusing on semantic alignment rather than explicit one-to-one code mapping. The approach aims to support transfer across datasets by representing medical concepts according to their meaning rather than their local coding frequency, while also allowing previously unseen concepts to be incorporated during deployment.
Must Read: EHR Interoperability: Levels, Standards and Practical Paths
Semantic Projection and Multi-Outcome Risk Modelling
GRASP maps OMOP vocabulary concepts into semantic embeddings generated from clinical concept descriptions using an LLM, specifically OpenAI text-embedding-3-large. These embeddings are stored in a lookup table, allowing patient histories to be encoded without repeated LLM calls at inference time. The embedding step does not involve individual-level data, enabling use in secure environments without sharing patient records or requiring continuous external connectivity.
Patient histories represented by embedded concepts are processed using a multi-layer transformer neural network. The model jointly predicts time-to-event risk for 22 endpoints, including 21 diseases and all-cause mortality. Predictors include age, sex and observed OMOP-mapped concepts related to diagnoses, procedures and drug prescriptions. A two-year washout period after baseline is applied to reduce the influence of conditions closely related to the predicted outcomes. Training uses a Cox proportional hazards loss, and endpoint-specific risk scores are combined with age and sex within a Cox model to estimate time-to-first-event risk.
Model development was conducted using UK Biobank data from 391,921 individuals. External evaluation was performed in FinnGen with 253,991 individuals and in the Mount Sinai Health System dataset with 386,755 individuals. Average observation periods extended up to 12 years in UK Biobank, 26 years in FinnGen and 6 years in Mount Sinai, with mean follow-up durations of 11, 10 and 6 years respectively. The average number of unique concepts per individual was 29 in UK Biobank, 39 in FinnGen and 19 in Mount Sinai.
External Validation and Cross-Model Transfer
In UK Biobank cross-validation, GRASP, random-embedding transformers and XGBoost all outperformed an age-and-sex-only baseline. Average improvements in C-index over the baseline were 0.081 for GRASP, 0.069 for random embeddings and 0.078 for XGBoost. Within this internal setting, GRASP did not significantly outperform the other approaches for individual outcomes, indicating that its main advantage lies in external transfer rather than within-cohort optimisation.
When applied to external OMOP-mapped datasets without additional training, GRASP showed higher average performance than a comparable transformer with random embeddings in FinnGen and Mount Sinai. It also outperformed XGBoost in both settings. Statistically significant improvements were observed for 12 of the 22 outcomes in FinnGen and 5 of the 22 outcomes in Mount Sinai, while performance was similar for remaining endpoints. Asthma, chronic kidney disease and heart failure demonstrated consistent gains across both external datasets.
A further evaluation examined transfer across coding systems without explicit ontology mapping. When trained in UK Biobank and evaluated in Mount Sinai using OMOP-mapped condition concepts, GRASP achieved an average improvement in C-index of 0.056 over the age-and-sex baseline. When evaluated using only ICD-10-CM codes in Mount Sinai, average improvement remained 0.036, with reduced performance observed for 9 of the 22 outcomes. In this setting, conventional models such as XGBoost were unable to support cross-model transfer.
Additional analyses showed that GRASP maintained stronger performance than comparison models at smaller training sample sizes and that enriching concept descriptions with additional ontology relationships did not materially affect results. Alternative biomedical embedding methods yielded similar improvements over random embeddings, and recalibration using age, sex and predicted risk produced comparable calibration across datasets.
GRASP demonstrates that language-based semantic representations of medical concepts can improve the portability of EHR-based risk prediction across healthcare systems and coding frameworks. By reducing reliance on explicit harmonisation and supporting transfer to external datasets, the approach offers a practical pathway for deploying predictive models in heterogeneous clinical environments. Reported limitations include the absence of detailed temporal sequencing, evaluation limited to three high-income settings, potential inherited biases from LLMs and the need for recalibration when applied to new populations. The findings highlight the role of semantic embedding strategies in improving cross-border risk prediction while maintaining operational feasibility.
Source: npj digital medicine
Image Credit: iStock