Clinically important oncology data can exist in source electronic health records yet fail to reach downstream datasets. A recent analysis published in JAMIA Open examined data transfer from electronic health records into cancer registry records and FHIR extracts in 30 non-small cell lung cancer patients. Demographic data showed strong agreement across systems, but biomarker information often failed to appear after exchange, raising concerns for precision oncology, real-world evidence and clinical AI.
Data Pathways Reveal Uneven Information Transfer
The comparison covered three clinical data systems: Epic electronic health record documentation, a FHIR endpoint using Draft Standard for Trial Use 2 and Cancer Registry data containing structured and unstructured variables. Two mutually exclusive patient cohorts allowed separate checks of exchange fidelity and real-world data availability. One cohort included 10 patients selected because all six positive non-small cell lung cancer biomarkers appeared in electronic health record documentation. A second cohort included 20 randomly selected patients from the registry using ICD-10 C34.x, regardless of biomarker completeness.
The variables covered date of birth, sex, race, ethnicity, smoking status and biomarker status. Manual extraction from electronic health records used chart search functions and review of clinical documentation, including internal and external pathology reports and embedded PDF files from commercial test panels. Registry data came through Microsoft Excel from the hospital-site cancer registrar. FHIR data came through XML using Patient, Condition, DiagnosticReport and Observation resources.
Demographic variables showed complete capture and high concordance across electronic health record documentation, the Cancer Registry and FHIR export. Date of birth, sex and race reached near-perfect agreement. Minor differences appeared in ethnicity coding, including “Not Hispanic or Latino” in FHIR and “Non-Spanish” in electronic health record and registry data. Smoking status had one discordance, with “former smoking” in electronic health record data and “current smoker” in registry data.
Biomarker Data Falls Out of Downstream Systems
The first cohort exposed substantial biomarker loss once information moved beyond the electronic health record. All 10 patients had complete documentation for the six biomarkers in the source record, yet FHIR extracts missed EGFR, KRAS and BRAF values for every patient. ALK and ROS1 appeared in only two patients, leaving 80% missingness for each marker. PD-L1 appeared in one patient, leaving 90% missingness.
Cancer Registry structured fields captured biomarker information more variably. Missingness ranged from 20% to 100%, with better concordance for registry-required variables. EGFR and ALK had higher agreement with electronic health record documentation, while biomarkers not required during the period showed greater missingness. Manual review of the registry’s unstructured Lab Text field recovered between 10% and 50% of missing biomarkers, depending on the marker.
Must Read: Health Data Access Hinges on Trust
The second cohort showed similar concerns in a more variable real-world sample. Electronic health record documentation differed by test type, reflecting variation in biomarker availability in routine practice. FHIR exchanged data still showed at least 80% missingness in documented biomarkers across all test types compared with the electronic health record. Registry-required EGFR and ALK data reached 100% concordance with the electronic health record. Non-required biomarkers showed variable capture, and some results appeared only in unstructured registry text.
These patterns create a clear distinction between data that sit in discrete, required fields and data that rely on extraction from complex clinical documentation. Demographics moved consistently through the systems, while biomarker data suffered marked loss despite availability in clinical records.
Interoperability Depends on Standards, Structure and Validation
The contrast between demographic and biomarker concordance suggests that extraction and exchange success depends on system architecture and reporting requirements as much as underlying record availability. Demographics sit in discrete electronic health record fields and downstream systems require them, supporting near-perfect concordance. EGFR and ALK reached full registry concordance in the randomly selected cohort, with 2021 Commission on Cancer and SEER requirements likely linked to stronger capture.
Semantic differences also affected exchanged data. Electronic health record terminologies followed internal standards, while FHIR and registry systems used other industry standards and value sets. Minor variation in ethnicity coding may look limited, but similar conflicts in medications, therapeutic classes, dosing or laboratory reference ranges can create downstream errors and bias. Data lags can also change concordance. A patient can appear as a former smoker in electronic health record and FHIR data but as a current smoker in the registry when updates have not reached registry extraction.
The limitations remain important. The analysis used one clinical site and has limited generalisability to other settings and populations. Clinical workflows, standards of care, integration levels and data exchange methods vary across centres. Manual chart review provided the reference standard for electronic health record source data quality, but it requires considerable resources. Large language models or natural language processing may offer more scalable approaches, although heterogeneity in formats, laboratory reporting and terminology mapping currently limits full automation of biomarker extraction.
Oncology real-world data can lose clinically important biomarker information as records move through FHIR and registry pipelines. Direct electronic health record documentation contained the most complete clinical information, while downstream sources showed critical gaps. The results reinforce the need to understand data pipelines before relying on extracted real-world data for research, clinical decision support, regulatory use or AI development. Until stronger standards and validated methods address extraction completeness, research using electronic health record-derived datasets needs verification against clinical documentation or clear acknowledgement of information loss and possible systematic bias.
Source: JAMIA Open
Image Credit: iStock
References:
App SJ, Meyer AM, Silkensen S et al. (2026) Follow the data: tracking data quality and completeness in oncology real-world data. JAMIA Open, 9(3): ooag052.