Extracting structured data from unstructured pathology reports remains a significant challenge in healthcare, particularly in oncology, where timely and accurate information is critical for diagnosis, treatment and research. Breast cancer pathology reports, typically composed in narrative free-text form, contain essential diagnostic and prognostic data but require extensive manual processing to become clinically actionable. The variability in formatting, language and content across different institutions adds to this complexity, hindering standardisation and scalability.
To address this issue, a recent study demonstrated the use of Generative AI, specifically GPT-3.5, integrated within a Streamlit web application to automate the extraction and structuring of such data. Focusing on breast cancer reports from Taipei Medical University Hospital, the study’s AI system achieved 99.61% accuracy, marking a substantial step forward in the application of AI for clinical documentation and biomedical informatics.
Automating Data Extraction: System Design and Functionality
At the core of this project was the development of a web-based application that employed GPT-3.5 to extract structured data from breast cancer pathology reports. Streamlit.io was selected as the development platform due to its ability to facilitate rapid creation and deployment of generative AI applications. The system was built using a single-page application model, with essential libraries such as streamlit and pandas for interface and data processing tasks. The coding and deployment processes were simplified by utilising GitHub Codespaces, which allowed for a seamless development environment directly accessible via web browsers, eliminating the need for local IDE installations or complex deployment pipelines.
Must Read: GPT-4 as a Proofreading Tool for Head CT Reports
The back end of the system processed uploaded Excel files containing free-text pathology reports, converting them into structured data through strategically engineered prompts sent to the OpenAI API. GPT-3.5 was chosen over alternative language models like BERT and BioBERT due to its enhanced ability to interpret and generate coherent, contextually rich medical text without requiring extensive domain-specific training or fine-tuning. Prompt engineering was a key component of this process. Prompts were designed and refined iteratively, tailored to the specific terminology and diagnostic patterns found in breast cancer pathology reports. Each prompt was evaluated based on the relevance and precision of the output, and adjustments were made dynamically to maximise accuracy.
Further attention was paid to API management, including the handling of rate limits and the implementation of secure access via environment variables. The application incorporated robust error handling to ensure data integrity and minimise operational disruptions during processing. Collectively, these design elements enabled a highly functional and reliable system capable of parsing complex medical text with minimal human intervention.
Clinical Relevance and Accuracy of Extracted Information
The AI prototype was designed to extract four key categories of information from the reports: macroscopic data, microscopic data, ancillary studies and pathological staging. In the macroscopic category, the system analysed specimen laterality, tumour site and the presence of associated tissues such as skin or skeletal muscle. It successfully identified right or left breast specimens in core biopsy reports and documented various dimensions and tissue compositions.
Microscopic data extraction covered focality, tumour size, histological classification and grade. This included distinguishing between single and multiple tumour foci, calculating the size of the largest invasive tumour and categorising histological subtypes and carcinoma in situ presence. These outputs were critical for understanding tumour pathology and guiding clinical decisions. Ancillary studies focused on hormone receptor and protein expression statuses—specifically ER, PR and HER2—which are essential for planning targeted therapies in breast cancer. The system’s ability to extract these markers with high precision was validated through manual review.
Pathological staging data—encompassing tumour size (pT), lymph node involvement (pN) and presence of metastasis (pM)—was also reliably extracted. These staging components play a crucial role in determining prognosis and therapeutic strategy. Despite the overall high accuracy, some limitations were observed. Ambiguous language in reports, such as fractional notations or compound anatomical descriptors, occasionally led to misinterpretations. Additionally, the restricted output categories within the application contributed to classification errors, such as interpreting ‘adenocarcinoma’ too narrowly. Refining the system to handle broader output expressions and more varied report phrasing would improve future accuracy.
Operational Efficiency, Ethical Considerations and Future Directions
In addition to achieving technical accuracy, the AI-driven application offered significant efficiency gains compared to traditional manual processing. By automating data extraction from narrative pathology reports, the system substantially reduced the time and resources required for clinical documentation. This improvement has the potential to support pathologists by alleviating workload and improving turnaround times in diagnostic workflows.
The research adhered to strict ethical standards, using de-identified data collected with informed consent and approved by an institutional review board. The authors noted that ethical deployment of AI systems in clinical settings requires transparency, ongoing monitoring and alignment with medical regulations such as HIPAA and GDPR. Clinical integration must be supported by intuitive interfaces, clinician training and robust feedback mechanisms to ensure safety and reliability.
The study also acknowledged its limitations. The dataset consisted of 33 reports from a single healthcare system, focusing solely on breast cancer. While this specificity helped standardise the data, it limited the generalisability of the findings. Future work aims to expand the dataset to include pathology reports from other cancer types and institutions, facilitating broader validation. The adoption of international data standards like HL7 and SNOMED could further enhance interoperability and scalability.
The integration of generative AI in the analysis of breast cancer pathology reports offers a compelling model for transforming unstructured medical data into structured, clinically useful information. By combining GPT-3.5 with a Streamlit web application, the research team achieved a 99.61% accuracy rate in data extraction, significantly surpassing traditional methods reliant on manual processing or rule-based NLP models. This approach not only enhances efficiency but also contributes to improved data accessibility and reliability in oncology. While the study focused narrowly on breast cancer pathology within a single institution, its results provide a solid foundation for broader applications of AI in clinical documentation. Future research will extend these findings across more diverse datasets and explore the use of multimodal AI to further improve the scope and precision of medical text analysis.
Source: Journal of Medical Systems
Image Credit: Freepik