Radiologists carry a substantial administrative workload, with diagnostic coding representing a routine but high-impact task. ICD-10 coding is required for reimbursement in many health systems and is a frequent source of billing errors. In Germany alone, a large volume of hospital bills required checks in 2023, and earlier reporting linked a considerable share of reviews to coding errors, with many found to be incorrect. Large language models have been explored as tools to reduce this burden, but earlier evaluations of older ChatGPT versions produced mixed results across specialties and data types. ChatGPT-5 has been described as having enhanced reasoning and analytical abilities, raising interest in whether it can support radiology-specific ICD-10 coding with measurable gains in efficiency.

 

Moderate Accuracy on a Large Multilingual Radiology Dataset

Performance was assessed using 2,738 fictitious radiology reports drawn from the PARROT database, which includes multiple imaging modalities and coverage across 13 languages. Each report was assigned a single most relevant ICD-10 code using ChatGPT-5, then compared with the predefined database reference label. Exact-code concordance reached just over half of all reports, indicating moderate agreement when full ICD-10 specificity was required. Agreement was stronger at broader hierarchical levels, with the first character matching in most cases and the middle characters also aligning in a large proportion of reports. Concordance fell when the most specific code elements were compared, reflecting the difficulty of consistently selecting highly granular codes.

 

Must Read:  Continual Learning for Medical Image Analysis

 

A random subset of mismatches was reviewed by a blinded adjudicator who did not know which code came from ChatGPT-5. In this review, ChatGPT-derived codes were more often judged better fitting than the database label. This result suggested that differences from the reference label did not necessarily indicate poorer coding quality and that predefined labels may not always reflect the most accurate choice for the report content.

 

Language Effects Reduced Concordance and Increased Coding Time

The same PARROT reports were also coded in their original non-English versions to evaluate multilingual performance. Coding in native languages reduced concordance across the ICD-10 hierarchy compared with coding the English translations. Agreement at the broadest character level remained high but dropped slightly, while the decline was more apparent at the middle characters and the most specific final characters.

 

Coding speed also differed by language. Median coding time was slightly faster for English reports than for the non-English versions, and the range of times was wider when reports were coded in languages other than English. These findings indicated that multilingual inputs introduced additional friction even when overall coding remained rapid. A blinded adjudication for non-English coding was not performed due to limited language coverage within the study team, meaning it was not possible to determine whether lower concordance reflected reduced clinical correctness or disagreement with the database reference labels.

 

PET/CT Results Showed Large Time Savings with Comparable Coding

A separate PET/CT cohort was used to assess both coding accuracy and efficiency against a manual baseline. One hundred fictitious PET/CT reports were created using standardised institutional text modules for PET/CT with contrast-enhanced diagnostic CT, designed to reflect diagnoses and incidental findings typically encountered at a tertiary university hospital centre in Berlin. Reports were coded manually by an experienced clinician and separately coded with ChatGPT-5 assistance. Both coders were allowed to use search engines and the official ICD-10 library.

 

In this cohort, the two approaches produced the same full ICD-10 code in most reports, while a smaller proportion differed. Agreement was particularly high at broader code levels and remained lower at the most specific characters. Discordant cases were reviewed by a blinded adjudicator, who more often favoured the ChatGPT-based code than the manual code, though this difference was not statistically significant.

 

Time savings were pronounced. Manual coding required substantially more time per report, while ChatGPT-assisted coding reduced the median time to only a few seconds. The median time saved per report was over two minutes, with some cases showing larger savings. Efficiency data were not available for the PARROT cohort because manual coding times were not provided for those database reports.

 

ChatGPT-5 achieved moderate exact-code concordance with reference labels on a large radiology dataset, with stronger alignment at broader ICD-10 hierarchy levels than at full code specificity. Blinded review of mismatched codes suggested that ChatGPT-derived choices were often judged better fitting than the database labels, supporting its potential as a coding aid rather than a simple replication tool. In PET/CT reporting, ChatGPT-assisted coding produced comparable coding outcomes to manual work while reducing coding time dramatically. Multilingual coding reduced concordance and slightly increased coding time, highlighting that language context may influence performance and requires further evaluation. These results support interest in time-saving ICD-10 coding support, alongside the need for careful oversight and data-secure deployment in clinical environments.

 

Source: International Journal of Medical Informatics

Image Credit: iStock


References:

Ruhwedel T, Rogasch JMM, Dahlke PM et al. (2026) Less time Coding, more time Caring: Performance evaluation of ChatGPT-5 for ICD-10 coding of radiology reports. International Journal of Medical Informatics; 210:106296.



Latest Articles

ChatGPT-5, ICD-10 coding, radiology AI, medical billing accuracy, PET/CT reporting, healthcare automation, clinical efficiency ChatGPT-5 shows moderate accuracy and major time savings for ICD-10 radiology coding tasks.