Clinical teams face growing volumes of notes, letters and test results stored across electronic health records. Summaries generated by artificial intelligence promise to condense this information into a format that is easier to use at the point of care. The challenge is how to assess the quality of these summaries at speed and scale without sacrificing clinical standards. A medical large language model acting as a judge offers a practical route. Using a validated rubric tailored to provider needs, automated judging reached agreement with physicians while operating far faster and at lower cost. Results held across multiple specialties and transferred to a separate summarisation task, pointing to a scalable way to support evaluation of clinical documentation tools. 

 

Clinically Grounded Rubric and Real-World Material 

The evaluation centred on a provider-focused rubric, the Provider Documentation Summarization Quality Instrument (PDSQI-9), developed to reflect issues common in clinical summarisation. It examines key aspects clinicians care about, including factual accuracy, relevance, organisation, clarity, succinctness, synthesis, citation practice and avoidance of stigmatising language. The summaries under review were built from real notes, with each patient represented by several prior encounters, and were scored by a panel of physicians. The material covered a range of specialties grouped into Primary Care, Surgical Care, Emergency or Urgent Care, Neurology or Neurosurgery, and Other Specialty Care. To enable development and testing without leakage, the corpus was split into a larger development portion and a smaller held-out set with comparable characteristics, such as note length and token counts. 

 

Must Read: LLMs Near Physician Diagnosis but Lag in Triage 

 

Automated judges were given the original notes, the candidate summary and the rubric, then asked to score across the same dimensions as clinicians. Agreement with physician medians was measured using standard reliability statistics used in clinical and social science evaluation. Additional analyses explored whether the automated judge could act as an extra reviewer or alternatively stand in for one clinician without changing overall reliability. Generalisability was tested on a diagnosis-focused benchmark that uses a different rubric, providing a check on transfer beyond the initial dataset. 

 

Reliability With Less Delay and Lower Cost 

A reasoning model emerged as the leading single automated judge, matching physician medians closely on overall quality and on the rubric’s component scores. Agreement on the primary metric reached the level typically associated with strong consistency across raters, and performance remained high even without elaborate prompting. Other models, including a non-reasoning system and open-source options, also achieved clinician-level alignment, though with a modest gap to the top performer. Smaller open-source models improved meaningfully after parameter-efficient training and preference optimisation but still showed issues with stability and score granularity that limited parity with larger systems. 

 

Time and cost differences were substantial. The leading automated judge produced a complete evaluation in under half a minute on average, while human reviewers took around ten minutes to complete the same task. Per-case costs for the automated approach were small compared with estimates based on physician consulting time. Multi-agent judging, which orchestrated several model reviewers and a coordinator to mimic panel deliberation, added both time and expense. It brought score distributions closer to the spread seen among clinicians but did not surpass the single reasoning judge on overall agreement. 

 

Transfer to the separate diagnosis benchmark sustained high agreement for the top automated judge without any additional training. Treating the model as either an added rater or a substitute did not materially change reliability for the panel as a whole, reinforcing its value as a scalable adjunct to expert review in settings where clinician time is constrained. 

 

Model Behaviours and Limits 

Reasoning models aligned more closely with clinicians on dimensions that demand careful synthesis of details scattered across long notes. These strengths were most evident in citation discipline, the organisation of key facts and the ability to weigh mixed evidence when a summary combined accurate and incomplete elements. Differences by provider specialty, note length or summary length were small, suggesting the approach did not favour particular clinical contexts within the tested range. 

 

The multi-agent setup achieved agreement levels close to the leading single judge and showed a balanced pattern of positive and negative deviations from clinician medians. However, its consensus process occasionally lifted scores prematurely on individual attributes, whereas the single reasoning judge more often settled on discriminating mid-scale ratings. Among open-source models, mixture-of-experts systems started strong without extra training and showed limited gains with further optimisation, hinting at a ceiling effect. In contrast, smaller dense models benefited from preference-based methods but continued to exhibit polarised scoring and occasional formatting issues that complicate deployment. 

 

Operational considerations also differed. Closed-source systems ran in a compliant cloud environment, while open-source models were deployed on-premise with high-end accelerators. Training open-source judges required notable compute, especially for larger architectures, and multi-agent workflows increased inference load because multiple evaluators and rounds of discussion were involved. These factors matter for organisations weighing total cost of ownership alongside reliability. 

 

An automated medical LLM judge can evaluate clinical summaries with agreement comparable to physician reviewers while cutting review time and cost. Reasoning models were most consistent with clinicians on attributes that rely on synthesis and structured judgement, and the approach transferred to a related task without retraining. Multi-agent judging offered reviewer-like variability but did not exceed the best single judge and added complexity. For healthcare organisations assessing summarisation tools, incorporating an automated, rubric-driven judge can help prioritise expert review, standardise scoring across specialties and accelerate the safe adoption of summarisation in documentation workflows. 

 

Source: npj Digital Medicine 

Image Credit: iStock


References:

Croxford E, Gao Y, First E et al. (2025) Evaluating clinical AI summaries with large language models as judges. npj Digit Med; 8, 640. 



Latest Articles

clinical AI, large language models, medical summarisation, clinical documentation, healthcare technology, AI in medicine, LLM evaluation, digital health UK, npj Digital Medicine, clinical workflow automation AI large language models match clinicians in judging clinical summaries—boosting speed, accuracy and cutting review costs.