Growing interest in deploying large language models (LLMs) for diagnosis, treatment planning and patient management has intensified the need to scrutinise how these systems reason, not just what answers they produce. A new benchmark and evaluation framework address persistent gaps by focusing on the transparency and fidelity of clinical reasoning at step level rather than relying on surface text similarity or headline accuracy. The work assembles expert-annotated rationales across diverse clinical domains and calibrates an automated judge against these references to mirror expert judgement while remaining scalable and efficient. The approach reveals where models reason well, where they fall short and why prediction accuracy alone can mislead.

 

Must Read: Uncertainty-Aware LLM Enhances Explainable Diagnosis

 

Rigorous Benchmark and Evaluation Framework

The benchmark, MedThink-Bench, comprises 500 complex medical question–answer pairs spanning ten domains, including Pathology, Disease Diagnosis, Treatment, Pharmacology, Diagnostic Workup, Public Health, Policy and Ethics, Prognosis, Anatomy and Physiology and Discharge. Questions were drawn from public medical QA sources, then manually curated to emphasise multi-step reasoning. A team of medical experts produced fine-grained, stepwise rationales through consensus, yielding reference trajectories that reflect real clinical logic and enable evaluation that goes beyond lexical overlap.

 

Building on this resource, the LLM-w-Rationale framework operationalises a reference-based LLM-as-a-Judge. For each question, the judge model receives the full model-generated rationale and checks whether it supports each expert step. Scores reflect the proportion of expert steps adequately covered, accommodating differences in length and phrasing while preserving fidelity to expert logic. This step-level approach contrasts with reference-free judging that estimates required steps on the fly and with text-similarity metrics that cannot capture logical structure or factual soundness in clinical reasoning.

 

Correlation, Discrimination and Robustness

Across twelve LLMs, reasoning scores from LLM-w-Rationale show strong alignment with expert evaluation, with Pearson correlations up to 0.87 and Kendall’s tau of 0.88 for model ranking. In contrast, commonly used metrics such as BLEU, ROUGE-L, METEOR, BLEURT and BERTScore display weak or inconsistent correlations with expert assessments. Visual comparisons of sample-level scores indicate that reference-based judging tracks expert scoring closely, whereas similarity metrics and reference-free judging diverge.

 

Discriminative validity was tested by stratifying samples into low, medium and high human-rated quality. LLM-w-Rationale consistently distinguishes among these strata with low p-values, while text-similarity metrics and reference-free judging often fail to separate quality tiers. Sensitivity analyses further demonstrate robustness. Performance remains stable across different judge models that follow instructions reliably and across prompt variants with similar semantics. Together, these findings indicate that grounding the judge in expert trajectories reduces bias, improves sensitivity to reasoning errors and stabilises outcomes across operational choices.

 

Implications For Model Choice and Efficiency

Benchmarking reveals domain-specific strengths and notable gaps. Reasoning performance can diverge from prediction accuracy, highlighting that correct multiple-choice answers may mask flawed logic and that incorrect answers can contain partially valid reasoning. Comparative results show that some smaller or open models can outperform larger proprietary systems in reasoning quality in certain domains, underscoring the value of domain-aware evaluation rather than relying on aggregate accuracy.

 

Efficiency analyses point to practical gains. On the same dataset, automated reference-based judging completes assessments far faster than human experts, requiring a small fraction of the total evaluation time while preserving high concordance. Error-detection analysis against expert ground truth yields strong precision, recall and F1 scores across models, indicating that the framework not only scores quality but also detects reasoning faults with consistency. Because MedThink-Bench compiles questions from public datasets, a contamination check was conducted. While some models show higher contamination rates for QA content, reasoning performance measured against expert rationales remains largely stable when evaluated on an uncontaminated subset, suggesting minimal influence of leakage on the reasoning assessment itself.

 

A step-level, reference-grounded evaluation anchored in expert reasoning provides a more faithful picture of medical capability than text similarity or accuracy alone. MedThink-Bench and LLM-w-Rationale enable scalable assessment that aligns with expert judgement, discriminates between quality levels and remains robust across judge models and prompts. For healthcare decision-makers, the findings support using rationale-based evaluation to select models for domain-specific tasks, to identify gaps in clinical logic and to prioritise safe, responsible integration into clinical workflows without relying solely on headline accuracy.

 

Source: npj digital medicine

Image Credit: iStock


References:

Zhou S, Xie W, Li J et al. (2025) Automating expert-level medical reasoning evaluation of large language models. npj Digit Med: In Press.



Latest Articles

medical reasoning, large language models, LLM evaluation, clinical AI, MedThink-Bench, stepwise assessment, expert judgment Benchmarking LLMs in medicine with stepwise expert reasoning reveals gaps, accuracy limits, and domain-specific model strengths.