Large language models (LLMs) are rapidly gaining ground in healthcare, offering new tools to improve access to information, assist clinicians and support patient decision-making. However, these benefits depend on how reliably and safely models behave in high-stakes scenarios. To meet this challenge, researchers have introduced HealthBench—an open-source benchmark designed to evaluate LLMs in realistic medical interactions. Created with input from 262 physicians across 60 countries, HealthBench assesses AI responses using detailed, conversation-specific rubrics. It provides a multidimensional view of model performance, highlighting areas of progress and pinpointing opportunities for improvement. By focusing on real-world relevance, expert alignment and open-ended evaluation, HealthBench sets a new standard for assessing the readiness of AI systems in healthcare.
A Benchmark Grounded in Clinical Reality
Many existing benchmarks in medical AI fall short of reflecting real-world complexity. Often based on multiple-choice questions or narrow clinical prompts, they cannot capture the nuances of how health professionals and patients actually interact. HealthBench addresses this gap through 5,000 diverse multi-turn conversations between LLMs and either individual users or healthcare professionals. Each conversation is carefully constructed to resemble actual use cases, drawing on a wide range of geographies, languages, medical contexts and user personas.
Must Read: Securing LLMs in Healthcare
To evaluate responses, HealthBench employs a unique rubric-based system. Each conversation is matched with a custom set of criteria written by physicians, scoring models across behavioural axes such as clinical accuracy, completeness, communication quality, context awareness and instruction following. These criteria, over 48,000 in total, are both granular and situation-specific, allowing assessments that mirror real clinical expectations. In addition to regular criteria, a subset of 34 consensus criteria—validated by multiple physicians—provide an extra layer of rigour for critical health scenarios, such as emergency referrals.
The benchmark also includes two special versions: HealthBench Consensus, which uses only validated consensus criteria for precision-focused analysis, and HealthBench Hard, a subset of especially difficult cases that remain unsolved by current top-performing models. Together, these components allow HealthBench to serve as a flexible and unsaturated evaluation framework, capable of distinguishing model capabilities and guiding further development.
Performance Insights Across Themes and Behaviours
HealthBench divides conversations into seven key themes, including global health, emergency referrals, context seeking, health data tasks, expertise-tailored communication, response depth and uncertainty. These themes reflect practical challenges in healthcare delivery and patient communication. Alongside themes, five axes are used to analyse model behaviour: accuracy, completeness, communication quality, instruction following and context awareness.
Findings from recent evaluations reveal rapid improvement among frontier models. OpenAI’s latest models, such as o3 and GPT-4.1, have significantly outperformed earlier iterations. For example, o3 scored 60% overall on HealthBench, doubling the score of GPT-4o from 2024. Notably, smaller and cheaper models like GPT-4.1 nano are now outperforming previous large models, making them potentially more accessible for low-resource settings.
Despite this progress, limitations remain. Models still struggle with context seeking, often failing to ask clarifying questions when vital details are missing. Similarly, performance under uncertainty and in adapting to different global health contexts is less consistent. Conversely, themes such as emergency referrals and expertise-tailored communication show stronger results, reflecting better performance in more structured tasks.
On the axes level, completeness remains the most challenging area, even for advanced models. Many errors arise from partially correct but incomplete responses, which can be misleading or unsafe in clinical settings. Although communication quality is generally strong, accuracy and instruction following still show room for optimisation, especially in complex documentation tasks.
Trustworthiness, Cost and Human Comparison
To ensure evaluation reliability, HealthBench uses model-based graders validated against physician judgement. For the 34 consensus criteria, grading by GPT-4.1 closely matched that of physicians, achieving performance within the top half of human annotators across all themes. This supports the trustworthiness of HealthBench scores and reduces dependence on costly and time-consuming human evaluations.
Another key strength of the benchmark is its measurement of performance relative to cost. Inference cost is a major concern for real-world deployment, especially in under-resourced regions. HealthBench compares model performance against cost per example, revealing a clear trend: newer small models are closing the gap in quality while offering significant cost advantages. This performance-cost frontier helps guide decisions about model selection based not only on accuracy but also affordability and accessibility.
In a notable evaluation, physician-written responses were also tested using the same rubrics. While unaided physicians were generally outperformed by models, especially on completeness and accuracy, they were able to improve earlier model responses when given access to them. However, with responses from the latest models, physicians could no longer consistently improve on the AI outputs. This suggests that model quality is approaching expert-level performance in many scenarios, although human oversight remains vital.
HealthBench emerges as a comprehensive and credible tool for evaluating AI performance in healthcare contexts. By prioritising realistic conversations, expert validation and open-ended scoring, it offers a robust alternative to narrow or saturated benchmarks. It highlights meaningful gains in model capability while also pointing to persistent gaps in reliability, context handling and completeness. Importantly, HealthBench enables developers and healthcare leaders to assess models not just by performance but also by cost-effectiveness and consistency. In the future, benchmarks like HealthBench will be critical in ensuring these technologies support better, safer and more equitable health outcomes.
Source: OpenAI
Image Credit: iStock