Evaluating AI Models for Healthcare with HealthBench

In IT
Sun, 25 May 2025

Large language models (LLMs) are rapidly gaining ground in healthcare, offering new tools to improve access to information, assist clinicians and support patient decision-making. However, these benefits depend on how reliably and safely models behave in high-stakes scenarios. To meet this challenge, researchers have introduced HealthBench—an open-source benchmark designed to evaluate LLMs in realistic medical interactions. Created with input from 262 physicians across 60 countries, HealthBench assesses AI responses using detailed, conversation-specific rubrics. It provides a multidimensional view of model performance, highlighting areas of progress and pinpointing opportunities for improvement. By focusing on real-world relevance, expert alignment and open-ended evaluation, HealthBench sets a new standard for assessing the readiness of AI systems in healthcare.

A Benchmark Grounded in Clinical Reality
Many existing benchmarks in medical AI fall short of reflecting real-world complexity. Often based on multiple-choice questions or narrow clinical prompts, they cannot capture the nuances of how health professionals and patients actually interact. HealthBench addresses this gap through 5,000 diverse multi-turn conversations between LLMs and either individual users or healthcare professionals. Each conversation is carefully constructed to resemble actual use cases, drawing on a wide range of geographies, languages, medical contexts and user personas.

Must Read: Securing LLMs in Healthcare

To evaluate responses, HealthBench employs a unique rubric-based system. Each conversation is matched with a custom set of criteria written by physicians, scoring models across behavioural axes such as clinical accuracy, completeness, communication quality, context awareness and instruction following. These criteria, over 48,000 in total, are both granular and situation-specific, allowing assessments that mirror real clinical expectations. In addition to regular criteria, a subset of 34 consensus criteria—validated by multiple physicians—provide an extra layer of rigour for critical health scenarios, such as emergency referrals.

The benchmark also includes two special versions: HealthBench Consensus, which uses only validated consensus criteria for precision-focused analysis, and HealthBench Hard, a subset of especially difficult cases that remain unsolved by current top-performing models. Together, these components allow HealthBench to serve as a flexible and unsaturated evaluation framework, capable of distinguishing model capabilities and guiding further development.

Performance Insights Across Themes and Behaviours
HealthBench divides conversations into seven key themes, including global health, emergency referrals, context seeking, health data tasks, expertise-tailored communication, response depth and uncertainty. These themes reflect practical challenges in healthcare delivery and patient communication. Alongside themes, five axes are used to analyse model behaviour: accuracy, completeness, communication quality, instruction following and context awareness.

Findings from recent evaluations reveal rapid improvement among frontier models. OpenAI’s latest models, such as o3 and GPT-4.1, have significantly outperformed earlier iterations. For example, o3 scored 60% overall on HealthBench, doubling the score of GPT-4o from 2024. Notably, smaller and cheaper models like GPT-4.1 nano are now outperforming previous large models, making them potentially more accessible for low-resource settings.

Despite this progress, limitations remain. Models still struggle with context seeking, often failing to ask clarifying questions when vital details are missing. Similarly, performance under uncertainty and in adapting to different global health contexts is less consistent. Conversely, themes such as emergency referrals and expertise-tailored communication show stronger results, reflecting better performance in more structured tasks.

On the axes level, completeness remains the most challenging area, even for advanced models. Many errors arise from partially correct but incomplete responses, which can be misleading or unsafe in clinical settings. Although communication quality is generally strong, accuracy and instruction following still show room for optimisation, especially in complex documentation tasks.

Trustworthiness, Cost and Human Comparison
To ensure evaluation reliability, HealthBench uses model-based graders validated against physician judgement. For the 34 consensus criteria, grading by GPT-4.1 closely matched that of physicians, achieving performance within the top half of human annotators across all themes. This supports the trustworthiness of HealthBench scores and reduces dependence on costly and time-consuming human evaluations.

Another key strength of the benchmark is its measurement of performance relative to cost. Inference cost is a major concern for real-world deployment, especially in under-resourced regions. HealthBench compares model performance against cost per example, revealing a clear trend: newer small models are closing the gap in quality while offering significant cost advantages. This performance-cost frontier helps guide decisions about model selection based not only on accuracy but also affordability and accessibility.

In a notable evaluation, physician-written responses were also tested using the same rubrics. While unaided physicians were generally outperformed by models, especially on completeness and accuracy, they were able to improve earlier model responses when given access to them. However, with responses from the latest models, physicians could no longer consistently improve on the AI outputs. This suggests that model quality is approaching expert-level performance in many scenarios, although human oversight remains vital.

HealthBench emerges as a comprehensive and credible tool for evaluating AI performance in healthcare contexts. By prioritising realistic conversations, expert validation and open-ended scoring, it offers a robust alternative to narrow or saturated benchmarks. It highlights meaningful gains in model capability while also pointing to persistent gaps in reliability, context handling and completeness. Importantly, HealthBench enables developers and healthcare leaders to assess models not just by performance but also by cost-effectiveness and consistency. In the future, benchmarks like HealthBench will be critical in ensuring these technologies support better, safer and more equitable health outcomes.

Source: OpenAI

Image Credit: iStock

AI in Healthcare, Medical AI, safe AI models, LLM benchmark, HealthBench, GPT-4.1 healthcare, clinical accuracy, healthcare cost-efficiency, AI evaluation tool, NHS AI standards

Latest Articles

Hospitals of the Future: The Next Frontier in Patient-Centred Care
- Journal Article
- 18/10/2025
Hospitals are rapidly evolving into smart, connected ecosystems focused on proactive, personalised care. Leveraging AI, robotics, remote monitoring and digital health tools, they enhance diagnostics, improve workflows and support decentralised models like virtual wards. Predictive analytics, interoper
READ MORE
AI Orchestration in Emergency Radiology – Implementation in the Valencia Health Region
- Journal Article
- 18/10/2025
The Valencia Health Region deployed a vendor-neutral AI orchestration system across 29 hospitals to improve emergency radiology. Validated at Hospital General Universitario Dr Balmis, it streamlines triage, accelerates diagnoses and reduces radiologists’ workload. The system processes over 5,700 studi
READ MORE
Advancement of 3D Printing in Healthcare and Its Impact on Sustainability
- Journal Article
- 18/10/2025
3D printing is transforming healthcare through personalised devices, surgical precision and faster prototyping while advancing sustainability. On-demand production reduces waste, supports circular economy models and lowers carbon footprints by minimising transport and inventory. Despite its promise,...
READ MORE

AI in healthcare, large language models, HealthBench, medical AI benchmark, LLM evaluation, clinical AI, GPT-4.1 in medicine, healthcare AI safety, AI cost-effectiveness, UK medical AI, AI clinical accuracy HealthBench benchmarks AI in real-world medical settings, measuring LLM safety, accuracy, and cost.

Evaluating AI Models for Healthcare with HealthBench

Latest Articles

Related Articles

Latest News

INFO

IMAGING

ICU

EXEC

IT

CARDIOLOGY

JOURNALS

EVENTS

FACULTY

PARTNERS

JOBS

COMPANIES

PRODUCTS

BLOG

VIDEOS

Communities

CONTACT US

EU Office

Rue Villain XIV 53-55

B-1050 Brussels, Belgium

Tel: +357 86 870 007

E-mail: [email protected]

EMEA & ROW Office

166, Agias Filaxeos

CY-3083, Limassol, Cyprus

Tel: +357 86 870 007

E-mail: [email protected]

Headquarters

Kosta Ourani, 5

Petoussis Court, 5th floor

CY-3085 Limassol, Cyprus

E-mail: [email protected]