Fine-Tuned LLM Aligns with Clinician Ratings for MADRS Severity

In IT
Tue, 21 Oct 2025

Large language models (LLMs) are increasingly explored in mental health where speech and narrative often mirror symptom burden. A German BERT-based model was fine-tuned to estimate item-level scores on the Montgomery-Åsberg Depression Rating Scale (MADRS) using language alone from structured interviews. The approach predicted continuous scores for nine items and benchmarked performance against a mean regression baseline and the unfine-tuned base model. Reported mean absolute error (MAE) ranged from 0.7 to 1.0 across items, with flexible accuracies of 79–88% within a ±1 tolerance and a 75.38% reduction in errors relative to the base model. These findings suggest a lightweight, task-adapted LLM can approximate clinician ratings and support symptom assessment and monitoring, especially where standardised digital tools can extend access.

Item-Level Predictions from Clinical and Synthetic Interviews

Interviews combined real clinical material with synthetic data to mitigate score imbalance. In total, 126 interviews were prepared, including 65 patient transcripts and 61 synthetic interviews, yielding 1,242 item-level samples across nine items. The first MADRS item, Apparent Sadness, was excluded because it requires nonverbal cues. Audio from videotaped sessions underwent speaker diarisation and automatic transcription, followed by manual review to ensure accuracy. Each interview was segmented into item-specific text units aligned to clinician-assigned scores from 0 to 6, creating a direct mapping between symptom language and severity labels.

The architecture used a shared BERT encoder with nine separate regression heads, one per MADRS item, to reflect symptom-specific scoring. Fivefold cross-validation was applied at the item level. Evaluation considered MAE, strict accuracy after rounding predictions and a flexible accuracy criterion within ±1 of the true label to reflect clinically acceptable deviations. The fine-tuned model outperformed a mean regression baseline that predicted per-item averages and surpassed the unadapted base model, which defaulted to predicting zero and failed to differentiate severity.

Performance varied by item but remained consistently strong. Under the flexible criterion, accuracies were 79–88%. MAE ranged from 0.7 for Inner Tension to 1.0 for Emotional Numbness. Compared with the baseline predictor, the average MAE reduction was 0.9 points. Error analysis showed the fewest flexible errors for Inner Tension and the highest for Loss of Appetite. Under strict exact-match criteria, Inner Tension again showed the fewest errors, while Emotional Numbness showed the highest. Confusion matrices displayed a clear diagonal pattern, indicating separation of adjacent severity levels from language features alone.

Performance Gains, Learning Dynamics and Evaluation Choices

Fine-tuning delivered marked gains over the unfine-tuned base model. Using the flexible evaluation, overall errors fell by 75.38%, under strict exact-score matching, errors declined by 30.29%. These improvements underscore the value of adapting an LLM to clinical language with item-level supervision. The flexible criterion captured clinically acceptable deviations that mirror known variability among trained raters, improving interpretation of near-correct predictions in practice.

Learning-curve analysis examined how performance scaled with data. Flexible accuracy rose quickly as training fractions increased, with notable gains up to roughly 50–80% of available data, then hit plateau. This indicates early efficiency in learning linguistic cues relevant to MADRS scoring, with diminishing returns thereafter. An analysis restricted to real interviews assessed generalisability without synthetic augmentation and complemented the main analysis that prioritised balanced distributions. Together, these results point to a model that rapidly captures core patterns yet may require larger datasets, additional modalities or architectural changes for further advances.

Must Read: Setting Safe Boundaries for AI in Mental Health

Methodologically, the pipeline established a repeatable path from audio to item-level text with verified transcripts, then to model fine-tuning and evaluation. Accuracy and MAE were computed per item and averaged across folds, while confusion matrices aggregated predictions for each item across splits. Comparisons to the mean regression baseline contextualised absolute error reductions, and zero-shot evaluation of the base model confirmed that task-specific adaptation was essential for item-level severity discrimination.

Implementation Scope, Constraints and Use Considerations

Interviews were conducted in German or Swiss German using standard MADRS prompts, with open narratives rated by trained clinicians. Ratings were assigned independently and discussed; if disagreements exceeded defined thresholds, consensus ratings were used for training and evaluation. Apparent Sadness was excluded because nonverbal information lies outside the language-only scope. The data pipeline included diarisation, transcription with a large-vocabulary system suited to Swiss German, manual correction and segmentation by item. Fine-tuning used a BERT-base-German-cased encoder with nine linear regression heads trained end-to-end using mean squared error loss within a fivefold regime.

Several constraints inform interpretation and deployment. Repeated interviews from the same participants were treated as independent samples and not grouped during fold assignment, which may introduce limited dependence, the sessions occurred weeks apart and reflected different clinical states. Individual rater scores before consensus were not consistently stored, preventing formal inter-rater reliability reporting. Synthetic interviews supported class balance but may not fully capture the variability of clinical language and could lower baseline difficulty.

Reliance on language alone excludes nonverbal cues that influence certain items, and the flexible ±1 criterion, though clinically motivated, can still mask meaningful misclassification in specific contexts. Pretraining may bias towards westernised populations, and performance was not stratified by demographic factors. The plateau in the learning curve suggests that future gains may rely on larger datasets, multimodal inputs or revised architectures.

Despite these limitations, the item-level design aligns outputs directly with a widely used clinical instrument, aiding interpretability for longitudinal tracking, research workflows and routine monitoring. Model transparency at the item level can support targeted follow-up when specific domains worsen, and consistent mapping to standard scores may simplify integration into existing documentation practices.

A fine-tuned, German BERT-based model produced item-level MADRS severity estimates from language alone, achieving MAE between 0.7 and 1.0, flexible accuracies of 79–88%, and substantially fewer errors than unfine-tuned baselines. The outputs mirror clinician practice at the level of distinct symptom domains and offer standardised, interpretable scores for repeated assessment. While constraints include synthetic data use, lack of nonverbal cues and limited rater-level metrics, the results indicate a feasible, clinician-aligned route for symptom assessment and monitoring. The approach suggests practical support for decision-making and follow-up, with further improvements likely from broader datasets, multimodal integration and validation across diverse settings.

Source: npj Digital Medicine

Image Credit: iStock

References:

Weber S, Deperrois N, Heun R et al (2025) Using a fine-tuned large language model for symptom-based depression evaluation. npj Digit Med; 8, 598.

digital medicine, AI in Healthcare, Natural Language Processing, Mental Health Technology, depression detection, clinician-aligned AI

Latest Articles

Hospitals of the Future: The Next Frontier in Patient-Centred Care
- Journal Article
- 18/10/2025
Hospitals are rapidly evolving into smart, connected ecosystems focused on proactive, personalised care. Leveraging AI, robotics, remote monitoring and digital health tools, they enhance diagnostics, improve workflows and support decentralised models like virtual wards. Predictive analytics, interoper
READ MORE
AI Orchestration in Emergency Radiology – Implementation in the Valencia Health Region
- Journal Article
- 18/10/2025
The Valencia Health Region deployed a vendor-neutral AI orchestration system across 29 hospitals to improve emergency radiology. Validated at Hospital General Universitario Dr Balmis, it streamlines triage, accelerates diagnoses and reduces radiologists’ workload. The system processes over 5,700 studi
READ MORE
Advancement of 3D Printing in Healthcare and Its Impact on Sustainability
- Journal Article
- 18/10/2025
3D printing is transforming healthcare through personalised devices, surgical precision and faster prototyping while advancing sustainability. On-demand production reduces waste, supports circular economy models and lowers carbon footprints by minimising transport and inventory. Despite its promise,...
READ MORE

large language model, BERT, MADRS, depression assessment, NLP, AI in mental health, clinician ratings, symptom monitoring, digital psychiatry, German language model, fine-tuned LLM, mental health AI Fine-tuned German BERT predicts MADRS depression severity from language, matching clinician accuracy and improving symptom assessment.

Fine-Tuned LLM Aligns with Clinician Ratings for MADRS Severity

References:

Latest Articles

Related Articles

Latest News

INFO

IMAGING

ICU

EXEC

IT

CARDIOLOGY

JOURNALS

EVENTS

FACULTY

PARTNERS

JOBS

COMPANIES

PRODUCTS

BLOG

VIDEOS

Communities

CONTACT US

EU Office

Rue Villain XIV 53-55

B-1050 Brussels, Belgium

Tel: +357 86 870 007

E-mail: [email protected]

EMEA & ROW Office

166, Agias Filaxeos

CY-3083, Limassol, Cyprus

Tel: +357 86 870 007

E-mail: [email protected]

Headquarters

Kosta Ourani, 5

Petoussis Court, 5th floor

CY-3085 Limassol, Cyprus

E-mail: [email protected]