Large language models (LLMs) are increasingly explored in mental health where speech and narrative often mirror symptom burden. A German BERT-based model was fine-tuned to estimate item-level scores on the Montgomery-Åsberg Depression Rating Scale (MADRS) using language alone from structured interviews. The approach predicted continuous scores for nine items and benchmarked performance against a mean regression baseline and the unfine-tuned base model. Reported mean absolute error (MAE) ranged from 0.7 to 1.0 across items, with flexible accuracies of 79–88% within a ±1 tolerance and a 75.38% reduction in errors relative to the base model. These findings suggest a lightweight, task-adapted LLM can approximate clinician ratings and support symptom assessment and monitoring, especially where standardised digital tools can extend access.
Item-Level Predictions from Clinical and Synthetic Interviews
Interviews combined real clinical material with synthetic data to mitigate score imbalance. In total, 126 interviews were prepared, including 65 patient transcripts and 61 synthetic interviews, yielding 1,242 item-level samples across nine items. The first MADRS item, Apparent Sadness, was excluded because it requires nonverbal cues. Audio from videotaped sessions underwent speaker diarisation and automatic transcription, followed by manual review to ensure accuracy. Each interview was segmented into item-specific text units aligned to clinician-assigned scores from 0 to 6, creating a direct mapping between symptom language and severity labels.
The architecture used a shared BERT encoder with nine separate regression heads, one per MADRS item, to reflect symptom-specific scoring. Fivefold cross-validation was applied at the item level. Evaluation considered MAE, strict accuracy after rounding predictions and a flexible accuracy criterion within ±1 of the true label to reflect clinically acceptable deviations. The fine-tuned model outperformed a mean regression baseline that predicted per-item averages and surpassed the unadapted base model, which defaulted to predicting zero and failed to differentiate severity.
Performance varied by item but remained consistently strong. Under the flexible criterion, accuracies were 79–88%. MAE ranged from 0.7 for Inner Tension to 1.0 for Emotional Numbness. Compared with the baseline predictor, the average MAE reduction was 0.9 points. Error analysis showed the fewest flexible errors for Inner Tension and the highest for Loss of Appetite. Under strict exact-match criteria, Inner Tension again showed the fewest errors, while Emotional Numbness showed the highest. Confusion matrices displayed a clear diagonal pattern, indicating separation of adjacent severity levels from language features alone.
Performance Gains, Learning Dynamics and Evaluation Choices
Fine-tuning delivered marked gains over the unfine-tuned base model. Using the flexible evaluation, overall errors fell by 75.38%, under strict exact-score matching, errors declined by 30.29%. These improvements underscore the value of adapting an LLM to clinical language with item-level supervision. The flexible criterion captured clinically acceptable deviations that mirror known variability among trained raters, improving interpretation of near-correct predictions in practice.
Learning-curve analysis examined how performance scaled with data. Flexible accuracy rose quickly as training fractions increased, with notable gains up to roughly 50–80% of available data, then hit plateau. This indicates early efficiency in learning linguistic cues relevant to MADRS scoring, with diminishing returns thereafter. An analysis restricted to real interviews assessed generalisability without synthetic augmentation and complemented the main analysis that prioritised balanced distributions. Together, these results point to a model that rapidly captures core patterns yet may require larger datasets, additional modalities or architectural changes for further advances.
Must Read: Setting Safe Boundaries for AI in Mental Health
Methodologically, the pipeline established a repeatable path from audio to item-level text with verified transcripts, then to model fine-tuning and evaluation. Accuracy and MAE were computed per item and averaged across folds, while confusion matrices aggregated predictions for each item across splits. Comparisons to the mean regression baseline contextualised absolute error reductions, and zero-shot evaluation of the base model confirmed that task-specific adaptation was essential for item-level severity discrimination.
Implementation Scope, Constraints and Use Considerations
Interviews were conducted in German or Swiss German using standard MADRS prompts, with open narratives rated by trained clinicians. Ratings were assigned independently and discussed; if disagreements exceeded defined thresholds, consensus ratings were used for training and evaluation. Apparent Sadness was excluded because nonverbal information lies outside the language-only scope. The data pipeline included diarisation, transcription with a large-vocabulary system suited to Swiss German, manual correction and segmentation by item. Fine-tuning used a BERT-base-German-cased encoder with nine linear regression heads trained end-to-end using mean squared error loss within a fivefold regime.
Several constraints inform interpretation and deployment. Repeated interviews from the same participants were treated as independent samples and not grouped during fold assignment, which may introduce limited dependence, the sessions occurred weeks apart and reflected different clinical states. Individual rater scores before consensus were not consistently stored, preventing formal inter-rater reliability reporting. Synthetic interviews supported class balance but may not fully capture the variability of clinical language and could lower baseline difficulty.
Reliance on language alone excludes nonverbal cues that influence certain items, and the flexible ±1 criterion, though clinically motivated, can still mask meaningful misclassification in specific contexts. Pretraining may bias towards westernised populations, and performance was not stratified by demographic factors. The plateau in the learning curve suggests that future gains may rely on larger datasets, multimodal inputs or revised architectures.
Despite these limitations, the item-level design aligns outputs directly with a widely used clinical instrument, aiding interpretability for longitudinal tracking, research workflows and routine monitoring. Model transparency at the item level can support targeted follow-up when specific domains worsen, and consistent mapping to standard scores may simplify integration into existing documentation practices.
A fine-tuned, German BERT-based model produced item-level MADRS severity estimates from language alone, achieving MAE between 0.7 and 1.0, flexible accuracies of 79–88%, and substantially fewer errors than unfine-tuned baselines. The outputs mirror clinician practice at the level of distinct symptom domains and offer standardised, interpretable scores for repeated assessment. While constraints include synthetic data use, lack of nonverbal cues and limited rater-level metrics, the results indicate a feasible, clinician-aligned route for symptom assessment and monitoring. The approach suggests practical support for decision-making and follow-up, with further improvements likely from broader datasets, multimodal integration and validation across diverse settings.
Source: npj Digital Medicine
Image Credit: iStock