Tongue movement patterns provide clinically relevant information for speech articulation, speech motor control and speech-related disorders. Ultrasound offers a safe, non-invasive and cost-effective way to visualise tongue motion during speech, but reliable contour extraction remains difficult because image quality varies and ultrasound frames often contain noise, blur and weak tissue boundaries. A recent analysis accepted in IEEE Journal of Biomedical and Health Informatics presents UltraUNet, a lightweight model designed to segment tongue contours in ultrasound tongue imaging. The approach addresses the need for fast image processing while maintaining accuracy across different imaging conditions. Its performance was assessed against established segmentation models and across datasets covering varied devices, participant groups and linguistic settings.
A Model Designed for Real-Time Use
UltraUNet is based on a compact encoder-decoder structure adapted from the UNet family of segmentation models. Its design focuses on reducing computational load while preserving the information needed to identify tongue contours in ultrasound images. Instead of using a heavier architecture throughout the network, UltraUNet applies selected feature refinement only in deeper layers, where more abstract image information is processed. This approach aims to support accuracy without adding unnecessary processing demands.
The model also uses Group Normalization in deeper encoder layers to improve training stability in small-batch settings, which are common in medical image segmentation. Skip connections rely on summation rather than concatenation to limit memory use and processing overhead. The decoder reconstructs the final segmentation mask while avoiding extra normalisation steps that could slow inference.
Must Read: Diffusion–Relaxation MRI Predicts Nodal Risk in Tongue Cancer
Training includes ultrasound-specific augmentation to expose the model to different image conditions. A denoising model generates additional training examples with varied noise characteristics. Other augmentation methods simulate changes in orientation, speckle noise and blur. Images are standardised before processing, and histogram matching is used in cross-dataset evaluations to reduce differences in brightness and intensity distributions across ultrasound systems.
Testing Across Varied Imaging Conditions
The evaluation uses multiple ultrasound tongue imaging datasets with different participant groups, languages and acquisition settings. Two datasets are used for training and evaluation, while additional datasets are reserved for testing. These include data from Mandarin-speaking adults, Cantonese-speaking participants, typically developing English-speaking children, children with speech sound disorders, children with cleft lip and palate and a single-speaker corpus with synchronised lip video.
The datasets differ in imaging quality, resolution, pixel intensity patterns and clinical or linguistic characteristics. This diversity is important because models trained on one ultrasound setup may perform less reliably when applied to data from another device or population. UltraUNet is therefore evaluated not only on unseen images from the same dataset but also on datasets that were not used during training.
All images use manually verified pixel-level annotations as reference contours. The evaluation compares UltraUNet with several established and lightweight segmentation models, including UNet, Attention UNet, IrisNet, Mobile UNet, SegNet, Squeeze UNet, Swin UNet, USEnet and wUNet. Performance is assessed through contour precision and segmentation overlap, capturing both the closeness of the predicted contour and the overall agreement between predicted and reference masks.
Accuracy Balanced with Processing Speed
In single-dataset evaluation, UltraUNet achieves the strongest average balance between segmentation overlap and contour precision among the compared models. Its results are close to or better than larger UNet-based models while requiring substantially fewer computational resources. The model reaches 250 frames per second during inference, exceeding the speed of UNet, Attention UNet and other comparator models in the evaluation.
The full processing pipeline, which includes image preparation and model inference, also maintains high frame rates during continuous operation. This is important for applications that require immediate visual feedback, such as ultrasound-guided speech therapy and real-time analysis of tongue movement. The model’s design therefore supports both computational efficiency and practical use in time-sensitive settings.
Cross-dataset testing further shows that UltraUNet generalises well to unseen data. When trained on one dataset and tested on others, it records the highest average segmentation overlap in one training configuration and the strongest overall results in another. Statistical testing shows significant improvements over selected baseline models across the cross-dataset evaluations. The ablation results also show that removing individual architectural components weakens performance, supporting the combined design of selective refinement, deeper-layer normalisation and efficient skip connections.
UltraUNet offers a lightweight approach to real-time tongue contour segmentation in ultrasound imaging. Its architecture reduces computational overhead while maintaining accurate contour extraction across varied datasets and imaging conditions. The combination of targeted augmentation, efficient model design and cross-dataset evaluation supports its use in speech research, speech motor disorder analysis and clinical workflows involving ultrasound tongue imaging. Future development may extend the model through spatiotemporal analysis, broader hardware testing and further refinement for real-world deployment.
Source: IEEE Journal of Biomedical and Health Informatics
Image Credit: iStock
References:
Myrgyyassov A, Song Z, Sun Y et al. (2026) UltraUNet: Real-Time Ultrasound Tongue Segmentation for Diverse Linguistic and Imaging Conditions. IEEE Journal of Biomedical and Health Informatics: Early Access.