SpatialNet with Binaural Loss Function for Correcting Binaural Signal Matching Outputs under Head Rotations
Abstract
Binaural reproduction is gaining increasing attention with the rise of devices such as virtual reality headsets, smart glasses, and head-tracked headphones. Achieving accurate binaural signals with these systems is challenging, as they often employ arbitrary microphone arrays with limited spatial resolution. The Binaural Signals Matching with Magnitude Least-Squares (BSM-MagLS) method was developed to address limitations of earlier BSM formulations, improving reproduction at high frequencies and under head rotation. However, its accuracy still degrades as head rotation increases, resulting in spatial and timbral artifacts, particularly when the virtual listener's ear moves farther from the nearest microphones. In this work, we propose the integration of deep learning with BSM-MagLS to mitigate these degradations. A post-processing framework based on the SpatialNet network is employed, leveraging its ability to process spatial information effectively and guided by both signal-level loss and a perceptually motivated binaural loss derived from a theoretical model of human binaural hearing. The effectiveness of the approach is investigated in a simulation study with a six-microphone semicircular array, showing its ability to perform robustly across head rotations. These findings are further studied in a listening experiment across different reverberant acoustic environments and head rotations, demonstrating that the proposed framework effectively mitigates BSM-MagLS degradations and provides robust correction across substantial head rotations.
Summary
This paper addresses the challenge of maintaining accurate binaural audio reproduction in devices like VR headsets and smart glasses that use microphone arrays with limited spatial resolution. The Binaural Signals Matching with Magnitude Least-Squares (BSM-MagLS) method improves binaural reproduction, especially at high frequencies and with head rotations. However, its performance degrades with increasing head rotation, leading to spatial and timbral artifacts. The authors propose a deep learning-based post-processing framework using the SpatialNet network to correct these degradations. SpatialNet processes spatial information and is guided by both signal-level loss and a perceptually motivated binaural loss derived from a human binaural hearing model. The effectiveness of the proposed approach is evaluated through simulations and listening experiments using a six-microphone semicircular array. The results demonstrate that the framework effectively mitigates BSM-MagLS degradations and provides robust correction across substantial head rotations. The listening experiment, conducted in different reverberant acoustic environments, confirms the subjective improvement in binaural quality. The paper introduces a novel application of SpatialNet and a binaural loss function based on auditory filters for enhancing binaural reproduction under head rotations, contributing to improved spatial audio experiences in various applications.
Key Insights
- •Novel Application of SpatialNet: The paper innovatively applies the SpatialNet architecture, originally designed for speech separation, to the problem of correcting binaural audio distortions caused by head rotations.
- •Auditory Filter-Based Binaural Loss: A perceptually motivated binaural loss function is introduced, derived from a theoretical model of human binaural hearing with auditory filters. This loss function is used for training the neural network and outperforms traditional STFT-based losses.
- •Robustness to Head Rotations: The proposed method demonstrates robustness to substantial head rotations (up to 90 degrees in the listening experiment), maintaining high perceptual quality compared to the uncorrected BSM-MagLS output.
- •Performance Improvement: The proposed SpatialNet-AUD achieved mean scores only 0.5 points below the reference for a head rotation of 60 degrees, a difference that was not statistically significant (p = .887). At a head rotation of 90 degrees, the mean difference was 0.8 points from the reference, again not statistically significant (p = .821).
- •Limitations: The study simplifies head rotations to the azimuthal plane and focuses on rightward rotations only. Also, the simulations used a relatively simple shoebox room model and data from the Cologne HRTF database.
- •Connections to Related Work: The paper builds upon the BSM-MagLS method and addresses its limitations under head rotations. It contrasts its approach with signal-dependent methods that rely on accurate parameter estimation and highlights the benefits of its signal-independent approach combined with deep learning.
Practical Implications
- •Improved VR/AR Audio: The research has direct implications for improving binaural audio reproduction in virtual and augmented reality applications, leading to more immersive and realistic spatial audio experiences.
- •Head-Tracked Headphones: The proposed method can be implemented in head-tracked headphones to compensate for listener head rotations and maintain accurate spatial audio rendering.
- •Practitioner Tool: Engineers and developers can utilize the proposed SpatialNet-based framework and the auditory filter-based binaural loss function to enhance the performance of binaural reproduction systems using microphone arrays.
- •Future Research Directions: The paper suggests extending the method to diverse HRTFs, broader and bidirectional head-rotation scenarios, more complex real-world acoustic environments, and arbitrary microphone configurations. Further research could explore task-specific neural architectures and larger scale listening tests.