Speech Processing and Multi-Speaker Recognition

Report on Current Developments in Speech Processing and Multi-Speaker Recognition

General Direction of the Field

The field of speech processing and multi-speaker recognition is rapidly evolving, with a strong focus on addressing the complexities of real-world scenarios such as meetings, cocktail parties, and other environments with multiple simultaneous speakers and far-field conditions. The recent advancements are characterized by a shift towards more robust and generalized solutions that can handle the intricacies of these scenarios without relying heavily on specific hardware configurations like multi-channel microphone arrays.

One of the key trends is the development of large-scale datasets that simulate these complex environments, enabling researchers to train and evaluate models on realistic data. These datasets are crucial for advancing technologies in speech separation, recognition, and speaker diarization, which are essential for understanding "Who said What and When" in multi-talker, reverberant settings.

Another significant trend is the integration of advanced neural network architectures, particularly transformers, into speech processing tasks. Transformers are being leveraged for their ability to capture long-range dependencies and complex patterns in speech signals, leading to improvements in tasks such as target speaker extraction and automatic speech recognition (ASR). The use of transformer-based models is also being extended to include adversarial training and multi-scale discriminators to enhance the perceptual quality of extracted speech.

The field is also witnessing a growing interest in serialized output training (SOT) for multi-speaker ASR, which offers flexibility and convenience. Innovations in this area include the introduction of overlapped encoding separation and serialized speech information guidance to improve the performance of ASR systems under complex conditions.

Additionally, there is a notable emphasis on developing systems that can detect and counteract deepfake audio, particularly in the context of singing voice deepfake detection. These systems are being designed to be robust against various adversarial conditions and are often based on ensemble methods that combine multiple models to improve detection accuracy.

Noteworthy Innovations

  • Transformer-based Target Speaker Extraction: A novel approach that introduces additional objectives for speaker embedding consistency and waveform encoder invertibility, significantly outperforming existing methods.

  • Neuro-Guided Speaker Extraction: Utilizes EEG signals to guide the extraction of attended speech from monaural mixtures, demonstrating superior performance over baseline models.

  • Universal Speaker Embedding-Free Target Speaker Extraction: Introduces a framework that eliminates the need for speaker embeddings, achieving state-of-the-art performance on standard benchmarks.

Sources

LibriheavyMix: A 20,000-Hour Dataset for Single-Channel Reverberant Multi-Talker Speech Separation, ASR and Speaker Diarization

Serialized Speech Information Guidance with Overlapped Encoding Separation for Multi-Speaker Automatic Speech Recognition

Spectron: Target Speaker Extraction using Conditional Transformer with Adversarial Refinement

USTC-KXDIGIT System Description for ASVspoof5 Challenge

Speech Foundation Model Ensembles for the Controlled Singing Voice Deepfake Detection (CtrSVDD) Challenge 2024

The USTC-NERCSLIP Systems for the CHiME-8 NOTSOFAR-1 Challenge

NeuroSpex: Neuro-Guided Speaker Extraction with Cross-Modal Attention

USEF-TSE: Universal Speaker Embedding Free Target Speaker Extraction