Audio and Speech Processing

Report on Recent Developments in Audio and Speech Processing

General Trends and Innovations

The field of audio and speech processing is witnessing significant advancements, particularly in the areas of deepfake detection, keyword spotting, and multimodal misinformation detection. Recent developments are characterized by a shift towards more robust and generalized models that can handle a variety of conditions and threats, including unseen or novel attacks.

  1. Deepfake Detection: There is a growing emphasis on developing systems that can effectively detect deepfake audio generated by advanced Audio Language Models (ALMs). Innovations in feature extraction and adversarial training are being explored to enhance the robustness and generalization of deepfake detection models.

  2. Keyword Spotting (KWS): The focus is on creating ultra-low-power KWS systems that can be incrementally trained and personalized post-deployment. Adversarial training and self-learning frameworks are being utilized to improve the accuracy and adaptability of KWS models, especially in diverse and noisy environments.

  3. Multimodal Misinformation Detection: With the rise of deepfake technology, there is an increasing need for comprehensive multimodal frameworks that can detect misinformation across various modalities, including audio, video, text, and images. Research is exploring the role of audio in these frameworks and the importance of modality alignment.

  4. Robust Speaker Verification: Efforts are being made to enhance the robustness of speaker verification systems against noise and spoofing attacks. Novel frameworks that combine noise disentanglement, adversarial training, and feature-robust loss functions are being developed to create noise-independent and speaker-invariant embedding spaces.

Noteworthy Developments

  • Adversarial training of Keyword Spotting to Minimize TTS Data Overfitting: This approach demonstrates significant improvements in KWS model accuracy on real speech data by using adversarial loss, even without real positive examples.

  • Does Current Deepfake Audio Detection Model Effectively Detect ALM-based Deepfake Audio?: The study reveals that the latest codec-trained countermeasure can effectively detect ALM-based audio, indicating promising future research directions.

These developments highlight the dynamic and innovative nature of the field, pushing the boundaries of what is possible in audio and speech processing.

Sources

SZU-AFS Antispoofing System for the ASVspoof 5 Challenge

Meta-Learning in Audio and Speech Processing: An End to End Comprehensive Review

ASASVIcomtech: The Vicomtech-UGR Speech Deepfake Detection and SASV Systems for the ASVspoof5 Challenge

Does Current Deepfake Audio Detection Model Effectively Detect ALM-based Deepfake Audio?

Adversarial training of Keyword Spotting to Minimize TTS Data Overfitting

ICSD: An Open-source Dataset for Infant Cry and Snoring Detection

A Noval Feature via Color Quantisation for Fake Audio Detection

BUT Systems and Analyses for the ASVspoof 5 Challenge

An Improved Phase Coding Audio Steganography Algorithm

A Joint Noise Disentanglement and Adversarial Training Framework for Robust Speaker Verification

Exploring the Role of Audio in Multimodal Misinformation Detection

Self-Learning for Personalized Keyword Spotting on Ultra-Low-Power Audio Sensors

Toward Improving Synthetic Audio Spoofing Detection Robustness via Meta-Learning and Disentangled Training With Adversarial Examples

Disentangled Training with Adversarial Examples For Robust Small-footprint Keyword Spotting