Efficient and Adaptable Models in Automatic Speech Recognition

The field of Automatic Speech Recognition (ASR) is currently witnessing a shift towards more efficient and adaptable models, particularly for low-resource and multilingual scenarios. Innovations in model architecture, such as the use of 1D-Convolutional Neural Networks for speaker identification with minimal datasets, are demonstrating high accuracy and potential for practical applications. Additionally, the fusion of discrete and self-augmented representations is enhancing the performance of ASR systems, offering a balance between computational efficiency and accuracy. The integration of speaker attribution into ASR models without extensive fine-tuning is also gaining traction, showcasing robustness across diverse datasets. Furthermore, advancements in detecting gender-based violence victim conditions from speech in a speaker-agnostic setting highlight the potential of AI in mental health assessments. Continual learning approaches, leveraging gradient episodic memory, are being explored to prevent catastrophic forgetting in ASR systems. Lastly, multimodal paraphrase supervision is being utilized to improve conversational ASR in multiple languages, indicating a move towards more context-aware and versatile ASR technologies.

Noteworthy papers include one that introduces a lightweight 1D-CNN for speaker identification with high accuracy on minimal datasets, and another that presents a novel fusion mechanism for discrete representations, significantly improving ASR performance.

Sources

From Statistical Methods to Pre-Trained Models; A Survey on Automatic Speech Recognition for Resource Scarce Urdu Language

Towards Speaker Identification with Minimal Dataset and Constrained Resources using 1D-Convolution Neural Network

Fusion of Discrete Representations and Self-Augmented Representations for Multilingual Automatic Speech Recognition

MSA-ASR: Efficient Multilingual Speaker Attribution with frozen ASR Models

Machine Unlearning reveals that the Gender-based Violence Victim Condition can be detected from Speech in a Speaker-Agnostic Setting

How to Learn a New Language? An Efficient Solution for Self-Supervised Learning Models Unseen Languages Adaption in Low-Resource Scenario

Continual Learning in Machine Speech Chain Using Gradient Episodic Memory

AMPS: ASR with Multimodal Paraphrase Supervision

Multiple Choice Learning for Efficient Speech Separation with Many Speakers

Built with on top of