Automatic Speech Recognition (ASR) and Related Fields

Report on Current Developments in Automatic Speech Recognition (ASR) and Related Fields

General Trends and Innovations

The recent advancements in the field of Automatic Speech Recognition (ASR) and related areas are marked by a significant shift towards more sophisticated and integrated models that leverage both discrete and continuous representations of speech. The focus is increasingly on developing models that can handle complex tasks such as multi-speaker recognition, code-switching, and cross-linguistic differences, while also being resource-efficient and adaptable to various data conditions.

One of the key directions in the field is the exploration of optimal transport-based methods for cross-modal knowledge transfer, which aim to align linguistic and acoustic features more effectively by preserving temporal order. This approach has shown promising results in improving ASR performance, particularly in languages with complex phonetic structures.

Another notable trend is the integration of Language Identification (LID) mechanisms within Mixture of Experts (MoE) models. These models are designed to handle the intricacies of code-switching by dynamically routing information to the most appropriate expert networks, thereby enhancing the model's ability to process diverse linguistic inputs. This approach not only improves recognition accuracy but also maintains the efficiency of MoE models during inference.

The field is also witnessing a convergence between traditional Recursive Neural Networks (RvNNs) and Transformer architectures. Recent developments have shown that models like Continuous Recursive Neural Networks (CRvNN) and Neural Data Routers (NDR) can bridge the gap between these two paradigms, offering improved performance in tasks that require strong structural inductive biases.

In the realm of speech tokenization, there is a growing emphasis on creating tokenizers that are aware of the underlying language models (LMs). This approach aims to reduce the mismatch between tokenization and LM usage, leading to more effective and integrated models for various speech-related tasks.

Noteworthy Papers

  • Comparing Discrete and Continuous Space LLMs for Speech Recognition: This paper provides a comprehensive comparison of speech representations in LLM-based ASR, achieving a state-of-the-art Word Error Rate (WER) of 1.69% on LibriSpeech.

  • Temporal Order Preserved Optimal Transport-based Cross-modal Knowledge Transfer Learning for ASR: The proposed TOT-CAKT model significantly improves ASR performance by preserving temporal order in cross-modal alignment, outperforming several state-of-the-art models.

  • Enhancing Code-Switching Speech Recognition with LID-Based Collaborative Mixture of Experts Model: The Collaborative-MoE model demonstrates significant performance enhancements in code-switching ASR, maintaining efficient inference capabilities without additional pre-training.

  • LAST: Language Model Aware Speech Tokenization: This study introduces a novel approach to training speech tokenizers that are integrated with pre-trained textual LMs, outperforming conventional methods in both spoken language modeling and speech-to-text tasks.

Sources

Comparing Discrete and Continuous Space LLMs for Speech Recognition

Resource-Efficient Adaptation of Speech Foundation Models for Multi-Speaker ASR

Temporal Order Preserved Optimal Transport-based Cross-modal Knowledge Transfer Learning for ASR

Enhancing Code-Switching Speech Recognition with LID-Based Collaborative Mixture of Experts Model

On the Design Space Between Transformers and Recursive Neural Nets

STAB: Speech Tokenizer Assessment Benchmark

Probing self-attention in self-supervised speech models for cross-linguistic differences

Quantification of stylistic differences in human- and ASR-produced transcripts of African American English

LAST: Language Model Aware Speech Tokenization

N-gram Prediction and Word Difference Representations for Language Modeling