Speech Recognition and Language Models

Report on Current Developments in Speech Recognition and Language Models

General Direction of the Field

The recent advancements in the field of speech recognition and language models (LLMs) are pushing the boundaries of what is possible with audio-based AI systems. The focus is increasingly shifting towards enhancing the capabilities of models for low-resource languages, real-time interaction, and multi-speaker scenarios. Innovations in integrating large language models with automatic speech recognition (ASR) systems are leading to more efficient, accurate, and versatile solutions.

One of the key trends is the use of pseudo-labeling techniques to improve ASR for low-resource languages. This approach leverages large-scale datasets to generate robust pseudo-labels, which are then used to augment existing training data. This method not only enhances performance on in-domain benchmarks but also maintains robustness on out-of-domain data, making it a promising direction for advancing ASR in underserved languages.

Another significant development is the integration of transformers into speech recognition systems. Transformers, with their ability to capture long-term dependencies, are being extensively explored for their potential in speech processing tasks. The topological-lingualism perspective, which considers the linguistic and topological aspects of speech data, is emerging as a comprehensive framework for understanding and improving transformer-based ASR systems.

Real-time interaction is also a focal point, with models like Mini-Omni demonstrating the capability to engage in near-human natural fluency conversations. These models are designed to handle audio input directly, perform reasoning, and generate speech output in real-time, eliminating the need for separate text-to-speech systems and reducing latency.

The coupling of ASR systems with large language models is another area seeing rapid progress. Techniques like SALSA propose efficient methods for synchronously advancing both ASR and LLM decoders, leading to significant improvements in word error rates (WER) for low-resource languages. This approach addresses the challenges of tokenizer mismatches and offers a more efficient training process compared to traditional methods.

Zero-shot spoken language understanding (SLU) is also gaining traction, with models like WHISMA showcasing robust performance in various zero-shot settings. These models combine speech encoders with large language models, fine-tuned on comprehensive SLU datasets, to achieve state-of-the-art performance in slot filling and generalisation to unseen domains.

Finally, advancements in multi-talker ASR are being driven by the integration of large language models. These models are better equipped to handle the complexities of overlapping speech and long contexts, leading to improved performance in conversational scenarios. The use of LLM-based serialized output training (SOT) is proving to be a powerful approach for multi-talker ASR, achieving state-of-the-art results on both simulated and real-world datasets.

Noteworthy Papers

  • Empowering Low-Resource Language ASR via Large-Scale Pseudo Labeling: Introduces a robust pseudo-labeling framework for low-resource languages, validated on a new benchmark, IndicYT.
  • Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming: Presents the first fully end-to-end, open-source model for real-time speech interaction, enabling near-human natural fluency.
  • WHISMA: A Speech-LLM to Perform Zero-shot Spoken Language Understanding: Demonstrates significant improvements in zero-shot slot filling and generalisation to unseen domains, introducing a new benchmark, SLU-GLUE.
  • Advancing Multi-talker ASR Performance with Large Language Models: Achieves state-of-the-art performance in multi-talker ASR, outperforming traditional methods on both simulated and real-world datasets.

Sources

Empowering Low-Resource Language ASR via Large-Scale Pseudo Labeling

Speech Recognition Transformers: Topological-lingualism Perspective

Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming

SALSA: Speedy ASR-LLM Synchronous Aggregation

WHISMA: A Speech-LLM to Perform Zero-shot Spoken Language Understanding

Advancing Multi-talker ASR Performance with Large Language Models