Text-to-Speech (TTS) Research

Report on Current Developments in Text-to-Speech (TTS) Research

General Direction of the Field

The field of Text-to-Speech (TTS) synthesis is currently witnessing a shift towards more efficient, scalable, and multilingual models, driven by the need to address the limitations of existing systems and to expand the accessibility of high-quality speech synthesis to under-resourced languages. Recent advancements are focusing on non-autoregressive models, which promise higher generation efficiency and robustness compared to traditional autoregressive approaches. These models are designed to handle the complexities of speech synthesis without compromising on naturalness or intelligibility, even when dealing with large-scale multilingual datasets.

One of the key innovations is the development of models that do not require precise alignment information between text and speech, thereby simplifying the training process and improving the flexibility of the system. These models often employ a two-stage approach, where the first stage predicts semantic tokens from text, and the second stage generates acoustic tokens conditioned on these semantic tokens. This decoupling of semantic and acoustic modeling allows for more efficient and parallelizable generation processes.

Another significant trend is the exploration of multilingual training strategies, particularly for low-resource languages. Researchers are investigating the use of cross-lingual transfer learning and multilingual pre-training to enhance the performance of TTS models in languages with limited data. This approach leverages data from multiple languages to improve the intelligibility and naturalness of synthesized speech in target low-resource languages, thereby democratizing access to high-quality TTS technology.

Efficiency in speech synthesis is also a major focus, with the introduction of novel codecs that compress speech sequences into shorter, multi-stream discrete semantic sequences. These codecs aim to reduce the computational complexity and improve the efficiency of language model-based TTS systems, enabling faster and more scalable speech generation.

Finally, there is a growing emphasis on the development of TTS systems for under-resourced and dialectally diverse languages, such as Taiwanese Hakka. These systems not only advance the technological capabilities of TTS but also contribute to language preservation and revitalization efforts by providing tools for generating high-quality, dialect-specific speech.

Noteworthy Papers

  • MaskGCT: Introduces a fully non-autoregressive TTS model that achieves superior performance in quality and efficiency, setting a new benchmark for zero-shot TTS systems.
  • VoxHakka: Represents a significant advancement in TTS for under-resourced languages, demonstrating high naturalness and accuracy in synthesizing Taiwanese Hakka speech across multiple dialects.

Sources

MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer

A multilingual training strategy for low resource Text to Speech

SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model Based Text-to-Speech Synthesis

VoxHakka: A Dialectally Diverse Multi-speaker Text-to-Speech System for Taiwanese Hakka

Investigating Neural Audio Codecs for Speech Language Model-Based Speech Generation