Advancements in Speech Processing and Synthesis

The field of speech processing and synthesis is witnessing significant advancements, driven by innovations in deep learning architectures and techniques. A notable trend is the development of more efficient and scalable models, such as recurrent neural network-based architectures, which are achieving state-of-the-art results in text-to-speech synthesis and voice conversion tasks. Another area of focus is the improvement of speech-to-text systems, with evaluations highlighting the strengths and weaknesses of various commercial and open-source systems. Additionally, there is a growing interest in controllable and expressive speech generation, with models being designed to incorporate high-level context and user feedback. The development of novel methods for disentangling speech feature representations and aligning speech tokens with text transcriptions is also facilitating progress in spoken language modeling. Noteworthy papers in this area include RWKVTTS, which introduces a cutting-edge RNN-based architecture for TTS applications, and TASTE, which proposes a method for text-aligned speech tokenization and embedding.Overall, these advancements are paving the way for more accessible, versatile, and human-like speech interfaces, with potential applications in various fields, including human-computer interaction, content creation, and assistive technologies.

Sources

RWKVTTS: Yet another TTS based on RWKV-7

Determined blind source separation via modeling adjacent frequency band correlations in speech signals

VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation

SpeakEasy: Enhancing Text-to-Speech Interactions for Expressive Content Creation

kNN-SVC: Robust Zero-Shot Singing Voice Conversion with Additive Synthesis and Concatenation Smoothness Optimization

TARO: Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning for Synchronized Video-to-Audio Synthesis

Evaluating Speech-to-Text Systems with PennSound

AVENet: Disentangling Features by Approximating Average Features for Voice Conversion

Controllable Automatic Foley Artist

TASTE: Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling

Built with on top of