Speech and Emotion Synthesis

Report on Current Developments in Speech and Emotion Synthesis Research

General Trends and Innovations

The recent advancements in the field of speech and emotion synthesis are pushing the boundaries of what is possible with digital human interactions and auditory experiences. A notable trend is the integration of cognitive and psychological theories into computational models, enabling more human-like and contextually appropriate outputs. This approach is particularly evident in the development of "non-phonorealistic" rendering techniques, where vocal imitations are generated by simulating human vocal tract models and incorporating communicative reasoning. This method not only aligns with human intuitions but also broadens the scope of auditory depiction in computer graphics.

Another significant direction is the adaptation and enhancement of text-to-speech (TTS) systems to handle complex linguistic and emotional nuances. Researchers are increasingly focusing on cross-lingual and multilingual capabilities, addressing the challenges of generating accurate and expressive speech in diverse languages. This includes the development of models that can generate Mandarin videos from audio, leveraging large-scale datasets and advanced feature embedding techniques to overcome the complexities of Mandarin lip movements.

The use of diffusion models and self-supervised learning (SSL) in speech-guided MRI video generation is also gaining traction. These models are capable of producing real-time MRI videos of the vocal tract during speech, offering valuable insights into speech production and articulator motion. This technology has potential applications in second language learning systems and the creation of speaking characters in video games and animations.

Emotional dimension control in TTS systems is another area of innovation. Researchers are developing frameworks that can synthesize a broad spectrum of human emotions by controlling pleasure, arousal, and dominance dimensions. These models leverage psychological research and SSL features to generate emotionally expressive speech without the need for extensive emotional speech datasets. This approach enhances the naturalness and diversity of emotional styles in synthesized speech.

Noteworthy Papers

  • "Non-Phonorealistic" Rendering of Sounds via Vocal Imitation: This paper introduces a method that aligns with human intuitions by incorporating cognitive theories of communication into vocal imitation models, significantly advancing auditory depiction in computer graphics.

  • Speech2rtMRI: Speech-Guided Diffusion Model for Real-time MRI Video: The use of diffusion models and SSL in generating real-time MRI videos of the vocal tract during speech is a groundbreaking approach with wide-ranging applications in speech production and digital character creation.

  • Emotional Dimension Control in Language Model-Based Text-to-Speech: The proposed TTS framework that controls emotional dimensions without requiring emotional speech data during training represents a significant leap in synthesizing diverse emotional styles in speech.

Sources

Sketching With Your Voice: "Non-Phonorealistic" Rendering of Sounds via Vocal Imitation

JoyHallo: Digital human model for Mandarin

Speech2rtMRI: Speech-Guided Diffusion Model for Real-time MRI Video of the Vocal Tract during Speech

Facial Expression-Enhanced TTS: Combining Face Representation and Emotion Intensity for Adaptive Speech

Cross-lingual Speech Emotion Recognition: Humans vs. Self-Supervised Models

Emotional Dimension Control in Language Model-Based Text-to-Speech: Spanning a Broad Spectrum of Human Emotions

Built with on top of