Advancements in Audio-Visual Generation and Dubbing

The field of audio-visual generation and dubbing is experiencing significant growth, with a focus on improving the quality and realism of generated audio and video. Recent developments have centered around enhancing the alignment between visual and audio domains, both semantically and temporally, to produce more realistic and engaging outputs. This has been achieved through the introduction of novel frameworks, such as multi-stage generative models and chain-of-thought-like guidance, which enable step-by-step reasoning and guidance for professional audio generation. Noteworthy papers in this area include:

Enhance Generation Quality of Flow Matching V2A Model via Multi-Step CoT-Like Guidance and Combined Preference Optimization, which introduces a multi-stage generative framework with Chain-of-Thought-like guidance learning.
MoCha: Towards Movie-Grade Talking Character Synthesis, which proposes a speech-video window attention mechanism to generate talking character animations directly from speech and text.
DeepDubber-V1: Towards High Quality and Dialogue, Narration, Monologue Adaptive Movie Dubbing Via Multi-Modal Chain-of-Thoughts Reasoning Guidance, which utilizes multimodal Chain-of-Thought reasoning methods to understand dubbing styles and fine-grained attributes.
VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models, which extends Neural Codec Language Models to incorporate video features and ensure time-synchronized and expressively aligned speech synthesis.

Advancements in Audio-Visual Generation and Dubbing

Sources