Multi-Modal Generation and Control in Audio-Visual Research

The field of audio-visual research is moving towards the development of more sophisticated multi-modal generation and control models. Recent advancements have focused on improving the quality and realism of generated speech, audio, and video, as well as enhancing the control and synchronization of these modalities. Notable innovations include the use of end-to-end frameworks, conditional flow matching, and audio-visual fusion modules to achieve more realistic and coherent outputs. These advancements have significant implications for applications such as talking head generation, human-computer interaction, and virtual avatars. Noteworthy papers in this area include DeepAudio-V1, which proposes an end-to-end multi-modal generation framework for simultaneous speech and audio generation, and OmniTalker, which introduces a unified framework for real-time text-driven talking head generation with in-context audio-visual style replication. GAITGen is also notable for its novel framework that generates realistic gait sequences conditioned on specified pathology severity levels, and FreeInv for its nearly free-lunch method to improve DDIM inversion. FlowMotion and ACTalker also present innovative approaches to motion synthesis and talking head video generation, respectively.

Sources

DeepAudio-V1:Towards Multi-Modal Multi-Stage End-to-End Video to Speech and Audio Generation

GAITGen: Disentangled Motion-Pathology Impaired Gait Generative Model -- Bringing Motion Generation to the Clinical Domain

FreeInv: Free Lunch for Improving DDIM Inversion

FlowMotion: Target-Predictive Flow Matching for Realistic Text-Driven Human Motion Generation

OmniTalker: Real-Time Text-Driven Talking Head Generation with In-Context Audio-Visual Style Replication

Audio-visual Controlled Video Diffusion with Masked Selective State Spaces Modeling for Natural Talking Head Generation

Built with on top of