Audio-Driven Talking Head Synthesis: Recent Advances

Advances in Audio-Driven Talking Head Synthesis

Recent developments in the field of audio-driven talking head synthesis have seen significant advancements, particularly in the areas of realistic motion generation, identity preservation, and real-time performance. Innovations in diffusion models and transformer architectures have enabled more precise control over facial expressions and head movements, leading to more natural and coherent video outputs. The integration of multi-scale modeling and explicit motion spaces has addressed previous limitations in inference speed and fine-grained control, making these technologies more viable for interactive applications such as AI assistants.

Another notable trend is the shift towards more personalized and lifelong avatar modeling, where avatars can be constructed and animated over extended periods, capturing changes in identity and appearance. This approach not only enhances the realism of the avatars but also opens up possibilities for long-term virtual interactions.

In terms of privacy and security, there has been a focus on developing facial expression recognition systems that can preserve identity information while accurately capturing expressions. This balance is crucial for applications in video-based facial expression recognition without compromising user privacy.

Noteworthy Papers

  • GaussianSpeech: Introduces a novel approach to synthesizing high-fidelity animation sequences of personalized 3D human head avatars from spoken audio, achieving state-of-the-art performance in real-time rendering.
  • Ditto: A diffusion-based framework that enables controllable real-time talking head synthesis, significantly outperforming existing methods in motion control and real-time performance.
  • LokiTalk: Enhances NeRF-based talking heads with lifelike facial dynamics and improved training efficiency, delivering superior high-fidelity results.
  • FLOAT: An audio-driven talking portrait video generation method that outperforms state-of-the-art methods in visual quality, motion fidelity, and efficiency.
  • MEMO: An end-to-end audio-driven portrait animation approach that generates identity-consistent and expressive talking videos, outperforming state-of-the-art methods in overall quality and emotion alignment.

Sources

GaussianSpeech: Audio-Driven Gaussian Avatars

V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow

Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis

LokiTalk: Learning Fine-Grained and Generalizable Correspondences to Enhance NeRF-based Talking Head Synthesis

Facial Expression Recognition with Controlled Privacy Preservation and Feature Compensation

Synergizing Motion and Appearance: Multi-Scale Compensatory Codebooks for Talking Head Video Generation

Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks

FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait

EmojiDiff: Advanced Facial Expression Control with High Identity Preservation in Portrait Generation

ControlFace: Harnessing Facial Parametric Control for Face Rigging

It Takes Two: Real-time Co-Speech Two-person's Interaction Generation via Reactive Auto-regressive Diffusion Model

TimeWalker: Personalized Neural Space for Lifelong Head Avatars

WEM-GAN: Wavelet transform based facial expression manipulation

Continual Learning of Personalized Generative Face Models with Experience Replay

SINGER: Vivid Audio-driven Singing Video Generation with Multi-scale Spectral Diffusion Model

IF-MDM: Implicit Face Motion Diffusion Model for High-Fidelity Realtime Talking Head Generation

INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations

MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation

Built with on top of