Advances in Audio-Driven Human Motion Generation

The field of audio-driven human motion generation is rapidly advancing, with a focus on creating more realistic and expressive movements. Researchers are exploring new architectures and techniques to improve the quality and diversity of generated motions, including the use of diffusion models, transformers, and recurrent embedded transformers. These innovations have led to significant improvements in generating coherent and natural-looking human movements, such as gestures and body language, that are synchronized with speech. Noteworthy papers in this area include DIDiffGes, which achieves real-time gesture generation from speech with high quality and expressiveness, and ReCoM, which presents a framework for generating high-fidelity and generalizable human body motions synchronized with speech, achieving state-of-the-art performance across metrics.

Sources

DIDiffGes: Decoupled Semi-Implicit Diffusion Models for Real-time Gesture Generation from Speech

AudCast: Audio-Driven Human Video Generation by Cascaded Diffusion Transformers

ChatAnyone: Stylized Real-time Portrait Video Generation with Hierarchical Motion Diffusion Model

Audio-driven Gesture Generation via Deviation Feature in the Latent Space

ReCoM: Realistic Co-Speech Motion Generation with Recurrent Embedded Transformer

Built with on top of