Human Motion Generation

Report on Current Developments in Human Motion Generation Research

General Trends and Innovations

The field of human motion generation has seen significant advancements, largely driven by the paradigm shift inspired by the success of Large Language Models (LLMs). Researchers are increasingly focusing on developing Large Motion Models (LMMs) to address the complexities of generating diverse and realistic human motions. This shift is characterized by a growing emphasis on scaling both data and model size, leveraging synthetic data, and exploring novel evaluation metrics to better assess model performance.

One of the key innovations in this area is the introduction of large-scale, high-quality motion datasets, which are crucial for training more versatile and generalizable models. These datasets often feature multimodal data, including detailed text descriptions, to enhance the alignment between textual instructions and generated motions. The use of synthetic data and pseudo labels has also become prominent, helping to mitigate the high costs associated with acquiring real-world motion data.

Another notable trend is the integration of hierarchical audio-motion embedding and diffusion interpolation techniques to improve the fidelity and synchronization of generated co-speech gestures. These methods aim to address the limitations of existing generative models, particularly in handling audio-motion misalignment and visual artifacts.

The unification of text, music, and motion generation within a single multimodal framework is also gaining traction. Models like UniMuMo are designed to take arbitrary inputs from these modalities and generate outputs across all three, addressing the lack of time-synchronized data by aligning unpaired music and motion data based on rhythmic patterns.

Moreover, there is a growing focus on developing versatile motion language models capable of handling multi-turn interactive scenarios. These models, such as VIM, integrate language and motion modalities to understand, generate, and control interactive motions in conversational contexts, leveraging synthetic datasets to address the scarcity of real-world data.

Noteworthy Papers

  1. MotionBase: Introduces a million-level motion generation benchmark, emphasizing the importance of scaling data and model size.
  2. TANGO: Enhances co-speech gesture video reenactment with hierarchical audio-motion embedding and diffusion interpolation.
  3. UniMuMo: Unifies text, music, and motion generation within a single multimodal framework, reducing computational demands.
  4. VIM: Integrates language and motion modalities for versatile interactive motion synthesis in multi-turn contexts.
  5. M^3Bench: Proposes a benchmark for whole-body motion generation in mobile manipulation tasks, highlighting the need for more adaptive models.
  6. LaMP: Introduces a Language-Motion Pretraining model, advancing text-to-motion generation, motion-text retrieval, and motion captioning.
  7. TextToon: Generates real-time toonified avatars from single videos, enhancing stylization and real-time animation capabilities.
  8. PedGen: Learns diverse pedestrian movements from web videos with noisy labels, incorporating context factors for realistic generation.
  9. Hallo2: Extends latent diffusion-based models for long-duration and high-resolution audio-driven portrait image animation.

Sources

Quo Vadis, Motion Generation? From Large Language Models to Large Motion Models

TANGO: Co-Speech Gesture Video Reenactment with Hierarchical Audio Motion Embedding and Diffusion Interpolation

UniMuMo: Unified Text, Music and Motion Generation

Versatile Motion Langauge Models for Multi-Turn Interactive Agents

Towards a GENEA Leaderboard -- an Extended, Living Benchmark for Evaluating and Advancing Conversational Motion Synthesis

M${}^{3}$Bench: Benchmarking Whole-body Motion Generation for Mobile Manipulation in 3D Scenes

LaMP: Language-Motion Pretraining for Motion Generation, Retrieval, and Captioning

TextToon: Real-Time Text Toonify Head Avatar from Single Video

Learning to Generate Diverse Pedestrian Movements from Web Videos with Noisy Labels

Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation

Built with on top of