Generative Models in Human Motion and Image Editing

Report on Current Developments in the Research Area

General Direction of the Field

The recent advancements in the research area are predominantly focused on leveraging generative models, particularly diffusion models, to address complex tasks in human motion analysis, image editing, and sequence generation. The field is witnessing a shift towards more controllable and interpretable models, with a strong emphasis on real-time interactivity and semantic consistency. Innovations are being driven by the integration of large language models (LLMs) and advanced neural network architectures, which are enabling more sophisticated and efficient solutions.

One of the key trends is the development of models that can generate synchronized text and motion, which is crucial for applications like sign language transcription and action segmentation. These models are designed to control attention mechanisms within transformers to ensure that text generation is aligned with motion sequences, thereby enhancing interpretability and accuracy.

Another significant trend is the improvement of interactive image editing techniques. Researchers are exploring ways to enhance the speed and precision of drag-based image editing, making it more suitable for real-time applications. This is being achieved through the design of optimization-free pipelines that rely on optical flow and diffusion models to accurately reflect user interactions while maintaining image content.

The field is also seeing advancements in the generation of complex 3D human motions. Techniques are being developed to decompose complex actions into simpler movements and then recompose them using diffusion models, allowing for the synthesis of realistic animations for unseen action classes. This approach leverages the knowledge of human motion contained in GPTs models and can be integrated with any pre-trained diffusion model.

Noteworthy Innovations

  1. Transformer with Controlled Attention for Synchronous Motion Captioning: This approach introduces mechanisms to control attention distributions in transformers, enabling time-aligned text generation synchronized with human motion sequences.

  2. InstantDrag: This optimization-free pipeline enhances interactivity and speed in drag-based image editing, requiring only an image and a drag instruction as input.

  3. DreamMover: This framework leverages diffusion models for image interpolation with large motion, ensuring semantic consistency by fusing information in high-level and low-level spaces.

  4. RNAdiffusion: This latent diffusion model for RNA sequence generation optimizes RNA sequences for higher rewards, holding promise for studies on RNA sequence-function relationships and therapeutic RNA design.

  5. MacDiff: This unified skeleton modeling framework leverages diffusion models for effective skeleton representation learning, enhancing fine-tuning performance in scenarios with scarce labeled data.

  6. MotionCom: This training-free motion-aware diffusion-based image composition method enables automatic and seamless integration of target objects into new scenes with dynamically coherent results.

  7. BAD: This bidirectional autoregressive diffusion model unifies the strengths of autoregressive and mask-based generative models, outperforming existing models in text-to-motion generation.

  8. MoRAG: This multi-fusion retrieval-augmented generation strategy enhances motion diffusion models by leveraging improved motion retrieval processes, improving the performance of motion generation.

  9. PoseDiffusion: This generative framework based on a diffusion model and graph convolutional networks demonstrates superior properties in text-driven pose skeleton generation, outperforming existing state-of-the-art algorithms.

  10. Generation of Complex 3D Human Motion by Temporal and Spatial Composition of Diffusion Models: This method synthesizes realistic 3D human motions for unseen action classes by decomposing complex actions into simpler movements and recomposing them using diffusion models.

Sources

Transformer with Controlled Attention for Synchronous Motion Captioning

InstantDrag: Improving Interactivity in Drag-based Image Editing

DreamMover: Leveraging the Prior of Diffusion Models for Image Interpolation with Large Motion

Latent Diffusion Models for Controllable RNA Sequence Generation

MacDiff: Unified Skeleton Modeling with Masked Conditional Diffusion

Continual Learning of Conjugated Visual Representations through Higher-order Motion Flows

MotionCom: Automatic and Motion-Aware Image Composition with LLM and Video Diffusion Prior

BAD: Bidirectional Auto-regressive Diffusion for Text-to-Motion Generation

MoRAG -- Multi-Fusion Retrieval Augmented Generation for Human Motion

GUNet: A Graph Convolutional Network United Diffusion Model for Stable and Diversity Pose Generation

Generation of Complex 3D Human Motion by Temporal and Spatial Composition of Diffusion Models

Built with on top of