Advancements in Multimodal Audio-Video Generation

The field of multimodal generation, particularly in the synthesis of audio and video from textual descriptions, is witnessing significant advancements. Researchers are increasingly focusing on the simultaneous generation of temporally synchronized audio and video, moving away from cascaded processes that often result in information loss. Innovations in joint training frameworks and the integration of semantic guidance are enhancing the quality and alignment of generated content. Additionally, there is a notable emphasis on modeling the relations between audio events and leveraging large language models for symbolic music generation from text. These developments are not only improving the fidelity and synchronization of generated media but also expanding the capabilities of models to handle novel tasks without additional training.

Noteworthy Papers

  • SyncFlow: Introduces a dual-diffusion-transformer architecture for joint audio-video generation, demonstrating superior audio quality and synchronization.
  • MMAudio: Proposes a multimodal joint training framework for video-to-audio synthesis, achieving state-of-the-art results in audio quality and synchronization.
  • RiTTA: Focuses on modeling audio event relations in text-to-audio generation, introducing a comprehensive benchmark and finetuning framework.
  • Text2midi: Leverages large language models for generating MIDI files from textual descriptions, streamlining the music creation process.
  • Smooth-Foley: Enhances video-to-audio generation with semantic guidance, improving audio-video alignment and sound quality.

Sources

SyncFlow: Toward Temporally Aligned Joint Audio-Video Generation from Text

Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

RiTTA: Modeling Event Relations in Text-to-Audio Generation

Text2midi: Generating Symbolic Music from Captions

Smooth-Foley: Creating Continuous Sound for Video-to-Audio Generation Under Semantic Guidance

Built with on top of