Audio-Visual Generation and Sound Morphing

Report on Current Developments in Audio-Visual Generation and Sound Morphing

General Direction of the Field

The recent advancements in the field of audio-visual generation and sound morphing are marked by a significant shift towards more efficient, scalable, and perceptually accurate models. Researchers are increasingly focusing on developing frameworks that not only improve the quality of generated audio and video but also address the computational challenges associated with high-dimensional data processing. The integration of diffusion models, transformers, and large language models (LLMs) is becoming a common theme, enabling more sophisticated and controllable generation processes.

One of the key trends is the optimization of model architectures to reduce parameter size and memory consumption without compromising performance. This is particularly evident in the development of models that leverage temporal-aware strategies and redundant feature removal techniques to enhance efficiency. Additionally, there is a growing emphasis on multi-modal learning, where audio and visual data are jointly processed to create more coherent and contextually relevant outputs.

Another notable direction is the exploration of unpaired and unlabelled data for training generative models, which opens up new possibilities for applications where paired datasets are scarce or non-existent. This approach allows for more flexible and scalable solutions, as demonstrated by models that can generate background music for videos without relying on paired audio-visual data.

Perceptual uniformity in sound morphing is also gaining attention, with researchers developing methods that ensure smoother transitions and more consistent transformations between intermediate sounds. This is achieved by explicitly mapping morph factors to perceptual stimuli, leading to more natural and intuitive sound morphing experiences.

Noteworthy Innovations

  • MDSGen: Introduces a novel framework for efficient open-domain sound generation, achieving high accuracy with significantly fewer parameters and faster inference times.
  • SoundMorpher: Develops a perceptually uniform sound morphing method using diffusion models, ensuring smoother transitions and consistent transformations.
  • MM-LDM: Proposes a multi-modal latent diffusion model for sounding video generation, achieving state-of-the-art results with improved quality and efficiency.
  • Audio-Agent: Leverages LLMs for high-quality audio generation and editing, offering a comprehensive solution for text-to-audio and video-to-audio tasks.
  • SONIQUE: Enables customizable background music generation for videos using unpaired data, providing a flexible and scalable solution.
  • SRC-gAudio: Introduces a model for sampling-rate-controlled audio generation, demonstrating improvements in audio quality across various metrics.
  • Language-Guided Joint Audio-Visual Editing: Proposes a diffusion-based framework for joint audio-visual editing, achieving superior results in language-based content generation.

Sources

MDSGen: Fast and Efficient Masked Diffusion Temporal-Aware Transformers for Open-Domain Sound Generation

SoundMorpher: Perceptually-Uniform Sound Morphing with Diffusion Model

MM-LDM: Multi-Modal Latent Diffusion Model for Sounding Video Generation

Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition

SONIQUE: Video Background Music Generation Using Unpaired Audio-Visual Data

SRC-gAudio: Sampling-Rate-Controlled Audio Generation

Language-Guided Joint Audio-Visual Editing via One-Shot Adaptation

Built with on top of