Music Information Retrieval and Audio Generation

Report on Recent Developments in the Field of Music Information Retrieval and Audio Generation

General Trends and Innovations

The field of Music Information Retrieval (MIR) and audio generation is witnessing a significant shift towards more sophisticated and fine-grained control over audio content and style. Recent advancements are characterized by the development of large, diverse datasets that cater to a wide range of tasks, from singer diarization to singing voice synthesis (SVS). These datasets are not only expanding in size but also enhancing in quality, with meticulous annotations that facilitate more accurate and nuanced models.

One of the key directions in the field is the creation of specialized corpora that capture the unique characteristics of specific musical genres or styles. These corpora are designed to address the limitations of existing datasets, which often lack diversity in terms of languages, singing techniques, and realistic music scores. The introduction of such datasets is paving the way for more controllable and personalized singing tasks, enabling models to generate high-quality singing voices with unseen timbres and styles.

Another notable trend is the integration of natural language descriptions (NLDs) into audio generation frameworks. This approach allows for more precise control over the content and style of generated audio, overcoming the limitations of coarse text descriptions used in traditional Text-to-Audio (TTA) models. The use of NLDs enables models to generate audio with fine-grained control, enhancing both the quality and controllability of the output.

The field is also seeing advancements in the development of unified models that can handle multiple tasks simultaneously. These models, which can transcribe lyrics and notes while aligning them, are particularly useful for tasks like singing voice synthesis. By eliminating the need for pre-processed data and addressing the challenges of aligning lyrics and notes, these unified models are making significant strides in advancing the field.

Noteworthy Papers

  • AudioComposer: Introduces a novel TTA generation framework that leverages NLDs for fine-grained audio generation, surpassing state-of-the-art models in quality and controllability.
  • GTSinger: Presents a large, high-quality singing corpus with diverse languages and singing techniques, facilitating advanced singing tasks and benchmarks.
  • StyleSinger 2: Pioneers zero-shot SVS with style transfer and multi-level style control, outperforming baseline models in synthesis quality and style controllability.

Sources

FruitsMusic: A Real-World Corpus of Japanese Idol-Group Songs

AudioComposer: Towards Fine-grained Audio Generation with Natural Language Descriptions

GTSinger: A Global Multi-Technique Singing Corpus with Realistic Music Scores for All Singing Tasks

SongTrans: An unified song transcription and alignment method for lyrics and notes

StyleSinger 2: Zero-Shot Singing Voice Synthesis with Style Transfer and Multi-Level Style Control

Built with on top of