Singing Voice Conversion (SVC)

Report on Recent Developments in Singing Voice Conversion (SVC)

General Direction of the Field

The field of Singing Voice Conversion (SVC) is currently witnessing a significant shift towards enhancing robustness, fidelity, and versatility in voice conversion techniques. Researchers are increasingly focusing on methods that can handle noisy input conditions, accelerate training processes, and integrate advanced pre-trained models to improve the quality and naturalness of converted singing voices. The use of self-supervised learning (SSL) models, adversarial training, and discrete representations is becoming prominent, offering new paradigms that promise to advance the state-of-the-art in SVC.

One of the key trends is the development of noise-robust SVC systems that can effectively process and convert singing voices even in the presence of background noise or music. This is particularly important for real-world applications where the source audio may not be perfectly clean. Techniques that incorporate adversarial learning and robust feature extractors, such as HuBERT-based melody extractors, are being explored to mitigate the impact of noise and improve the similarity and naturalness of the converted voices.

Another notable direction is the pursuit of high-fidelity singing voice generation with faster training times. Researchers are experimenting with novel neural vocoders that combine differentiable digital signal processing with adversarial training to achieve high-quality voice synthesis with significantly reduced training steps. These advancements are crucial for making SVC more accessible and practical for real-time applications.

The integration of pretrained audio models, whether in continuous or discrete forms, is also gaining traction. These models offer enhanced versatility and intelligence, enabling more adaptable data processing workflows and supporting multi-format inputs. This trend is exemplified by toolkits that provide comprehensive solutions for SVS, including automatic error detection and correction, and perception-based auto-evaluation modules.

Finally, there is a growing interest in zero-shot SVC methods that can perform voice conversion without the need for paired training data. These approaches leverage innovative phoneme representations to separate content, timbre, and singing style, thereby enabling precise manipulation of voice characteristics. This research direction is particularly promising for scenarios where large-scale paired datasets are not available.

Noteworthy Innovations

  • RobustSVC: Introduces a noise-robust SVC framework using HuBERT-based melody extraction and adversarial training, significantly improving similarity and naturalness in noisy conditions.
  • InstructSing: Proposes a high-fidelity neural vocoder that converges faster while maintaining quality, achieving comparable performance to state-of-the-art methods with only a fraction of the training steps.
  • Zero-Shot Sing Voice Conversion: Develops a zero-shot SVC method based on clustering-based phoneme representations, enhancing sound quality and timbre accuracy without paired training data.

Sources

RobustSVC: HuBERT-based Melody Extractor and Adversarial Learning for Robust Singing Voice Conversion

InstructSing: High-Fidelity Singing Voice Generation via Instructing Yourself

Muskits-ESPnet: A Comprehensive Toolkit for Singing Voice Synthesis in New Paradigm

Zero-Shot Sing Voice Conversion: built upon clustering-based phoneme representations