Advancements in Audio and Music Processing: A Focus on Machine Learning Models and Frameworks

The recent developments in the field of audio and music processing research highlight a significant shift towards leveraging advanced machine learning models and frameworks to address complex tasks such as audio classification, music source separation, pitch estimation, and symbolic music analysis. Innovations in graph neural networks (GNNs), unsupervised learning techniques, and self-supervised learning (SSL) models are at the forefront, offering new ways to capture higher-order relationships in audio data, improve model performance with limited labeled data, and enhance the understanding of musical structures and emotions. Additionally, the introduction of comprehensive datasets and novel frameworks for music generation and analysis underscores the field's move towards more generalized and controllable solutions. These advancements not only push the boundaries of what's possible in audio and music processing but also open up new avenues for interdisciplinary research, bridging gaps between technology, art, and biomedical sciences.

Noteworthy Papers

  • LHGNN: Local-Higher Order Graph Neural Networks For Audio Classification and Tagging: Introduces a graph-based model that outperforms Transformer-based models in audio classification tasks, especially in scenarios lacking extensive pretraining data.
  • MAJL: A Model-Agnostic Joint Learning Framework for Music Source Separation and Pitch Estimation: Proposes a framework that significantly improves both music source separation and pitch estimation tasks by addressing the lack of labeled data and joint learning optimization.
  • Unsupervised Speech Segmentation: A General Approach Using Speech Language Models: Offers a novel unsupervised method for speech segmentation that handles multiple acoustic-semantic style changes, outperforming traditional methods.
  • Guitar-TECHS: An Electric Guitar Dataset Covering Techniques, Musical Excerpts, Chords and Scales Using a Diverse Array of Hardware: Introduces a comprehensive dataset that advances data-driven guitar research by providing a wide spectrum of audio inputs and recording qualities.
  • MAD-UV: The 1st INTERSPEECH Mice Autism Detection via Ultrasound Vocalization Challenge: Bridges speech technology and biomedical research by demonstrating the feasibility of automated ASD detection in mice through vocalization analysis.
  • Phone-purity Guided Discrete Tokens for Dysarthric Speech Recognition: Proposes a novel approach to improve phonetic discrimination in dysarthric speech recognition, achieving significant word error rate reductions.
  • Evaluating Interval-based Tokenization for Pitch Representation in Symbolic Music Analysis: Introduces a framework for interval-based tokenizations that improves model performances and explainability in symbolic music analysis tasks.
  • Planing It by Ear: Convolutional Neural Networks for Acoustic Anomaly Detection in Industrial Wood Planers: Explores the use of deep convolutional autoencoders for acoustic anomaly detection, showing superior performance in real-life industrial settings.
  • Vision Graph Non-Contrastive Learning for Audio Deepfake Detection with Limited Labels: Proposes a novel framework that maintains high GNN performance in low-label settings for audio deepfake detection, demonstrating strong cross-domain generalization.
  • Music Tagging with Classifier Group Chains: Introduces a method that models the interplay of music tags, improving tagging performance by considering conditional dependencies among tags.
  • Music and art: a study in cross-modal interpretation: Investigates the effect of music on the experience of viewing art, proposing guidelines for using music to enhance art appreciation.
  • Towards Early Prediction of Self-Supervised Speech Model Performance: Proposes unsupervised methods for early prediction of SSL speech model performance, reducing the need for GPU hours and labeled data.
  • Comparing Self-Supervised Learning Models Pre-Trained on Human Speech and Animal Vocalizations for Bioacoustics Processing: Highlights the robustness of speech-pretrained SSL models for bioacoustics, suggesting extensive fine-tuning may not be necessary for optimal performance.
  • Open-Source Manually Annotated Vocal Tract Database for Automatic Segmentation from 3D MRI Using Deep Learning: Evaluates deep learning algorithms for automatic vocal tract segmentation from 3D MRI, aiming to reduce manual segmentation efforts.
  • Sanidha: A Studio Quality Multi-Modal Dataset for Carnatic Music: Introduces a novel dataset for Carnatic music, improving source separation models' performance through fine-tuning.
  • Estimating Musical Surprisal in Audio: Investigates the use of information content as a proxy for musical surprisal in audio, correlating with human perception of surprise and complexity.
  • Decoding Musical Evolution Through Network Science: Uses Network Science to analyze musical complexity, revealing trends towards simplification and homogenization in modern genres.
  • XMusic: Towards a Generalized and Controllable Symbolic Music Generation Framework: Presents a framework for generating emotionally controllable and high-quality symbolic music, significantly outperforming current state-of-the-art methods.

Sources

LHGNN: Local-Higher Order Graph Neural Networks For Audio Classification and Tagging

MAJL: A Model-Agnostic Joint Learning Framework for Music Source Separation and Pitch Estimation

Unsupervised Speech Segmentation: A General Approach Using Speech Language Models

Guitar-TECHS: An Electric Guitar Dataset Covering Techniques, Musical Excerpts, Chords and Scales Using a Diverse Array of Hardware

MAD-UV: The 1st INTERSPEECH Mice Autism Detection via Ultrasound Vocalization Challenge

Phone-purity Guided Discrete Tokens for Dysarthric Speech Recognition

Evaluating Interval-based Tokenization for Pitch Representation in Symbolic Music Analysis

Planing It by Ear: Convolutional Neural Networks for Acoustic Anomaly Detection in Industrial Wood Planers

Vision Graph Non-Contrastive Learning for Audio Deepfake Detection with Limited Labels

Music Tagging with Classifier Group Chains

Music and art: a study in cross-modal interpretation

Towards Early Prediction of Self-Supervised Speech Model Performance

Comparing Self-Supervised Learning Models Pre-Trained on Human Speech and Animal Vocalizations for Bioacoustics Processing

Open-Source Manually Annotated Vocal Tract Database for Automatic Segmentation from 3D MRI Using Deep Learning: Benchmarking 2D and 3D Convolutional and Transformer Networks

Sanidha: A Studio Quality Multi-Modal Dataset for Carnatic Music

Estimating Musical Surprisal in Audio

Decoding Musical Evolution Through Network Science

XMusic: Towards a Generalized and Controllable Symbolic Music Generation Framework

Built with on top of