Date: Thursday 15th - Friday 16th of October 2021
Time Zone: UTC+01 (UK summer time)
MIP-Frontiers project (New Frontiers in Music Information Processing) has supported 15 PhD students at four universities working with a range of industry and cultural partners. This final workshop allows the fellows to present the main results of their research over the last three years. We have also invited several distinguished researchers in the field to give keynote presentations.
|Simon Dixon, Queen Mary University of London|
|Domna Paschalidou, European Research Executive Agency|
|9:15||Data-Driven Musical Version Identification: Accuracy, Scalability, and Bias Perspectives|
|Furkan Yesiler, Universitat Pompeu Fabra|
|Deep Learning Methods for Musical Instrument Separation and Recognition|
|Carlos Lordelo, Queen Mary University of London & DoReMIR|
|10:30||Informed Audio Source Separation with Deep Learning in Limited Data Settings|
|Kilian Schulze-Forster, Telecom Paris|
|Deep Neural Networks for Automatic Lyrics Transcription|
|Emir Demirel, Queen Mary University of London|
|11:45||Data-Driven Approaches for Query by Vocal Percussion|
|Alejandro Delgado, Queen Mary University of London & Roli|
|14:00||Keynote 1: Applications of Machine Intelligence in Computational Acoustics|
|Augusto Sarti, Politecnico di Milano|
|15:00||Towards Neural Context-Aware Performance-Score Synchronization|
|Ruchit Agrawal, Queen Mary University of London|
|Autonomous and Robust Live Tracking of Complete Opera Performances|
|Charles Brazier, Johannes Kepler University Linz|
|16:15||Large-Scale Multi-Modal Music Search and Retrieval without Symbolic Representation|
|Luís Carvalho, Johannes Kepler University Linz|
|Investigating the Behaviour of Audio Classification Models through Adversarial Attacks and Gradient based Interpretability Methods|
|Vinod Subramanian, Queen Mary University of London|
|9:00||Keynote 2: Source Separation Metrics: What are they really capturing?|
|Rachel Bittner, Spotify|
|10:00||Exploration of Music Collections with Audio Embeddings|
|Philip Tovstogan, Universitat Pompeu Fabra|
|Audio Auto-tagging as Proxy for Contextual Music Recommendation|
|Karim Ibrahim, Telecom Paris|
|11:15||Deep Learning Methods for Music Style Transfer|
|Ondřej Cífka, Telecom Paris|
|Exploring Generative Adversarial Networks for Controllable Musical Audio Synthesis|
|Javier Nistal, Telecom Paris & Sony CSL|
|14:00||Keynote 3: Generative Modeling Meets Deep Learning: A Modern Statistical Approach to Music Signal Analysis|
|Kazuyoshi Yoshii, Kyoto University|
|15:00||Neuro-Steered Music Source Separation|
|Giorgia Cantisani, Telecom Paris|
|Automatic Characterization and Generation of Music Loops and Instrument Samples for Electronic Music Production|
|António Ramires, Universitat Pompeu Fabra|
|Domna Paschalidou, European Research Executive Agency|
Please remember to download zoom app and if you have any issues, please email firstname.lastname@example.org
Information retrieval and machine learning have grown to the point of playing an leading role in all aspects of sound analysis. There are some research areas, however, in which the potential of information retrieval techniques is only now beginning to have an impact on the research community. One of these areas is musical acoustics. In this area, in fact, machine intelligence and information retrieval have been widely used only for timbral analysis. Machine learning is now beginning to play a relevant role also in the analysis of vibrational and acoustic properties of musical instruments. With this talk I would like to offer an overview on some emerging methodologies for vibrational, acoustic and timbral analysis based on machine intelligence.
The task of source separation has relied on what has become a standard set of metrics for the past 15 years. These metrics have been the basis for evaluating the latest “state-of-the-art”, and for determining when design choices are good or bad. Yet a number of articles have shown that these metrics are limited in how well they correlate with human perception. This talk gives a history of the metrics, their evolution, implementations, and what we use today. Then, we dive into the mathematical properties of these metrics and give illustrative examples. We conclude with a short survey of alternative metrics from the literature, and criterion for what we might consider when defining a new standard.
I will present a modern statistical approach to automatic audio-to-score transcription based on an effective combination of an acoustic model, a language model, and an inference model. These models are implemented with classical probabilistic models (e.g., HMM) or with deep neural networks (e.g., LSTM) for richer expression capabilities if necessary. As subtasks of music transcription based on this approach, I will introduce music structure analysis and automatic transcription of singing voice, chords, keys, drums for popular music. I will also introduce the state-of-the-art automatic piano transcription system that can yield decent symbolic piano scores.
This thesis aims at developing audio-based musical version identification (VI) systems for industry-scale corpora. To employ such systems in industrial use cases, they must demonstrate high performance on large-scale corpora while not favoring certain musicians or tracks above others. Therefore, the three main aspects we address in this thesis are accuracy, scalability, and algorithmic bias of VI systems.
We first propose a deep learning-based model that incorporates domain knowledge in its network architecture and training strategy. We design explicit modules to handle common transformations between musical versions (e.g., key, tempo, structure, etc.). We then take two main directions to further improve our model. Firstly, we investigate embedding distillation techniques to reduce the size of the embeddings produced by our model, which reduces the requirements for data storage and, more importantly, retrieval time.
Secondly, we experiment with data-driven fusion methods to combine information from models that process harmonic and melodic information, which greatly enhances identification accuracy. After exploring potential improvements in accuracy and scalability, we analyze the algorithmic biases of our systems and point out the impact such systems may have on various stakeholders in the music ecosystem (e.g., musicians, composers) when used in an industrial context. We conclude our research by analyzing the performance of our proposed systems on two industrial use cases, in collaboration with a broadcast monitoring company.
Overall, our work addresses the research challenges of the next generation of VI systems. We show the feasibility of developing systems that are both accurate and scalable at the same time by carefully combining domain knowledge into data-driven workflows. We believe that our contributions will accelerate the integration of VI systems into industrial scenarios, and, thus, the impact of VI research on musicians and listeners will be more eminent than ever.
Automatically transcribing polyphonic music to a score is a challenging task and one of the most discussed topics in the Music Information Retrieval (MIR) community. In particular, when analysing recordings with multiple instruments, the transcription process becomes even more complex, because not only each note should have its pitch and duration properly estimated, but the information regarding the timbre of sounds should also be correctly processed. It is mandatory to have a way of recognising the instrument that played each note and associate each sound to the correct voice in the final staff notation. Moreover, when dealing with signals with pitched and non-pitched sounds (drum kits), drum detection and classification is often performed independently because the sound characteristics of drum instruments differ in many aspects from pitched instruments that constitute the melodic and harmonic nature of music.
With the final objective of allowing a more general multi-instrument automatic music transcription, where not only harmonic, but also percussive instruments could be present in the signal, in this project we propose and deeply investigate deep-learning-based harmonic-percussive source separation and instrument recognition methods. We show that by musically motivated architectures can improve the performance of each of those tasks and with the usage of the Tap & Fiddle Dataset, a dataset curated as part of this project containing 28 stereo recordings of traditional Scandinavian fiddle tunes with accompanying foot-tapping, we show that unsupervised domain adaptation methods can help in cases where no labelled data is available for specific instruments. Regarding instrument recognition, we investigate the pitch streaming task and propose novel deep-learning-based methods that can perform the task when any multi-pitch estimations are provided as input to the system.
Audio source separation is the task of estimating the individual signals of several sound sources when only their mixture can be observed. State-of-the-art performance for musical mixtures is achieved by Deep Neural Networks (DNN) trained in a supervised way. They require large and diverse datasets of mixtures along with the target source signals in isolation. However, it is difficult and costly to obtain such datasets because music recordings are subject to copyright restrictions and isolated instrument recordings may not always exist.
In this dissertation, we explore the usage of prior knowledge for deep learning based source separation in order to overcome data limitations.
First, we focus on a supervised setting with only a small amount of available training data. We investigate to which extent singing voice separation can be improved when it is informed by lyrics transcripts. To this end, a novel deep learning model for informed source separation is proposed. It aligns text and audio during the separation using a novel monotonic attention mechanism. The lyrics alignment performance is competitive with state-of-the-art methods while a smaller amount of training data is used. We find that exploiting aligned phonemes can improve singing voice separation, but precise alignments and accurate transcripts are required.
Finally, we consider a scenario where only mixtures but no isolated source signals are available for training. We propose a novel unsupervised deep learning approach to source separation. It exploits information about the sources’ fundamental frequencies (F0). The method integrates domain knowledge in the form of parametric source models into the DNN.
Experimental evaluation shows that the proposed method outperforms F0-informed learning-free methods based on non-negative matrix factorization and a F0-informed supervised deep learning baseline. Moreover, the proposed method is extremely data-efficient. It makes powerful deep learning based source separation usable in domains where labeled training data is expensive or non-existent.
Lyrics are an essential building block for the representing, understanding and appreciation of singing performances. Therefore, the automatic retrieval of lyrics from singing performances, or automatic lyrics transcription, has a number of potential industrial applications, though the performance of such systems had not reached to a level to be used in industrial applications. In our project, we develop the first automatic lyrics transcriber module that is integrated with an automatic music transcription system to be used in large scale applications. To achieve this, we propose a number of novel methods for an improved lyrics transcriber, such as a compact multistreaming neural networks architecture, cross-domain training, singing adapted pronunciation dictionary and music informed silence modeling. In addition, we introduce a new evaluation set for this task that is much larger than the existing benchmark test sets. Finally, we provide a quantitative comparison between the state-of-the-art DNN-HMM and end-to-end methods within this context. While our best performing model sets the new state-of-the-art in lyrics transcription, it is going to be included as a novel feature of the new ScoreCloud - Songwriter app which is planned to be released in the upcoming months. We will finalize our talk with an initial demo for the industrial application of the automatic lyrics transcription technology.
The imitation of percussive sounds via the human voice is a natural and effective tool for communicating rhythmic ideas on the fly. Query by Vocal Percussion (QVP) is a subfield in Music Information Retrieval (MIR) that explores techniques to query percussive sounds using vocal imitations as input, usually plosive consonant sounds. In this way, fully automated QVP systems can help artists prototype drum patterns in a comfortable and quick way, smoothing the creative workflow as a result. This project focuses on applying data-driven approaches to the two most important tasks in QVP. On the one hand, the task of Drum Sample Retrieval by Vocal Imitation (DSRVI) aims at picking different-sounding samples by timbral similarity with a given vocal imitations. This is a problem of correspondence between two acoustic spaces, the one for the real drum samples and the one for the vocal imitations, and thus the main objective is to learn the set of audio features that best link them. On the other hand, the task of Vocal Percussion Transcription (VPT) works by identifying distinct vocal percussion utterances that trigger individual drum samples. This problem, in contrast, is one of correspondence between a sound and a label (classification), and the relevant set of audio features is the one that best separates all classes, independently of how the triggered drum samples sound like. In this study, we try to give robust solutions to these two problems using recent deep learning techniques so that music producers can have a more pleasant experience when searching for sounds and composing beats.
Music synchronization aims at providing a way to navigate among multiple representations of music in a unified manner, lending itself applicable to a myriad of domains like music education, performance, enhanced listening, automatic accompaniment and so on. This project focuses on improved music synchronization in real life settings. This entails developing robust alignment methods which have significant domain coverage and can adapt to the setting they are being employed in. This project develops synchronization methods applicable to both audio-to-audio and audio-to-score alignment and addresses important challenges that make up limitations of traditional alignment algorithms.
This project addresses the challenging task of tracking complete opera performances in real-time along with their respective scores. So far, existing approaches have proven their efficiency at tracking full orchestral works with accuracy and robustness. However, these approaches fail at tracking operas. Such trackers must not only deal with a continuous musical recording, but also have to consider a complex mixture of polyphonic music and singing voice, acting sounds, interludes, breaks, and also applause from the audience. All those parameters interfere with the tracking process and tend to provoke failures where the tracker is lost in the score. To address this issue, we propose to develop new methods to integrate different extra-musical knowledge sources (e.g. acoustic event detection, speech-specific features, acoustic model) into state-of-the-art music score following algorithms to achieve robust and accurate opera tracking during live opera tracking in real conditions.
The goal of this project is to propose methods for the automatic structuring and cross-linking of large multi-modal music collections, with focus on audio recordings and sheet music images, and without the need for symbolic representation. These methods should support tasks such as the retrieval of one modality based on another one, identification of different versions of the same material, and piece identification in unknown recordings. We have then identified two main trends for this research. First we focus on how to learn better audio-to-sheet music correspondences, following recent advances in deep neural networks. Our approach consists of learning similarities between short snippets of audio spectrograms and staff-wise unrolled sheet music pages. Second, we build upon the learned correspondences and investigate how to best exploit them for identification and retrieval tasks on real multimodal archives of music, aiming for fast and scalable methods.
In this presentation we will explore different methods that help us understand how audio classification models work. The two main methods are adversarial attacks and interpretability. Adversarial attacks allow us to perturb the input to fool the classifier and interpretability uses the gradients of the classifier to show what parts of the input are important for the model prediction. We focus on motivating our research in the audio domain, tackling challenges unique to audio, and exploring directions that the community must take to mature the field.
Music recommendation systems are an integral part of modern music streaming services. To balance user retention and diversity of recommendations, most industrial systems utilize exploit vs explore model. In this thesis, we focus on music exploration as opposed to exploitation, as this area of research is less developed and better suited to the academic environment as opposed to industry, which mostly focuses on improving the exploitation performance. We propose a novel approach to music exploration that utilizes visualization of music in continuous semantic latent space instead of browsing through artists, genres or moods.
We release MTG-Jamendo - a new open-source auto-tagging dataset that provides full audiotracks under Creative Commons license with tags categorized between genres, moods, and themes and instruments that are useful for research on music exploration. We utilize state-of-the-art deep auto-tagging systems to perform the evaluation of the dataset.
We present a novel web interface to visualize music collections using the audio embeddings extracted from music tracks. The system allows exploring the relationship between music tracks from multiple perspectives and on different levels (segments vs full tracks), displaying embedding spaces extracted by music auto-tagging models, trained using different architectures and datasets, coupled with various 2D projection algorithms. We conduct a user study to analyze the appropriateness of different embeddings and visualizations on the music collections, particularly for playlist creation and music library navigation and rediscovery. Our results show that such an interface provides a good alternative to standard hierarchical library organization by metadata. Furthermore, we provide the analysis of the participants’ preference for different audio embeddings and visualization algorithms.
The exponential growth in volume of online services and user data changed how we interact with various service providers, and how we explore and select new products. Hence, there is a growing need for methods to recommend the appropriate items for each user. In the case of music, it is more important to recommend the right items at the right time. It has been well documented that the context, i.e. the listening situation of the users, strongly influences their listening preferences. Hence, there has been an increasing attention towards developing context-aware systems. State-of-the-art approaches are sequence-based models aiming at predicting the tracks in the next session using available contextual information. However, these approaches lack interpretability and serve as a hit-or-miss with no room for user involvement. Additionally, few of previous approaches focused on studying how the audio content relates to these situational influences, and even to a lesser extent making use of the audio content in providing the contextual recommendations. Hence, these approaches suffer from both lack of interpretability and the cold-start problem.
In music, composers, arrangers, performers and producers often adapt existing pieces to different contexts and audiences. Recently, deep learning methods have enabled transforming musical material in a data-driven manner, setting the ground for tools which could partially automate this process. The research performed in this area so far has focused largely on conversion between a small set of musical genres or instrument timbres, and on tasks that involve completing a partial arrangement in a desired style. The focus of this thesis, on the other hand, is on a family of tasks which we refer to as (one-shot) music style transfer, where the goal is to transfer the style of one musical piece or fragment onto another. We propose two specific tasks in this direction: (1) accompaniment style transfer for symbolic music representations (i.e. digital scores or MIDI files), and (2) timbre transfer for audio recordings. For each of these tasks, we propose novel methods based on deep learning, as well as evaluation protocols. Additionally, we present a broader contribution related to the processing of sequences (music or otherwise) using Transformer neural networks.
In the first part of this work, we focus on supervised methods for symbolic music accompaniment style transfer, aiming to transform a given piece by generating a new accompaniment for it in the style of another piece. The method we have developed is based on supervised sequence-to-sequence learning using recurrent neural networks (RNNs) and leverages a synthetic parallel (pairwise aligned) dataset generated for this purpose using existing accompaniment generation software. We propose a set of objective metrics to evaluate the performance on this new task and we show that the system is successful in generating an accompaniment in the desired style while following the harmonic structure of the input.
In the second part, we investigate a more basic question: the role of positional encodings in music generation using Transformers. In particular, we propose stochastic positional encoding (SPE), a novel form of positional encoding capturing relative positions while being compatible with a recently proposed family of efficient Transformers. The main theoretical contribution of this work is to draw a connection between positional encoding and cross-covariances of correlated stochastic processes. We demonstrate that SPE allows for better extrapolation beyond the training sequence length than the commonly used absolute positional encoding.
Finally, in the third part, we turn from symbolic music to audio and address the problem of timbre transfer. Specifically, we are interested in transferring the timbre of an audio recording of a single musical instrument onto another such recording while preserving the pitch content of the latter. We present a novel method for this task, based on an extension of the vector-quantized variational autoencoder (VQ-VAE), along with a simple self-supervised learning strategy designed to obtain disentangled representations of timbre and pitch. As in the first part, we design a set of objective metrics for the task. We show that the proposed method is able to outperform existing ones.
In this thesis, we study Generative Adversarial Networks (GANs) for musical audio synthesis. We explore various sources of conditional information in order to shape the synthesized sounds according to high-level features. Additionally, we also address a fundamental problem originated from applying image-based GAN architectures to the audio domain: the generation of sounds with variable duration.
In this project, we address the challenge of integrating BCI and music technologies on the specific application of music source separation, which is the task of isolating individual sound sources that are mixed in the audio recording of a musical piece. This problem has been investigated for decades, but never considering BCI as a possible way to guide and inform separation systems. Specifically, we explored how the neural activity characterized by electroencephalographic signals (EEG) reflects information about the attended instrument and how we can use it to inform a source separation system.
First, we studied the problem of EEG-based auditory attention decoding of a target instrument in polyphonic music, showing that the EEG tracks musically relevant features which are highly correlated with the time-frequency representation of the attended source and only weakly correlated with the unattended one. Second, we leveraged this “contrast” to inform an unsupervised source separation model based on a novel non-negative matrix factorisation (NMF) variant, named contrastive-NMF (C-NMF) and automatically separate the attended source.
Unsupervised NMF represents a powerful approach in such applications with no or limited amounts of training data as when neural recording is involved. Indeed, the available music-related EEG datasets are still costly and time-consuming to acquire, precluding the possibility of tackling the problem with fully supervised deep learning approaches. Thus, we explored alternative learning strategies to alleviate this problem. Specifically, we propose to adapt a state-of-the-art music source separation model to a specific mixture using the time activations of the sources derived from the user’s neural activity. This paradigm can be referred to as one-shot, as the adaptation acts on the target song instance only. We conducted an extensive evaluation of both the proposed system on the MAD-EEG dataset which was specifically assembled for this study obtaining encouraging results, especially in difficult cases where non-informed models struggle.
Repurposing audio material to create new music - also known as sampling - was at the foundation of electronic music and is a fundamental component of this practice. Loops are audio excerpts, usually of short duration, that can be played repeatedly in a seamless manner. These loops can serve as the basis for songs that music makers can combine, cut and rearrange and have been extensively used in Electronic Dance Music (EDM) tracks. Similarly, the so-called “one-shot sounds” are smaller musical constructs that are not meant to be looped but that are also typically used in EDM production. These might be sound effects, drum sounds or even melodic phrases. Both loops and one-shot sounds have been made available for amateur and professional music makers since the early ages of electronic music. Currently, large-scale databases of audio offer huge collections of audio material for users to work with. Significant research has focused on easing the navigation of one-shot sounds in these databases, either through similarity search, clustering, high-level description or recommendation of sounds. Loops however have not yet been given such attention.
In our work, we address two fundamental methods for navigating sounds: characterization and generation. Characterizing loops and one-shots in terms of their instruments or instrumentation (e.g. drums, harmony, melody) allows organizing unstructured collections and a faster retrieval for music-making. Generation enables the creation of new sounds which are not present on the database through interpolation or modification of the existing material. To achieve this, we employ deep-learning-based data-driven methodologies for classification (e.g. Convolutional Neural Networks) and generation (Wave-U-Net and Generative Adversarial Networks).