Highlights of ISMIR 2018

MIP-Frontiers ESRs have compiled the overview of the current trends and analysed how popular is deep learning and open science in MIR research

Posted by MIP-Frontiers on November 15, 2018 · 10 mins read

International Society on Music Information Retrieval (ISMIR) is the biggest conference in MIR, where all MIR-related topics are addressed. In this post, we will provide an overview of ISMIR2018. For each section we also provide the number of papers that post their source code in open access and number of papers that use deep learning of any kind as an interesting statistic to look at. Over all of the sessions the percentages are 51.4% and 54.4% respectively.

This summary of events is meant to show the broader trends and challenges in topics that the MIR community is interested in as well as introduce people new to MIR to ongoing research. The most obvious change across almost all disciplines is the growing popularity of deep learning so we’ll avoid repeating that in our summary too much. We’ve decided to present our summary based on the different sessions at ISMIR. We know that given the diversity of research in ISMIR this won’t do justice to all the papers at ISMIR but we hope to provide some insight into the more popular research areas. All papers are available online.

Session A - Musical objects

The session on musical objects mainly focused on tasks like music transcription, key labeling, chord recognition, and tempo estimation.

There were some interesting papers which used traditional/novel approaches which do not rely on deep neural networks - these can show comparable or better performance than approaches using deep neural networks (DNNs), especially in situations where there is a lack of availability of large-sized datasets / other resources. This is an exciting direction for future work on musical objects. Another exciting direction is the incorporation of user information / practical conditions in addition to the advancement in technology. A common challenge faced by some papers was the lack of specific, common and robust evaluation metrics for a fair comparison of approaches at subjective tasks, which are harder to evaluate using traditional metrics.

Deep learning: 12/18. Open source: 9/18

Session B - Generation, visual

Music generation is a very exciting field because the aim of a lot of the tasks in this area is to enable creativity using artificial intelligence. The different papers in the session focused on tasks at the symbolic level, chord level, song level (playlist generation) etc. so while all these tasks come under the broader umbrella of music generation there is a very little similarity in these tasks from a practical sense. The next step for the ISMIR community in music generation is to develop robust evaluation techniques to create a baseline for good music generation algorithms.

Visualization refers to optical music recognition (OMR) which is about applying computer vision to sheet music to convert it into some digital format. The OMR community is well established and the tasks this year focused on improving the optical recognition task by detecting handwritten music, more musical symbols, mensural notation etc.

Deep learning: 10/17. Open source: 7/17

Session C - Source separation, symbolic, emotion

Source separation is a very popular task in the MIR community. With computers being more powerful and the success of deep learning, lots of deep models were developed for source separation. Encoding-decoding DNN architectures had great success due to their capabilities to extract both local and global features and to project them back to a higher resolution space. Usually, these deep models work in the frequency domain, handling spectrograms as images. As a consequence, the important phase information is lost leading to artifacts. A novelty of this ISMIR was WaveUnet, i.e. a source separation DNN working directly in the time domain without losing phase information.

Recent research shows that music clearly induces emotions in the listeners, but emotions are really complex to be categorized and measured. However, lots of physiological signals, e.g. heart rate variability, skin conductance, EEG and so on, reflect the effects that emotions induce in our bodies. There were some works that categorize emotions by exploiting a continuous valence-arousal space.

Thanks to the creation of such type of datasets, recent research is focusing on finding links between physiological signals and music. By understanding the link between emotions and music using these biological signals the music recommendation systems can be greatly improved.

Deep learning: 8/17. Open source: 12/17

Session D - Corpora and voice

Any music related research quickly runs into the problem of data because most music is copyrighted and it becomes impossible to have well labeled freely available data. This problem is more glaring with the advent of supervised deep learning where large amounts of data are required. This session of ISMIR introduced a lot of labeled datasets which the community can use to help researchers with several MIR-related tasks like automatic music transcription, chord recognition, tempo estimation, singing voice detection. Most datasets came with baseline algorithms. In addition to the datasets presented, there were also studies who have applied a corpus-based research mostly on Jazz and Western Classical Music.

In the second part of the session, studies regarding the task of singing voice detection and transcription took place. These studies have tackled the problems in the current state of singing voice related system in various point of views. Overall, they emphasize the need for consideration of the multi-cultural, contextual and psychological information in the music when modeling overall architectures for singing voice signals.

Deep learning: 5/17. Open source: 11/17

Session E - Timbre, tagging, similarity, patterns and alignment

Two big subtopics in this session are instrument recognition and identifying patterns in music. Regarding instrument recognition, particular attention was given to cultural context, specifically finding out that features used for western instruments don’t perform as well for non-western instruments and vice versa. One of the new tasks that had been addressed is identifying specific time periods of instrument presence in the music, and there had been some impressive results, as for a new task. There has been also some research on trying to identify separate instances of very similar instruments, e.g. tenor vs baritone saxophone, what is much harder than separating instruments. One of the applications was an instrument-based music exploration demo which has proven interesting and innovative.

Talking about pattern identification, there had been evidence of successful application of techniques from the natural language processing area (e.g. n-grams and skip-grams) to music, which signifies that music can be approached from a linguistic perspective. Some examples show application of pattern identification, specifically using rhythm features is cover song identification, specifically for folk songs.

Some papers have tried to address the semantic gap between the features that are useful for understanding the music and ones that work well for MIR tasks. Some papers showed a connection between psychological features such as aggressiveness or sadness and audio features such as energy, valence, etc. The main problem with this kind of research is the evaluation by human subjects that is subjective and consumes a lot of resources.

From a more practical perspective, there had been several papers on disambiguation of large-scale music catalogs. It is a big problem for streaming services because of people using a popular artist’s name to submit much lower-quality music that gets exposure because of the name.

Deep learning: 10/17. Open source: 12/17

Session F - Machine and human learning of music

The session began with several papers studying user behavior. Rather than stressing the human learning aspect, these works were more focused on users’ interaction with music and music services, and the effects of this interaction. There was also some focus on making better predictions of user behavior. The data were collected from music services, as well as from human subject experiments.

Two studies tackle the problem of style transfer in music in a data-driven way, focusing on symbolic music for the first time. Although the evaluation metrics for this task are not well defined, it is apparent that the results are not yet satisfying. There was also some research on the visualization and exploration of musical data.

The rest of the session covered a diversity of machine learning topics. Notably, this paper, introducing a deep reinforcement learning approach to score following, was granted the Best Paper Award of ISMIR2018.

Deep learning: 8/17. Open source: 5/17