As my final blog post for the MIP-Frontiers project, I have chosen to give my two cents on the future of musical version identification (VI) research and technologies. The topics that I elaborate on below are all included in our survey article and my dissertation. My goal here is to give a brief summary of the selected set of future work mentioned in those two works.
As opposed to the categories me and my colleagues used in our previous work, I will now attempt to categorize the potential strategies for the VI research into two groups: short-term and medium-term strategies. For a more detailed perspective on the issues below and for some other future work, our survey article would be a great place to start.
Considering that the main use case of VI systems is digital rights management, the short-term strategies I mention below focus more on developing reliable systems that can satisfy the current needs of the relevant stakeholders. Assuming that an interest from the industry may give a quick boost to the speed of VI research and development, unleashing the full potential of such systems would be beneficial to the overall VI ecosystem.
Historically, VI systems have used harmonic and melodic information from audio signals by extracting chroma and dominant melody features, respectively. The experimental results have shown that systems that use such information can demonstrate reliable performances for many mainstream use cases. Apart from the fact that harmonic and melodic characteristics are expected to show a certain degree of similarity between versions, another reason why those features were the popular choice among VI researchers is that they are relatively easy to compute and process.
In the last year, VI research has witnessed an important development in going beyond the harmonic and melodic characteristics. Vaglio et al. used a system that extracts lyrics from audio signals and proposed to use them for VI. Since it is a very intuitive idea, it may surprise people that it hadn’t been tried before. However, the main obstacle was having a reliable system that can extract lyrics directly from the audio signals. Luckily, the research community has come up with interesting solutions to this problem, and developing VI systems that use lyrics in their workflow is a very viable and promising idea today.
Another interesting and not-yet-tested idea is to use a music classification system to obtain some information about the audio track that may help the VI system. Such an “auto-tagging” system may detect the genre of a track, or whether it has vocals or is completely instrumental. Such side information can be used to steer the VI system on the right path. Exploring potential synergies between VI and other MIR tasks would absolutely be a worthy effort.
One of the major topics of my dissertation was the scalability perspective of VI systems. To be taken seriously by the industrial stakeholders, this is an important aspect to take into account. However, these days, many information retrieval systems and machine learning research, in general, seem to put enough emphasis on this issue. Therefore, I think VI research just needs to be aware of general tendencies that going on in the related research fields, rather than devoting a considerable amount of effort into this aspect.
Historically, VI systems have always been designed to process and compare entire tracks or long segments of audio (e.g., 1-2 minutes). Although this is desirable behavior for many of the VI use cases, querying short audio segments (e.g., 5-15 seconds) may facilitate exploring some new commercial use cases that so far have not been studied. Although it sounds like a straightforward extension to the current capabilities of VI systems, the wide range of musical characteristics that can differ between versions make it drastically more difficult to solve this problem using only a short snippet of audio. In fact, even humans may require a bit more than just a few seconds when they are asked to decide whether two tracks are versions of each other. Therefore, although intriguing, this use case requires serious efforts to yield reliable systems.
VI systems try to return the most similar item for each query independently, and because of this, the relations among a group of items are often discarded. However, by considering the versions that originate from the same musical work as members of the same set, certain postprocessing operations can be designed. Such operations may take advantage of the cases where VI systems are more confident in order to revise decisions in cases where the systems can’t point to a clear solution. Such postprocessing operations are generally low-cost but highly effective.
The points I explain below are related more to some of the current issues with the VI research. Although such issues are mostly of a conceptual nature and may not directly affect the performance of VI models, keeping them in sight while working on the short-term strategies outlined above may result in a healthier and more sustainable development of VI technologies.
Throughout my research, I have repeatedly mentioned that the main use case of VI systems is digital rights management and detecting cases of copyright infringement. An important point for achieving this goal is to correctly define what constitutes a copyright infringement. For this, we have to think about several questions: “what is a musical work?”, “what is a composition copyright?”, “what commonalities do two tracks need to share so that they can be considered as versions?”, etc. To study VI at any degree, one must answer such questions.
In our research, we have favored a quite permissive definition: a version can be considered as any reinterpretation of an existing musical work. Although such a definition allows us to continue our research from an academic perspective, it may not be practical for industrial applications. In the music ecosystem, oftentimes, two tracks are considered as versions not because of their musical similarities but because of the agreements between rightsholders. This introduces a complexity into our efforts of defining versions that requires us to consult legal entities. Therefore, to design and evaluate VI systems that can seamlessly be used in industrial use cases, one must revise the definitions of musical version and musical work to comply with those of legal perspective.
Apart from the discrepancies between academic and legal definitions of versions, we must think about how to define versions in non-Western musical traditions. The assumptions that we have while designing our systems are likely to fail when applying them to other cultures.
Lastly, one may need to consider musical tracks that are far from general tonal conventions. In Western music, we have a good idea of how similar two tracks should be to be considered as versions of each other; however, what about ambient music or soundscapes? Or when the identity of a track heavily depends on its timbral properties rather than harmonic or melodic ones. Although these are mostly edge cases that VI systems may not encounter frequently, thinking about the extent of musical content those systems should handle can be useful in the future.
Because they are examples of information retrieval applications, VI systems are mostly evaluated using metrics from other information retrieval frameworks. Although this is pleasant from an academic perspective where we mostly use carefully curated datasets, potential issues in industrial datasets may create some unexpected results on how these metrics work. One such issue is the presence of near-duplicates, which may drastically change the outcome of evaluation metrics like precision and recall (see our survey article). Other than potential unexpected behavior, some current metrics may not correspond to how such systems are evaluated in industrial contexts. Therefore, a careful study on designing or choosing evaluation metrics for VI may result in future benefits.
The behavior of VI systems on different genres has been underexplored. It is quite intuitive that systems that process only harmonic or melodic information cannot perform equally well on a wide range of genres; however, only a handful of previous research has explicitly tried to quantify any potential discrepancies. Developing systems that perform well for most popular genres first requires an analysis of the drawbacks of current systems. By detecting the cases where the current systems perform badly, specific measures can be developed in order to obtain high-performance VI systems.