During ISMIR 2019, we got the chance to have a talk with Sergio Oramas. Sergio is a researcher in Music Data Science, specialized in Recommender Systems, Natural Language Processing, and Deep Learning with audio and text. He got his PhD in 2017 at the Music Technology Group of the Pompeu Fabra University in Barcelona. He is currently working as a Research Scientist at Pandora. He is also an Assistant Professor of Machine Learning at the Pompeu Fabra University. He has more than 20 publications in top-tier peer-reviewed conferences and journals and holds a B.S. in Computer Science, a B.A. in Musicology, and a Major in Jazz Composition. He also works as a composer for films and produces his own songs. He sings and plays guitar, timple (a kind of ukulele from the Canary Islands), and synthesizers.
Thanks, it is nice to be here. About me, I did my PhD at Pompeu Fabra University with Xavier Serra. My PhD was about Natural Language Processing (NLP) applied to music recommendation and classification. At first, I started with the long tail problem, where most of the tracks are not discovered or listened to. My motivation was that I have a band and we needed to reach the crowd and get our audience. But how can a band find the right crowd? So, I decided to work with the text. I found the text to be more interesting than audio. Especially, that there is a lot of text about music on the internet that is not being used. So, I started learning more about NLP and how to work with text. Afterward, I went to Pandora for an internship where I started working with audio instead. I learned more about using audio and deep learning there. Then, I found that the text is interesting but the audio is even more interesting, especially with the recent progress in deep learning. Finally, I decided to work with both and move to multi-modal approaches and get the best out of both worlds.
It is nice. After my PhD, I started working with Pandora remotely from Barcelona with 2 other researchers all working remotely in Europe. Sometimes I would have meetings at night with the people working in California, but it is not a problem. It was complicated to move to California so I stayed in Barcelona and agreed to work remotely.
Currently, I am working on voice queries, which is understanding what the user is asking for in the voice commands. We try to understand the request, identify the entities and/or the tags in the request to find the right item in the catalog. It is not so simple because there are different words used to describe the same things which need a solution. Also sometimes, the tags would overlap with an entity, i.e. some artists or albums have the same name as a tag and we need to identify both. I do not work with audio so much at the moment though. The queries are often transformed into text, which is someone else’s work, and then I work with the text to identify the tags and entities in the query.
Yes, there are different commands. I am working with the “play music” command, which can be a track, tag, or context. For example, people can ask for music for a certain activity, location, or even some demographics. The most complicated part is that tags overlap a lot with entity names (e.g. artists or tracks), which is more complicated to solve.
Not yet. It is hard to get accepted in NLP conferences when working on music and specifically when it is an industrial application. And in ISMIR, I have never really found a lot of interaction with people when working with NLP. So I am thinking of starting a workshop about NLP applied in MIR, which would be an interesting venue for people who are working on both topics to get together.
In my internship, I worked mainly on the cold-start problem using audio. It is easy to recommend tracks that people are already listening to, but it is harder to recommend the rest of the catalog. Some of this work is going already into production, but I am not working on that anymore so I do not know the current state.
One interesting problem is the multi-stakeholder recommendations. Currently, we are only recommending based on what the user likes and dislikes. For example, we need to consider the artist or the record label as well. Instead of optimizing the metrics of the user satisfaction only, we could optimize the metrics for the rest of those stakeholders which is an interesting work.
Maybe the artist is harder to define, but perhaps for genres, you want to balance between the recommendations of different genres. I am not very sure because I didn’t work on it yet but it is an interesting problem.
Yes, that’s true. But what is the reason why people are listening to popular artists? Maybe it is also because we enforced it somehow. So it needs more study for sure.
Ah, that is totally true. Even me as a consumer I care more about listening to new music, so I put the radio or recommendations, and it is true you listen to them and you don’t have much information about them. But someone has said something relevant recently, we need to give the user some control over the recommendation. For example, have the option to select different modes, for example, “explore mode”, or “more obscure tracks of this artist”, or “similar tracks related to that artist” or just “popular tracks”. And also, we need more transparency about the recommendations to understand why things are being recommended and to get the user more involved in the process and the music listening experience.
Thank you, same to you.