Deep beers

A chat with Philippe Esling

Posted by Javier Nistal on January 04, 2020 · 22 mins read

It is my pleasure today to be sharing with you my chat with Philippe Esling; I would say, one of the most eccentric and creative researchers you can meet today in the MIR community. Besides being an acknowledged synthesis geek, a fan of Noam Chomsky and Trotskyism, or a fervent defender of procrastination, Philippe is an associate professor and researcher at IRCAM, where he leads the ACIDS (Artificial Creative Intelligence and Data Science) research group. He also teaches computer science and mathematics at Sorbonne University and machine learning in the ATIAM Masters. He received an MSc in Acoustics, Signal Processing, and Computer Science in 2009 and a Ph.D. on multiobjective time-series matching in 2012. After, he did a post-doc in ecological monitoring and meta-genetics in the Department of Genetics and Evolution at the University of Geneva in 2012. In this short period, he authored and co-authored over 15 peer-reviewed journal papers in prestigious journals such as ACM Computing Surveys, Publications of the National Academy of Science, IEEE TSALP, and Nucleic Acids Research. He received a young researcher award for his work in audio querying in 2011 and a Ph.D. award for his work in multiobjective time series data mining in 2013. His contributions to the field of computational music creativity have been of high relevance. Among other things, he participated in the development of Orchids, the first computer-aided orchestration software which is used by a broad community of composers. His latest research on A.I.-driven audio generation focuses on learning musically meaningful latent representations from data. Recent papers produced by his team, allow regularizing the topology of these latent spaces based on perceptual aspects and performing timbre style transfer between instruments. His contributions have not only advanced the state-of-art on the field but have also materialized into creative tools used by artists to perform pieces played at renowned venues.

From the descriptions I found online, I see that you have a pretty interdisciplinary background. Can you tell me a little bit about it?

So basically, I have a BSc in Mathematics and Computer Science. Actually, I started doing pure mathematics for the first years of uni, and then I went to a computer science school and did a mixture of both. After that, I ended up doing an MSc in distributed processing, which was purely computer science. Next, I went to IRCAM to do the Acoustics and Signal Processing master. Following, I did a Ph.D. in multiobjective time series matching, more related to data mining. Afterward, I went to work on genetics, which was really fun! I wanted to try something different. It was a very cool project focused on meta-genomics. In particular, I studied ancient DNA material recovered from the deep sea for environmental monitoring. Meta-genomics means that you look at not only how the genome expresses itself but how genetic communities behave. The cool thing, for instance, is that if you dig a hole in the sea soil, you can find traces of DNA, and every 10cm, you go back around 1k or 10k years, depending on the sediments. So then, you can do its back-reading (back-to-the-future kind of stuff!). In particular, for this project, I was involved in a somewhat geeky task, dealing with the approximately 100Gb/h of information generated by the genetic sampling and sequencers! Impressive right? Anyways, then I came back to Paris, and I got a professor position super quickly, which was really lucky. I was hired at Sorbonne University and IRCAM to work on A.I., first focused on doing statistical inference and Markov models of symbolic music. To be honest, I found it kind of a weird thing to do symbolic analysis when you talk about creativity. Right now, I don’t see a future on symbolic, but that’s another story…

So you’ve gone through research in time-series, meta-genomics, and music. But which one really came first?

Music was always there somehow. I’ve always been a nerd of modular synthesis! However, in the beginning, I wanted to do something related to hardware and distributed computing, i.e., the behavior of chaotic-type computation. Then, I found out about IRCAM, and I said to myself: “Wow! I can do maths, computing, and modular synthesis, that’s awesome!”. It was kind of a revelation for me. Back then, the meta-genomics thing was more of a change of scenery. The work I did was fascinating, although, at the same time, it was very overwhelming. In biology, there can be a lot of pressure and competition for publishing; you cannot imagine! Even though I felt happy about doing something innovative and ecologic, which could help the environment, it disappointed me so much this drive for publishing and getting funding… These and other obscure experiences I had made me reconsider my situation and go back to music.

Have you ever thought of combining both interests (genetics and music) into a single project?

Actually yeah! I did this weird thing that got rejected from every single conference, which I called The molecular clock of music. I don’t know if you’ve heard about the theory of phylogenetics. It’s a very important theory that establishes how species are related to each other through an evolutionary tree. So if you take the genetic sequence of two species, you look at something called the conserved regions and the consensus sequence, and you compute the amount of discrepancy and dissimilarity between these, you can infer when the two species were the same. So, for example, at one point, monkeys and humans divided from the same species and gradually became more separated. But monkeys and humans were still related a long time ago with, I don’t know… ducks?! So from these measures, one can infer this tree, which gives the distance between species and the time they took for them to split. So I thought: how about if we applied the same principle to music? Like taking the symbolic sequence of structures, and then try to infer when two genres started. Then, one could go back in time and try to infer the genre that originated these, which may be lost. The theory is pretty funky, although pure speculation, and I did not have anything to back it up, which is the reason why it got rejected from all conferences.

Going back to your work at IRCAM. You lead the ACIDS project, focused on developing creative tools for making music. Can you tell me a little bit more in-depth about the ongoing research?

So the ACIDS project is a little bit chaotic… When I go to conferences, I have to pretend that we have structure, a specific plan, etc. Actually, it is people shooting in all directions. I find it beautiful! All projects indeed have the common goal of trying to develop interactive tools for musicians. It is what we like to call co-creativity. We try to build technologies that are right in the middle between traditional tools, which may have very complex interfaces, and something that you would simply push a button and generates music. Much of the research out there on artificial creativity actually do this, generate scores or audio. But that is entirely useless for musicians and composers. At our group, we try to build tools that are controllable, useful and designed from the desire of specific artists. This last point can actually be very hard. In many cases, artists do not know what they really want, or we don’t understand them, so the artist gets back saying: “that’s exactly not what I was looking for!” It can be a bit of a painful process. For instance, recently, I met this fantastic German composer, Alexander Schubert. He composes these weird pieces for violins that are played in a way that makes them sound kind of dubstep-like. We talked about starting a collaboration, and he asked for a noise generator. This seemed a bit counterintuitive. Who the hell wants something that spits out noise? This is the worst thing you can ask for a machine learning model! However, he was asking for noise that could be musically interesting, and that’s very exciting! So we started building a model that generated sounds like those from scratching the body of a violin, tapping the strings or, for instance, putting a nail on the strings — basically, all types of sounds that are musically interesting while still being very close to noise. We had something almost working, and then he came and said: “if you want, I can record more of these.” We thought: “cool, more data!” He ended up sending us a one-hour recording of him using a drill on the violin… This guy is a genius!

Are you talking about changing the corpus (using one that contains these noise-like sounds)? Or were you somewhat interested in finding a way to encourage generative models to learn representations from which to generate these noisy, but at the same time musically interesting sounds, without explicitly existing in the data?

That’s actually my biggest question right now and a fascinating topic in artificial creativity. I think I might have a solution for that, but it’s gonna take me at least a year or so just to formulate it correctly. Right now, what we are doing is more akin to finding areas of dense probability from which we can create things that, of course, are not part of the dataset, i.e., not exactly replicated, but are variations of it. I would say that we are generating new content, yes, but it’s true that those are not the regions of the space that are actually interesting. That’s why we usually try to go around this question. Creativity is supposed to be defined, though, we do not yet know what it means, we just assume it must be some nice interpolation in some regions of an unknown latent space. So right now what we are doing is to try to alleviate this problem by this idea of co-creativity, where the generation of novel material gives at least a new sense of how the musician in front could interact with it; basically, having models that are responsive to different ways of interaction. For instance, I find it pretty useless to say: “ok, we are gonna build a model for audio synthesis conditioned in pitch and dynamics, etc…” What’s the difference with a traditional synthesizer? Nothing!

Well, you could argue that these models can learn rather abstract and musical parameters from data, instead of the low-level technical controls that conventional synthesizers offer.

Oh yeah, absolutely. That is the exciting aspect of these models. However, you will get stuck in a different problem: the identifiability of the latent space. I encountered this problem with one of my recent works, the Flow-based synthesizer. It worked pretty well, although while playing with it, I can tell you, I was completely lost in my latent space. I ended up trying to break my head on what each control was doing. My goal was to build an expert disentangled space but ended up getting a black-box synthesizer in which you have controls that you don’t know what they do. It was very frustrating for me that I know synthesis, and it seemed weird as an interaction protocol to not know what the parameters meant. But when I gave it to my girlfriend, for instance, she enjoyed very much playing with it. It is interesting how this byproduct arises from complex systems with simple controls.

I guess this makes sense. For synthesis geeks, complexity can somehow be part of the end goal? Doing something that nobody else understands…

That’s true. Each person will have a different formulation of art and creativity. For someone like me, who knows synthesis, having this thing where you don’t know what’s going on can be frustrating. It happens a little bit the same with the researcher, that thinks about creativity as being a pure object of knowledge. In contrast, most of the artists are doing a non-sense thing that eventually ends up being cool. This is beautiful! What I said about Alexander when he drilled the violins… Somehow that can be more creative. And that’s a bit the problem I have with orthodox creativity research because it focuses on modeling something which I think is supposed to be completely erratic. And for me, intellectually, this is the thing that prevents me from sleeping these days: are we really doing something creative? Because we are using expectations, meme machines, and we say it’s creative because it’s doing something new. And I’m not entirely sure about that…

How do you see the future generation of music technologies? You mentioned that you don’t have a very structured plan, but do you have a specific vision about your work?

I have the attention span of a fly. That’s a problem, I guess. I’m not very proficient in one single question and keep on changing projects. To be honest, I don’t understand people that have been working on, let’s say, source separation for twenty years. That’d be mental for me! How can you spend twenty years looking at how to improve your model by 2%? Sometimes you need new pursuits. In our field, we have the liberty of sometimes saying that we are doing music and other times that we are doing science. I find it surprising all these people that strictly stick to the science part… But anyway, if I had to mention a vision, I’d say the personalization and democratization of music creation. Many people want to D.J., produce music, and so on. However, they may find a lot of technological and knowledge barriers. We have gradually pushed the boundaries of how we can create while removing more and more the music-theoretical and technological aspects of it. This expert knowledge is getting less necessary in modern music. So I can think of machines that will allow you to speed up your creation workflow while adapting to your music taste on the fly. I think this will be the next big step, how to enable this personalization of music.

I think that these technologies are starting to be in a pretty good stage of maturity for being integrated into commercial tools. However, I find it surprising that in the industry, traditional audio companies such as Native Instruments, Steinberg, Waves etc., do not seem to be contributing much to this wave nor integrating any of these technologies into their products. On the other hand, companies that had nothing to do with audio (Google, Facebook, Nvidia) or new startups (Accusonus, Jukedeck, Landr) are the ones taking the lead on this. What’s your opinion about this?

I’m not gonna tell you what do I think about companies… But for me, it’s always a problem when the money is involved. For instance, you cite Facebook, Google, Nvidia, etc. They are not doing it for the money; they already have loads! They do it for fun. Even though they are huge companies, they are doing it in a purely joyful manner, which enables them to build groundbreaking stuff. In contrast to these companies, I was at Native Instruments H.Q. for a meet up some time ago. The guys are super talented, but they kind of dissolved the whole research team. They make much more money by just selling hardware. In the short term, research is kind of useless, so basically, they won’t spend energy and effort into something which they cannot exploit immediately. On the other hand, I think it’s beautiful to see that startups are trying to make the difference and being game-changers. However, most of these startups will grow to the point in which bigger companies absorb them. Not because its technology is gonna be groundbreaking, but just for the sake of letting them die, fearful of how well it works, and how costly it would be to adopt it. In general, I think that money spoils everything somehow.

To wrap it up. If you could go back to your Ph.D. days, what would you tell young Philippe Esling?

I would tell him to enjoy himself more. I’m now starting to understand that if you really want to make a change, you must have lots of failures. People, in general, get absorbed by low-level success, in part, because they have this imposition from the environment. I’m not sure if I could have done anything about it, but I would say to myself this: “if you want to revolutionize the world with your work, maybe you have to fail your Ph.D. thesis…” Which is a piece of terrible advice, right? But we should stop being so concerned about small-scale issues. Today, I think that one needs a lot of time and free thinking before arriving at an idea that is sufficiently important and well-formulated to get to the really fascinating problems. Just scraping little things doesn’t bring a revolution overnight. Sometimes, things that one deems as important are just the shadow of what it really is. Spending some time to take a step back and observe, without necessarily being proactive, is more important than anything. I would say that in order to formulate a question that can have a real impact, one has to be somehow procrastinating. To get to the interesting space of ideas, you need to take your time to see the bigger picture. When doing my Ph.D., I was always worried about producing, producing, and producing. Which is sound advice when you have to publish, finish your thesis in three years, etc. You have all these obligations, but doing something great, means taking your time. For instance, the guy that solved the Poincare’s Conjecture, he spent several years without publishing anything of relevance and not having an apparent impact. Many would have said that he was a terrible researcher. But then, years after his Ph.D., he comes up, apparently from nowhere, with the proof to one of the Millenium Prize maths problems!


Philippe Esling and Carlos Agon. “Multiobjective time series matching for audio classification and retrieval.” In IEEE Transactions on Audio, Speech, and Language Processing 21.10 (2013): 2057-2072.

Philippe Esling, Naotake Masuda, Adrien Bardet, Romeo Despres and Axel Chemla–Romeu-Santos. “Universal audio synthesizer control with normalizing flows” In Proceedings of the 22nd International Conference on Digital Audio Effects (DAFx-19), Birmingham, UK, September 2–6, 2019.

Philippe Esling, Axel Chemla–Romeu-Santos and Adrien Bitton. “Generative timbre spaces: regularizing variational auto-encoders with perceptual metrics” (2018).