 Hi, everyone. I'm Mateusz. I'm based in Warsaw, Poland. I'm an assistant professor at the Warsaw University of Technology, which is this lovely building over here. I've also worked at Apple in London, and I've worked at some other cool companies as well, but for the past 10 years or so, I've also been a professional musician. My primary instrument is drums, but I also play a little bit of synthesizers, and I do a little bit of music production. And over the years, I've been very fortunate to play with some amazing Polish artists. I've played hip-hop with Grubson. I've been a part of the Dumplings, a very Polish name, indeed, which is a popular synth-pop band. I've played very sad and very soulful music with Marek Dejak, and I've played very cheerful music with Majka Jeżowska, who is somewhat of a children's music legend in Poland. And the case with children is they grow up and they start attending music festivals, and this is where you end up. And along with Michał Miłczarek and Bartek Łuczkiewicz, we also have a jazz trio that we've been very fortunate to tour internationally quite a bit. I've also played with some terrible bands, and I had shows canceled because there were zero ticket sold, so I had the full spectrum of experiences of being a musician. And I work in this interdisciplinary research area that kind of combines music and artificial intelligence, digital signal programming, computer science, and the whole kitchen sink of other disciplines, and it's known as MIR, which stands for Music Information Retrieval. So the goals of MIR are processing music in useful ways, generating new previously unheard musical content, as well as in the process, hopefully, enhancing our perception of music. So let's jump right in and see, first of all, how we can interact with music using Python. So this is a pedalboard. This is a device that you would plug your guitar or synthesizer in in order to modify its sound. Each of these little pedals may be a different effect, so you may have like an echo or like a compressor or like a reverb. And pedalboard is also the name of an awesome audio processing library developed by Spotify. It emulates many popular effects, and since a couple of weeks also supports third-party VST plugins, which is just awesome, and it's actually very easy to use. And by very easy to use, I mean you just import what you need, you create an instance of a pedalboard, then you instantiate the effects with some parameters that you need, and you just run your audio through the board. It's as simple as that, so the next time you're creating perhaps a presentation and you're recording your voice, you can give your voice that little bit of magic using Python, which is just great. Another library is called Pio, and this time it's a low-level digital signal processing library. It also has a synthesis engine, so you can implement your own synthesizer and actually play melodies with this library, and what's even cooler is that it's very fast, so it's suitable for live performances. And a couple of Euro-Python conferences ago, actually, there was this amazing presentation by Matthew Amiga, who has an ensemble that plays traditional, medieval, and Renaissance music on traditional instruments, which is, I mean, it's amazing on its own. But what's even more interesting is they use Pioscripts, which are running during the show. They have custom loopers and effects implemented, which are an integral part of their show, so Python is almost like another band member when they're playing, so it's just fantastic. Another library is Librosa, which is an audio processing, visualization, and feature extraction library. It's very feature-rich and it has really sensible default parameters for many of the operations, so you can actually learn about digital signal processing, while you're getting familiar with the API of this library. And the API is functional, so think more in terms of like NumPy, rather than scikit-learn. And Librosa over the years has become kind of the research and industry standard for music analysis. So if you want to create perhaps a spectrogram, which is a standard way of visualizing audio, this is how we can do it using Librosa. So there's just imports. Then we are loading a sample of a trumpet, which comes bundled with the library. Then we're calling the STFT function, which stands for short-term Fourier transform, which is the operation that we need, whatever that means. Then we're just rescaling it to decibels, and there's even a set of special functions that allow you to visualize your features. And this is what we get. This is a visualization of the mixture of frequencies that can be found in our signal. So far, we've only been talking about one representation of music, which is audio. So something that we can directly listen back to, something that's listenable, so perhaps it's a WAV file, or it's stored in an MP3 file. But this is not the only representation of music that's interesting for us. There's also what's called symbolic music, and the original OG symbolic format is just sheet music, so it's music that's written on a sheet of paper. You can't listen back to paper, but you can give it to a musician who can perform it for you, and this is how you would listen back actually to what's written on the paper. And in the digital realm, we use also some symbolic formats, like MIDI or ABC or MXL. In this case, we would run this through a synthesizer or import this into a digital audio workstation, and this is how we would listen back to music in symbolic formats. So just the key distinction is audio is immediately listenable to, and symbolic music is not. And Python also has a rich ecosystem of libraries for working with music in symbolic formats. So the gold standard seems to be pretty MIDI by Colin Raffle. It allows you to process, read, and edit MIDI. A wonderful library is MewSpy, which also has some automated functions for analysis and metrics. And one of the newer ones from the bunch is called MIDI Talk, which enables you to use some MIDI tokenizers. And MIDI tokenizers are needed to work, to make MIDI work with some of the newer machine learning models, for instance, like Transformers. Okay, I've uttered the words machine learning. So let's see what we can actually learn from music, and what are the most common, some of the most common, MIR tasks. So music tagging is kind of the mother of all music information retrieval tasks. We start off with some music. We run it through a model, which will most usually be a neural network. And we expect to get some tag as the output. So in this case, this is an instance of genre recognition or genre classification. So the tag that we get is rock. But in most cases, we are interested in a bunch of tags that are non-exclusive. So we may ask about tags that tell us about the genre, the instrumentation, perhaps about the decade, perhaps about the mood of the music. And this is a very important task because automated methods of analyzing the contents of music is one of the things that really powers recommender systems and that really powers all analytical systems when it comes to music. And the good news is many of the state-of-the-art music tagging models have very good, very well-written open implementations in PyTorch. So you can just start exploring right away and just see through the whole timeline. Another interesting MIR task is what is known as source separation. So again, we start off with some music. We run it through a model, and this will also, in most cases, be a neural network. And we expect to kind of break down the audio into individual instruments. Now, a couple of years ago this was like magic and this was like almost every mixing or mastering engineer's dream to have this, and this is very much the reality right now. The most common setting is source separation into four sources, vocals, drums, bass and other, and this is only because of the way the first data sets for source separation and other options are also possible. And a very popular and well-known model for source separation is Demux. This was released by Facebook, or Meta, in 2020. It is what is known as a UNET type of neural network architecture and it looks absolutely terrifying and that's why we are not going to explain it, but we are going to listen to it. So I've prepared this short little beat, especially for this conference. Okay, and I've already pip-installed Demux. So let's try to source separate it. You can also specify the options into how many tracks you are separating. In this case we just want to separate the drums from the whole rest. Okay, it looks like something worked. Looks like we've got a folder. Let's listen to how this actually sounds. So this is no drums, and this is drums. So that's Demux. Just fantastic. Awesome little tool to have. Thank you. Some other source separation models that are also available and you can play with are Splitter by Deezer. Also works really well. And there's also this library called Nozzle by Ethan Manlow at the Northwestern University. And this library also contains some more classical models, like matrix factorization, so it's not only deep learning. Another very interesting MIR task is music representations. So we start off with music or some audio, but audio is like hundreds of thousands of numbers. This is a very heavy data type for computers. And what we would like to have is actually some other compressed representation that's still meaningful in some way. It might not be relevant or meaningful to a human, but it is to a machine. And I think that you already know where this is going. Let's just ask a neural network to produce this. Oh, and these representations are known as embeddings. So one of the most known embedding models is OpenL3. This is also something you can pip install. It has a very nice command line interface. And this is actually two neural networks that were trained in a audio-visual correspondence task. So basically, we are training a neural network to determine whether what we are seeing is the same as what we are hearing. And we are using clips or frames from a video and one second of audio, and we are trying to determine whether those come from the same clip. And in the process of doing that, the neural network learns to produce these compact representations, which can be further used in some downstream tasks. Another model, a newer one, is called Clap. And here we are trying to match the audio with a text description. So now we are producing embeddings based on a text description, so we are basically learning whether what we are hearing is the same that we are reading about. And this will come in handy in just a few minutes. So you can try this model out either from the official repo from Microsoft or also since quite recently, thanks to Huggingface, you can just import the Clap model from Transformers, which is just amazing. And this is just scratching the surface. I mean, there's a plethora of music information retrieval tasks like transcription and beat and pitch tracking, sheet music, optical recognition, artist similarity analysis, et cetera, et cetera. But one thing that has been recently getting really, really a lot of attention is generative AI. So let's also take a look at how we can generate music using AI. So again, we've got two settings, audio or symbolic music. In terms of audio, we will be talking about neural audio synthesis, which is like a fancy term for generating audio. And for symbolic music generation, what we will be doing is actually only composing music using AI and leaving the rendering and creating the actual sound to someone else. So let's take a look at symbolic music generation first because it's the easier setting. There's this paper from 2016 that I really like for a whole bunch of reasons. It's called Music Transcription Modeling and Composition Using Deep Learning. The authors use a recurrent neural network, which by today's standard is a pretty simple model. They use music in what's known as the ABC format, which is basically text with some additional metadata. It can be further rendered down into media and that can be further rendered down into audio, so you can listen back to it. So this is basically like the task of predicting the next character in a sequence of text. And I really like this paper also because the authors provide a very nice musical analysis of the results. So mind you, this is not the newest model at all, but they find that what they are able to generate follows the typical style of traditional Irish music, which is what they were training on. It has correct repetition, variation, harmony, rhythm. The results are checked against the data set for originality in order to determine if the network isn't just memorizing everything. And what they find is it makes no glaring mistakes, which is pretty nice. And this idea of kind of using language modeling in order to enable AI-assisted composition has been played with and developed quite a bunch. Also by the people at Google, there is a team called Google Magenta, which is kind of like their creative AI division. And they proposed a bunch of models and let's just listen to one of those and this will be the music transformer from 2018. I mean, that's pretty nice, right? It has a clear feeling of forward motion. It has these slight tempo shifts that are pretty characteristic for classical music. It's coherent over several seconds. So yeah, transformers are used also for composing music. And these ideas are still being developed and right now we are in the process of looking for ways to actually incorporate this into real musicians' workflows and enhance human creativity with it. Okay, so let's take a look at the second case, which is neural audio synthesis, which is generating audio. So this is a hard problem because audio is a very heavy data format because audio, for a computer, is just a super long one-dimensional vector of numbers. So basically, when we are generating audio, what we are doing is we are modeling very complex and abstract and very human relationships between thousands of samples that sometimes are millions of samples apart. So the initial idea, how to do it, was to kind of muscle through this problem in a autoregressive fashion and just generate audio sample by sample, and this is pre-attention. So we are just looking back at the previous samples in a hierarchical order and trying to generate sample by sample. And I'm gonna do a pretty big segue over here. Let's segue to 2020. The idea of how to do neural audio synthesis has changed quite a lot because nowadays we are mostly using encoder-decoder models. So we are trying to come up with a compressed space that kind of encapsulates the character of our audio and then decode and produce some music out of it. So one of the models that work in this fashion is Jukebox by OpenAI. It's a model that, as its input, takes the genre, the style, and some lyrics. And let's listen to an example of how rock in the style of Elvis Presley could sound when hallucinated by OpenAI's model. You can hear some artifacts, but still that's scary good at times. And people have been playing around with this sometimes in really crazy ways. One of my favorite examples is this programmer-slash-musician duo Databots. And these are guys who create infinite live streams of AI-generated music. And my favorite one is called Relentless Doppelganger. And it's a infinite live stream of technical death metal that has been playing on YouTube for almost four years as of now. It's playing 24-7 great stuff, highly recommended. But again, we are also looking for ways how to incorporate this into actual musicians' workflows. And a great example would be Rave from 2021. This is a variational autoencoder for fast and high-quality neural audio synthesis. Basically what it is doing, it is learning to recreate audio. And by recreating audio, it's creating this kind of internal representation which compresses the characteristics of our audio. And inside of that internal representation, it's also learning trajectories to take advantage of the temporal structure of music. So basically what that means is if you train a rave model on Darbuka, it's going to produce more Darbuka for you or change your audio into Darbuka. If you train it on Antonin Dvořák, it's going to change your audio into whatever it thinks Antonin Dvořák sounds like. And if you train it on your voice, it's going to change your audio or produce more of what it thinks your voice sounds like. So let's take a listen to Rave. This is, again, the beat that I created for today's talk. This is how Rave reimagines that beat as a Darbuka. And this is how Rave trained on voice reimagines that beat as voice. So this is like an amazing tool for all of these glitchy patches that would be really difficult to produce using traditional synthesis. I absolutely love it. It's fantastic. And the last family of neural audio synthesis models that I would like to talk about are text-to-music models. So this is kind of like a chat GPT for music, so you input a text description and music comes out. And these are very involved and very modern and very complex models. One of them was published January this year. It's called MusicLM. It's also from Google. It's basically a neural network made out of three other neural networks. There's one part known as the neural codec, which is responsible for actually generating the audio. There's another part which is responsible for creating semantic tokens, which basically means making the sense of a sequence of audio. And there's another part which is responsible for making sure that what we are generating and the audio that we are processing matches the text description. Remember the CLAP model that we talked about earlier, the embedding model? This is where it is used, although this is not the CLAP model. It's a different one. It's a proprietary one. Also, literally a couple of weeks ago, MetaAI proposed their own text-to-music model. It's called MusicGen. And the paper was titled Simple and Controllable Music Generation. Whenever you see Simple as the first word of the type of a paper, you can expect like maybe it's not really that simple, but the model does sound pretty good. And that's like a pretty high-level description. So we're getting close and we're really doing interesting things in this space. Okay, so wow, we've covered a lot of ground and yet barely scratched the surface. But let's summarize. I hope I've been able to convince you that MIR, Music Information Retrieval, enables new experiences both for listeners and for creators. I think that we can agree upon that autonomous music AI is still kind of immature, but it already enhances human creativity in novel and interesting ways. And I think that I was able to convince you that Python has a really, really rich ecosystem for working with music. You can find me everywhere on the internet under the handle Mamodrzejewski and it's been a real pleasure talking to you about Music Information Retrieval. Thank you. We have a few minutes for questions, so please come here to ask. First of all, wow. Second of all, thank you for promoting the beautiful pink building of our university. Very nostalgic to me. And third of all, I'm also trying to make music but probably as a mature. And recently I was looking into sampling, into longer samples, but also in the context of the last example that you showed of AI generating quite a beautiful piece of music, I have very bad musical hearing, probably nonexistent. So can you recommend something for making it easy for me like using AI for transcribing it back to MIDI, back to notes? Absolutely, yes. There's a model that was released, I believe, the previous year. It's called Basic Pitch. It was developed by Spotify and it's a really, really good transcription model. So you can hum into your microphone or you can put in a piece of audio and it transcribes it into MIDI and it works really well. And if you just write AI transcription, there's a bunch of other approaches for that as well. So that would be where I would start, Basic Pitch. Thank you. Hello, great talk. I don't know if you're familiar with the emerging genre Algarave. I was just really interested how quick are these models. You talked about the live stream, but is there a latency? Are these models quick enough to use live? Yeah, so some of these models are and some aren't. So the rave model, the one that was changing stuff to Darbuka and to Voice, that one is 20 to 80 times faster than real time on CPU. So that one is like extremely fast. You can deploy it inside of a VST plugin and you can play with it live, absolutely. But some of the more involved models, you have to wait a bit for the results. So if I were looking for something to play with live, I would start with rave, definitely. Cool, thank you. Great talk, thank you. So many synthesizers today have really complex engines. So they are capable of reproducing a wide variety of sounds. So an increasing kind of challenge is actually coming up with the presets for them. So could machine learning help us actually reproduce sounds that we tell it or just produce a wide variety? Like how can we, you know, generate presets from an existing sound engine? Absolutely, I mean like a big part of a producer's work is looking for snare drum samples that don't sound like garbage, right? Exactly. Or like tweaking the presets of your synthesizer. And that's also a very active research area. I don't remember names of specific models, but there are models that you can feed in some samples and they basically generate more of the stuff that you fed into them. And there are also ideas for like differentiable synthesizers. So if you make all of the electronic circuits, the emulations, the operations, the mathematical operations, differentiable, you basically get like a neural network that's a synthesizer. And you can like train it on other sounds and you can basically like synthesize sounds directly using a neural network. And there's also ideas for learning the parameters of some presets of synthesizers in order to use that in actual musical workflows as well. So this is absolutely also an active research area. Awesome, thank you. Thank you. Thank you very much. We don't have any more time for any more questions. So please reach out to Mateusz on the hallways or on Discord if you have any other questions. I think he'll be happy to answer any questions. Absolutely. Thank you very much. Thank you.