 Actually changed my title of the talk to having fun with music and Keras because I think that's that's what it's all about Sorry So What's my personal motivation for you going to music and doing stuff with it? Well, I play guitar and when I was studying playing guitar 20 years ago I was listening to a CD trying to figure out what notes were being played Push the rewind button listening over again try it try it keep repeating repeating it took so much effort I was like isn't there a computer program that actually can do this for me But why is there no computer program that can do this for me for me? But apparently it's very very difficult and right now with deep learning we are actually getting close to tackling this these kind of problems and So that's my personal motivation for looking into it. So today. What are you going to do? first we're going to build a simple synthesizer using just numpy and some Jupiter notebook enhancements Then I will explain it just a little bit about music theory and come up with a very simple way of generating Well, you it sounds musical. Hey, it's not it's not the biggest new summer hit of 2018 let's let's be fair about it It's just some music and then use this generated So we have we are going to generate data and we can use it for deep learning So there's two two benefits. First of all, we we are in control of our own complexity So we can make the sounds more distinct or more similar, but also we can We have also an unlimited data set we can just Keep on iterating and we know what the ground truth is So that's we don't have the effort of labeling our data or that's all being taken care of So that's why we take this approach And also it's way more fun this way because then we can build some synthesizers in them by So what are we going to do with on the deep learning part so on to on synthesize music I will Single out a single instrument when multiple synthesizers are playing at together. So just filtering it out then the second thing that I will be covering is detect when What notes are being played first for a single instrument and then when instruments are playing together And when there are multiple notes being played by the same instrument and then finally I will extend that model To with multiple notes and I also have a I couldn't resist doing one real world Example where I've taken a backing track Play some guitar on it and try to find to have a detector when I'm playing or not Okay, so let's first make some With this This is readable. Should have make it bit bigger. So first let's have a look at what is music actually So let's let's I've loaded here Some data and if you look at the data it just consists of Raw the raw data is just integers and what do these integers represent integers actually represent How far the speaker is going to move left or right? so since the integers have a specific resolution so Right now they're 16 bit integers and I'm going to scale them between minus one and one being All the way to one end and all the way to the other end for the speaker Well, if you look at the shape, it's just one big array one dimensional array because it's mono if it was Stereo you have like multiple channels over here Then we also get another quantity back and that's called the sample rate And that's basically tells you how many samples are there per second. So every say there's this like With normal CD quality music we have 44 kilohertz and that means you have 40 44,000 Samples per second. So we have this so if we would have 44,000 of these numbers we would have one second of music So we can plot them It looks like this this is for a whole piece of music on the x-axis you have the time in seconds So there's an intro over here some bunch of music over here and then finally some Intermediate part in the outro now, let's just zoom in a bit And there's this very nice I Python display widget where you can just Put in any array and just listen to it. It automatically also skills it to minus one one So you don't necessarily have to do it and then let's hope this all works So this is now what you find here below That's it if you're going to zoom in a bit, I'm just going to explain you how difficult music is Then we find something like this That's all encoded in this part of music if we zoom in it slightly more Then this this you should reconstruct from this mess of Stuff over there. That's really a difficult Signal basically If you look at So now we are going to to mill make a simple synthesizer we're going to make some own music So let's start with a sign generator sign is just the most the simplest Periodic signal that we can think of so we start Doing it in the data frame. We have some time we sample it with the sample rate I mean we're going to compute the sign of it every time step. Let's do it then we get this kind of Big blob Then if we zoom in we we see actually the sign back and there's also a nice thing with there's a free Yeah, there's a method called free a and actually what free a does is it takes this signal this this signal and it will Determine what frequencies are present in the signal so in this case I've made a sign with 440 Hertz you can actually see if we make a spectral representation for it We see actually one big line of this 440 Hertz. We can also listen to it. So let's do it Not not the most pleasant sound But it's that's a sign No, it's not that's very musical at this point. So let's make Let's put an envelope on it. So let's have an exponential decay of this sine wave. How does that sound? already slightly better Then Let's add some multiple signs together nice thing about About tones is I can like if I have a sign of a some frequency I can add another sign of twice the frequency or three times the frequency and will still sound like as the original Sign that I had the same tone will still be perceived, but it sounds different. It has a different sort of timbre or different feel it's more bright or But it doesn't change the perception of the pitch. So let's try that out So I've now again added so here. I'm Just adding a couple of signs together Like 440 Hertz two times three times four times with some different random amplitudes that I've put in I'm going to Well, this is down the thing that we generate In spectral we see actually the lines coming coming back and then how does it sound? so it's still and if you look at the So This is still the same right, so if we go back the same pitch right although we have added a lot of extra frequencies to it So what we also can do? I've added over here. We also can do is we can also just Give give the signs random phases. So what what I've done if we're here is Is here you have I've added the signs together with all phases zero Then I did some random phases. So the shape is completely different, right? Will they say still sound the same do you think? Sorry sort of well, it's a bit late this thing Is one do you hear a difference exactly the same so your your ear is completely Cannot hear the difference between this part in this part So already for when you're doing deep learning things and you're looking at RMSEs and stuff like that You have to take two into account that that the RMSE might not be a very good quantity for comparing signals with each other Well, let's add some decay on the additive sound Let's do that We have something that's good. It's a very simple electrical piano or maybe a harp or it's not be very beautiful, but I can play something with it Now, let's have a little just as a sight side track. Let's do some drums. So here I'm taking I'm Slowly going to do one over the inverse of the so the frequencies gets much lower as we go on and also I'm going to add some Exponential decay, let's listen to that this one Have a drum kick Then for the snare I'm going to build a snare from a drum kick plus a bit of noise. So noise. I'm sorry We're going to listen to noise That's noise And if I add some So the the the kick and a very short bit of the noise together If something that's like a snare drum So you can do a lot of cool things with just simple simple things it's combining Simple things. There's another thing that's called subtractive synthesis. It actually starts with building a Very rich waveform like a square wave has a lot of frequencies in it It has a so it's very frequency rich as and then you you actually cut out some frequencies in order to make it You know again nice nice sounding. So this is a Like a Mario computer game That's without the filtering and then with the filtering it sounds like this So there's a there's a lot of different ways of generating sound. There's there's also FM synthesis, which is different way There's also you can also try to model a complete piano with all the resonances and stuff like that It goes it goes on and on but this is just a very simple way of building it Okay, let's start make some music So who's aware who knows a bit about music. I'm going to explain it anyway Okay, that's that's more luckily quite a lot. So everyone knows this this picture, right of Piano actually there's Then the nice thing about the white notes the white notes are all in the same scale But let's let's I will talk about it a little bit later. So if you look at the notes Every note in modern Western tuning is very easily calculated And so you have just this the 12th root of 2 to the power of n where n is the number of Half steps away from some root note and a half step is like Going from this to that to this to that. That's that that's a half step So if you want to know that this the the frequency of this note I'll just count back how many how many I'm away from from a four say that this note not one two three four Add that into this formula and I can compute all the frequencies of the piano that are there There are different ways of tuning, but this is the simplest one. So Well, I've created here now a small instrument from the things that we did in the previous Notebook and I can just generate now a note That's a note But now let's have a look at scale. So the scale is basically a groups of notes. That's fit well together so Doremi facile at the dough everyone knows it, right? Well, you can easily so that's Doremi facile at the dough and then we can also make for every Note that's in a skill. You can also make a chord out of it. So you can also do that mathematically. So that's so that's So now we have this this this kind of rules that govern together which which note sounds musical together Well, we can just make some random walk in like this scale and then mix it together and try to find out Something that that fits well together. So But first of all first I will put a chord progression below it. I just do to listen to I should first say this this chord for I took this chord progression because it's like the most common chord progression in pop music And I just decided to go for that one again So you probably all know a song that fits with this I suppose So now the notes are just um I'm going to generate some notes on top of this and this is just going to be a random walk. So of course I can I don't want my notes to sound too high and I don't want to sound them too low So I'm going to have a bounded random walk. So whenever I'm going to draw a new note that is above the limit. I'm going to bounce back and And go back again. So that's the way I do it and I'm also going to So there's a lot of some probabilities of making the random walk So I will never stay on the same note And there's a quite a big chance of going to a note that's close to the other note And there's a smaller note the smaller chance to making a big big step and and so on I just tuned this a little bit by hand until I thought okay, this is sort of acceptable Well, if you take the random walk and if I just index it with the the scale that I put in I just get a lot of Well notes And then I can just generate those notes And mix it together and then we get something like this And this can go on like Infinite at an infinite time Sounds a bit even better So Well, let's go So now that we have this Infinite data set I'm going to the deep learning applications And the first thing I will do is to filter out the lead instruments. So in the previous sound fragment I had the the the chords And the melody now I'm going to learn the the algorithm to Get this melody and regenerate the only the the melody again the the chords plus melody and go to only the melody So In model is input wave with three instruments output wave with just one instrument And actually One way of doing it is to do it to do it outer encoder outer encoder like who knows what the outer encoder is Okay, that will just go quickly over it an outer encoder is A neural network where which Goes from from an input like for instance, it took in this case a picture And the picture will go through a set of layers Then it will go to typically to A layer that has less dimensions than the original layer And then it will scale back up to the original input size So you you get this picture in and it has to so that you train it by giving This picture in and the same picture In as a as a result that it has to produce again So then it will learn ways to to actually come up with a way of compressing it such that it makes it also easy to to To generate it again back again So this is the outer encoder. So out of one of the things with outer encoder is like an An architecture that has the same input nodes as output nodes and you train it with the same as the Same input as the outputs Well, we can actually do something similar but instead of using the original and the The inputs the same as the output we can say well, I'll take the input the complete Wave with combined instruments and as an output. I will just only give it the the melody So this is the input And this is the output And this is how we build it. So we start with a model like a keras keras model Then i'm going to add a dance layer, which is i'm going to i should have mentioned that before So i'm the music i cannot music is like a continuous signal It goes on it goes on it goes on so we have to cut it up in some way So what i did is i cut the music up into small fragments of 1000 samples 1024 actually And then i would On this fragment i would just filter out the lead instrument. So that's what i'm doing Putting in here. So there's a 1024 samples going in then i have a A prelude layer just Tried out different things with this this work best Then we have another dance layer. So this is one is smaller. I took the half of it And then we're going to scale up again to the original output shape Trained it on the targets and the outputs And how well how does it sound before? It's like this Should maybe have a bit shorter fragments here. I see people sitting like this but It's not not so much. So the network still has to learn a lot, right? I must say that actually in In generating this part of the music. I'm still putting in the original Wave format, but then chopped up in the batches of 1000 24 Yeah, so actually What i'm doing is i'm i'm Taking the the the i'm doing the batched ways batched way and then i'm shifting also half by half and then i'm Counting adding them together with like crossfading. So that's uh indeed that's also it So I also will fit the well this is what's the original title. So actually you can fit a model into epochs Um, that's it. That's how it looked like and let's see how it sounds and this is That's it. Is it overfitted uh, I would probably if you just have the this data set It's very likely to be overfitted, right? So what I did here is I um generated the data set again So just a totally new data set and also put in a different different type of Scale here. So I trained it on a minor scale and this is a this um, uh major No, it's prison harmonic, I believe um Frigian dominant, um, so it's a different scale and Let's listen to it There's still a bit of noise left in there, uh, main it's so that uh, that still needs to to If you train it for longer than two epochs you get more, uh, okay, you get better results Um Yeah, um, and also I can also make the epochs really really long of course infinite amounts of data, so Let's go to um another, um Part and that is the node detection. So right now we're going to detect what nodes are being played Um, so here the model is going to be wave in sequence of detected nodes out So the method will here be so generate some random music Chop the wave data again in small batches of one tenth of a second Then use free a uh, free a is the way of getting to the the frequencies that we saw earlier And um, then um for each of these batches We're going to predict what node is being played Then we're going um to write this uh, this as a multiclass Classification problem. I will show you in more detail later And then each class, uh, is going to be one node is being played I will train it as a gaiting recurrent Unit as a classifier So, um for generating and rendering the data so What I get is a a data frame Where um, well, I take the original data that I Did random walk from the random walk. I will basically make a data frame that looks like this I have an offset an end and between those two I I have one hot encoded. So basically, well, it's not necessarily one hot But I've listed what nodes are being played in this in this sequence Then finally I'm going to match match my batches to one of these offsets ends and then I know Okay, this node's being played or not So if you look at the free a analysis of a waveform, it looks like this And so actually you can already see here The the individual notes you can see it back for for For the 1d For the one One instrument you also see it. I don't know if it's very visible But there's like the same pattern is also repeated above here And it's like at three times the the frequency and that has to do with uh, the the the the extra nodes that I Showed at the beginning for adding the additional signs to it to the original um note So setting up the the group. So, um, who has ever used a gated recurrent unit before Who knows a bit about recurrent neural nets? So, um, I'll just quickly go over maybe Short introduction to it so, um Let's I will only keep it at recurrent neural net. Um So what is what's recurrent neural net? You starting with some input x Then that will go to a hidden state H and that will produce an output o so what what we what we can What you can see is like if you have a sequence of of of of access Then you will basically update every time the hidden state Here so this one will update it over here this one will update it over there and then this will Is a very good way of accumulating evidence for a specific For for specific Feature and you don't it's and you can also do this on randomly sized sequences and stuff like that Yeah, that's what I want to say about it So I um since I can I know that I've generated Data with only 14 different nodes. I'm using a 14 dimensional state So that's also the the same number of nodes that can be played so, um I've put the number of time steps to one And I put the RNN stateful which means that it will it will keep on Remembering what it has seen before and then finally I I will Apply the Fourier before going into it And I'll show you the dimension the model the modeling building over here so This is the the number of channels that I There's for a great gated recurrent units. You basically build it up with a number of steps per batch So you you you basically chop up your original signal Into batches then in that batch you have a number of time steps per batch And then you have a number of channels which for typically raw Wave files is only one but since we are now Giving it a free a input it will be actually a lot of frequencies per Time step in this case. So we have a slightly more complicated inputs over here Then we have a gated recurrent units We put at the end a dense Layer just for compensating Being being able to fit it a bit easier. Then we have input over here At 32 is a batch size just one step per so one free a spectrum per time step 2049 free a dimensions and then our gated recurrent units and finally we have our Network predicting which one of these 14 nodes are being played and we use the categorical cross entropy for fitting it So this is fitting it typically keros outputs and then if we Look at the prediction. So if we this on the the right is what we put in as a function of time over here We see what node is being played as a yellow blob as a punch card basically and this is what it predicts now So this is only after just a little bit of fitting Okay, so this is actually was the easy Easy one if we by the way, if we test it we see that it's also still works I must say that if you train this a lot longer, it starts to look a lot better and I'll actually I will do that for This one we're going to make it a bit harder now Instead of looking at a single Instrument we're now going to look at the combined the the whole sequence of the music that I Displayed earlier So we're going to add a chords and bass instruments So this is how it sounds And add and we're also going to add harmony. So the the the the lead the melody instrument is going to play two notes at the same time That's that that's the music This is how the free a spectrum now looks like So so this is also what I'm going to offer to my Network to to fit on and you can see over here. It's messy, right? It's really messy. I can it's hard for me to find out what's being played Although if I look a little bit higher, I can still find the dots there So the information is still in a way. It's still there But it is really hard to find for instance to to come up with business rules that that that that Captured this problem very well So this case I'm still using the same the same Network topology I think I've did if I look here a split did it half the time frequency is twice as big and then let's fit it Get some coffee and it looks like this And so this is after a lot of iterations. It's actually it finds it. Yes, you can see it finds it back quite well Every now and then they're still like Very close to the notes. It's not not completely confident You see over a year sometimes so when it's close to to to to To to the part where it's really difficult. So if you remember this part, right? That corresponds to this part So it really has done quite well at this part Well, okay, if we get a bit of coffee and it looks already this good What will happen if we train this for one night? It actually does it almost perfectly, I would say so that That's that Now the final thing Is let's let's do it on the real music now So and the the the cases can we detect when I'm using when an instrument is playing So we're going to load a bit of data and this is uh, I played some music over a backing track And there's a full mix and there's uh, so because I played over it. I can also just mix down only the guitar And I have the two separately Now what I'm going to do first of all, we have a big Big issue of detecting when is the guitar playing solo? So here this is only the guitar part And first of all, I cannot say if the guitar is below zero I'm not playing because it means There the guitar goes the the the waveform goes through zero all the time. So I cannot just say Below some threshold. It's not working. So I'm instead of that. I'm calculating the rolling RMC And when the rolling RMC is dropping below some level, I'm going to gate it. So I'm going to stop Saying the guitar is still playing so if we just So in the if you look in this graph if the time the guitar isn't in in blue in You see the original so the mix mixed waveform in gray And you see the rolling RMC in the the orange and whenever I'm the the the green line is just one and or zero Whether I'm playing or not playing so, um, I'm I can Um, basically, um Mask the signal with the zeros and the ones And so I'm going to to to multiply the signal with with zero when I'm not playing or with one if I'm playing And then we can actually hear the effect of the the gate So that it detects when it's not playing or not playing sometimes there's a little bit for jitter, but that happens when Yeah, that just happens um Now we're going to build a model And we're we're again going to chop up in fragments of 0.1 seconds And that's again going to be time resolution So I can either know I'm going to be playing the instruments or not within this time resolution And for each fragment, we are going to detect whether the guitar is playing or not. So we have So we have a short fragment which is Opened 1 seconds then we have the model and we are going to predict playing or not playing a boolean So this is how it looks like So I'm taking an input over here with the fragment length which is open 1 second Then I'm actually One very nice thing about uh tons of flow and keras is that you can still apply Some specific functions within uh keras. So instead of First doing Fourier by hand before I can also say well have Apply a Fourier somewhere in the middle of my network and calculate even the gradients of the that operation and so you can have a look at at what kind of transformations are still in in TensorFlow or That you can actually just do some nice things inside your model instead of in the pre-processing phase um So here i'm actually doing the the the the the Fourier inside the model with a lambda And this is the the the output shape will be then The the fragment will still be the fragment length And finally i'm going to put on that a dense layer That's just going to have with the sigmoid activation So it's either a zero or one and i'm going to predict whether uh i'm playing or not So the model then hey you see this is the the time domain the wave function This is the frequency domain like the the stripes And i'm finally Yes or no So actually the nice thing about this we can actually listen to it and watch at the same time. So I made Does anyone know movie pie? That's definitely a package that you want to check out because uh, there's a very nice function where you can just Uh, you can just make a frame given some time that you uh, so some some value some x And then you can render it to a movie very easily. So I use this for um for uh, These these movies what we're going to see is we have the waveform And we are going to have a sliding window going across the waveform And as the the waveform progresses i'm going to light up blue whenever i detected the the the notice playing Then we have also the green line with the gating whether i'm playing or not As well as the prediction in orange Okay, i'll open it differently Because otherwise you'll hear my me fucking up That's it. Okay. I think we've got a few minutes for questions. Okay, so you're you first Hey, um, thanks for your talk. Uh, I'm really curious, uh, how fast Can the network figure this stuff out like would it be applicable for music visualization because With electronic music, it's fairly easy to do stuff beat detection, etc But classic music or acoustic is kind of hard. So would this be useful for that case? Oh that uh, well with music, um So people have a very good, uh internal clock for when it comes to they have a very good internal time resolution when it comes to Music perception. So You need to be really really really, uh, you have to have a very short latency So for instance already the rendering you need to be able to do it within like Order of 16th of milliseconds. So I don't know if a lot of people probably did see a little bit of a desync between, uh, Between the usb. Well already there you it's very humans are very perceptible to to to small, uh Latencies, so I think it in principle would be possible, but you yeah, you you would It's not going to be easy well, um For what we see right now, I don't know actually Um, so but I know that that latencies of 20 milliseconds Can become noticeable already Especially when you're making the music, so if you're playing, uh, an instrument You have a latency of 20 20 milliseconds. You'll already start noticing it But that's when you're playing it if you're just listening to it and and and seeing it Of course, uh, with with electronic music, uh, you can already pre-process it probably right so you can pre-render it, I suppose um, but for for real, uh, live music where there is, uh People playing live on stage with no backing track or whatever then, uh, yeah, it's going to be hard, I suppose Okay, any other questions Well key changes is uh, well I'll repeat that for the people How to deal with more complex music like jazz whereas key uh changes In principle, um, so now I've really tuned my model to um to the complexity of uh, of that I have right now So I have only at my my hidden state has only 14 14 neurons which exactly the the number of notes that I can draw from the random walk In principle, you can still enlarge it to like, uh, the the complete spectrum of all different, uh Different possible, uh notes on a piano for instance, but then okay jazz music has still a lot of like, um, like Notes gets gets uh, bent up like it goes higher slowly. So it's in between or vibrato or vibrato stuff like that. You have a lot of, um I say, um Additional things that color the music that you still need to to be able to Write down maybe or not and or maybe can even harm you. For instance, a vibrato could maybe be transcribed as higher lower higher lower and stuff like that. So it's it's going to be difficult, but I suppose that uh Actually, the the thing that I tried here right now is actually to make it simple and then build it up So I'm actually not Not intending to stop it at this point and slowly, uh Maybe get some jazz going on I think we have time for one more question Um, so they they are published. Um So if you look at my name, uh, github and then my name Marcel Raas, you'll find it And there's also the the whole synthesizer thing is also there Uh, it could could be cleaned up a bit. That's that's so, uh Okay, are there any other questions or If not, I'd like to thank Marcel. Give him a round of applause