 So this is Anna Shiborowska and I practiced it and I didn't quite get the V at the beginning And she's going to be talking about music transcription. You've given her a big hand already, but do it again and enjoy the talk I've just found out that I have five minutes less than I thought I do so I'm gonna rush I guess Hope it's fine. My name is Anna Shiborowska And I work as a software developer in the music industry in a company called Ableton. At Ableton We produce three main products The main product is a bit software. It's a digital audio workstation Which allows musicians to record, edit and produce music. Apart from that, it was actually designed as An instrument that you can take on stage and perform live, hence the name Next thing is LINK. LINK is a technology that allows people to play together on electronic devices thanks to syncing them in time over wireless network and Last but not least, this training beauty is actually something I work on at Ableton This is an instrument that allows you to create your musical ideas without looking At a computer, although it's connected to life Yeah, but let's get back to the topic. So what does it mean to transcribe music? Transcribing music means Transforming the audio recording to a music notation. What means you just like write down what you hear with Notes that can be interpreted by other people. There are some Some people who have like superpowers of doing it on ear. I'm not one of them My ears haven't been properly trained to do it So that's why I prefer to figure out how to teach my machine to do it for me So let's have a look on what we need to To do from that. So first of all, we have to figure out how to read the audio stream You have to think of what it is what it's gonna be and how to read it and start so that we can later process it Then how to figure out that the note actually occurred and what note it was was like C Was it E what octave it was and then transcribe it so Write it down in some standardized way. They could be interpreted by other software other electronic Instruments or even other people Okay, so Let's go in steps first question how to read and store data first of all, we need to know what data it is So our audio input is basically a continuous continuous wait for wave That we unfortunately can't just like process like this with our computer first. We need to digitize it, of course Digitization comes into steps something and quantization. What does it mean? So something basically Changes the signal to a sequence of samples. So we end up having a finite set of samples How many samples this is determined by a something rate that we choose So something rate defines how many Samples will be piqued per second There is also one important thing that we need to remember off while sampling which is like one of the basic and most important things in the whole digital signal processing which is so-called Shannon or Nick waste or simply sampling theorem that says that you have to make sure to be able to later restore the continuous signal We had on our input to be able to restore it We have to make sure that our frequency components in our digitized signal don't contain frequency components Above half of the something rate. What does it mean? It like what happens if you have like higher frequencies in there Then this frequency comes in and wants to represent itself in our sample data So what it does it kind it tries to assume a different frequency that its own which is called Elias thing so it kind of takes an alias and like to the facts itself which in Adiasing is like a double course because first of all first we lose the high frequencies But we also corrupt our low frequencies because we don't know anymore if the frequency we have there Is actually the one that belongs to the original signal or is Is there because of aliasing? So yeah, just make sure you pick up the right something rate the usual one is 44.1 kilohertz and it allows us to basically encode everything that audible for human being because we hear something in the range from about 20 to 20 k Hertz Okay, then so now we have our independent domain that contains a finite number of Things like in the number of samples, but now still our Dependent variable. It's not Finite yet because each value can be any float so but let's say we want to encode our information on Set number of bytes that's a 8 and we want to have only like 256 numbers available for encoding data, which means we have to quantize the data The most simple thing to do it is just to find the nearest quantity quantization level so the nearest possible value to assign To a sample so it's like just correct the amplitude a bit So both sampling and quantization end up like restricting how much information we end up having in our digitized signal Okay, so we know what we have when we read Read our audio stream. It's gonna be like probably rather large a right of data We'll have to figure out what things to choose to Like what data types it choose to store it and then what we want to do we want to check notes So first of all, we want to say, okay the note occurred. We here have a plotted recording Down on this wonderful thing The seconds of it we can see exactly like I just played two notes, right? It's quite easy to see on a Plotted way for him. We have to figure out how to how to calculate that Okay, and then we know that I played two notes now in a spectrum graph We can see that these were different notes, right because they're significant peaks in different frequencies Okay, and Last step is to think about the standard how to encode their notes so that other software understands that later and Let's get our questions again. We need to read and store store data figure out how to do that How to deck notes and how to represent not and before we go any further to my Suggested implementation of that. Let's see what we're aiming for. So aiming for so to demo time and I mean on things can go wrong right now because it's in a real-time audio processing I'm sure it works But just a small caveat ghost notes may appear due to all the noise around but let's see how it works so I'm gonna play this beautiful thing that's been with me ever since I was in elementary school probably and I'm gonna use it one more and probably last time in my life. I Have this microphone here that I'm gonna play to and I'm not gonna play fancy melodies I'll tell you later one just single notes. Okay. Let's see if it works So I'm gonna ask my Algorithm to detect a note Detective speech and played with a different sound. Let's see how that works. You get the idea Still better than expected So what happened here is that we read chunks of data at time and we process them trying to Trying to figure out if a note occurred and what note it was Then we create we created a note in the standardized Format send it to the synthesizer that produced a different sound the sound of a piano Okay, so how did we do that? I read the data using a pie audio right library, which is basically a These are Python bindings around port audio, which is like a cross-platform library enabling you to Play and record audio in real-time It supports blocking and non-blocking mode non-blocking mode is based on just calling callbacks in the Separate threads. That's what we're doing here So you basically need to instantiate a pie audio Object create an open a stream in our case for reading the data Tell it what data format you want how many channels you're gonna read and what is the callback and Most importantly, you need to start the stream and make sure the main thread doesn't die doesn't terminate You know, so you we need to keep it alive by like I put in some like a sleep there or something So let's see how they call what a callback looks like We receive a data frame count number of data time influenced stages flag our data is a string at this point and What's most important it needs to return the frame count frames and a flag stages like tells our Stream if it should continue feeding the data It happens when we pass continue flag or if it should terminate in that case Yeah, we put a flag to say Yeah, complete stop doing this Okay What's next next we have to figure out how to store our data because like as here we're getting strings Not the best type may be for data manipulation and calculations. I think Yeah, so today the string here, so that's why we converted to A number array why number a because it's like how more efficient than Python lists Due to internal implementation, even though both implementations aren't see like just because Python lists allow you to put like Different type of objects in the same list and it needs to store the information about the type It can't really use the vectorized implementations and can't really optimize Operations. I'm gonna abusively use doing this Not too efficient number your race though speed up things massively Okay, so and also like give us for free some Rather complicated and how does some come but very popular in common operations on big matrices like Convolving or like making a transposed transposing a matrices or like even we get for your transform for free, okay So now we Rather a data now we want to see if our current chunk of data Something changed and the note occurred here I Plotted short time for your transform of our recording from previous example What what what does mean that this short time for your transform? It means that the recording was divided in chunks and for each chunk We calculated power spectrum to see the energy changes of the spectrum in time This way we can Distinguished if like there was a significant change in our spectrum assuming like another period What have we don't do that in our implement implementation because like we analyze each chunk of data time But basically we did the same we calculate power spectrum of it and we try to compare it with previous spectrums So as we want to Measure how quickly the power of our spectrum changes over time We do it using so-called spectral flux, which is basically the difference between the current Power spectrum and the previous one and it's plotted with a green line here over short time for your Okay, so we could Find the peaks in here already, but there's there are some minor peaks that we might end up Ending up ending up finding and we don't want that because it's kind of a noise We apply circle thresholding function, which is basically we choose a number of chunks that we average and multiply by a given constant to make sure we and we are like basically only interested in picking peaks that are above the given threshold and Values that are bigger non-zero values that are bigger than the previous value In here we can see that thresholding function could be better because it still leaves behind some peaks we don't want to Choose in my implementation thresholding function like which is far Higher because I wanted to make sure it's not too sensitive in an environment like this So it's probably performs better, but these two parameters the number of chunks we're going to average and the multiplier is basically The parameters that you can change to adapt your To make your application perform better Okay, now We picked the significant peaks here and we want to find what pitch these notes had we do it by calculating so-called seps drum of a signal and seps drum is an inverse Four-year transform of a logarithm of a calculated spectrum In like see the words is kind of a spectrum of a spectrum so that's a way to think about it and basically you can treat it as the information of About the rate of change over time, so it's kind of a measure of time, but you shouldn't think about it as of a Signal in time domain, but it's kind of colorate correlate correlated with the time in In that so we all know that frequency equals one pair of time of a single cycle of the way so knowing dead In our frequency domain we print is a domain of seps drum. They're all like Playing with words spectrum seps drum frequency frequency frequency is like it represents like the time cycles So like the high frequencies will be like have shorter time cycles and they will be they will be represented at the beginning of frequency domain and then Lower frequencies at the end here. So before we start picking our Finding our Fundamental frequency in this seps drum. We want to actually this is already narrow So we want to narrow the seps drum to the frequencies. We are interested in I narrowed the frequency to Frequencies to ones corresponding to like eight notes. I would probably be willing to play tonight Yeah, and so it's like pretty narrow range And we can think of it. So these frequencies are from 500 Hertz to 1200 Hertz We can also think about them as narrowing it to 80 microseconds of a time cycle to two milliseconds of a place time cycle that corresponds to 500 Hertz This is how seps works more or less. Okay, so knowing this We picked the maximum value which in our example was between 25 and 30 and Okay, we have the in like the value in frequency domain We have to now transform the frequency to frequency because that's what we're interested in to calculate it We simply divide the sample rate by Seption peak index which kind of Is derived from what I've just tried to explain So in our example This is what I mentioned. We are narrowing the the sep stream and And then find a peak in the narrowed sep stream But then we're actually when we're trying to figure out what was the front of under crooked So you had to remember that the peak index should refer to the original sep stream That's what we're doing here and we figure out like saying, okay, it's the value of it ends up being 689 Hertz Okay, now it's also nice to mention that by like pitch detection applies a slight correction to our onset Detection because then we can just ignore the the onsets that are out of our Frequency range of our interest Okay, now we found the nodes now we want to encode it in a thing that can be later understood by our synthesizer so what we do is So what we do is that we choose Something ready to use something like use massively and everything MIDI protocol MIDI stands for musical instrument digital interface and it basically It can code like a note and velocity like pitch and velocity of our notes The messages MIDI messages we're Interested in would be note on and note off note on means start playing a note not off Stop playing a note Message consists of three bytes as we can see here in the first one We say what kind of a message it is what channel we're using we have 16 channels to use Second data byte has our pitch encoded as we can see it's on seven bytes 128 values and the last one is velocity meaning like velocity is a strength of The note being played so it means like we perceive a note being played louder or softer Okay, so this is how we transform frequency to MIDI note a Number because as we can see we only have seven bytes to encode our pitch meaning the maximum value would be 127 but our frequency before was 689 so how do we encode that well like this? and and this is the It's a part of a chart that actually tells you what note in what octave has what MIDI number and what frequency and as we can see here our Some of their our found note of frequency 698 Hertz is is a note F in Seventh octave and has a MIDI number 77 Okay, now we know what our notice what we can encode MIDI message and we can send it to a different instrument I chose Library called pi fluid synth, which is also a set of patent bindings around a thing called fluid synth Which is a basically software that allows you to play sound fonts that encode instruments in real time and Yeah, it would be it What are the conclusions Python is amazing for rapid prototyping? I wouldn't really use Python probably for production code, but to just Check some like check out different Ideas solutions or like even try out different detection algorithms. It's just it works like the charm The whole thing that we've seen now It's not a lot of codes. We you can have a look it's on github It's been on github for the last like two hours today And still the Soundfonts that I use now. It's not Pushed there because it was to be so you'd have to like download your own but they're available online. It's no problem and Okay, so why is it so good for rapid prototyping so first of all thanks to amazing numerical libraries that I briefly discussed before like numpy makes things much easier a little pie and Of course IO operations in Python are like pretty simple no much messing around like That's really useful and the API of the wrappers I used are Really good and they make the code look very clean and very readable. So, yeah That's it. That's all I have had prepared today and Yeah, we do have four minutes for questions this hand went up immediately Hello to question actually first you need a very good microphone or whatever it works Microphone. Yes. No not very good But I just brought this one because the laptop microphone like gets the noise from all around Okay, also, that's how it's designed this one's more like And second what about instrument identification, I mean knowing that it's a piano or a guitar or a violin or It's a different thing. So distinguishing between like so different instruments would have different spectrum spectral features and then we would need to analyze different things like We would probably we wouldn't care about the energy distribution over time. It would rather Care about of energy distribution at all like, you know, because like different instrument would probably have different characteristics in Yeah, this would be probably Yeah, completely different things to analyze although having similar So thanks for the talk was super good quick questions. So this is for mono. So if there's like one Melody and only one note playing at the same time What are the main sort of challenges or maybe sort of advice that you might have if you want to do this for chords? Yes, so the problem with parts is that like it's hard to distinguish distinguish what notes were played because the frequencies overlap and If I had a very performant Algorithm now to present on Like transcribing ports I would probably be a PhD by now And so I'm certainly gonna try out some things, but the solutions available by now are no more Efficient than 70% so it's a rather complicated thing A steel thing I'm gonna work and try out different things a very interesting concept is a CQT transform You can have a look. This is what people tend to like try to use for polyphonic Music recognition as well. So, but yeah, that's why I didn't buy it More questions there you go. We had the same question You know that there is a story that says that Wolfgang Amadeus Mozart when he was a child he heard a Full play of a piece of music and he was able to transcribe from his memory all the piece your your system is able to to hear full piece of music and Try and print in music notes all the piece like Mozart well the problem So this version and this version pushed to github works only for real-time input. So One time at a time went one note at a time. I initially implemented it for Analyzing the whole like pre-recorded things files and So actually it can recognize notes and pictures of it in time and like reconstruct like a create MIDI note Well, like keeping the time and and pitch and so on but unfortunately the problem as discussed with By answering the different question would be to like Have polyphonic music transcription which I don't support yet. So that would sadly not work Maybe in the future. I hope that's a direction. We have time for one more question I'm thinking about So you can make recognition of Monophonic music would do, you know about Some implementations that also let the user Teach the algorithm about something like this is the noise and don't Interpret it or this is my color of instruments don't Interpret anything else. Is it any approachable? Because even in Ableton, I see Different algorithm for harmonic detection different algorithm for melody for rhythm and Know any method for a supervised Detection Is it practical? Practical at all. It was it would be too difficult to implement I'm not sure I completely understand the question like would it be only Transcribe one instrument or yeah, kind of like I could say To algorithm that I want it to try and scribe this and don't transcribe that so Such to know detect your ghost nodes like here So this is a noise and it interprets it So maybe you could just put this noise to algorithm and it would learn that it is just noise Yeah, so I mentioned before I think like maybe you would at first have to have like kind of Instrument recognition like you know like to Find out some Spectral features that would be able to determine like, you know the Instrument that you want to sort out or something Not trivial to implement I guess I don't think it's on Raider But it's certainly an interesting interesting thing. I think so you just probably yeah, first of all would have to Try different like try out Like checking different spectral features like what is works best for distinguishing between different instruments Like what tracks the timber or something and then just apply it as your Filter for for the pitch detection That would be my idea of how to approach that but unfortunately, I don't know of any ready implementations like this There are people who are trying to like separate separate in instrument and it's the same problem like the frequencies overlapped Characteristics are sometimes to similar so Yeah, I don't know of anything but that would perform well And that's all we have time for so I'm sure you'll all join me in thanking Anna Jevorska You