 So, now we have Cameron McLeod, speaking about implementing a sound identifier in Python, I'm really excited about it, so it's the Shazam thing, but of course you're here just because you hear this. Okay, so last talk for today, except of the lightning talks, of course, have fun. Hello. Hello. Okay, fantastic. You can all hear me? Yeah. Thank you for coming. I'm Cameron. So, just a brief overview, this is just really just the theory behind how Shazam used to work. I don't think it still works like that because they published a paper on it, therefore they've got something better now, but a quick disclaimer, I'm not an expert on literally anything, and therefore everything contained within this presentation should be taken with a pinch of salt. Fantastic. Music information retrieval, so imagine you are in the car, or if you don't have a car, you're listening to the radio, and of course you're not controlling the music that's being played, so a song comes on, it might be Britney Spears or something, you don't know it's Britney Spears, but you think, you know, this is a pretty good song, I could get into this, and you want to find the name of it. Of course, nowadays you've got things like Shazam, you've had it since 2001, to just identify quickly what this song is and the artist, et cetera, but this is where, this is the basic problem from which music information retrieval comes from, so music isn't easily searchable as a thing. If you've got two different recordings or two different encodings of the same song or the same piece of audio, it's going to vary quite wildly in bit representation, and they're also quite large, the files that you find that contain music, so Shazam solves this using something called fingerprinting, we'll go through that in a bit, but you've also got other applications within the field such as score search, if you're a musician, you've got that little piece of paper that you flick through quickly and incredibly because you're playing and you're flicking at the same time, I think it's quite impressive, and you're flicking through that, and there's a search for that. You can search by humming nowadays, I can't remember the name of the applications that do it, but they exist, and today we're going to talk mostly about Shazam, but the techniques I'll talk about will apply to other fields as well. It's a little bit magic, I'm going to be honest, but hopefully we'll understand it by the end. Okay, so why did I choose Python for this? Because surely it's a pretty bad choice for real-time processing that you want to go fairly quickly if you want the user to not give up and just go away from their phone. But the thing is, it's actually not that bad for data processing. We've got libraries like NumPy, SciPy, Map.lib for visualizing data when you're developing, and Python's actually quite good for this. A lot of NumPy is written in C, so it's fairly fast. Not only that, but the best language in my opinion is the language that you know the best, and I think since we're all at a Python conference, that's probably going to be Python for us. So here's the awesome demo that totally works. I spent quite a long time trying to get this to work, and well, it totally works, but I'm not going to show you because it's far too cool for all you guys. Yeah, it doesn't work. I didn't build it. The challenge to the listener is for you to go out and build it after I told you how it works. So, quick show of hands. If I was to say the words signals and Fourier transforms to people, who would understand what I meant? Oh, lovely. Fantastic. So, yeah, for those few people that didn't, a Fourier transform, you take a signal in the time domain, so it's time against amplitude, and you extract the frequencies from it. You don't need to know the math behind it because they're awful. If you've ever seen the equations, it's got imaginary numbers and cosines and, ew, disgusting. So, signal is just information as well. Are you also hearing me say the words FFT, or the phrase FFT? That's just fast Fourier transform. It's an algorithm used generally to calculate these things, and it's faster than doing the infinite recursion that the math thing uses. So, the basic structure of the application is a normaliser, a fingerprinter, and some storage afterwards. And the thing to recognise with a normaliser is it's not a normaliser like you might have on iTunes. Usually when people in audio talk about normalisers, they take some audio, multiple pieces, and they make it all at the same sort of volume. This is not that. This is taking audio of different sample rates, so, how fast you've been digitising it, different bit depths, sort of resolution of it in formats, and turning it into one single format. This is great, because it means we can write only one fingerprinter, and it saves you a lot of development time. Not that I finished that. So, you might be able to use FiFFMpeg for this, which is a library that basically wraps FFFMpeg, and FFFMpeg is a converter for all sorts of multimedia. It does video as well. But I looked at the page, and the last time I was updated was about 1934. So, I decided to again choose in that. There's also LibAV codec, which is the C version. You could use that directly with C types, but I've used C types before, and I decided against that. So, fingerprinter, what do we want to do with a fingerprint? Well, comparing like for like with audio doesn't work. Different bit depths, different representations, etc. So we want to compress it as well, not just that. You've got smaller storage if you're taking fingerprints that come to about seven bytes each, and you've got what, a few hundred of these per song, which is a lot smaller than three megabytes. And if you're storing a million songs in your database, then you're going to want to compress it as much as you can. This also gives us a faster search, the less you have to search through the faster you can go through it, and robustness in the presence of noise. So if you're, for example, at a gig, before the gig starts, they play this music on the speakers, and because it's a gig of someone you like because you're there, you probably like the music that they're playing on the speakers, you might want to identify it, but of course there's the drunk guy shouting next to you. You want your phone to still recognize what song it is, despite this noise that's going on. You also want it to be that it can match only short recordings to the full original, because you don't want to stand there for the entirety of the song, just holding your phone like this like a proper numpty. It's not that fun. Okay, so here's the basic diagram of how the fingerprinter works. It's fairly large, but we'll go through it line by line, starting with the first one. So these are the diagrams you should have saw before. First off, we take the audio, which looks a lot like the left graph, well, but wigglier and longer, and we split it up into smaller wiggly bits so that you can get the frequencies of each individual. We then do the Fourier on it. And the advantage of this is that if you looked at the previous one, the next signal you're about to see is the same signal here, but with noise added. And the Fourier transform, you can still see the 50 hertz and the 80 hertz quite clearly. So this helps us to protect ourselves from the noise flow. The other thing you want to do while we're here is humans don't listen to or don't experience frequency and sound linearly. If you played someone 100 hertz sound and then a 200 hertz sound and then played them in 10,000 hertz sound and 10,100 hertz sound, the second one is going to seem a lot closer together because we hear logarithmically. So there's something called the male scale, which is basically logarithmic scale. And the added benefit of that is that it cuts down the data. So the highest male we can hear is about 4,000, whereas the highest frequency we can hear is about 22k. So you then take this and you hash it. Hang on, have I skipped something? Yes, I've skipped something. Sorry, here we are. So you take this and you make it into a spectrogram. What we've done is we've taken this Fourier transform, and we've got the high bits, the large amplitude bits, and we made them dark, and we've got the low bits, the large amplitude bits, and made them light. And we've kind of turned it on its side and done this a bunch of times across the entire track. And this gives you something called a spectrogram, which you can see on the left. It's basically a representation of frequency over time of the song or the piece of audio you've got. So we want to take the highest points of this, because as we saw before, the highest points always survive the noise. And with that, you can then, you basically run the nearest neighbor search on it. So you just check for local minima, local maxima, sorry. And I think the way I did this was I just sorted them and then picked the top 30 or whatever. Yes, so we'll go back to the slides that were before. You've got your anchor points. You've got your maximum points now, which exists a long time, and you split these into larger regions. In each region, you want to find something called an anchor point. And this anchor point is just used for pairing with other points. So in the original Shazam paper, which you can find online, I can show you a link if you want to afterwards, he says, okay, you need to get some anchor points, and you ask how. And of course, because it's a paper, it doesn't respond, which is unfortunate. So I took maximum. I decided the largest point in that region would be our anchor point. And with this, you literally just pair it up to every single point in the following region. With the end region, skip it. So the reason we're doing this, instead of just storing all the frequencies, all the maximum points here, is because it adds more entropy. So it makes your song more distinguishable. So if you were to have Metallica's wherever I may roam, and you would have Britney Spears' OOPS I did again, they might share some frequencies, but we'd all probably agree they're fairly different songs, even if you've never heard one of them. You'll have heard one of them. So by pairing them, you've not only got that frequency and the other frequency, but you've got the time difference between them. It's adding more information, more distinguishability. So here we are. Here is your hash. This is a hash. We don't bother with SHA, 256 or whatever. We literally just take the two points, F1, F2, and we connect them together. So a bit on the left, F1, frequency 1, frequency 2, difference in time. And you also store this along with the time of the first point, so you can identify where in the track you are. And the idea of the song that you're fingerprinting, so that you can identify which song you actually have, that's an important bit. Don't forget that. There we are. So storage. You've now got this F1, F2, DT, T1, and these kind of go together quite compressively. So you want it to be a fixed size, because variable-sized storage is, well, it's quite slightly different comparatively, and it's easier this way. So frequencies in melds. As I said, 4,000 melds is about the highest you can hear. So we can fit that into our bytes, 4,096. Time is in milliseconds here. It's arbitrary. You could do it in system ticks if you felt particularly masochistic. And that means you can store in about 10 bytes. That's 1,024. That gives us a second worth between an anchor point and each pair point, which is fair enough. It gives you about 512 milliseconds per each region, and you don't want anything near there. The windows that we split it into for the Fourier transform would be about 16 milliseconds-ish, depending on what you configured it as. And the reason we store this as milliseconds as opposed to window, well, number of windows that we're using, is because then if you change the number of windows per each region for the anchor points, then you can keep compatibility. So this also imposes a limitation on T1, which is the time of the first point, as that can be no more than, here we go, 4,194,304 milliseconds. I didn't read that. I memorized it. So that's about 70 minutes, which, if you're doing a music identification service, that's going to get you the majority of all, but the most obscure prog tracks. For something else, if you had an identification service for talk audio, for example, that could be something you build, then you might want that field to be bigger. So you then put these into a database, and you can search through them. What if you just want something that works? There is an application out there called Deja Vu. That's it. And it is this. It's a Python implementation of Shazam. I swear I didn't know it existed before I started this, but it does literally just work. You put it on your computer, you kind of press play, and it goes, and it's fantastic. If you're looking to do it, that's the GitHub address. That's the guy's name. He's a genius. Go do something with it. One file down the takeaway. We may look at some code because I've gone through this a bit quick. This project was inspired by talk at Fosden. I admit I didn't finish the project, but I've learned a hell of a lot doing so. I've never done any of this math CDSP stuff before. It's pretty quite cool. You should look into it. There are a lot of talks here, and I've seen them from go to machine learning to the micro bit. I hear we might be able to get micro bits. That sounds cool. So pick something up and run with it. And I want to see you guys here next year. For example, I saw one paper where a guy took multiple videos of a concert from different viewpoints, and he synchronized them all using fingerprints of the audio. That could be something you guys have good implementing. I think there's only been one implementation so far, and the world always needs more. Questions? Or code? Code? Let's look at some code. Hello, code. Okay, cool. So this is a notebook. It's available on GitHub. I'll talk you through it quickly. It's not very well organized. You read in the audio. This is what song looks like, by the way. It's quite cool. If no one's ever seen that, probably you are. You then do... This was me figuring out how FFTs worked. They're quite difficult. That's what you get to start off with. You don't want that. If you get that, you don't want that. So there you go. There's your mail frequency. That's the equation of Wikipedia. Don't blame me. We then take that. We just take the integer so that we got smaller storage. That function is far more complicated than you speak. And so this is normal. This is your frequencies before the integer, and that's... Hang on. Sorry. That is your for FFT, and as you can see the frequencies that we want are all near the bottom, because that's... Well, I mean this particular piece of audio is quite basic, but that's where we generally tend to hear the most of it. So if you take the mail version, the bottom stretches out quite a lot more, and you've got a lot more useful information for a lot less storage. Defining some stuff. Defining some stuff. Not very interesting. Oh, spectrograms. Matplotlib. I've been told it does spectrograms. This doesn't look particularly spectrogram-y to me. Constellation maps. I didn't actually plot that, but there's your constellation map, and then the hashing. This is all on GitHub for you to look at if you particularly want it. So, yeah, actual questions. Hello? Yes. Okay. So you have to speak here because it's recorded. Really silly question. What's a mail and was it named after whoever came up with it? I'm pretty sure it was. And a mail is defined as 2,595 times the log base 10 of 1 over something of your frequency. It's a very weird unit for frequency, and it's basically a logarithmic scale for frequency as opposed to a linear one. Hang on. Where's the equation? Did I skip it? Yeah, I did. There you go. Log base 10 of 1 plus your thing over 700. Don't ask me where you got that from. It was an experiment in the 50s, and they got a bunch of musicians and said, okay, can you hear this? Is that different from the one before? Is that different from the one before? And they just plotted that and made a graph based off of it. More questions? Actually, I have a card. Oh, there's a question. So the fingerprint is of the entire song, and it's just that collection of data points that you get. Yeah, so the fingerprint is the collection of not that thing, not that thing, that thing. You take this thing and you do this. So for that, each of these links would be one fingerprint. And for the entire song, you do that. And that together would be the fingerprint of your song. So yeah, each of those would be a hash and those hashes together would be the fingerprint of your song. And so how do you compare the two fingerprints then? You literally, so you search for exact matches. So the storage bit is the bit I didn't really do very well. That's why it doesn't work. But you compare the two matches, see if they match exactly. If not, you can create basically a thing of how many match. And then there was another talk on this can link you to its FOSDEM. He's much cleverer than me. And he knows these things. I have a question. So I can ask myself. So to what kind of distortions is it robust? So if you, for example, if you're a DJ and then you're pitching, will it still work or is it then off? So time-based distortions, it should work because, well, within an extent, if you obviously stretch it about 5,000 times, it's not going to work. But if you stretch it a little bit, that should work because you've still got the fingerprints. It depends on your, when you search it depends on how accurate you want the, or you decide the time should be, how close the time should be. In terms of other noise, Shazam, when they released this paper, they said it was, it worked perfectly over the GSM network. So a bit of history for everyone. Shazam wasn't always an app. It was a service that you rang up and then it would be like hold your phone to the music. And you told your phone to the music. And then it would text you afterwards, which was really quite cool and that's how they started. So it would be robust to the phone network. And it's robust to TV noise and things like that. So. Okay. More questions? Yes, so do you know if there is some similarity in the hashing system as to music brains? Music brains. Now, I have something for this. It's not that. That's in fact all the processes. What is music brains? I'm going to try and remember what music brains is. Oh, it stopped. Oh, no, I haven't written down what music brains is. Sorry, I'm no idea. So music brains is basically a database of hashes of tracks, the whole track. And basically, it helps you. You have some entries or files on your computer and you want to fix the text to fix the title or the author. It has a database of all of these tags occur and also a hash of that file. So if you have MP3 converted from NGGA from whatever, it will still be able to find out with what break it is. Okay. So if it's based off of the Shazam paper, then yes, it's similar. However, there are a few other big papers that have been released in here. Computer vision for music identification was a very big one. And that uses a slightly different technique in that it uses computer vision similarity things to calculate the hash. That is a subject for a whole other five talks. So yes and no, maybe. Okay, more questions. I'm still fit enough to come to everyone. So also in the back, nobody. Okay, if there are no more questions, then we can close the session and also the first day of the EuroPython. I hope you had fun and learned a lot of stuff. So see you tomorrow. Thank you.