 Hello, everyone. Welcome to another episode, special episode today, because today we are going to walk through how to build from inception to actual implementation, a music generator, a music recommendation system. Yeah, got that right. So first, we're going to start with a proof of concept, just just a little overview of what we're going to be doing, followed by like getting into each of these individual components separately. It's going to be a long video, but it's going to be fun. So just keep following along if you can. Okay, so first of all, we have this like piece of audio right now, right? And then we want to pass it into this embedding generator. So what an embedding generating embedding generator does it is it'll take your song, and then it'll vectorize it convert it into like a bite vector of sorts, right? And, and, and then it takes that vector and then passes it into a recommendation engine, because now machine learning understands math. So this recommendation engine will then recommend the closest values that are to this original music audio. And so effectively, we're building a recommendation system or recommendation engine. Now, in order to build all of these components, we have three major resources that we use. The first is the audio set data set, which will provide us all the, you know, kind of like the YouTube videos and all those, you know, resources in general. Then we have the max audio embedding generator from IBM, which is a tool that helps you convert raw wave files into into a byte string or into like vectors in general. And then we have annoy, which is an approximate nearest neighbor's implementation by one of the guys from Spotify. And what it does is it basically takes your vector and then spits out neighboring vectors. And these vectors can of course be translated back into music, right? And so, well, that's actually the entire gist of the video, we're going to be walking through each one of these separately from audio data set, the max audio embedding generator and annoy. And we're going to be walking through certain examples, how they're used with code. So keep falling along. Okay. So right now, first is the audio set, right? This audio set data set, like I mentioned, it's a set of YouTube video IDs. It's kind of like this massive huge file, which consists of JSON like information, but it's not exactly JSON. We're going to have to do some conversions and processing later. So let me just let me just give you an example of a format. So this entire JSON looking structure is of one sample video in the entire corpus. And like this entire sample, they have like, I don't know, 13,000 of them. So let's just start with this, like, what does this actually mean, right? So first of all, it's saying that, okay, we have a video ID, which is right here. Now this video ID is basically like a YouTube URL kind of thing, you know, YouTube.com slash watch slash whatever that ID is that is going to be here, then start time and end time. So basically it's saying, okay, from this video ID, we want to take the timestamps from six seconds in the video till 16 seconds in the video. And that 10 second chunk is your music. That's your frame that you need right now. Right. Next is that we have labels. So labels are kind of like tags that are given to the video. It's like, okay, is this music? Is this within music is a guitar? Or is it like a man talking or a woman talking or is this like a car horn? And all of these labels are kind of stored in a list. And the meaning of these labels, well, it says it can be found here, but it's actually in a in a CSV that I'll probably share with you later. And it's exactly like what I said those tags could be. And then we have feature lists, which is basically saying every second of audio or every part of that audio is now encoded into this 128 feature string. So the thing is that like, let's say that we have like a 10 second piece of audio. What it's doing is that we are chunking that 10 second piece of audio into like 10 chunks of one second each. And each of those seconds are encoded into a 128. Basically it's an array of 128 numbers, which is like a byte array. And so this is the first second of audio. This is the second second of audio. And like that we'll have like 10 lists for every second of the audio. And yeah, this is well, that's actually our entire data set like this. It's repeated the maybe 13,000 times. And it's a really big file. If you download it, it's like 2.4 gigs, right? You can get this entire data set by clicking on probably the US corpus yet this link right here. And then you just need to double click and extract that just extracted, right? And all of these files are essentially they are stored. I think I have it here. Yeah. So all of that is basically this thing, this audio set v1 embeddings, and they're stored in the form of TF records. So each of these records have exactly well, this kind of information for like 13,000 examples. And right now, the thing is that, you know, all of these features are like typically when you look at music raw wave data, right? They're kind of stored in like you see like CDs, for example, having 44.1 kilohertz, which is like saying 44,100 numbers are used to they're used to represent like one second of data, right? But, you know, that's a lot to work with machine learning. And that's why the output of this right now those 120 features is a lot less. And it also represents the same amount of information. So this is the embedding that we were talking about right in the in the video in the in this proof of concept right here. And these embeddings are actually pre generated by this max audio embedding generator. Yeah. And if this so we don't really need to do any of them, we don't need to do the embedding ourselves. They're already available in this data set. But let me just explain how that embedding generation works. So let me come to this screen. Okay, so this I'm running. Basically, I built an application. I mean, I didn't build it myself. But like I'm running an application that I clone from the IBM max audio embedding generator library. And what they do here is, well, you can basically let me just show you. So I uploaded a way file, right? Oh, this way file for like, it's just like a car horn weight way file. And why am I saving this? It's a car horn way file. And what's happening now is if you click like, you know, if you click to execute this, it will return a response right here. And the response is the embedding. So like I said, the embedding will be generated for, you know, every second of audio, this entire thing, you see this list of lists, right, this first list of till, yeah, until here is 128 numbers of bytes. And that represents one second of the car horn. Like that, this represents the other section, which is the second car, the second second of the car horn, this is the third second and so on. Now, I can, I'm going to explain exactly like how you can render this. So if you wanted to, you can see I'm running this on like local hosts on port 5000, right? And in order to do exactly this, you just need to go to the IBM's main repository was that yeah, the max audio embedding generator repository, you clone this repository, and then you follow the steps for running locally right here. So basically clone the repository, go into their main directory, then you need to have Docker installed, which will help you containerize the build, it helps you build containerize applications. So applications that can kind of run on their own with their own dependencies, just all in one bundle, right? So a good way to build applications these days. And then you just need to run the model. If you run Docker run on port 5000, you will get and then go to your browser and type in localhost 5000, you'll get exactly this page. And you can play around with it just to just to see how you can upload WAV files and then see their corresponding results. So hope that is super clear. Right now, okay, so we kind of got an intuition. Let's see, let's go back here. So we know what the audio data set is for. We know how to actually embed now the audio generating wall. At least we know, you know, using this tool how to create the embeddings. So now we have a vector, right? But the thing is that like we we don't want to we have like 13,000 of these vectors, right? I mean, 13,000 examples, we don't want to just like sit and keep doing it for, you know, every single one of them. So the audio data set, like I mentioned, already has the fact they've already used the max audio embedding generator. But this is just for more for your reference to understand what's really going on, what the data in the audio set here, this represents. So cool. Yeah, we're cool, Professor Halthor Professor, what's no way. Okay. So now, so within IBM, you know, this I was actually here, right? Sorry if the explanation is jumping all over the place, because there's just a lot of files to go through. The idea is just for you to understand what's going on. As long as you do, I'm satisfied. So now you can run the model like this with the fancy UI, but they also gave you know, you see these this little curl requests that there's a little curler quest, if you can see here, right? We can use that to pass in an audio file, like a WAV file via like a terminal command. And then we can get the response to, right? And so what we can do is they have like a demo notebook within this. Let me just open that. I think it's in samples. Internet. Yay. And then like there's demo IPI and B, right? So I'm not going to open it here, but I actually have it. I have it open over here. Yeah. So that same notebook that you saw is exactly what I have copied and pasted here. Hence all the licensing information is here. And what this will do is it's going to read away file and get the corresponding embeddings like I just mentioned, but via, you know, a code interface, right? I'm putting it here because I made a couple of changes just to make sure that the notebook works properly. But let's go through exactly what this notebook is doing. So first of all, you're importing, you know, just a bunch of like system level, you know, libraries, JSON, to deal with JSON data. Mouthplot live to just plot. We have like a plot of a chart coming up. Then like, yeah, see similarity, a cosine similarity again for, for getting similarity between two vectors. So cool. Also glob is used for getting it's used for getting like media type assets and fetching them kind of like how here we're using to fetch fetch everything in a directory. We're going to fetch the all the way files in demo assets, which was basically all the all the stuff in this repository right here. So all of these way files, we are going to fetch, right? And if we list out what exactly is being fetched, you see, right? So it's like a way file of a jazz guitar, a way file of birds chirping, a way file of gunshots. Yeah. Now the goal here is to convert these way files into the embeddings using the max audio embedding generator. By the way, just to make it clear, you know, an embedding is basically an array that has a representation of the corresponding file that is coming from in this case, the way file. And we just require it so that it's basically essentially a compressed representation that has all the meaning of the data. And because it's more compressed, we can deal with it in machine learning or like, you know, in general, right with like the nearest neighbor approach, whatever it is that we're going to be using later. All right. So now coming to this little section over here. So what we're doing is like for every single way file, we're going to execute this curl request, right? What it's going to do is I'm passing the audio way file. And I'm passing it to that URL that we just mentioned that we just put up right here. And then what it's going to do is it's going to run the prediction model, which is basically run it through our embedding that embedding generator. And when it generates an embedding, it's going to output it to a JSON file. Excuse me. Right. And that, yeah, that is basically the format of the response that we just shared. Right. Now, this is going to be like, let's say that you just extract the embedding, which is a list of lists where each list within that list is 128 characters of bytes. And now what I'm doing is I want the audio link to be cut at 10 seconds. So let's say that we have clips that are, you know, some clips are various lengths, right? But I only want audio clips that are at least 10 seconds. And for those clips, we cut it at 10 seconds so that we have so we make sure that every single audio clip is the same size. So in this case, we're just going to have like every if every single audio is now cut to 10 seconds, every single list of lists. So the list of lists will basically contain 10 lists. And each of those 10 lists will be 128 characters for bytes. That's exactly what's happening here. Right. And I'm just printing out every single column making there's like 19 or 20 calls. And I'm encoding every single one of those way files that I printed above into an embedding. So now, and what I'm doing now is since it's a list of lists right now of each of those lists is a 10 cross 128 for every single sample, I'm just flattening it. So now we have like a single vector of 1280 dimensions that represents a way file. Cool. So now that it's a much compressed more a much more compressed representation, we can do things like I don't know find the cosine similarity between them. So that that will inherently give us like, okay, if if we have the cosine similarity is higher, that means the two vectors are more similar in meaning to write. And so I'm just testing out a few values here. Like, I mean, you'll kind of see these later, it's a value between zero and one, this is the cosine similarities. So this is like the first way file, this is the second way file, and their corresponding cosine similarity. It looks pretty good. They're pretty high values for, for, for things that are similar, like two types of birds or two types of guitars. Right. And now what I'm doing is, well, let's print all of these into like a cosine similarity matrix, right. And so what we have is this little bad boy right here, cosine similarity between normalize embeddings. So x and y axes are, you know, the corresponding items that we're getting higher the better. So what stands out here is like here, okay, 0.89 is pretty high for a jazz guitar. What is 0.89? It's a normal guitar, which is what we would expect a jazz guitar is the most similar to a normal guitar. And then for birds too, well, it's most similar to birds one. Look at that. And then gunshots, well, 0.86 is the highest here. And that is, excuse me, more similar to fireworks, which makes sense, right. And so overall, the embeddings that are generated by this max audio generator, embedding generator are pretty good. They're pretty good to use. And yeah, I mean, like some, of course, like, you know, some of these are a little higher than I'd like it to be. But, you know, on a relative scale, it works for us. It's good enough. And so, yeah, we can now use this back to our proof of concept here. We can now, we now are able to basically, well, have a set of these vectors, and we can now pass them to recommendation engine. But of course, I just did it on random way files right now, executed on a random way files, but we need to execute it on our audio data set, which I do here. Right. So now we're going to the objective of this notebook, it's a second notebook, by the way, called audio set processing, where the objective is to read that audio set data that's in a convoluted like those TF records, essentially. And then we need to convert it into just like a single JSON file that will act as our data set. So this is the second of the third notebook, by the way. So you know what's coming. So the prerequisite is to download the audio set data set in that tar dot Z file, make sure the notebook is in the same directory as the uncompressed embeddings. And then, and that is like basically the same folder level as the audio set V1 embeddings, which was this, this little thing right here. Yeah, this one, this folder. Cool. Right. So now, okay, so now I wrote this file so I can explain stuff. Right. So we're going to be using some TensorFlow here. So you want to install that. The directory we want to be reading is all the all the TF record files that are in the eval folder. Now, what I'm trying to do is well, if it ends in TF record, I'm just storing it into my data set. So this data set will essentially be like a list of TF record files. I'm enabling eager execution because, you know, TensorFlow has this thing where I'm not sure if I require it right now. But when I was testing certain things out, TensorFlow has a lazy method of execution where it will like build an entire tensor graph, I guess, I forget what the appropriate terminology is. And then like when it really needs to form that execution, it will start executing only then and eager execution is like what you would typically expect. Like as soon as it sees like a particular function, even though it doesn't need to execute it, it will execute it. So yeah, I'm enabling that here. I'm converting it now since data set is a list of TF records, I want to just encapsulate it in a TF record data set object, which I call now raw data set. And now, well, we have a bunch now all these this entire data set could be representative of many types of audios. In fact, let me actually show you because like I have this there's like class labels, right? And I want to get all the labels that you know it can correspond to. So essentially, it'll look like this CSV file. So this these these are all possible, you know, labels that can be appended to the audio data set that is all the labels that can be in. So from if this is an example, all the labels that can be in this this value region, right? And that is what exactly we're seeing here. Right? So it's 88 response to click clop, whatever that is, 87 is the sound of a horse, and so on. We have like hundreds of these. Okay. So what I'm doing is I'm reading that CSV file and getting those labels. And then I only want to extract the ones we're only concerned with music, though, right, because this is supposed to be a music generator. And so what you do is I'm only going I'm doing like a string search for everything that contains the word music within those labels and extracting them only removing case sensitivity because obviously, and so we have a set of music labels that are stored here. Right? Next, now we are going to iterate over every single one. Now, this is the main meat of the code where, you know, we're converting all those TF records in that incorrigible form to a JSON data set that is more manageable. So, okay, so now we have for every single TF record in this data set, what we're going to do is we are going to kind of like TensorFlow has this internal object called example, where it helps in serializing TF record data, and it creates something called an example. I'll probably put this is a type of data structure in TensorFlow and how to process TF records. I have it in the resources down here. You can click this link for tutorials on exactly like how and why I'm using this right here. Right? So anyways, this is just to extract near that TF record data. Now we know that there's audio metadata associated with it, that audio metadata. Again, I'm going to keep referencing this audio metadata being this this chunk for this chunk, the context right here. Right? So that has like all the information of video ID start time and time and labels. And that's going back here is exactly what I am extracting. So we get the context dot features labels, and then you know, you just have to follow the tree down in order to extract the actual label value in the form of a list. So all the audio labels start time and time video ID, we store it all. Now when we store it, though, I want to check if this particular sample corresponds to a music label, right? Because what is essentially doing is like it's going to do like an intersection. So I just want to make sure that this audio label like one of the tags, at least it's going to have several tags, right? At least if one of them corresponds to a music kind of tag, I want to include it in our data set. But if it's not, then forget it, skip to the next one, skip to the next record, right? And so that hence that logic is right here. Now let's say that it is a music label, that's great. Now let's track the feature, the actual audio embedding. So that audio embedding is the second part of this, which is here. So you have to go down feature list, feature feature feature list, key is audio embedding, and then get the byte list for every second, which is exactly what I do here. Feature, yeah, feature lists, feature lists, audio embedding dot feature. And then, yeah. And for every feature, we get the byte list, right? And that is going to be our embed, this is going to be like a list of lists, of course, you know, this is going to be like a list of each list is 128 of size, right? But of course, they can be varying lengths because it depends on the length of the audio. But for the audio embedding that we're going to use for our data set, we want to, you know, we want to truncate it to just 10 seconds, num seconds is 10 seconds. So essentially, we're just going to get the first 10 lists in that list of lists, right? And finally, what we're going to do is, well, we are going to see if, yeah, so obviously the length has to be, yeah, that's fine. This is just going to be just to make sure that the if the if we're getting if we're dealing with like shorter audio, you know, like six second audio or three second audio, it's not going to be the uniform across, you know, the other audio, it's clipped at 10. So if it's small audio, we're just going to skip. And then all we're doing is encapsulating it into a little JSON right here. We have the label, the video ID, the start time, the end time and the audio embedding, which is going to be a list now of this is a flattened list. So it's a list of one of 1280 numbers, right, which corresponds to 10 second of audio data. And that is embedded. So very cool. We just append it to our list. I'm just showing a counter just to see the size of our data set. And then yeah, that's it. So we have like around 2000 files that are that are around processed, which is great. We have like over 2000 records for music to play with now. I'm now storing all of this. I'm dumping all of this into a JSON file. And this music set JSON is now our formatted data set that we can now use in our third notebook for generating recommendations. And that's coming up next. Luckily, and now you don't actually even have to execute this code at all, because I'm going to provide the music set JSON in, I think I'm going to provide it in the repository with the code that's going to be in the description below. It's a huge file. I think it's like 13 megabytes of just raw data, which is cool. Right. And I'm just like printing it out just so that you can show see like, okay, the first 10, this, well, this actually doesn't make sense. But this is like the first for the first example only the first 10 numbers of the 1280 for like four samples of data just to see if it's working. But honestly, you don't really need this. So cool. That worked. And notebook number two is complete audio processing. Now before we get into this third notebook on annoy, right? Let me explain exactly how annoy computes the nearest neighbors with a slide deck right here. So I'm getting this slide deck from the creator of annoy. I think it's Eric Byrne. He used to work at Spotify during this time at Spotify. He built this annoy index that helps you find approximate nearest neighbors. Why we want to use approximate nearest neighbors is that it's much faster than you know, trying to compute I think like, you know, it just so much so much faster than using like, you know, let's say if you have 10,000 pieces of music in your data set, you can't just do a brute force linear scan. Otherwise, it'll take forever. Right. A noise just like hundreds of times faster. And so we are going to use an approximate nearest neighbor algorithm to get, you know, pretty good recommendations. So this X is an excerpt from a 50 minute video that he has on YouTube explaining this entire concept. But I'm going to try to distill it down for what we require for this video. So for that, let's get into building the annoy index. So let's say that each of these points, these X's here, it's like an effector space of this is the embedding space, by the way, generated that max by that max audio and better. Here, it's like a two dimensional space. But let's just imagine that it was like, I don't know, let me open it up. Let's imagine that this was, I don't know, the 1280 dimensional space. But for the sake of understanding, it's a two dimensional space. Right. Now, how to do this is that you just pick any random two points in your embedding. And then you draw, you draw a line between them, and then split that line with a perpendicular line. Right. And now we have two separate house right here. Now what you're going to do is you're going to choose two points in the top in like this region and two points in this region, split it by two again randomly. And that's what we do here. And you just keep doing that split, selecting, you know, any random two points per region and performing a split until eventually you get, you know, these small itty bitty box regions. Right. So you would think that, okay, well, now this is a partition of an embedding space that you can also format as a decision tree. Right. So the first like node is like all, all of them are there. Then the first split would be, okay, not a decision tree, more like a binary tree. Right. Because let's see if they have this. Yeah. So this would be like, for example, all, all your points are in one. When you perform your first split, you, you know, you have two regions now, and then you perform two splits each time. Right. And you keep going down and down and down until eventually you'll end up at the circular leaf nodes, which basically correspond to, you know, these individual like this individual region of, you know, points and these three points are within those leaf nodes. Right. So yeah, now when you're searching for a point. Yeah. So now let's say that you want to search for a particular search for it. Let's say that now you, you want to give your this annoy like a vector, right. And you want to find the nearest neighbors itself. So what do you do? So let's say that the actual point in the embedding space lies somewhere here, right, where this white X is. So you would think, okay, should I just, just traverse the tree, which is corresponding to just that region. So should we just recommend, you know, points within the leaf, this corresponds to a leaf in the decision tree? Should we just correspond to, you know, should we just like return these, but there's a problem with this because there might be some points here, like you can see like these X's over here that are very close to this particular point too, but they're not being recommended to you because they're in a totally different area. But all of this happened arbitrarily because of a random split. And so there are situations where, you know, when you're choosing the two points to create the split, the distance between those points could be extremely small. And so if it is small, we want to traverse both sides of the tree. That's what they do here. So let's say that, okay, you have, this is the main thing, you choose to split, you then have two regions here. And now our main point actually lies here, right? But the thing is when we're making the split right over here, let's say that the split is between two points that we're taking by random chance, right? All this happens randomly is going to be extremely close to each other. It's lower than some threshold. And because of that, okay, because it's lower than some threshold, we want to traverse both of these trees because, you know, that's a kind of where, like just to make sure that we're getting all of the points that are encapsulating, you know, both regions. So we not only have this region, but we also have like some one, two, three, four, five other regions that could potentially be our nearest neighbors, right? And so if you kind of highlight all of these five regions in the actual space, you get something like this, where we have, you know, all these five regions, and then we have the main region. So basically, what it's saying is that if we pass in something, like an embedding, some random piece of music that is at X, this red X, it'll be recommending all the points in this region, right? And that's pretty cool. So yay, cool. We were able to do that. Now, the thing with this is that the implement implementation wise, so you can let me look at this. So this entire thing, right, all this entire tree in general, and this entire searching for this nearest neighbor can be implemented by using heaps and a priority queue, well, heaps, which are implemented, sorry, priority queues, which are implemented by heaps. And with priority queues, you have basically a strut where, you know, a weight is given to every single node, right? And you sort it by, well, this, this part is a little opaque to me, honestly. So I really do suggest you to watch his video for specifically the implementation of the priority queue. But essentially, what's happening is that it's just another way to extract the nearest neighbors, kind of like we see in this graph, right? That's it. Now, the thing is that a problem that we're facing here now is that, well, so the problem that we're facing here is that this, this is formed by just one tree. Like for example, if this is like the embedding space that we got, it was, we get one tree and we have like a set of nearest neighbors. But this is just like one example. And it's always better to have multiple trees like this, we can build multiple trees like this so that, you know, that by random chance, you know, we won't be missing certain points. And the more trees we build, and we take all of these trees in consideration when making the decision for determining nearest neighbors, it becomes a lot more concrete. So for example, let's say with all the random splits on the same data set, we get this set of points, right, for the nearest neighbors that red X is like probably are the music that we put in. And then the all the nearest neighbors are within these, this purple region. And then with the same set of points, we're going to perform a noi again, we're going to do that random splits. But then this time we get splits over here. And then the last time again, we're going to do it. This is like a third time over here. So because of that, we have like, I don't know, let's say that we wanted to recommend 10 songs for each for each, right? So that means like there will be 10 in this recommended 10 here, 10 here, and they may be overlapping. But then we use all of this combined together in order to make a decision of which ones to actually surface as the nearest neighbors. So I think that's going to be more clear over here, right? So basically what we do, the steps that annoy takes is that, well, we'll basically search all trees until we have K items. So this K items is like, let's say that we wanted to build a recommender where we input a piece of music and we output 10 recommended items. In that case, let's take K to be 10 items here, right? So every single tree that we construct will output 10 items. So let's say if we do 50 trees, I think in the code, you'll see that I do 50 trees. So each of those 50 trees will do 10 items, which means that we have 500 items overall. But now if you take the union and remove duplicates, there's definitely going to be a lot of duplicates because annoy itself is already pretty, you know, even with one tree, it's going to get pretty good examples. So I wouldn't be surprised if all the other trees regardless of the random split would surface similar recommendations. And so let's say from a 500, I don't know, it distills down to like 500, but like 400 of them are duplicates, it distills down to 100 points now that we can use. And now since it's only like 100 points, we can actually compute raw distances for every single one of those 100 points from the query vector, which is the music that we want recommendations for. And it's not too bad. So once we do that, then we can sort them in a ranked order of nearest neighbors or the nearest distances. And all of that, I think, is represented visually here. So you have three decision trees, when in our case, we'll have like 50. We find the candidates, each of the leaves contain like maybe 10 examples, 10 pieces of neighbors. Then you get the union of all of these. So which would probably be like 100, like guys mentioned before, removing all the duplicates, you find the nearest neighbors for all of those 100 from the query point, which is our vector that we want the recommendations for. And then we can rank them to find the nearest 100 neighbors or the nearest 10 neighbors or whatever that is. Right. And that's it. That is the entire nearest neighbor algorithm with annoy using finding approximate nearest neighbors. And you'll find that this is blazing fast. Right. So with that intuition in mind, and I hope that was a respectable explanation, you can always watch his video just to get some more context. But I think this is enough for this video. And it's also a 50 minute video. So I'm saving you some time. Now let's build the recommender with annoy. Very easy. So first of all, the objective is to get the top 10 recommendations for a piece of input music. So right now we need to for this to run this notebook, you need the music.json file from the previous notebook, and then you can run this notebook. But I'll provide it for you. So first install annoy. Now I need this literal eval because like I'm reading from a json file and I want to make sure that all of it is in a json structured manner. You'll see that actually right here when I'm reading the json file. And right now I am reading the class labels from that CSV that I showed earlier, this CSV with all the categories and classes. And let's go back. Where's that? Yeah. So yeah. And then, yeah, just yeah, you could say I'm just doing I'm creating a dictionary out of it. So it's like one would be speech, two would be, you know, whatever, male talking or something like that, just to create a dictionary. So it's easy to access. Now I'm taking the once I read the json file, I'm taking the length of the data set. Like I said, it's a 2000 examples. So we have 2000 pieces of music to play with. Now this is the meat of the code right here. And this is all you really need for annoy. Basically, it's saying, Okay, I want to create an annoy index where we now need to pass in the number of dimensions that each of our samples is, which is 1280. The length of the vector will be indexed. Yeah, cool. Then for every single, you know, every single piece of music in this, we want to add it to the index. Basically, we're saying, Hey, these are the samples we want you to use to kind of like find the nearest neighbors off. So this is our entire world, right? Now I could only pass in 1000 of these 2071 examples, because otherwise it'll crash the program. A way to to kind of, you know, get rid of that. This barrier is probably to decrease this dimensions, which you can do with other, you know, maybe like other dimensionality reduction techniques. But even with 1000, I think that this is perfectly fine. It'll work fine. Right. So now for every single one of these, I want to get the corresponding data, which will be the, you know, the actual vector 1280 dimensions. And we add that vector with an index. So I'm just indexing as, you know, one, two, three, four all the way. And we're just adding it to our index. Next, we don't want to build it. So basically what build is going to do is the, the heavy lifting of like creating those splits, random slips, like 50 times we create those trees that we talked about. Like I said, we have 50 of those trees. The more the better, it just gets you a more accurate representation of the best nearest neighbors and most mix the more approximate nearest neighbors, less approximate and more actual. So for more accuracy, you can use a higher number of trees to build the index. And then you can save the nearest neighbors in a graph in an an file. So cool. That's done. Now, when you want to get the nearest neighbors, well, here I'm indexing every item. Let's say that I want to get the nearest neighbors for item number 193 in the index like that we built here. So item number 193, I want to get that index and I want to get the 10 nearest neighbors for this index, right? This is what we're, yeah. So I want to build an index and get the 10 nearest neighbors for that index. Okay, cool. Right. And what I'm doing right here now, I know like 193 is like a weird number, but that's why I've also like printed out the entire music data set within that index, like all the 1000 examples so that you can see exactly like, okay, here's the index and here's the kind of the labels that are actually given to it. So you can kind of compare, okay, 193. That's like, it's opera music. Okay. So opera music, that's just for intuition so that you can understand. Right. And now what we do is now we have this nearest neighbor index for this will be like the nearest neighbors and I'll store it here. And what I'm doing for every single one of these nearest neighbors, I just want to print it out. Right. And that's what I'm doing. So the first nearest neighbor of course is the 193, which is obvious because that is the same music that we input to the data set. So obviously it's the closest because it will overlap. But the second one is this. So it's basically saying, let me explain every single one of these parts. So each part of these nearest, these are the 10 nearest neighbors, each part is for things in it. This is the index within annoy internally. This is the set of tags that are given to that piece of music. This is the, uh, this is actually YouTube URL that you can use to Google it. And I'll show you right now actually, and this is the starting and ending timestamps and seconds. So let's look at this first one here, right? X T G M R L O, whatever this is, right? So if you go to that YouTube video, that's right here. I pasted it right here. And we want to start at 30 to 40 seconds, right? 30 seconds to 40 seconds is the clip. So that's this. Let's start something like that, right? I'm not playing more because of copyright, but you, you kind of heard that, right? Pretty good opera music. And then, um, so that's the input. And that's also the first closest neighbor, but the, the actual closest neighbor, which it would have been is this song is this one. So it's this video at whatever this K 90 thing is at 80 to 90 seconds. So I think that's this video. Yeah. Can I T C right here? And then at 80 seconds is a basically a, you know, whatever one minute, 20 seconds. So let's play that pretty good, right? So they are actually pretty close to each other. And what you can do is you don't even need to use just, um, I'm just use recycling like whatever is in my index to find the nearest neighbors, what you can do actually in, in actual usage is that you can encode, you take your raw wave file and you can encode it using this, um, this audio set, um, yeah, this audio set processing, uh, actually not the audio set, this demo file. Yes. You can encode it using this demo file just by calling that curl command on your wave file, kind of like this. And then it'll output a set of, um, and then you can get like the embeddings information. And then you can use that embedding in, um, this annoying notebook right here. And in order to, to like feed it into our nearest and into, into this, this thing right here. So here I'm getting nearest neighbor by item, but you can also get nearest native neighbor by vector. So you'd be passing in the vector that you just created and you can get like as many nearest neighbors as you want to from the embedding set. Obviously I can only get a thousand because there's only a thousand here, but yeah, that's, that's pretty much it. Right. And that's, if you understood this, well, I mean, this is all of it, three notebooks, three resources, um, I'm going to put all of this down in the description below. Everything is going to be up on GitHub. The link to that will be in the description down below along with the class label CSV, as well as the data set. If I can get it uploaded, I'm sure like 12 megabytes is not too bad, but I hope this was helpful. I hope you all learned something, understood something, hope the explanations weren't too out of the way. I know that priority queue implementation is a little hazy, but other than that, you can read their entire GitHub too for more information. Please do subscribe, write some comments, start a discussion in the description, in the, you know, description down below in the comments down below. Oh God, long videos. What am I going to do? Okay. Well, I'm going to see you later. Okay. Just do, do what you got to do. Bye.