 Okay, the last talk of the day, always a challenge. So we're gonna try and keep it fun and light. So we're gonna talk about music discovery and language models. Both of us don't work at Spotify or SoundCloud or any of these companies. We just wanted to build something that's fun. I'm Nishchil. I'm currently the VP of data and ML at Scout B based out of Berlin, building large language models, knowledge graphs and a bit of ML ops. And Raghotam. Hi everyone, I'm Raghotam. I'm from Bangalore. I work for PayPal and my regular day work looks like playing around with LLMs, document AI, content moderation models to name a few. Happy to be here. Yeah, fun stuff. So what's the problem you ask, right? So about a year ago, Raghotam and I and one of our friends shares, we started talking about YouTube recommendations and Spotify. So we sort of started having wars, like saying Spotify recommendations are better, YouTube's better. But then we realized that while they're good, there's still some challenges, especially if you're based out of cities where there's a lot of music and performers, finding an artist you like is very hard. Even if you go on Resident Advisor, Eventbrite, wherever, it's impossible to find unless they are famous. And if you filter by genres, just to give you an idea of what 2022 with genres look like. So these are all the genres that you see. So let's imagine you like house and you just search for house. There are about 98 variants of house, right? And it's almost impossible to understand what you actually like. So that was one. And then we're Pink Floyd fans and we are always looking for artists who perform and play riffs like David Gilmore from Comfortably Now. So we couldn't find a place where we said, okay, if we took this piece of track and put it somewhere, just this slice, can we find other artists who actually have tracks that have the same kind of samples in them? And we couldn't find anything either. And the third thing that we said was, okay, as music producers, and we're talking a lot about an event, even at the conference with large language models becoming a bigger and bigger reality, everyone's talking about, okay, are they gonna replace people? Are they gonna be loss of jobs? But given that both of us have been working with language models for quite a long time, our mindset here is can we build things that can actually help people do stuff rather than trying to think about, this will replace, right? That's not gonna happen. It's more or less thinking about, as a music producer, I wanna build this DJ set. Now I have all these tracks I wanna play. Now, how do I combine them? How do I build bridges between them? How do I select a sequence of songs so that I can have an awesome set, right? So that saves a lot of grunt work for the producer. Now, what does the current landscape look like? So just to give you a quick statistic, so 70% of the music streaming market share comes from these platforms. And I think Spotify being the biggest of it. And then there's YouTube music and Apple music and then there's SoundCloud that kicks in. So the challenge most of these platforms have is that they're built to operate at scale. So you're talking about billions and billions of streams kicking in and you always have this challenge of explore versus exploit. And this is a normal challenge you have building any sort of recommendation system, right? So how do you let your users explore for new content? And at the same time, the content that you show, the recommendation that kicks in, it almost comes from collaborative filtering, which is essentially who's listening to these tracks. And what happens with that is popular artists become more and more popular and a lot of new artists and upcoming ones, it's very hard to find their content. Unless Spotify brings it to your playlist, the chances that you've heard of this music producer is almost close to zero. And we sort of also see that when you look at Spotify and you look at YouTube music and a bunch of these other streams, memorization plays a huge role. And as humans, we like this too because you keep listening to songs that you like. And when you are biking, you're in your car, you'd like to listen to songs that you like. So a lot of times songs that you've already heard sort of is used as part of your playlist. Even if you start from a song X that you didn't know, chances that you will listen to songs that you've already heard start to get higher and higher as the tracks proceed. Now, how did we sort of solve this, right? And there are conditions apply. So it's not like we've solved this, solved this, but we have an approach that we just wanted to share. And Raghutam's gonna run you through that. So we saw one of all the problems, right? Across different kinds of users, be it music producers or as consumers of music, if we want to find a similar sample of music then it's hard. So we went a deep learning route. We know that our goal was to split a given task song into a 10 second sample, right? So you might have like a minute long song and you split it into 10 second samples. Now, given this swatch of audio sample, can I find a similar one, which is very similar in terms of the BPM, the kind of music instruments those are used, right? The mood of the song, right? That's the goal. And if you look at that, it's very similar to what Shazam does today, right? Shazam is trying to find the exact same song, but for us the end goal is slightly different. We want to find near similar songs so that you can explore similar music. That's the goal. So we created our own custom dataset where we curated a bunch of songs, 600 songs. And all of these songs have no lyrics to make the task easier. And then we split them into 10 second samples, right? And then we sample them at 16,000 and 48,000 sample rate. And the reason being the current models that we have today either leverages the data in 16,000 or 48,000 sampling rate. So quick trivia about sampling rate, right? What does this mean? Basically your audio sample is a signal. And then if you want to convert into a digital format, you have to sample it at certain time, right? And the more samples that you take per second to hear the quality of your signal and better the chances of getting a better representation and finding a better match. So that's a quick trivia about what sampling rate is. This is very similar to what you can see as a parallel in terms of the images that we have as a megapixel and the resolution, right? It's something very similar. Now, when we started digging about what are the currently existing models that we can leverage, right? Today, transformers is everywhere. So basically all we needed was transformers. So we did a quick scan, did our research and then found out that we have three candidate models. The obvious one which is Wave 2.2. The moment you say AST or audio understanding, people would look at Wave 2.2 which is the state of the art model which is mostly in SOTA in terms of understanding speech recognition, but not specifically audio. But we'll dive into that later. And then the other one is an audio spectrum transformer and the third one being CLAP. These are the three candidate models we found which are transformer based which was easy to use with our data. So what we essentially did was as a start point we generated 10 second samples, generated embeddings for these 10 second samples across the 600 songs that we have, generated embeddings and then indexed them on a vector store. So we basically use phase for this and now we took a candidate song. Let's assume that this is a candidate song. Let's listen to this. And then we generated the embedding and we saw what was the closest, similar one for this candidate across different embeddings. So let's look at what Wave 2.2 returned. Was that close enough? This is with AST model and the last one being CLAP. Clearly, while they're all similar, right? If you observe it deeply, you see that the second two songs, the samples are coming from the first sample, the candidate sample itself. So they're very close. However, the moment you choose a Wave 2.2 embedding, it is giving you some kind of a noisy answer. So we did some analysis as to why this is happening, why are we not finding a closer match with Wave 2.2, right? That's when you dive into the data, you look at the spectrogram, you look at how the audio sample is looking like for the 10 seconds. And this is how it looks for the candidate 10 seconds sample that we have. And this is a spectrum analysis of Wave 2.2, the song which was written by Wave 2.2. This is for AST model, this is CLAP. Now let's just compare them visually and then we'll see if they're really same or not. So if you look at the input sample and CLAP, they're very similar. And even AST is also similar, but Wave 2.2 actually is not so similar, right? So the probability of you finding a good similar sample is hard with Wave 2.2. And we then started looking at why is this happening? And when we started looking at what and how Wave 2.2 model was built, we understood that the end use of Wave 2.2 is more speech to text and speech analysis rather than audio analysis, which is more leaning towards the music, right? So Wave 2.2 is trained on Libre speech, which is 960 hours of speech data and it has nothing to do with music or audio, right? Not suitable for our task. However, if you look at the AST model, this was trained on a two million dataset of 10 second sample. And if you look at the classification on the right side, that dataset has a very wide spectrum of varieties, be it human sounds, be it natural sounds like river, ocean, the animal sounds, even music produced by different instruments, right? So it's able to capture all of them in that dataset. So across the entire dataset, we have 527 labels and think of this as something like an image net for your computer vision model, right? This is a parallel that you can see here. So that's the reason that the AST model is very powerful. So what's actually happening with the audio spectrum transformer is that we take the 10 second sample, we convert them into a spectrogram, which is basically an image. The moment you have an image, now you start leveraging your image transformer. So the pre-training is basically on the image transformer where you see that you give a spectrogram image and it's divided into patches, they go as tokens, and then you have a classification task. So this is much more neater for us to find your own and also generate the embedding for our use case. Now let's look at CLAP. CLAP is very similar to what you have as clip in the image and computer vision world, right? A clip is nothing but contrastive language image pre-training, but here this is contrastive language audio pre-training. So the goal here is that during your pre-training process, you have a contrastive learning loss function and your input is basically a swatch of audio and then a description of that as text. Both of them are now goes as an input and you have a fusion layer and then you have the model learning the audio and text. And during inference time, you can either just give it a description saying, give me a dog barking sound or a piano sound, right? Or you can give it an audio and then get a description back. So that's what CLAP is about. So this was trained on the free sound data set which is around 50,000 and this also has a mix, good mix of human sounds, animals, natural sounds, musical instrument. And that is the reason we see that both AST and CLAP, given that their data is closer to audio and music analysis, it is retrieving better similarity for us. So in summary, we curated a data set, 600 songs, split them into 10 second samples across different sampling rates. We generated embeddings for three different models. We indexed them, we retrieved the similarity. We did spectrum analysis and made sure that, whatever we are getting as similar or really similar. So with this, we kind of now found out a way to find similarity. And one point to note here is that we just want to clarify that this is in no way. We are saying that the current recommendation algorithms are inferior, but as a music producer, if you want to find similar songs or if you want that and then put more layers in terms of the creation that you have, then you're a different target audience and you might need a different platform or a recommendation system. So now that we have figured out how we find similarity, discovery is the next part of it which Nisjil will walk us through. Okay, awesome. So one thing is finding similar samples, right? So of course we went through our whole understanding of the transformers networks, what we can do, embeddings, everything. But on the other side, discovery is a whole other challenge. How do we sort of have a user experience where people can discover music? And one of the key things that I've had the privilege of working the last few years is to think about knowledge graphs and graphs as a means for supporting discovery itself. There are quite a few talks that covered about graphs at the conference as well. The idea is that a lot of times we think about data in a two-dimensional form which is rows and columns and now the newest form which is a vector databases or key value stores. But in order to sort of have exploration and understand the explainability behind the exploration itself, graphs can be quite powerful. So if you wanna start with graph and you wanna apply this to any of the domains or the problems you're working with, ontologies are a way to sort of express real world entities in a graph world, right? So you sort of try and understand, okay, what are my entities and what are the relationships between them? So we started with something very simple where we said, okay, so we have a song which is an audio entity and then we have an artist. A song can have, a sample is also a type of audio and then they have relationships of similarities between different other audio samples, right? So we wanted to power discovery from this and this is what discovery starts to look like. You start with a song and you don't have to stare at the image and try to zoom that with your phone. We'll show you a demo where this actually, you can sort of navigate through the graph. We can see that 10 second samples have similarity with other 10 second samples and they're all coming from different artists and this was actually very useful for us because if we tried to build anything where we couldn't explain why we are starting from a particular track and where we are lining up, it becomes a challenge in terms of trying to build an experience for discovery itself. Now, yeah, we didn't want this to be all talk and no show, right? We actually wanted to show you a demo of what this looks like. Okay, so I'm not sure how many of you know Ben Bomber. If you don't, I definitely think you should check it out. He's a great music producer. Okay, so we have somebody in the audience who likes him. So this is a track that we selected and we have a bunch of samples from everywhere. So playing this track and I'm from Berlin, of course you expect house and techno, it's not gonna be anything else. So this is the input track and then you see the sample and the spectrum next to it and this is cool, great. Now, what we generated by selecting that as an input track is that we brought a similarity and a traversal algorithm that picks up all of the other songs that are not coming from the same sample using two different transformer models. So we didn't use wave two, wave two because we realized the quality of similarity and detection be quite poor given that it was not meant to do this job and you can listen to another song that's coming from AST which it thinks is the most similar and we'll play that as well and it's coming from another Ben Bomber song and we have one for clap two, sorry. And by the way, all of this is built with Streamlit and all of the open source libraries and the codes available on GitHub as well. So if any of you wanna play around with it and build something, everything's there. Of course we can't publish the dataset because this is something we picked up and we don't wanna get into a whole lot of legal tuzzle by publishing music datasets. We're two people who like to do the tech stuff. We don't want anything with the legal lawyers and have a fight there. But anyways, that's one part of it, right? And we found a ton of other similar samples and then you could go on and on about this and we said, okay, so this is great. So we started from identifying a small portion of track and from there we use the capability to find other tracks that sound similar that's potentially not coming from the same artist to sort of go through the whole discovery process. Now, none of us are gonna be building our own playlist. I mean, maybe we did it when we were working with Winamp and in the early 2000s where sharing playlist was the way to get street cred, but we don't do that anymore. We just let Spotify do it for us. So we said, okay, so if we wanna get to anywhere near close to enabling that discovery, we need to build a playlist. So what we tried to do is we said, okay, we start with our first sample and then we find the next similar sample and use that as a seed track and sort of jump through the entire graph to enable this. So before we show you the playlist itself, a little bit of graph traversal, right? So selecting the song, which you can see here. So you can see that I can find all of the other similar songs and I've of course reduced the number of similar songs and samples that we have. So we don't get overwhelmed, but if any of you are interested to get overwhelmed a little, this is what it starts to look like, right? So every sample has a lot of similarities with other samples from different tracks and because we have different transformer models that we're using, the relationship that you see in terms of similarity tells you which, what is the relationship? Where does it come from? What does the similarity score look like? So instead of computing everything on the fly, we sort of pre-computed it all of the data. Instead of showing them on a vector database where you can't really understand the explainability part of it, we sort of push them into graphs. Now with that, we generated a playlist. So the playlist sounds with our seed song, which was this, slowly starts to get into the next song and the next song and the next song. So once we did this, right? So you can see this weird switch. That switch at 10 seconds is exactly what we wanted to change, which unfortunately we couldn't get the time to build our own network. So the future of what we want to do is to sort of get in, so to build our own network where once the discovery is enabled, we want to train a generative model that given two different samples and from all of the training data that's present from the DJ sets and how they are built and the mood of the track and everything, we can enable a proper, we can generate samples in between the tracks to sort of seamlessly transition from track A to track B to track C and you can select the number of songs you want and you can go through the entire traversal for different songs that you have and of course we wanna also get into different genres. Yeah, so the future is wild with everything that you've heard around large language models and transformers and everything with Lama 2 coming out. I don't think it's for any of us to be scared. I think now we have immense power to do things which we couldn't previously build. So I think we can look at it in a way that in the next year or two, we'll see an amazing set of plethora of tools and products that we'll have access to that of course it's very powerful. We have to be very careful with what we do with it but yeah, we think it's a great thing that we see right now. With that, of course, open source for the win. None of this, if we tried to build this 10 years ago or even five, we probably had to write tons and tons and tons of code. So a big, big thank you to all of the open source work that's going on all the way from hugging phase with the models, NumFocus has a group with tons of libraries that they're finding, Meta with some of their open source projects. Yeah, it's weird to talk about Meta and open source but they're doing a great job. And then we have Python and Neo4j and all of these communities that we've sort of used in order to, worked with in order to build this. And yeah, reach for the stars. So you can build some cool things and share them in the community and speak at conferences like this. So thank you. We're open to questions. Maybe we should play the playlist. Oh yeah, sure. If anybody wants to hear the playlist. Yeah. Thanks for the talk. Very, very interesting tool and concept. So I think one of the things that makes a product like Spotify so successful is that they have a huge collection of genres, music, artists, and so on. So when you try to build like a product like that, you might have bias because you don't have such a big collection and you don't have such a diverse collection of genres and artists and so on. So have you given any thought to how are you going to solve that? OK, so one, so we're not going to bring this product into public for people to use. This is a project that we built. Yeah, this is like a side project we built. But what we do want to do is to try and see if we can, at least for our own personal use, with all the tracks that we listen to sort of get them because we're also listening to quite a few genres and then see how this tool sort of goes forward. And this was also a way for us, especially some of the work that we've planned around the generative part itself of generating samples between the generator playlist. We want to sort of publish a paper to see if that can get picked up by somebody from SoundCloud or Spotify so that they can help music producers to build these DJ sets. Makes sense, thank you. Thanks, great talk. And also shout out for probably featuring what is probably the greatest guitar solo four times. But my question actually goes in the same direction as the first one. So do you see in the future any legally feasible way that one can maybe somehow, say, crowdsource the database so that I have my music locally and it generates the model and then it will also give me recommendations from stuff that other people have put up? Yeah, so one of the things that we've been thinking about on that aspect is if we were to take this into a public domain, essentially, we could have people upload their Spotify API keys and we could essentially still build it on top of Spotify, on top of SoundCloud to enable this exact discovery because the challenge that we are trying to solve is with anything with recommendations or discovery, we as users have very little control. And if we can crowdsource this, of course, a huge piece of royalty-free music database that we've looked at as well, but we wanna sort of take back a little bit of control for music discovery because music is so personal and you wanna find new artists based on what you like and not just based on what somebody thinks we like because of something else. Thanks, yeah. Hi, so my question is about alignment, right? So if you have a long song, say, comfortably numb and you would do those 10-second splits such that you're splitting the two samples just in the middle of the solo, right? So that will obviously affect how the similarity is gonna be discovered in the other samples. So how did you solve that? Do you wanna take? Yeah, so the point here is that in a big song, right? The goal might be that you would like a specific piece of 10 seconds with which you wanna look at the similarity, right? So you then generate the embeddings and get the best match with that. But given that you have across your entire database, you have 10-second samples and embeddings for that, you do like an M cross N match, right? So it's like searching all the 10 seconds of all the songs and then getting you the best. It might be in the start, middle, or the end, it doesn't matter. But which is that specific 10-second sample that matches with the current one? So that's how we align. Did that answer the question? Yes, so I follow up to this in the kind of combining of the three different songs, like you actually start at the beat. So how did you align the beats? Was that the same thing? So how do you keep the beat going, right? So currently, we have just joined them together but that's what Nestle was talking about. It's just an abrupt join, right? There are different ways you can solve this by normalizing or doing a crossfade, which is easier. But what we want to do is have a generative model which will actually generate the gap and the transition between these two pieces, right? That would be beautiful and amazing. I'm just, yeah. A quick question. So you only showed Techno. How well does it work with lyrics and things like that? We haven't tried it with lyrics yet. So yeah, this was just a fun project and we thought that it would be easy to start with just music audios and not lyrics. Maybe it's still possible. Then you can have your singers, which they'll pull up as very similar and a combination of the instrument use, but we haven't tried it yet. Yeah, I mean, one of the things that we did talk about is when you do a fast Fourier transforms on your incoming, like the sample that you've chosen, even if it has lyrics and other tunes, you can also pick and choose the signals that you want to sort of look at. So the reason we didn't choose lyrics is lyrics add, like the way people speak and sing has a lot of impact in terms of similarity. So you can find artists who sing in a very similar fashion, but the underlying beats and tones and the mood of the music can be very different. So the whole idea of playing around, understanding the lyrics, understanding of all of the constituents that actually make up a signal, that's some work that we have to do. We haven't looked into that direction yet. I don't know if that answered your question, we can also. But it's interesting. Yeah, we also have a repository. So if any more contributions are very much welcomed as well. Okay, awesome. Thank you. Thanks everyone. Thank you.