 Hello, everyone, thank you for coming. My name is Tim Schmeyer, and I'm going to talk about some of the research we've been doing at iHeartRadio with vector space models and introduce a new API that allows you to access and manipulate the vector space in a very easy and compact manner. And so the outline for the talk will be what is iHeartRadio? What kind of data do we have? And then we'll introduce vector space models in general, just giving a brief overview. Then we'll talk about and introduce the API that we've developed internally to access and manipulate this vector space. Then we'll go on and we'll talk about the types of vector spaces that we use this API to manipulate. So we'll introduce collective matrix factorization, which is a slightly different flavor than the normal collaborative filtering that most of you are probably familiar with. And then we'll talk about how we've developed an acoustic vector space that we also use. And then finally, we'll make a few generalizations and conclusions from our work with the new API and vector spaces in general. And so iHeartRadio, who are we? So we're one of the top music streaming apps. We have tens of millions of active users. In addition to having our streaming app, we also own about 1,000 radio stations throughout the United States that sets us apart from people like Spotify and Pandora. In addition to our live radio stations and our streaming app, you can create a custom station from your favorite artist or any sort of any track. And then in addition, you can interact with our app by listening to our terrestrial radio stations, listening to podcasts, or what we call perfect four stations. And so a perfect four station would be like a curated station like perfect for a summer afternoon barbecuing or perfect for 4th of July or something like that. So kind of a curated list of tracks organized in a playlist. And so vector space models in general. So what is a vector space model? It's a mapping of a number of entities into some high dimensional space. And each entity is represented by a vector or a point in this high dimensional space. And distances between two entities are thought to be how similar they are. So if, for example, two artists collaborate together and share a large number of users that listen to both artists, they get pushed together in this vector space. And if we have two artists like Kanye and Taylor Swift that probably don't share much of the same audience, they get pushed apart. And I Heart Radio kind of the joke that we had when we came up with this slide was Kanye is in a vector space all his own. And so this is just kind of what the vector space might look like in two dimensions, right? And so our red points here are users and our black points are artists. And this is kind of the typical collaborative filtering type of matrix factorization that most of these streaming apps use. And so we can see me down here. Somehow I got put by Beach Boys and Avril Lavigne. And the guy who actually made this slide, Robbie, put himself by David Guetta. So I'm not sure how that happened. And so why do we use these vector space models in general? And why are they kind of state of the art and everyone uses them? So they are compressed expressive representations of some pretty complicated interactions. And so we have millions of users interacting with millions of tracks, but also thousands of artists, thousands of podcasts, thousands of live stations. And we can represent that all in a very compact space in about 100, with each vector being about 100 numbers. And so that's pretty powerful. Another reason people use vector space models is because the operations are simple. Basically, you're looking at Euclidean distance or cosine distance. And those operations scale linearly with the dimensionality of the vector space. And so that's obviously good for scalability. In this presentation, when we talk about scoring, we'll be using the cosine distance. And finally, we've developed an elegant API that allows us to access and manipulate this space. And this API is great because when you have an API that allows the ease of manipulation of all these different entities in this space, it makes prototyping and development time decrease really fast. And so I'm going to talk about how this API works. We'll talk about our collective matrix factorization and mapping all of these different entity types that we have into a single space. And then we'll start to show some examples of how the API enables us to rapidly prototype and create new products in a very short period of time. And so first, we'll talk about a little bit of terminology. So what is an entity? So an entity is any different type of user or product. In collective matrix factorization, we normally talk about users and items. Or I guess I should say, in regular matrix factorization, we talk about users and items. In collective matrix factorization, we have many different types of items. And so for example, we would have podcasts, users, tracks, artists, all these different types of entities that we want to put into the same space. And so how you represent that is just a simple tuple, the ID, and the type of entity that you actually are interested in. Obviously, we can also represent that just with our vector. And then the API enables three functions that we've found cover every use case that we would want when manipulating this vector space. And so scores takes a kind of a candidate list and then scores it against an entity of interest. Similar is a simple kind of nearest neighbors function. And getVector allows us to actually pull the vector representation for an entity out so that we can manipulate it and then kind of feed it back in to either scores or similar. And so I'll go over that really briefly. I'll introduce each one of these. And so scores. So if I have a candidate list of, let's say, three artists and I want to score them against a user named Ravi, I would just call the scores endpoint, feed in the entity of interest, and then the candidate list. And then I would get three scores out for each tuple that I fed in to score against the person Ravi of interest. Similar is a simple nearest neighbor function. So you would pass in an entity of interest along with the entities that you are interested in. And so in this case, if we're trying to find three artists that I might like, we would pass in my entity tuple, the artists. So we're passing in the entity type that we're interested in and then three for three nearest neighbors. And we get the corresponding entities out or the corresponding nearest neighbors. And we use annoy as a back end here for our approximate nearest neighbors lookup. And so you can find that on GitHub. That's been open source for a couple years now. And then GetVector just gives us the vector out so that we can manipulate it and then we'll feed it back into one of the previous two endpoints. And so what is collective matrix factorization? So this is how we build our vector space. And so the first approach of simple kind of matrix factorization is you have users and items and then you factorize them into some low dimensional space, right? And so this came out of the Netflix challenge and then has since made its way over to implicit matrix factorization, a nice paper by who. And so this is kind of the standard for both explicit feedback like the number of stars and then implicit feedback which would be like the number of times you listen to a track but you didn't really do anything. You didn't rate it explicitly. And so what it allows us to do is take a huge matrix. So our artist user matrix would be some trillion number, some number of trillions of interactions in a very sparse matrix. And then we can factorize that down into much lower rank matrices. We factorize it into matrices of dimension 100. And so we use implicit matrix factorization because we often don't get explicit feedback from our users. And so unlike a movie where you might watch it once and give it a rating of three stars or five stars or one star, music is consumed a little bit differently. Oftentimes people will go back and listen to the same song that they like over and over and over again many times but they don't actually rate it one, two, three, four, five. And so implicit matrix factorization will just take in the number of times you've consumed an item and factorize it out. And it uses this objective function, sees the confidence of or the weight that you're assigning to some observation and that is hyperparameter. And XTY is gonna be your user vector and your item vector. And then you have a regularization term here. And this has no convex solution. And so the trick is to fix one of these matrices, solve for the other and then you fix the other and then you solve for the first one. And so just to illustrate that. So what happens here is you fix the items matrix, you solve for the users matrix and then you fix the users matrix and solve for the items matrix. And that's called alternating least squares. And then you can converge to a solution in this matter. And so this is how we can build our vector space. And so the theme of this talk or one of the big themes is why not just, or why not everything? We're not only interested in users and artists, we're also interested in tracks and podcasts and live radio stations. And we wanna be able to recommend all these different types of entities to a user. And so this is cool because it allows you to share information across users. And so say you have a group of users that listen to podcasts and artists and then you have another group of users that listens to, let's say, live radio station and artists. Well, you can kind of cross recommend those podcasts and artists to the groups that haven't actually listened to them because you can transfer the information across users. And that's really powerful. It also helps us address the cold star problem, which is when we onboard someone, we have very little knowledge or almost none at all about who they are. We can't really make personalized recommendations until we get some feedback. And we've found that we can actually factorize demographics into this space and that helps us to address the cold star problem and provide personalized recommendations almost upon entry with the user starting our app. And I'll show some examples of that here in a second. And so how does this work? So you guys are familiar kind of with the user item matrix. That's just traditional matrix factorization. Collective matrix factorization, basically you just concatenate user entity matrices. You concatenate it into one long matrix and then you factorize it as usual. And so as we can see here, we have a user artist, user track, user live, user genre, interaction matrices. We just concatenate those into a very wide matrix and then we can just factorize that as usual. And now we have all these different entity types mapped into the exact same space. And so now we can start to make these cross scoring and transfer that information across entity types. And so now we can start to put things in like our radio stations or genres and stuff like that. And we've actually factorized about 10 or 12 different types of information into or different entity types into this vector space. And so you can almost put anything you want in there. And so now I'll give a couple of examples of the power of collective matrix factorization and kind of our API enabled to kind of create very personalized recommendations with very few pieces of user feedback. And so let's say we're interested in recommending some new music for Ravi and so we just pull two artists from his recent history. We find similar artists to the artists that Ravi recently listened to. Then we can score each of those artists against Ravi's entity and take the top 10 choices to recommend to Ravi. Here's how we address our cold start problem. And so typically when a new user downloads our app they're asked to register and as part of the registration we ask for their age and gender. And we can map that into our vector space. And so here's an example of male versus female. So as you can see on the top example if a new female joins our service she'll get assigned the average vector for a female listener in our service. And if a male joins we can see that the recommendations are much different for him because he occupies a different part of our space, right? And so now we can take that a step further. So we have this cold start problem. We have a new user. If it's a new female user and it's a younger woman she gets much different and more recent track recommendations than an older female new user would get. And that kind of makes intuitive sense I think. So let's say that we have this younger female new user and the first thing she does is she starts a Florida Georgia Line station. And so Florida Georgia Line's a country band in case many of you don't know. And so you'll start to see that our young female gets more recent country recommendations. And at this point we've gotten one piece of explicit feedback, right? She started a single station. So now let's say we have this young female country listener start in a Dell station. And so now we can start to see that our top recommendations also pull in more recent female pop artists into these top recommendations. In addition to, I'll take questions at the end. In addition to the country that we've already noticed. So kind of one more manipulation here. Let's say that she actually thumbs down Carrie Underwood's song once it's been presented to her. And so now we see the candidate list changes again and now we see female country artists are no longer in the top 10 picks, right? And so we still have country, we still have female pop, but we no longer have female country artists being recommended. And so this is pretty powerful. We now have three pieces of explicit information and we've already been able to create a deep level of personalization for a new user with almost no information. And this can all be done in theoretically real time. And so now I'm gonna change gears very briefly and talk about another vector space that we've been working on. This is our acoustic modeling vector space. And I'm not gonna go over all of the background. It's based on a deep learning model that was published by a graduate student at UGENT. And he then did an internship at Spotify and wrote up some really great results. So I'll kind of refer to the paper and not go over that in detail. But I will say very briefly, or I will introduce briefly the concepts that you need to know to follow this talk. And so what this graduate student did is he takes the audio representation in terms of a spectrogram and then he tries to use a deep neural network to map that into the vector space of users and tracks. So basically he's trying to take a track and then map it into the vector space that we were just talking about. And this is useful because new tracks face the cold start problem, right? When a new track comes out, it has no lessons. So it can't be recommended. And so this was research done to address that problem, not for users, but for items. And so a couple of things you'll need to know to follow the rest of this talk. So convolutional neural networks learn hierarchically. And so typically they're used for image recognition problems. And so when I say they learn hierarchically, typically the lower levels of these convolutional neural nets look for edges, contrast, like simple things like that. And then at later layers they look for abstract things like cats or airplanes or stuff like that. And that also happens with music. And so at the lowest level we're looking for specific frequencies, right? So is this frequency resonating at this time step? And then once we get up to the later levels, these fully connected layers, we're looking for things like genre, right? And so we're getting increasingly abstract as our vector space gets more or gets deeper, I should say. And then eventually we end up in this Latin user item interaction space, which is very complicated and not very interpretable even. And so you can also view the output of each of those layers as its own unique vector space, right? So we're basically mapping from signal vector space into some number of intermediate vector spaces and then into a genre vector space and then into a user item interaction vector space. What I haven't done yet is actually tell you what the purpose of this research is. And so what is the purpose of this? What are we trying to do? We want to learn a model that will discriminate between songs that either sound similar or do not sound similar. And some of you might be asking, well, why don't we just use the genre layer or the genre vector space output? Well, that's not really useful because we have that data already, right? And so we want something a little lower level that addresses the perception of similarity. So the song has to have a similar structure, but we don't want to just organize by genre because that's not very useful. And so how we do this is we basically chop off the convolutional neural network before it organizes itself by genre, right? And so we've said we get these increasingly complex vector spaces as we go deeper and deeper through this neural network. And so down here we have some simple signal filters. Up here we have stuff that actually sounds pretty similar but is not yet enforced into groups by genre. And then we call that our acoustic vector space, right? And so we basically use this as a feature extractor. And so here's an example of our use of the API now using the acoustic vector space that we created. And so if we wanna know what are five similar tracks to a song called Was Not Was, Boy's Gone Crazy, or I should say by the artist Was Not Was, Boy's Gone Crazy, we're never gonna give you up. And really that was just a way for me to sneak that in. And so now you might be asking, well, this model is kind of like Jerry Rigg, right? Like we didn't train it to learn to discriminate between songs that sound alike or didn't sound alike. We just kind of chopped off the top layers of a model we trained for something completely different. And so you might ask like, well, how does that perform? And so we think it performs fairly well. We think the songs that it thinks are the nearest neighbors are actually or are relevant and do sound similar. But quantitatively, it also has an AUC of about 0.78. And some of you are probably thinking right now, like, well, how do you have an AUC? You don't really have outcomes, or at least you didn't train a model with outcomes. This was kind of a, like I said before, a hacky thing. Well, you just kind of make up some outcomes. And so what we did here then is after we trained the model, we were like, well, how good is this? How can we know? So we turned our workplace into a mechanical turk and we just made people that work with us give us some feedback. And so what we did here is we get the top 200 nearest neighbors for a given query track. We just shuffle them up and randomly choose five and then we solicit some feedback from people in our office, right? And so now we do have outcomes, right? We know what sounds similar and what doesn't. And so if you do this experiment, you might end up with some distributions that look like this. So we see that the score put out by the acoustic model clearly tracks with higher ranking or more similar user perceptions. And so I've been calling this kind of a hacky solution all along. Well, now we want to make it an unhacky solution and do something to retune this model to actually reflect what we want it to do, right? And so this has some clear parallels to like transfer learning or learning a distance metric. And so what we've done here is we're saying, can we find a vector of weights, beta, that we can use to drop unneeded dimensions or dimensions that are not relevant to the human perception of music and can we stretch or compress the dimensions that are relevant to give us a better discriminatory power, I should say. And so this is still ongoing work, but preliminary results suggest that this actually works pretty well. So our initial vector space is of dimension 1024 and we find that we can compress our dimensionality down to about 250 while also improving the discriminatory power of this model. And so now we're remapping this space to get it to behave a little bit closer to the human perception as opposed to just like signal processing, trying to map that into this user item Latin space. And again, so like the theme of this talk is you should put as much stuff as you can into a single vector space because that allows you to do some really powerful things. And so now that we have this acoustic vector space, we wanna put other stuff into it. And so we can put genres into it. We can put male and female vocals, instrumentation, different moods, and then we can also put artists in this space as well. And so here's another example of using our API to score mood, a mood, angry or mellow against two tracks that have an acoustic representation. And so I assume most of you guys know or can tell from the picture, Pantera would be a pretty angry heavy metal group. Ben Harper is like a kind of a folk singer songwriter. And if we score them against our mood vectors, we see that Pantera has a much higher score against angry and Ben Harper has a much higher score against kind of this mellow mood that we've embedded into our vector space. This is a particularly cool example. So I grew up in the mid-90s and I was thinking about this one day and I was like, well, I wonder what Nirvana would sound like if Kurt Cobain wasn't in the band and instead they had some female singer, right? And so we can actually, we can do that in this vector space. So what we do is we take the vector for Nirvana, we subtract the male vocal vector and then we can add the female vocal vector and score them against some tracks. And so this, I don't know, made intuitive sense to me but you guys can be the judge. And then like the final example I'll have here about using our API in this vector spaces. Let's say we've embedded some genres along with tracks into this vector space. And let's say that we know a user likes jazz and they like funk. And so one obvious question is like, well, would they like something in between the two, right? Like what does half jazz have funk sound like? Can we let the user explore the space in between those two genres that they like? And so we can walk the manifold in between those two items. And what I mean by that is basically we take a few steps in between those two points in this high dimensional space and at each point we're gonna grab five or six or 10 nearest neighbors and then present them to the user. And so if we do that here, we can start with the funk vector and then we can work our way towards the jazz vector. And I wasn't familiar with all these songs but we start kind of with funk, we go through blues with horns and then we end up kind of with jazz at the bottom. And so it's kind of another cool thing we can do once we push all these different entity types into this acoustic space. And so to conclude, you wanna put as much stuff into this vector space as you can. Anything you can think of can be useful in some way for a product or recommendations or other stuff. So if you can put it all into a single space, you should try to do that. We use Sparks ML Lib matrix factorization and they have a nice, their user group has a bunch of questions about how to exactly implement using collective matrix factorization in Sparks ML Lib. So if you're interested in trying this, you can refer to the user group, there's a lot of good stuff there. And then the other thing is it's just as important to have a flexible API that allows you go in there, pull out and score different entity types very rapidly because that allows you to prototype and experiment really quickly. And with that, I'll conclude and I saw some questions so I'm happy to take a few. Go ahead. So thanks so much for the talk, it was really interesting. It seems to me one natural conclusion from this is that we don't necessarily have to focus on music and in fact this could be used over across like many different data types and specific things. Is there any other use cases that you guys have tried outside? I know like obviously you're a radio music company so like clearly but is there any other use cases you've tried or places of inquiry you think would be interesting to try next? That's a good question. So the data science team at our company is actually fairly new. So I was actually the number two guy there and I've been there for about a year. So this is about as much as we've been able to do in that time. And I'm not sure there are many data types left that we haven't embedded in one space or another. So this may be it for us. You're right though. I mean this is definitely item agnostic. So you could use this in Netflix or any other company. I'm sure Amazon is doing exactly this stuff too. Yep. Yeah so I think the question here is how do we handle updating our vectors with new user interaction with some item? Yeah so we run a bash process right now every night and so that will then create a new vector mapping. We are experimenting with updating it in real time although that's not, we can't make any conclusions if that's better or worse at this point. Sorry can you speak up? So well the thing we're experimenting with now is updating all the vectors in real time and so let's say a user will thumb up a track we would just move their vector slightly towards that track and if they thumb down a track we would move it slightly away from that track. I'm not sure that we could do real time AB testing but we could do batch AB testing where we have one set of user is using this streaming update type of model and one using a batch model and then we could just test whatever metric we want. We test total listening hours usually. Questions, other questions? How do we handle it? We don't, we let the vector space handle that and it works pretty well. So that's like one of the benefits of using matrix factorization is because you can capture a lot of those different complex interactions kind of more or less automatically and so we don't see that being a huge problem. One of the things I can say is if one of the ways to deal with that is actually to increase the dimensionality of the factored spaces, the factored latent space. So if you're not capturing the user item interaction to the degree that you'd like you can try expanding from let's say 40 dimensions to 100 dimensions. Yep. Hi, thanks for the talk. Could you describe again briefly how some of the feature embedding works? Like for male vocals, are you just taking that input? You know which songs have male vocals and then you're averaging over the vectors for all of those or is something else going on like you include that as a bit before you do your matrix factorization? Yeah, so that's a good question. So once we have our acoustic space that's exactly what I'm doing. I'm taking a linear combination of tracks with male vocals and then I'm averaging them and then what I try to do is I try to time and genre match with the female vocals. And so basically like if I have, let's say for example I do nirvana I would try to time and genre match that or genre match that with whole for example. And then I would do that across a ton of different genres, a ton of different songs to create the male and female representation. So the hope is that I'm gonna wash away all of the kind of acoustic information and be left with just male, just female. Yep. In your recommendations, how do you decide how much weight to give to the different factors? Oh, okay, another good question. Let's go back to that. So are you talking about like this, this slide here? So that's a good question. So we typically weight explicit feedback higher than implicit feedback. And so typically for like Florida, Georgia, line of Dell and subtracting this carry underwood vector we would weight that with like three or five as opposed to female and young, which don't really, it's not really personalized at that point. And so we give that a much lower weight than we give kind of these more explicit signals that we get from the user. And do you have a way of measuring how good that weighting is? AB testing is always the answer. So yeah, it's empirical. Think I saw a hand over here? So remember I gave this example when I listened to some song country song by a female vocalist and then you moved me in this vector space and then I voted thumbs down and then you moved me away. So the fundamental problem with it that it probably shows I still like country songs by female vocalists. I just didn't like this particular one. How can we address it with a vector space model? Yeah, I mean, that is an inherent drawback but you would hope that the user would continue to give you feedback in the future that just kind of changes where that position is in space. And you would also hope that perhaps that cluster of tracks that this specific female doesn't like would perhaps be in a different part of the space than the female country songs that she would like. And so you're slowly gonna hopefully move in that direction with more and more feedback. So everything you've described is in terms of the recommender seems to be optimizing for similarity but there are other factors that surely must play a role in the recommendations as well. So just the overall popularity of a track. How do you weigh those different factors when you decide on the recommendation or on what you show the user? Yeah, totally. So it kind of depends on which product but one way you can take that into account is what we find is the length of the vector correlates with the item popularity. And so if you normalize all vectors to be of length one you kind of reduce that bias that's built into the vector space model. Sometimes that's not what we want though. Sometimes we wanna bubble the most popular tracks to the top at all times. And so it really depends on which product as far as how we handle normalizing that data. One more. Yeah, so how many different techniques did you guys try? Like there are a whole bunch of different ways to do this vector embedding. Like maybe you could have done deep learning for the first part. How did you decide upon the method that you used and are there other ones that you're considering or that you had problems with? Sure, how do we decide? So, well I guess I'll address that in two parts. So, like I said, our team is very young. And so our first attempt at creating personalized recommendations is obviously matrix factorization because that is pretty much the standard method of doing it now at this point, state-of-the-art. I came on board and I did the acoustic stuff. So what I wanted to do is actually use data that we already have that we already have that we're not using already. And so, like I said before, matrix factorization is content agnostic and so it doesn't matter what the items are. I wanted to actually take into account some part of what the user actually consumes. As far as which models we used, so there was some precedent already for this sort of stuff. So I kind of wanted to build on this but take it in a new direction. I am thinking about experimenting or I am experimenting with other deep learning models too that do the same sort of thing. And then what we're hoping to do is kind of compare the vector spaces and then we can start to rank them to see which ones are better or start to mix and match them in different products. And so, some of the other stuff I'm thinking about doing is auto-encoders. I think Pandora may or may not have experimented some with that. And then generative adversarial networks are new and those kind of hold some promise for developing a cool vector space for stuff like this too. So it wasn't this model only I'm experimenting with all of them. Great, any others? I ended kind of quick so I'll do a couple more I guess. Yep. Yeah, so the question is do we look at other metrics other than total listening hours? The other I would say biggest metric that we look for is user retention. And so I would say those are the two most important for the kind of decision makers way above us. As far as whether some have had big improvements or not, a lot of the time it's very small incremental increases. And so we don't expect to increase it 20%, you know. Great, if there are no more questions, thanks for listening.