 A few days ago, I was watching a video on the nature of YouTube's recommendation system. I'm talking about YouTube's AI that recommends videos to you. I'll play a clip of this video right now. A couple of months ago, I made a Twitter thread about some weird activity I saw online, and after I posted that thread, tons of engineers from many different tech companies reached out to me privately to tell me their stories. My interest in all this started one day when I was scrolling on YouTube and the algorithm served up a pretty weird video for me to watch. You know how the algorithm works, right? It looks at your past activity and tries to figure out what you could watch in the future that would keep you on the platform the longest. It optimizes watch time. The video is an excerpt from the channel Smarter Every Day, and in this video, he goes on to explain how people are tailoring their videos to appeal to the algorithm. So even if a video is of low quality, it is still somehow picked up by the algorithm and recommended to a bunch of viewers. He explains this as an attack on the YouTube algorithm, and I highly recommend you watch his video for the full scoop. But for this video, I thought I would look at recommender systems from a more technical perspective, from that of a deep learning engineer. I'm not sure how much you know of this as viewers, but in the creator community, there is much talk going on about how YouTube tends to recommend videos with higher watch time. And watch time is essentially just the length or duration for which a video is watched. The idea is that, longer the video is watched, YouTube will push that up in the rankings, and so more people will be able to see it. That's great, but is the algorithm really that simple? Does it only value watch time when recommending us videos? And in this video, we're going to actually take a look at just that. Techie or non-techie? You're gonna understand this video through and through, even if you don't have any knowledge of recommender systems. So if you want to learn more, stay tuned. So we're going to talk about YouTube's recommendation system, and more specifically, deep learning and how neural networks fit in. Let's take a concrete yet simple example and see exactly how this is working under the hood. Say you're Susan, you own this platform called YouTube, where users can watch videos. Let's also assume we only have five videos on the platform. Videos one through five. Every online platform has users. And let's say that we have five of them too. Users A through E. And now, we have this little matrix structure, where each cell represents whether a user likes a particular video or not. Note a matrix is a 2D assortment of numbers. And in this context, each cell can take three types of values. It takes one if the user liked the video, a negative one if the user disliked the video, and a zero if the user hasn't watched the video and so hasn't rated the video. Now these five users have watched and rated whatever the videos they've seen. We show the results here in this matrix. We're making things simple by stating that if a person has watched a video, they must either like or dislike a video. They can't leave the video unrated. A property of this matrix is sparsity. Every person has not watched every video on YouTube, even in this mini matrix example. So there is bound to be some empty values and a lot of them, which I've labeled as zero here. Now that we have our data, let's make some predictions. Let us predict if user D will like or dislike video three. Now how do we do this? Traditionally, it's done through a technique called collaborative filtering. The intuition behind collaborative filtering is collaboration. We use the ratings of older users to make predictions for newer users. The same can be done for items. We can use the ratings on older items to make predictions for newer items. When I say items, in this case items are the same as videos on the platform. I'll use them interchangeably. Depending on whether we've used items or users, we have a user-user collaborative filtering and item-item collaborative filtering. I'll talk about both with a simple example. Now in user-user collaborative filtering, it's a two-step process. First, determine how similar other users are with this user, D. And then we use these users to predict whether D will like or dislike the video three. The intuition here is that similar users will have similar interests. So if there are users who have the same tastes as user D and like video three, then the chances are that user D will also like video three. We're making this prediction with numbers. That's the only difference. For the first step, every user can be represented by the row of values. This row of values is called a vector. And we use the cosine similarity formula to compute the similarity between two vectors. This is just how similarities are computed mathematically. And it's only one single way. There are other ways to also compute similarity. The similarity score can take values in the range of, well, cost data, which ranges between negative one and positive one. Now, why does cosine similarity work? Well, it's a direct measure of how close two vectors are to each other. If user A and user B are similar to each other, then the angle between them will be small. So if theta is zero, the users will have identical tastes. And as the angle increases, it means the two users' tastes are more spread apart. Completely opposite vectors means that user A and user B have opposite tastes. So if user A likes video three, chances are that user B won't like it. But while computing similarity between users, we cannot just take the direct cosine similarity between the videos they like or dislike. Some users may be more critical than others, so they tend to give more negative reviews. Now, to account for this difference in grading, we have to normalize or centralize the ratings. This is done by computing the mean for every single user and then subtracting it from their corresponding ratings. And this will now bring the average rating for every single user to zero. From this example, you can see that user B and user E tend to be more lenient and give more positive ratings. So when user E says that he dislikes video three, this dislike is rated more heavily than their liking for the other videos. In a similar way, users C and D tend to be kind of strict in their ratings. So when user C says they really like video two, they really like video two. And this like is rated more heavily than their dislike for the other videos. I hope that makes sense. Now, our goal is to determine if user D will like video three. This way we can determine whether we should recommend the video to him or just not. We normalize the user vectors already. Now, the next step is computing the similarities between these vectors and user D. And this is done with the cosine similarity that I mentioned before. Remember, computing similarities is just to determine which users are more in line with user D so that we can make recommendations based on those users. Looks like users A and D tend to have opposite tastes. And this is because user A really disliked video two while user D seemed to really enjoy it. As for user B, we can see that he agrees with D when it comes to the videos four and five. User C and D agree with their dislike of video four. And then we have user E, whose similarity with user D can't really be attributed to a single video review, at least in my opinion. But I just calculated it like I did for the others and we actually see some agreement between users D and E. Next, we make a prediction. Consider the users who rated video three. In this case, it's user C and E. Now we take the weighted average as the predicted rating. We get a value below zero. Remember, the average user rating for all users is now zero. So chances are that user D may not like video three. So we don't necessarily need to recommend the video to them since the chances are they don't like it. What about video one though? Well, we do the same thing. We make the prediction based on all other users, which is four of them here, and we take the weighted average rating as the prediction. Once again, we get the value which is less than zero, so we don't recommend this video to user D either. Remember, this type of collaborative filtering is called user user collaborative filtering. It's called so because we are comparing different user interests in video three to predict a new user's interest in video three. It's really easy to implement, right? But there's a problem with this. Say that user D was a business magnate and he spread the word of this online platform called YouTube that's just amazing. Overnight, the number of users increased from five to 500 and the next night to 5,000. And then by the end of the month, I'm dealing with a million new users. This is amazing, but the big question now is how does my algorithm deal with this sudden burst? Well, let's see. When an algorithm wants to see if a user likes a particular video, we have two steps to follow. The first is determine how similar users are with this particular user. And then the second step is we use these users to predict whether the current user will like or dislike the video. The time to do the first step is heavily dependent on the number of users. Because of this, an explosion in users will lead to a significant delay in the algorithm. So user-based collaborative filtering, its pros are that, well, it's simple to implement, whereas its major con is it doesn't scale well. However, there is another type of collaborative filtering technique called item-based collaborative filtering. It's quite easy to understand the mechanics because it's very similar to user-based collaborative filtering. For the sake of comparison, we'll try to answer the same question as before. Let's predict if the user-D will like or dislike the video-3. The first step is the same as we did with user-based collaborative filtering. Find the center at cosine similarities. But this time, it's between items and not between the users themselves. So in this case, it's between video-3 and the other videos. Next, we compute the weighted average rating. This is done by considering all items rated by user-D. And like before, we get a value that's less than zero. So we don't recommend video-3 to user-D. Notice that we do get actually similar results to the previous approach of user-based collaborative filtering. And this should usually be the case. Now, why does item-based collaborative filtering usually perform better than user-based collaborative filtering? Well, here's a few reasons. Items are much easier to categorize than users. A user may be into science and technology, entertainment videos, educational videos, or even comedy videos, while a video itself doesn't belong to all of these genres. So comparing them makes more sense than comparing something as complex as people, users. And another reason why item-based collaborative filtering may be better is simply because that the number of videos or items on the platform doesn't really increase as fast as the number of new users on the platform. So computing item-based similarities is less computation heavy. But even item-item collaborative filtering still has a major problem, sparsity. The matrix of users versus items has a ton of zeros. Why is this, though? It's because users watch a fraction of videos and videos are only watched by a fraction of users. Now let's take a look at this problem more visually. Our platform called YouTube started with five users, A through E, and five videos, one through five. In user-user collaborative filtering, each user is represented by a five-dimensional vector, and each dimension of the vector is a measure of how much they liked or disliked each video. Geometrically, we can plot these users as points in this space one. Now with item-item collaborative filtering, each item, or video in this case, is represented by a five-dimensional vector. Each dimension of this vector is a measure of how much each user either liked or disliked the particular video. Here, each point is a video in this space. We'll call this space two. We can use either one of these spaces to try recommending videos to users using one of the collaborative filtering techniques that we discussed. The problem here is most of these points have no information on most of the axes. So it should be possible to project them onto a much smaller space with not much loss of information, right? In simpler terms, we can reduce the five-dimensional user vectors in space one or the five-dimensional item vectors or video vectors in space two to say some two-dimensional vector in another space, like space three. By reducing dimensions, we can increase computation power. This is kind of like PCA's dimensionality reduction if you're familiar with machine learning. You don't have to be, but I'm just saying. This reduction in dimensions is done by a technique called matrix factorization. I won't get into the nitty gritties, but I will explain at a high level so that you understand what it is and why we use it. The user video matrix has tons of zeros. We establish this. This is true with really high number of users or videos or both. By projecting it onto a smaller space, we increase the computation efficiency. Also, both users and videos are projected onto the same space. So we can compare them directly. The result is that videos closer to certain users are likely to be recommended to them and we no longer need to depend on different videos themselves to rate a video or depend on different users themselves to help rate a user. And this technique is the essence of matrix factorization. In matrix factorization, it uses something called SVD, which is singular value decomposition, and we break down a matrix into a product of three smaller matrices without loss of information. The first matrix is a set of user vectors in the new space. The third matrix is a set of video vectors in the new space, but the second matrix represents the strength of each dimension in this new space. It's a diagonal matrix, so only the diagonal elements are non-zero. Now, some of you nerds out there just might be curious, how does it find these three matrices? It's by optimizing reconstruction loss between the original user item matrix and the product of these three matrices. So the matrix factorization technique, its major pro is that it overcomes the scalability issue that collaborative filtering has. We no longer care about the number of users and items on the platform just to rate a particular user's interest in a particular item. So yeah, it's much more computation efficient, but now we have another problem, human interpretation. Using this technique, we will be able to recommend videos to individuals, but unlike collaborative filtering techniques, the dimensions we are projecting to aren't well-defined. Remember, they are mathematically determined while solving the optimization problem, and they could be literally anything, from genre to video length to something we may not even be able to comprehend as humans. In other words, matrix factorization can recommend videos, but it cannot tell us why it recommends the video. This is vital information to drive new business decisions, and without it, we wouldn't know what improvements can be made to increase revenue, for example, on our platform. So yeah, that's actually a major con. Okay, until now, we've developed some recommender systems. So let's briefly recapitulate each. First off, collaborative filtering. In user-user collaborative filtering, the technique recommends videos to a user based on similar users. Whereas an item-item collaborative filtering, if we recommend videos based on similar videos that a user liked. The advantages of collaborative filtering are, it's pretty simple to implement, whereas the cons of this are, well, you know you too. We have thousands of new users in videos too every day, so neither user-based nor item-based collaborative filtering scale well. To address this, enter a matrix factorization. They transpose users and items to the same space, and so they can be directly compared. The pros of this are, well, predictions are independent of the number of users and videos on the platform, so it scales way better than collaborative filtering. The cons, however, are users and items are transposed to a latent space, an unknown space. So essentially, users are recommended videos, but we have no idea why users are recommended these videos. On top of all this, we made a major assumption in the beginning. If user watches a video, they have to like or dislike it. But in the real world, on YouTube, this is not the case. Users may comment, rate, subscribe, or just watch the video and do nothing. In fact, most users do nothing, so we need a mechanism that factors implicit feedback too. I'll talk more about implicit feedback shortly, but let's get back to the main question. How does YouTube recommend videos? What you're looking at right now is the main architecture that Google published way back in 2017 about their modeled architecture for how YouTube actually recommends videos. And it's a two-stage process. The first is candidate generation and the second is ranking. Candidate generation takes millions of videos on YouTube and filters out the potential videos that a user may like. And the ranking part, well, takes these videos, these hundreds of videos that were chosen from the first phase and sorts them in the order of relevance to the user. These videos are then shown to the user. Let's now dive deeper into each of these processes. Until now, we've only considered a physical click of the like or dislike button as a factor to recommend videos to users. But like we mentioned before, most users don't provide such explicit feedback. So it makes sense to collect information about users or selves. Based on this paper that was released by Google, here are some implicit features YouTube looks out for while recommending videos. One is watch history, which is the videos that you watched in the recent past. The second is search history, which is a list of queries that you typed into YouTube search bar to search for certain results. Then the third is users geographic location. And this makes sense because if you're in Mexico, we get more Spanish creators or videos from more Spanish creators. And if you're watching from say India, you're recommended more tech videos. Then next is device type. So we have mobiles, tablets, whether you're viewing from a laptop or desktop. There's gender, there's also age, and there's video freshness. And this is the age of the video. I feel like this is significant because videos get the most views within the first two days of their upload. Moving forward, I'll address these seven features as the user context, because it's just a shorter name instead of saying the seven features that we described before. I'll just use the phrase user context. Now we have this information in hand, but we need to somehow feed this into a neural network. If you don't know what a neural network is, don't worry about it. No need to know the details. Just know that it is a magical black box that takes one type of variable as input and converts it to another type of variable at the output end. For the video recommender, this input will be the user features we just described, the user context. And the output would be the set of recommended videos. But since a computer needs to deal with these, we convert everything or re-encode everything into numbers and vectors. Let's consider the encoding of input to a vector. For watch history, we need to encode each video into a fixed vector. And then we take an average of all of these vectors to get a final vector for watch history of all the videos. Next is search history, which we do pretty much the same thing. We encode each query into a vector and then take the average of these queries to get the final query history vector. Geographic location embedding is just a vector representing the country of the user. And all the other factors are just encoded as scalars. We now take each of these vectors and scalars and can concatenate them into one large vector. And this final vector is the input to our network. So the input to our network is sorted out. That's great. Now for the output. It's the set of recommended videos. But how do we represent this? In traditional deep learning, we model each neuron as a probability. The jth neuron would be the probability of watching the video j completely. It is well known as the full softmax method or better as just softmax. But this would mean that the number of videos in the output layer be equal to the number of videos on YouTube. And well, that number can be in the millions. It's way too large to compute. So the big question, how do we feasibly compute this and recommend videos? The most common solution and the solution that YouTube uses here too is something called candidate sampling. Instead of considering every single video on the platform for recommendation, we only consider a subset of say 100 or 200 videos. We select these 100 or 200 videos by fancy term, sampling videos from a distribution. This concept gets very mathematical, but I'll explain it in a big picture way enough to understand what's going on. And then I'll delve into specifics for the nerds out there. First off, where do we get our training data from? We monitor every user's activity over time. And every time a user clicks on a video, we have one training sample. And at this point, we would document the user context. So we have a number of pairs of user context in the videos watched. For every user context, we construct the input vector Xi. Remember, user context is a fancy word that describes the seven features of the user that we discussed before. And based on this user context, we determine a set of videos to recommend to the person, which we call S here. Mathematically, the idea is to find the optimal set of videos that maximizes this equation. S here has a fixed size like 100 or 200 videos. And S is the set of candidate videos that move on to the next stage. That is the 100 or 200 videos that the algorithm thinks the user will like. The end result is given a user's watch history, search history, location, and other information, and using that to maximize this equation, we get the appropriate set of videos to recommend a user. Now, that is good enough to know what's going on. But if you're a math nerd, keep listening. Q is a distribution where Q of Y given Xi is the probability of watching a set of videos Y given some user context, Xi. In our training data, users either watch a video or they don't. So Q is a Bernoulli distribution that takes on two values during training. One, if the video was watched and zero if it was not watched. It's customary machine learning that while training, we assume a parametric distribution. And we'll call these parameters theta. Using our training data with our network, we approximate this optimization. And during test time, we use a sample from P to give you a set of candidate videos when given context vector X. Okay, end of the math nerd sprint. Hope this candidate sampling is clear to an extent. I'll leave links in the description for more details and also a link to the paper too. Now, once we have candidate videos, we now need to go into the next step of YouTube's recommendation algorithm, the ranking phase. The big picture is that this involves sorting the candidate videos in the order of relevance. Each video is assigned a relevant score. Higher the score, higher is the relevance for the user context and the video is pushed up higher onto the list. Now to compute this score, YouTube uses hundreds of features. Some of these features are similar to user context we discussed before like watch history and search history. Some of the video specific features include how many videos has the user watched from this channel or when was the last time the user watched a video on this topic. Also, the score of a video is upgraded for every impression. So if a user doesn't click on the recommended video, then the score of this video for the user context is decreased and it drops down the list. With such frequent updates to the score and hundreds of features to consider, I think you can see why we need candidate sampling to weed out the less relevant videos before we rank videos. It would be insane to just rank 100,000 or millions of videos every single time user clicked on a video. Now, another big question. How exactly are these scores computed? You may have guessed this but the scores are directly proportional to watch time. In fact, in YouTube's ranking neural network, these values are equal. So higher the expected watch time of the video for a given user, higher is its score. Off the bat, I find this weird because it's obvious that longer videos have high raw expected watch time. And so the ranking algorithm will likely recommend longer videos, which is in alignment with what a number of creators are complaining about these days. If you want more technical details on how the ranking algorithm predicts expected watch time, put your nerd hats on for a bit, okay? Just put them on. YouTube uses a technique called weighted logistic regression to come up with these scores. It takes hundreds of features of the video and user's input and it spits out the relevant score, which is also the expected watch time of the video. Now, why does this work? Let's build some intuition. We want to assign a high score to a video if we believe the user will view it for a long time. In other words, if the odds of them watching a video for a long time are high, we want to assign it a higher score. The odds is proportional to the probability of watching a video. It's actually the ratio of probability of watching a video to the probability of not watching it. But here, we are weighting each of the videos differently based on watch time. So let's say that N is the number of candidate videos from candidate sampling, which is like, I don't know, 100 or 200 or some large number. Then K is the number of videos which are actually watched by the user. So N minus K will be the number of videos that were considered as candidates but not watched by the user. Now, videos that are not watched don't have watch time. So it is assigned a watch time of one second. So the denominator just becomes the number of videos the user didn't click on, but was recommended to them. Weighted logistic regression maximizes the odds of seeing a video that is watched and minimizes the odds of seeing a video that is not watched. This is similar to the philosophy of normal logistic regression. But the difference here is, well, weights. We weight the videos watched much more heavily than the weights of videos not watched. Simplifying this mathematically, we can assume the number of videos watched is actually really less compared to the number of candidate videos. And in the end, we get the odds nearly equal to the expected watch time of a video for a given user. Now that's the basic math. Now put on your technical hats. We're going to construct a neural network in a similar way as the candidate sampling network. During training, the last fully connected layer is able to learn the logarithm of the odds or the log odds, the logit. So while testing, we take the exponent of this value to get the odds, just the raw odds, of the user watching the video. And this odds is the relevant score of the video for the given user. And just as we proved, it is also the expected watch time. And that's the gist of YouTube's algorithm and the deep learning approach for building a recommender system. So cool, let's summarize everything that I just mentioned in this video. We started out with the definition of a recommender system just using YouTube as context. We went into details about the types of recommender systems. One is collaborative filtering, where we use ratings of different users to predict if a particular user will like a video. But the disadvantage of this in a general store or general situation is new users tend to join the platform at a very high rate. So that doesn't scale well with users. A solution is to use videos instead of users to determine or to make predictions about whether a user will like a particular item. However, here in YouTube, users are continuously signing into the platform and videos are also continuously updated to the platform. So collaborative filtering in general doesn't scale well, regardless of whether it's with respect to users or items, videos. Another problem is sparsity. Too much storage space, too much computation complexity for the user video matrix. A solution to this is to use the second major technique that we described, matrix factorization. This technique decomposes the user video matrix into three matrices. Geometrically, it projects the user vector and a video vector into the same space so they can be directly compared. We would recommend items whose video vectors are closer to the users vector. The problem here is the users and the items are transposed to an unknown space. We know the video to recommend to the user, but we don't know why we are recommending this video to the user. And then finally, we introduce a deep learning technique to construct a recommendation system and explain YouTube's algorithm using this approach. The recommendation system is actually split into two parts. The first is candidate sampling where we select hundreds of potentials of videos from the millions of videos on the platform based on user interest. And we also took a look at a list of features that the algorithm considers. The second is the ranking algorithm where we assign a relevant score to each video and sort videos accordingly. This is done using weighted logistic regression, which is the weighted version of logistic regression where the weights of training videos are assigned based on watch time. There are hundreds of features that the algorithm uses, but YouTube has not made most of them public. So are recommendations on YouTube solely based on watch time? Well, that's actually not quite the case because in the two processes that we discussed, the first is candidate sampling. We actually get these candidate samples based on other factors, such as what the user previously saw before or what queries the user may have typed into the search bar. But it's in the second phase of ranking that we do predominantly use watch time in order to generate the rankings of our videos. And this is probably why you see longer content being recommended more than shorter content, even though both of them are actually still based on watch history and search history. And that's it. Hope you guys liked the video. Hope you learned about recommender systems in general and also how YouTube recommends videos using their deep learning techniques. If you liked the video, give it a like, subscribe for more awesome content. It took a long time to make, so a subscribe would really be well appreciated. Keep up to date with my content and I will see you guys in the next one. Bye-bye.