 All right, ladies and gentlemen, next up we have with us Vui Vera, a software engineer at radanalytics.io. And he's going to be talking about building streaming recommendation engines on Spark. I request you all to have a seat and welcome Vui. Hi, everyone. Thank you very much for coming to hear me. No? Hello? Yeah, so I'll use this one. OK. Thank you. So hi, everyone. Thank you very much for coming. So my name is Vui, as you heard now. That's what's happening here. Thank you. Right? OK, sorry about that. Thank you. So my name is Vui, as you just heard. And I'd like to talk a little bit today about building streaming, distributed streaming recommendation engines on Spark. And I'd like to talk a little bit about batch streaming, batch recommendation engines. So that's a common approach of doing recommendation engines. And also, how easy it is on one way to build these kind of distributed recommendation engines. But on the other hand, out building them in a streaming and distributed way can be tricky. So we have some roadblocks. You might stumble upon. So I'll introduce the concept of collaborative filtering, which is the most common way of producing recommendation engines. And I'll talk about two variants, which is the batch alternating with squares and streaming alternating with squares. I'll also introduce a little bit about batches Spark. I'm assuming some of you are familiar with. And I'll talk about an implementation on top of a batches Spark called distributed streaming ALS. And finally, I'll talk a bit about how to deploy these kind of setups on a modern cloud environment such as OpenShift. So what is collaborative filtering? So I mean, first let's talk about recommender systems. So recommender systems are a popular method of matching historical data from users, products, and the rating that you have, so the connection between those users and products. Usually you have a unique relation between a user, a product, and a rating. So say if you go to a movie website where you want to see new movies or buy a new movie, you might see some movies there. You recommend one. So if you give it five stars, you're going to have a unique relation between yourself, the user, the product, the movie, and the rating that you just get. And in this jargon, collaborative filtering, collaborative just means using all of the data that you have globally from all the users. And filtering is basically predicting. So you're basically doing predictions on the data you already have. So in a way, collaborative filtering, we use it in our everyday lives. And it's quite common sense if you think about it. So let's assume the main idea behind it is let's assume you have two groups of people. So one of which is group A, which is a group of people with which you share a musical taste. So everything they like, you usually like. And you have a group B, which is people which you don't share any kind of musical taste. So everything they like in music you hate. So if group A recommends you an album and group B recommends you another album, so which one are you probably going to buy? So you're probably going to buy group A, right? So yeah, so that's basically collaborative filtering in a way. So as a bonus question, if group B says an album is really good, or says an album is really bad, sorry, does that mean you're going to like it? Because you really don't have an informative relation between that data and the one between the group you have. So it's a different kind of relation. So you can't really say you're going to like that album. So one of the most popular methods for collaborative filtering is alternating waste squares. And in ALS, we assume that we have all of the data organized as a sequential ordering of users and products. And we can build a kind of matrix. It's a natural way of displaying this data. So we have a matrix representing all the ratings. And this is a sparse matrix. So obviously you're going to have some ratings missing from that. So not all of the users rate all of the products. And each entry represents a unique relation between the user and the product. So what we're doing with ALS in a nutshell is basically try to factorize this big ratings matrix into two latent factor matrices called, we're just going to call them U and P here. And these two factors, when they multiply back together, they're going to give an approximation of the ratings matrix. And that approximation is going to include all the ratings that we're missing. And that's going to be a prediction of the ratings that we're missing. So one classical way of doing this is using a batch method. So in a batch method, what we do is the factorization is done by defining a loss function. So basically, define a loss function having an error term there, which is a difference between the actual ratings that you have and the prediction of the rating you're going to have. And you have some regularization terms. And this loss function has to be minimized. Unfortunately, using ALS, this loss, this actually minimization problem has a closed form solution. So you can actually set the derivatives of the loss function in order of U and P to 0. You have a nice set of linear equations that you can solve by iterations. And so that's quite handy, right? So the way we do it is we fix one of the factor matrices. We solve the estimator in order of the other one. And then we just iterate the process going back and forward. So we have one fix and the other fix. So eventually, this process is going to converge. And you're going to have a very good approximation of the ratings matrix. And finally, in the end, what you have is something like this. So you still have the data that you actually have. And in red, you have an approximation. And what this approximation means mathematically is that these values are going to be the ones which actually minimize that ALS recursion. So you can bet it's probably going to be if you have enough data and enough iterations of a good approximation. So to visualize this, let's imagine a very quirky shop. So you have a shop where you have 300 products. They only sell 300 products. And you only have 300 customers. And to be even more quirky, each customer can give a rating in 8 bits. So you can give any rating from 1 to 256. And as we're humans, and we visualize better things, patterns in colors, than we do in numbers, let's assign a palette to these numbers. So I think you know where this is going. So you can build a ratings matrix that looks like something we can see. So this is going to be our ratings matrix. So if you use ALS to solve this, how do we go about it? So first we fill the rating factors with random numbers. So obviously if you fill them with random numbers, your initial approximation is going to be a random matrix. So that makes sense. And then you start the iteration. And as you go about the iteration, you can see that, well, it's actually quite a good job. So after a few iterations, it's actually approximating the ratings matrix that we had originally. So OK, it works. But it's expected to work in this case. So this is probably the simplest case of ALS you can imagine. So it's a very small data set. You have all the ratings. You know all the ratings. And it's not a distributed system. So no traps to fall upon. So yeah, of course, it's going to work. So you might be thinking, well, this is all nice. But we can do this in a streaming way, right? So if we're feeling we're approximating that matrix, we get new observations, we get new ratings, or we get new users, or new products as a release, we can just recalculate the whole thing, right? Well, yes, you can. But in one way, that's not going to be streaming just because of two technicalities. One of which is you have to keep the whole of the data. So you're not using a streaming implementation because, obviously, you get new data, but you have to store the entirety of the historical data. You can't approximately new matrix with just that one new rating that you had. And secondly, if you want to do this in real time, it might be a bit problematic. Because you have to imagine if you're a shop or a company that has millions of users and millions of products, obviously, it's going to be a bit tricky to recalculate the whole thing in real time. So I'll be good about it. So we want to look at a method that allows us to do this factorization with just one rating or a few ratings at a time. And fortunately, there is a method for that. It's called the stochastic rating descent applied to the factorization. And the specific method we're going to use is bias stochastic rating descent to factorize the ratings matrix. So what's the difference between stochastic rating descent and match rating? So basically, we introduce a new concept. And this is the concept of bias. So here we have a bias between x and y. So x is a user, y is a product. And the bias associated with a product or a user is going to be mu, which is the average, the global average of the biases of the ratings that you have, sorry. And you have bx, which is basically how much does a rating deviate from the average you have for all of the users or from all of the products. And then the new approximated rating is just going to be the batch case plus this new term, so plus the new biases. So if we replace this into the loss function, so we can just devise a new loss function for the streaming case. But now we're just going to have the new predictions in the loss function plus some regularization terms that you're not going to go into them. So calculating the field ratings is quite expensive, so we're going to calculate them for a single observation. So as you can see from the update of the biases and the update of the factors, we can do this iteratively, but just with one observation at a time. And that's exactly what we want. So provided we have a single rating, the rating of user x and product y, we can update the biases as well as the latent factors. And that's going to allow us to do the factorization real-time. And an interesting thing is that this method also has a conversions property as the batch method has. So the practical difference is, in terms of the streaming data, it's obvious now, I hope, is that in both methods, the objective is to estimate the latent factors. But in one method, so in the batch method, whenever you get a new observation, you're going to have to recalculate the factors with an entirety of the data. Whereas in the streaming version, whenever you get a new observation, you just update the gradients that relates to a specific row or a specific column, so the user and product column, of that specific user and product. And it is important to note that under a certain point of view, these methods aim at exactly the same thing. So they both aim at calculating the latent factors and from that, making the predictions. It's just the way they use the data is going to be significantly different. So let's look at an illustration with the same data set of the streaming case. And we're going to use the same manufactured ratings data. And as you can see, the conversion here seems to be happening, but slower. That is to be expected, because now we're not using the entirety of the data, just one observation at a time. But in the end, we're going to see kind of a conversion to a similar result. And we're getting a good approximation at the ratings matrix. And again, this is a simple example. So this is with a small data set in a local machine. There's no distribution happening. But we don't want this, right? We want to try this with big data sets in production. So we want to actually implement something that works at scale, so something that might work with big amounts of data. And to do that, we're going to use Spark. So I'm assuming that some of you are familiar with Spark. So who here has worked with Apache Spark? Yeah, not that many people. OK, so I'm just going to do like the 10 second mandatory introduction of Apache Spark. I hope I hope your answers describes faithful what Spark does. So Spark is a framework that allows you to distribute calculations at scale. And it provides several core data structures, such as resilient data sets, distributed data sets, data frames, and data sets. And the RDD, the resilient distributed data set, is an immutable distributed type collection of objects. And what does that mean? It means that when you create one of these RDDs, they're actually mapped across your cluster, right? And as they're immutable, what happens is you can actually map your computations to each of the clusters. So the calculations are done in parallel at the clusters. And then you just aggregate the results back. So this allows for a very natural way of distributing computations if you can translate your algorithm in these kind of distributed, immutable operations. And for the streaming ALS application, you're going to use RDDs as the core data structure. So Apache Spark already provides in its ML library an implementation of ALS. So I've alternate it with squares. But it's a batch implementation of alternating with squares. But it is a very performant one. It works very well. If it works for you, just use it by all means. And it has a very simple API. So basically, if you want to train a model, you just need a few quantities. So basically, you need what is called a rating. And a rating is just a wrapper around the quantities you mentioned, so the user ID, the product ID, and the rating. And you have the ratings, which is just your matrix, is just an RDD of ratings. You also need a rank, which corresponds to the number of elements in each of the columns and rows of the latent factors that you mentioned. And you also need iterations, which is basically a hard stop on when should the iterative process stop. So this is quite useful because you actually can know that the problem is computationally bound. So you know, it's not going to last forever. You can say, well, after 100 iterations, this is going to be a given approximation you can stop. And it allows you to pass also some regularization terms, as we've seen, such as level. I'm sorry. So the way to train a model is quite straightforward, as I mentioned. So basically, you just pass to the ALS class. There's an ALS object. You just pass the ratings, the rank, the iterations, and the lambda. And what you get back is basically a class called matrix factorization model, which is just a wrapper around the two latent factors matrices that we've seen. And this is to work, obviously, in a batch setting. But to actually work in a streaming implementation, you're going to need a streaming data source. And the streaming data source that we decided to use is the Spark's discretized streams, or D streams. And they basically work as mini batches of RDDs over a certain time window. So basically, you're going to get batches of resilient distributed data sets over a certain frequency and time window. And then you can apply the process to each of these mini batches. So an important thing, I'm sorry if I can just scroll back. So an important thing to notice about or advantage of these discretized streams or D streams is that we now, if we use this in a streaming ALS, we no longer need to keep the entirety of the data in memory or even access it. So basically, if you can imagine the case that I mentioned that we have millions of products and millions of users, now if you get a mini batch with just a few ratings, you don't need to, say, read the database with several hundred million ratings to redo the whole process. You can just use the data that you have in that mini batch. So we wanted to, since the ML web API is quite straightforward and intuitive, we wanted to use the same API on the streaming ALS. So we're going to use the same type of commands. And the way we're going to do this is initially, when we don't have any model or data, we're going to create a model with the initial RDD, with the initial mini batch. And then from that model, we're going to update it with the mini batches that are going to come afterwards. So you're going to be continuously updating the model as new mini batches of data arrive. So what do we need to train the model? So now I'm just going to give you a few of the steps, actually, like the algorithmic steps of going from that initial mini batch to a trained model. It's just going to be a couple of steps. They're going to go into some detail. But hopefully, it's going to give you an idea of how it is tricky to implement this kind of algorithm in a distributed way. But in a way, the flip side of it is that you get, obviously, a distributed recommendation engine, which is quite performance. So what do we need to get the model? So as we see if you remember from the initial slides, from the formulas, so we need actually these quantities to have a trained model. So once we have them, we can say we have a trained model and we can perform predictions. So let's start with looking at user-waking factors. So these operations are going to be identical from one mini batch to the other. So the same set of operations, you're going to do on the first mini batch. You just repeat them with the new data and you get to continuously update the model. So to calculate the user-waking factors, so what we need is, like in the batch ALS, we get an RDD of ratings. And this corresponds to the ratings that each user gives a product. And the first thing we're going to do is we're going to split this RDD into two RDDs. So one keyed by user and one keyed by product. And this will allow us to compute the waking factors. Now we're going to do them. So just to keep in mind that this is the first initial step. We have no model. We have no waking factors. We don't have anything. So the first thing that we do is for each of these two RDDs that we created, we're going to generate a random feature vector. And the way we're going to do it, we just create a feature vector of rank R, so the rank that we decided, and we just create random uniform values. And with each of those feature vectors, we're going to associate a random bias as well. So you can do that. It's quite easy. So the next step is to, because we actually split the RDD into users and products, keyed one by user and one by product, we might have some duplicated users or products in these RDD. So you can imagine the case where if you rate the two movies, obviously, you're going to be on two entries of these RDD. So twice in the users and twice in the products. So what actually we're going to do, we're going to join these ratings, which in turn will return the data set consisting of product IDs, user IDs, ratings, and user factors. So we join them. And we get to return is we have these mappings between the user, the product, the bias, and the feature vectors. So finally, it's just a couple of steps more. You can see it's actually quite simple. What we do is we swap the RDD keys. So instead of having one key by user or product, which is reverse 16, and we take this intermediate data set and we join it with the other feature vector. So now we have a complete RDD in which each raw item of it is going to include all the biases and all the future vectors for each combination of products and user IDs and ratings. So now we can calculate the global bias. So if you remember, the global bias is simply the average of all the ratings that we have. So we can do that easily in Spark. So we just calculate an average for all the ratings that we have. So finally, we just need to now calculate the user-specific bias and the product-specific bias. And that's quite simple as well. So we've seen before we can update this. So we can update the bias by using just this gradient term. So the new bias is just going to be the old bias plus this new gradient term. And to calculate the gradient, we just need these quantities that we have here. We need the error. So we have a difference between your rating and the prediction. We need the gamma and the lambda. We have those. Those are model parameters. So we have everything that we need. So let's start with calculating the gradient. So now that we have each item of this RDD, for each one of them, we calculate the future measures. And now we can calculate the error. So the error is going to be simply the rating versus the prediction. So you can calculate the error. So now that we have these quantities, if you remember, just the expression of the gradients, it's quite straightforward to calculate. So basically what we do, we basically take the RDDs that we have for the rating factors, for the users and the products. And now we just key them by user and product. And we do an aggregated sum for each of those split RDDs that we have. So now we get the gradients for the users. We get the gradients for the user-weighting factors. And we get the gradients for the products and the gradients for the product-weighting factors. So now we have all the quantities we need to say we have a model. So we just sum the gradients so that we have one gradient for each user and product. And that's it. So you might say, well, that seems a lot of steps. But that's the price you have to pay for doing this computation in a distributed way. So you might think, well, now that we have them, we have the rating factors. We can perform easily predictions. But what if you get new observations? Or what if you get new data? What if you get new data from a project we never seen before or a product we've seen before? So how do we deal with that? So you might remember that we train the model with nothing in the first initial mini-patch, right? Sorry. So we train the model with nothing in the initial mini-patch. But now we're just carrying over the model to the next mini-window. And we're just with the next mini-patch. And we just train it with the rating that we have. So let's look at the case where now we assume we get a mixture of data from ratings we've seen before. So imagine you rate a new movie. You rate a new movie last week. But you rate a new movie. But some user decides he goes into the system. It's the first time he's trying to rate something. Or he's rating a movie that didn't exist before. How do we deal with it? So as you hear, we just assume that the cells in red are ratings that we haven't seen before, right? And the others are cells that for users or products that we've seen before. So what we do now is, instead of assigning random feature vectors for all of the products that we get, we basically do a full outer join between the RDDs that we produce and the rating factors from the data RDDs and the rating factors that we have. So that allows us to keep the RDDs that we already have with the feature vectors that we already have. And for the ones that we've never seen before, now we can do exactly the same steps as before and create new feature vectors and random biases as well. So in this way, you can deal any kind of situation that arises, right? So how does this behave in the real world? So we decided to test this with some real data. And to do that, we use the movie lines data set, which is a quite widely used data set in the recommendation engine research field. And it actually has two variants, so a small variant, which is quite good for prototyping. So it has ratings that users actually gave to movies in a small file. So 100,000 ratings is quite good to do some quick prototyping of algorithms. And it has a full variant, with 26 million ratings, which is quite good if you want to do some more in-depth analysis of the performance of an algorithm. And the data actually has several fields of interest, but we're just going to use a few ones from this data set, which is the user ID, the movie ID, and the rating. So that's basically what we're going to use. No, no, I was just going to go into how we set up the whole thing to train the streaming case. So yeah, that's a good question, but I'm just going to explain your question in a second. So how are we going to train this in answer of the question? So first we train a batch model, so we can have a baseline, so we can compare it to the streaming version. And we just use the Spark MLB out-of-the-box batch ALS. So we split the data into 80%, 20%, and we basically train the data in one of the splits, and then we just keep part of that data set to the side. So we know that we're going to use exactly the same two splits on one method and the other, so we have a fair comparison. And we calculate some kind of error measure to have some kind of metric of how well the model is performing. So in this case, we use the root mean squared error. So it's easy to calculate in Spark, given that if you have the predictions and the original data. And... Sorry. Okay, so we measure the root mean squared error and then we're going to compare the root mean squared error of the streaming version against this one. So how do we set up the actual streaming testing case? So we train this on OpenShift, and the idea was to use some kind of message broker like Kafka, and we actually used Kafka to simulate the data stream, and we used StreamZ to do that, which is a project that allows you to deploy Kafka on OpenShift, and we used Oshinko, which is a tool from RedAnalytics.io, which allows you to easily deploy Spark clusters on OpenShift. And basically what we just did is we read, we answered your question, so we read the entirety of the data, and then we just basically replayed it through Kafka as to simulate the stream, right? And we used Windows of five seconds with a thousand observations each, but this is just for convenience, it's just because it's convenient for practical purposes. You can use whatever you want, but I mean, realistically if you use like one observation per minute or something, you're probably gonna wait several months or years to wait for this to finish, so we just use that for practical reasons. And an important note is that the best parameters for the batch model are not gonna be necessarily the best parameters for the streaming model, obviously they're completely different or very different algorithms, but also for convenience and for practical purposes, we use the same parameters in both models. But more in the next slides, I'm gonna go into how to estimate hyperparameters for a streaming ALS version. So this is the result that we had. So in the horizontal dashed line is the root mean square error for the batch version, and the blue square line is the root mean square error for the streaming version. And you can see, well, it's quite good in the sense that it is what we were expecting, so in the beginning you don't have much data for the streaming version, so it's kinda all over the place, but as time goes by it does seem to be converging to a value that which is in the same region as the batch version. So in the end, both the batch and the streaming process is the same amount of data, so it's a reasonable result. But you might think, well, this is all very good, and streaming ALS is like a silver bull. It's like magically solved, there is a problem we might have, but that's not the case, obviously. So some things which are very important to consider when using streaming ALS. So a problem with, and this is not particular of streaming ALS, it's for all ALS and many machine learning models, is the cold start problem. So basically, the cold start problem refers to a initial point in your model training where you don't have enough data to make any kind of insightful inference or projection. So you can imagine those slides I showed you the monoleumiser in the beginning. If you remember, the item sectors are completely filled with random data, so the approximation is gonna look completely random. So if you just have a few observations, that's not gonna change that. So it's gonna look pretty much random. So always be careful when providing, because you might feel tempted to, since you have a streaming version, I'm gonna start serving away predictions immediately. So that might not be the best idea. And something you might do to mitigate that is to say you should have walls of data, if you're a big company and you have walls of data. First, train the streaming model off-line with a big chunk of data and then start serving in a streaming way, right? But bootstrap the model with a big chunk of data. It doesn't start from zero and immediately start serving predictions with, like, say, five grades or something like that. That might be, so the predictions might be rubbish even if you're pretty sure to get back. So another thing to consider is hyperparameter estimation, excuse me. So hyperparameter estimation, excuse me, in the batch ALS is quite straightforward because you can do a grid search for parameters for several sets of parameters and then you decide, well, parameter set D is the best for my data. And then at some point in the future, if you want to retrain the whole thing, you can do it. Like I say, after two months of having this model running, you say, well, it's not behaving very well. If you try to retrain it with a rank of double size, something like that, then you can do that perfectly. You take all of the data and retrain the model, that's it, that's fine. Because you have all of the data. But with the streaming case, that's not the case because when you get the data, you discard the data. I mean, in theory, obviously. But what I mean is, if you're in a position in the streaming ALS that you need to refer back to an entirety of the data and retrain the model, then it's not really a streaming version. You're doing a batch hybrid streaming version. So what you do is, you have a set of parameters, you get the data, and then if you want to try a new set of parameters, you can only retrain the model with that new data that you have because you discarded the previous data. So you can't really do what you do in batch ALS. So a possible solution around that is to perform like a parallel grid search. What you do is, you have a bunch of models in the beginning, you train them with a set of parameters each, and then as time goes by, you see which of these models or which of these set of parameters gave me the model which is least performant, and then you prune that model from your search. You say, well, I'm just going to keep these three models, and you keep doing that. This has an obvious drawback, which is, it might be computational expensive to train lots of models simultaneously. And another thing is, there's no theoretical result that actually guarantees you that a model that you discarded in the very beginning might not be actually the best model in the future when you had more data. You might be the best set of parameters, forget specific small chunk of data that you had, problem to train parameters with the streaming version. Finally, there's a consideration of performance. So in these kind of models, you're going to do, as you see, lots of joins. You're going to have lots of data shuffling around the partitions. So you have to be really careful of optimizing these kind of algorithms. Spark, Apache Spark does something very clever with a batch version, in which they do something called the blocked ALS. Basically, what they do is that they pre-calculate the amount of outgoing and incoming, outgoing and ingoing, sorry, connections between the chunks of RDDs for the future of the latent factors, so they can minimize the amount of data shuffling that happens. So it's a quite clever algorithm. But a naive implementation of streaming ALS will give you nothing like that. So you have, on top of this algorithm, you have to think of some clever strategies to minimize data shuffling. And also something that for the more seasoned Apache Spark developer might raise some, make some alarm bells ring, is the fact that you might use some ad hoc random access fetching of RDDs to calculate some quantities. So say, if you're now waiting to find yourself calculating the predicted rating for a specific user and product, many times, keep in mind that to do that, you're gonna have to access a specific row between commas or column of an RDD, and that's not really this kind of an anti-pattern in Spark. So it might be, if your code ends up looking like this, having to do lots of lookups and stuff like that, so possibly it is just rethink your strategy of doing the predictions. So this was basically the explanation of a generic Spark AOS streaming algorithm. So if you wanna check out more stuff about streaming algorithms or Apache Spark or OpenShift, I invite you to take a look at my blog. This is a specific post on streaming AOS, if you wanna see it. And if you wanna play around with distributed algorithms on Apache Spark, on OpenShift, on the cloud, I strongly recommend that you go to the ratanalytics.io website. You have several use cases for intelligent applications, machine learning at scale, which are very good, very well documented. You can actually learn how these things work just by reading the source code and documentation. And they actually have some ready use cases you can actually deploy like they have a microservice oriented recommendation engine built with Apache-wise version of Spark, which is very, very, very interesting. So I strongly recommend it. So that is it for me, and I thank you very much for your time. No questions. Go for questions? Yeah, yeah. I'm sorry, if anyone has any questions, sorry, I thought that was a given, but. Do you have a view of it? Oh, it's fine, yeah, don't worry, don't worry.