 Good morning. Thank you so much for turning up at 9.30 on Sunday the morning after the party. It's very good. So I'm Sophie and this is my colleague Ruri. We both work at Red Hat and we're part of an emerging technologies group. So we're the Brad analytics.io team. Our team is mixed so we have some data scientists which I definitely fit into and Ruri fits into and then we have some super techie software experts which Ruri definitely fits into as well and the rest of the team. So we work to try and kind of bridge the gap between machine learning and application developers that make it easy for machine learning experts to put their models into production with little friction. So something that Ruri and I worked on over the last year and a half since we started Red Hat is recommendation engines. So today we'll just talk a bit about what they are, teach you how to use them, how to get going and hopefully these skills will be transferable to other models and so on. So recommendation engines are everywhere. So we'll just talk about some movies data. We need to talk as we manage open source data for that but we'll also look at music data and making recommendations where the data isn't quite so concrete so people haven't rated a film for example. They've simply listened to a song or not listened to a song. Ruri's also going to tell us about how we put things into production and how we deal with streaming data in the recommendation engines. So we'll start by talking about the main recommendation algorithm class that we're going to use which is collaborative filtering. We'll look at parameters, tuning, metrics and considered some general good practices when we're machine learning algorithms into production and developing them and so on. And then we'll look at those implicit recommendations. So that's the case where someone just has listened to a song or they haven't, you don't know whether they like it or not. Post-processing is really important so that's something that comes after the model before we return the recommendations to the user. So we'll see why that's important. We'll give you a chance to make some film predictions for yourself, see how terrible they are without post-processing and then we'll do some simple post-processing to make the results better and keep in mind how to do that in a production environment. And then finally, as I mentioned, Ruri's going to talk about streaming data so where the data is coming in as it were in production from users updating their preferences and so on. So we're going to work in Python today. Who has experience in Python? Okay, so all of the code is pretty much written. We'll talk through it as we go. But please do stop and shout at any point in time. Hopefully this can be conversational rather than a two-hour lecture. So that sounds awful to me and Ruri. And I guess you lot as well. So yes, please ask questions, any issues with the code or the malware in the system and so on. We're also going to use Apache Spark. So anyone who uses Spark has one. Okay, so Spark is fast, general, easy to use. It's a parallel data analysis tool and it handles the parallelism itself with minimal input from you. So if you want to dive deep and find out how it's doing the parallelism, you can but we like it because it's just very straightforward. Another advantage is that it combines things like SQL, test processing and machine learning. Whereas previously you would have had a different engine for each. So it's pretty general and we like that. So we're going to be working in Jupyter notebooks today. So anyone who's Jupyter notebooks? Okay, cool. So if you have no worries, we'll talk about how to use them, how they work. But in general, they're a lovely environment for exploratory data analysis and getting started with machine learning. So they have a nice mix of code, text, markdown and you can get images in there as well. And yeah, we've got Apache Spark working in there. And finally, before we dive into the date, into the notebook itself, we're going to need some data. The data set we're going to work with today is the MovieLens data set. It's open source, a lot of online machine learning tutorials use this data set. So I find that beneficial because sometimes when you run something or you're getting started you kind of want to do a sanity check to show that what you're doing is sensible and it's nice to be able to follow what someone else has done first and see what your results are, at least similar or not completely wacky. So this data set was curated by the GroupLens Research Project at the University of Minnesota and it's nicely organized. There's a small data set of 100,000 ratings users have made about films. That's what we're going to be using today for speed purposes but in practice if you really want to use some serious tuning and get going then you might want to go back and have a go later with the full data set of 96 million ratings. So let's see if everybody can get their notebook up in the morning we'll talk about notebooks and we'll load in that data. So you can either use the QR code or you can... So does the QR code take you to the binder? No, just keep it up. Okay, so do you want to talk about staying up the binder? Alright, so thank you everyone. Sophie explained so we're going to try to do this with a Jupyter notebook because Jupyter notebook is a very good way of doing interactive code prototyping and experimenting so you can write your code in several languages so Jupyter supports several kernels, not just Python. You can use Scala, you can use many other languages and one of the advantages is that you can actually have immediate feedback on what you're doing so instead of writing a long script and then running it and trying to find an error you can actually execute portions of it and you can have plots embedded so if you want to visualize the data it's very useful. So the ones of you that actually want to have a go so if you can try going to that repo and if you go to that repo you should have... let's just go here... you should have a button here which is launch binder and binder is a service that provides a containerized execution environment for Jupyter notebooks so basically when you push this button it turns the repo into a container and it runs a Jupyter book online for you so you don't have to set up anything, you don't have to set up Jupyter anything so if you push the button, hopefully you should see something like this okay so the container is already built so it caches the container so it says it found the image, it's launching the server so if you wait a little bit it should take you to a live version of that Jupyter notebook Has anyone used my binder before? No So as a future reference, if you do a Jupyter notebook and you want to share it with a colleague with someone instead of having them have all the hassle of setting up Python setting up the packages, setting up everything you just put a link to my binder and they can execute the Jupyter notebook themselves so okay, so this seems to be a bit slow for a Sunday Alright, so this is a home page of binder So all it takes in is your GitHub repo you can test it out on branches as well and as long as in your GitHub repo you have a requirements.txt file which has all of your Python packages that you want installed in it then it will build you this notebook environment that will run with all of its dependencies Has anyone got the notebook to launch yet? Is everyone, it's launched? Yeah, good, so it's just my way Okay, cool There is hope Oh, of course, sorry, yeah, of course So it's... Is it this one? Yeah, sorry So I'll just give it a minute Is that an error message chuckle or a... Really? The repository is taking longer than usual to load Right, okay So this is the problem with doing things online, right? So many things can fail So, I mean, we can... We can actually show the notebook locally Alright, so mine is working Perhaps if you can retry I just had a bad luck, I'm sorry Right, so this is basically while other people wait, I can just explain a little bit about the notebook So basically you have two main types of cells So we have the markdown cells where you just write explanations for your code, comments, etc And all of this is... You can edit, you know, it's just double click and you can edit Then you have the code cells And these ones, if you press Shift and Enter you just execute the code cell And Jupiter has a model where it keeps things in memory So I'd say if you import some Python modules in this cell then you can just, you know, keep on going and running some other bits of the code without having to do the importing again So, you know, it'll just work with this portion of your code Alright So that's the main interface we're going to be using which is going to the code cells doing Shift and Enter, running them And obviously the advantage is if you want to change something on the cell try it for yourself I'd say I suggest running something here with the value If you want to try yourself through the different values just write another value and see what happens, right? So that's the advantage of it Okay So this first cell here is importing the Spark libraries that we need getting Spark set up and setting us up a Spark session So this is a standard kind of copied from how to get started with Spark in Python session So as we said, if we want to execute the cell which we'll need to in order to set up the Spark session we have to click Shift and Enter So when we do that what you see is you'll see a star in the square brackets on the left that means that the cell is executing and once it turns to a number that means it's done So we've now run that cell and we're in business So we're now able to load in the data set So do you want to talk about that? So as Sophie said, the data set comes from movie lines and have several files as comma separated value files and the main thing we're interested in is a ratings file and the ratings file is just a file that contains a unique user ID as an integer a unique movie ID and an actual rating So basically it's saying user X gave movie Y the rating 3.5 or whatever and we have a secondary file in which we're interested which is a movies data file and that just contains the link between the movie ID and the actual title of the movie because sometimes we want to see what the actual movie is So the first thing we're going to do is going to use this function we created as Sophie said a Spark session and the Spark session is your connection to the Spark cluster and Spark is a MapReduce framework that does distributed computing in memory So basically what it does is you load the data into Spark Obviously you're running Spark here locally but ideally in the real world you run Spark on a cluster so you have like a Spark node on each computer and what Spark does is it loads the data splits it into what they call resilient distributed data set you can imagine just as a big array it distributes them over the cluster and then you can transparently map functions to those arrays So let's say you have a major array with all the numbers from 1 to 10 to the power 1000 So Spark would just distribute those numbers randomly perhaps through the nodes and if you wanted to square those numbers what Spark would do is just pass the function that squares something to all those nodes and then it will execute locally in memory that function and then you have all the numbers squared so that's where you distribute computations on Spark in a very simplified view obviously So you have this function the first question is how do you get data into Spark So you have this function called text file given the Spark context which is your point of entry for Spark and basically what it does is it reads the file into Spark that's all it does So here we have a simple function called world data what it does is it takes the path for a CSV file it loads it into Spark and then you see here you can already start applying the functions to those RDDs which are distributed over the cluster So the first thing we do is we filter out all the fields that correspond to the header because we have a CSV file it has a header so you want to strip it but remember these RDDs are distributed so you can't say take out the first element because Spark would say where is the first element this is all shuffled and distributed So you take something that is equal to the header take it out and then you split all the strings on the CSV file where a comma is it's pretty much standard CSV processing if you do it manually and then you map a function that just takes like the first item or the second item etc and then you create something which is a data frame and data frames in Spark are pretty much like data frames in other languages like R or Pandas or something like that but they are distributed and the advantage is you can run SQL type queries on those data frames distributed so if you want to do something like a distinct count or whatever you can do that on a massive dataset which is distributed kind of a distributed database as well So first step we load the movie's CSV we say strip out the line that's equal to this and then create a row with the first item and the second item and the same for the ratings strip out the line that contains this and then take out those items and then we have our datasets So execute this Is everyone that's trying to run it okay running on binder or yeah good okay so it's done obviously since this slide it's just for illustration purposes and explaining we're running Spark locally which you know it might seem a bit redundant but it works, what it's doing is running Spark on the same machine that we're doing the tests and it's just one nose but you know it works so it works to show what Spark can do So now we're going to do like the first thing, we're going to see how many ratings we have and how many users we have and we use the count function for that and here you can see the SQL type constructs so we see from the ratings dataset select the column user take out just the distinct users and then count them so okay so here we have the first result, so we have this amount of movies, this amount of ratings and this amount of users and this is just like an example it's very common for data frames it just takes something which means that it takes 5 elements from that data frame and shows them on the screen so it just prints them okay so we have, you can see the format yeah that's a very good question so ideally it should be lazy so the question was if when you create the data frame you should recommend to create the data frame for the operation or if it does it immediately if it splits the data and creates the frame in memory so you can specify the laziness level I think by default it's lazy so only if you do operation that's good in a way for performance reasons but if you have problems on your code it can be very annoying because when something is executed in the branch that you're not expecting okay so what would happen if we change the number after take so you mean if we an example of just editing a notebook but you can't see all of that shell so you don't change the take button to take number to 10 for example and then the output is going to count the number of movies the number of users which will remain the same and now we see 10 films so the question is how is this how is how does this dataset look like so apparently it's like a very good dataset it's very common in research for recommendation engines it's a very standard dataset but let's look at how it looks so the first thing we're going to do is we're going to look at the distribution of ratings so what do we have what do people rate more what do people give more as a rating so you can see okay it's not uniform like this so the most common rating is four stars which is a bit surprising so people give lots of four stars then they give three stars and they give three and a half okay it's sensible it corresponds to something you might expect to see and this is another advantage of Jupiter notebooks if you want to do interactive and prototyping data science you can have quick feedback visually so you can plot something very easily you're using a library for python called seaborn which is a wrapper around matplotlib which is a default library for plotting in data science in python and it has very simplified commands one of them is this plot it just shows an histogram of the data right so okay let's try to explore the data a little bit more so the second question is do users give all the same amount of ratings like if you have 1000 ratings does that mean that 1000 users gave 10 ratings each or do you have users that give lots of ratings and some that give few I mean we know intuitively they should give different amounts of ratings right so here we use some more SQL type of operations we aggregate by user we count the numbers of ratings that they gave and we look at the distribution and okay we see okay this is sensible as well so you see that many users like the majority gives very few ratings they just rate a few movies they like a lot but then you have the super users you see the people that actually give ratings right and they give like 2500 ratings so these are people that basically are constantly rating things but just looking at the Instagram it's hard to see how many of them are right so let's just actually count them so how many super users we have like how many people give more than 1500 ratings I suspect it shouldn't be a lot right so it's actually four okay that's good so it's just four people actually really love giving ratings to movies and how many do we have which give less than 200 ratings that's a lot right it should be the majority yeah so almost 500 people okay so just finally to finalize the data exploration what about movies do movies have all the same amount of ratings I mean we shouldn't expect right like a very popular movie should have lots of ratings because fewer movies should have few so we should see something similar to the ratings per user yeah that's it you see like majority of movies has just a few ratings are probably very not well known and some movies have lots of ratings okay so okay so this is another count some fact if you are actually going to watch 15,000 movies 1,500 movies and they were each 90 minutes sorry fun fact take two if you are going to watch 15 1,500 movies and they were 90 minutes in length you'd spend 93 days watching films I think that's something we can all aspire to alright so yeah super users okay so now quickly the thing I think most people want to hear about what are recommendation engines and what do you do so you have lots of types of recommendation engines you have knowledge base recommendation engines you have several types of approaches to problem but the one we're going to talk about today is called collaborative filtering and it's a pretty established and standard way of doing recommendation engines and basically how do how does collaborative filtering work so collaborative filtering basically use it every day in a common sense way and probably don't even think about it so collaborative filtering works by creating by finding affinities between groups of users and saying well if these users like something similar then it's probably that they're going to agree on something that they don't know so I'll give you an example so I say you have two groups of friends right and one of them group one they they have very similar music taste to you right so they like all the things that you do and group two is a group of friends that everything that they like you hate so every music they listen to you hate so if group one recommends you an album or you should really listen to this and group two recommends you an album you should really listen to this which one are you going to choose right okay so that's common sense right okay so this is collaborative filtering in a way right so we try to find affinities between users and if those users like something that you don't know then it's very likely that you're going to like it as well just as a bonus question if group two doesn't like something does that mean you're not going to like it as well yeah no you don't know because you know you don't have like a positive correlation you don't have a relation between the information so you can't really say for sure for sure so yeah so it only works if people agree with something so how does this work this works by pairing three quantities right so all you need is a user product and a rating right and users sorry for the algorithm to work users they have to be have a unique ID an integer ID and products might have a unique integer ID as well and ratings could be anything as long as it's numerical so you know it could be you don't have to have a scale obviously you know it doesn't have to be 0 to 5 stars or whatever it could be any number basically so so how does this work so basically the way this works is we build a matrix which is called the ratings matrix right and on one side see the the columns you put all the users orders right and on the rows you put all the products so you build a matrix of the ratings you have so user two like product one 45 etc obviously some of these most of these entries will be empty because you don't know if you knew you wouldn't need a recommendation engine so you don't know so the way this the algorithm works the most typical implementation of collaborative filtering which is alternating with squares works by factorizing that ratings matrix into two latent factors matrix matrices so U and P and what this does is so these are all your ratings right and these are the factorizations that means that matrix multiplied by this one is going to give this and if you if you establish those factors right when you multiply them together you're going to have all the entries filled with an approximation right because they're going to be filled with the values that maximize the closeness of one matrix to another so that basically means that those numbers that you didn't have are going to be predictions right so you can say if user three didn't see product one but the value of that three to one that makes the matrix more close to the original one when you multiply it back is 3.8 right so this is a classic it's a classic we squares problem right where you just iterate until you solve the matrix problem with a loss function okay so we're trying to I think I've got a different version of the presentation running to you so we're trying to I can show you sir you just want to refresh so we're trying to find the most facts of matrices you and P such that when you multiply them back together the ratings that a user had already given are as near as possible to what was initially there so when you multiply the first row in the first column you want to get what the first user actually rated that film on the ratings that they did give and as a result we do get those filled in missing values something that's really nice about the user factors and the product factors so you can associate the first row of that user matrix with user one and it tells you something about user one now the factors that are in that aren't a traditional feature vector in the machine learning sense I couldn't tell you what the first one means or what the second one means or what the third one means but what you could do is perhaps look at the difference between two users by comparing their vectors so you get some sense of distance between the users so they're two numeric vectors you could take the mean squared error between those two vectors compare them see how different they are and if that number is smaller than a different two users then you know that your first two users were more similar than your second relatively similarly you can make comparisons about the films in exactly the same way so the different rows of that vector or the columns in this sense because we've turned it on its side of that vector P each correspond to a product in this case it's films and so we can compare those in the same way yeah okay so we're going to in order to train a model what we do is we take our ratings matrix R and we take 60% of it as training data so we're going to use that to learn those vectors U and P they're what we're aiming to learn once we've got to make ratings but in order to determine how good our vectors U and P are we want to test it on a validation set so if we just tested how good it was on the training set it's likely that you get a wonderful result on the training set it's what we call overfitting so it fits the training set perfectly and when you actually go to make a prediction for a different user that wasn't in your training set your results are terrible so your model is what we call overfit so by keeping that validation set and evaluating the model on that validation set we can determine in a more robust way how good the model is and we're just determining this through mean squared error so the ratings in our validation set we say what rating did user I give product J and what does our matrix predict user I would give product J and we compute the mean squared error so that's the squared difference between the prediction and the truth so if we transfer over to the notebook then we can see a quick example of computing the mean squared error so in there it explains the mean squared error itself and we've got two vectors here so you could imagine like I was saying that these represent two users for example and so if we execute this cell shift and enter you can see the star shows it's running and then it turns into number 11 so we know that it's ran and then we've got a function here to compute the mean squared error so what it's doing is that it is zipping up two vectors so it takes in two values it zips them it effectively combines them into I think of it as a table together so they're now in pairs and then we say for each item each pair in this zipped up data frame we take the first which is we can think of as the truth we take the second which we can suppose that prediction we square the difference and then we scale by the length and so you can see here the mean squared error for that vector is 0.13 so that's just an illustrative example and we'll use that later when we train a model and we evaluate the model so the ALS model has a few different parameters we've talked about this the feature vectors U and P and one thing that the user gets to tune is how many rows or columns there are in U or P how big it is so we'll be tuning for that later and that's called the rank Rudy do you want to talk about the cold start strategy parameter so I mean Sophie explained how the mean squared error what it's trying to measure and I talk a bit about the latent factors and you always talked about the rank as well so as I said our objective is to try to find out those two factors right that's how alternative squares work but the thing is how this works is solving a least squares problem by keeping one of them fixed and solving a linear system of equations for that one and then alternating so it fixes that one and solves in order of this one and it just goes back and forth until it converges in a solution the problem is how is that solution when is it good enough so I mean when should it stop ideally you could just say it'll run forever you don't want that so you have some parameters on Spark's ALS Spark has an inbuilt implementation of ALS to work with that but when you're doing this approximation you might get non-numerical values on the solutions if one of the matrices is singular so you have that option called search strategy drop that just drops values which are any from the matrices so it doesn't take too many to account Sophie talked about the rank which means when we're solving for these equations we need to say how wide they are how many columns or how many rows they have and that's a rank we don't know what the rank is how do we know it could be 3, it could be 100 obviously it's sense to reason that if you should say a rank of 100 you're going to have a big matrix and a big matrix and you're going to multiply them so it's going to be slower so it's going to take longer to solve do you want to it's also going to overfit much worse if you have to high rank isn't it exactly because Sophie's going to talk about in a minute about the rank which deals with yeah so max iterations is just the maximum number of times that it iterates between optimizing for those matrices u and p so fix u optimize for p fix p optimize for u and then I mentioned overfitting a few times and if this was my data and I wanted to fit a model to that data then the model which would minimize mean squared error would be a zigzag line that went through all of these points because at the points values our model fits perfectly but I think you would all agree that perhaps what we would prefer would be more like a linear line, a straight line we could argue about whether or not that's a good fit later but you can see that it kind of captures what's happening in the data and that is what we need in order to make good predictions we don't want to overfit so the regularization parameter just prevents overfitting you'll see those in most in all hopefully machine learning algorithms otherwise you're going to have a problem when you put it into production or try to make predictions on data that you haven't yet seen so let's flip back to the notebook and actually start running these models tuning parameters and getting going you want to talk about them? yeah so if my model overfeits obviously am I going to increase the regularization or decrease the rank we're going to go into what's the best way of estimating the parameters I mean you have some standard ways of determining the best parameters and we're just going to go through that in a second it's a good question sorry the question was if your model is overfitting how do you choose the proper if you wanted to change something like the lambda or the rank how do you change it, do you shrink it do you make it bigger the question is there's a standard way of testing what's the actual proper value to put there it just depends on how much time and competition time you have yeah in general I would say that you want to if you know your model is overfit you want to tune the regularization parameter rather than the rank but it's really important I hadn't considered really that the rank would lead to overfitting so that is a very good point we'll see later so okay the first thing we mention is we should split the data because we don't want to obviously do we don't want to calculate how good our model is on the same data that we trained it because the answer might be it's very good and it's not actually so what we do is we split the data and as you remember Spark splits all the data through innumerous nodes on the cluster so how do you split data that's you know split in a thousand different computers okay fortunately Spark has a function that deals with that and it's called random split and the thing that you do is just you pass it proportions so say you want to split it into three you give it three proportions that hopefully add up to one and each one presents a percentage so here you're going to split the data set into 63% on one side 18 on the other so it's good it's like a handy thing you don't need to think how Spark does it and it just does it so we've split into three here with the view that we're going to retain one of those three chunks and use it for continuous okay we've split into three here because we're going to use that third set so in the slide I showed that we just split into two one for testing one for training but actually we're going to retain a third set that we'll use to iteratively check how well our model is performing when we put it into production so we want to keep that third set back that we can use in the production case so about the training yeah I think we can keep going with the slides sorry with the no-pub you can go okay so as we said Spark provides out-of-the-shelf implementation of ALS that's quite good because ALS is actually trivial to implement locally on a single machine but as with anything else I'm sure you guys have experienced with it as soon as you try to distribute it it's a big headache right I mean distributed computing is not simple Spark has a quite good it's a very good implementation right it's actually very clever it does things like trying to find affinity between user vectors to keep them close in the cluster right so it minimizes the shuffling of data between nodes so it's quite good and it's very simple to do recommendations with Spark so they have a class called ALS and you just need to pass these parameters if you want all of them have defaults so here we're going to instantiate an ALS class with maximum number of iterations 5 regularization parameter or the lambda 0.01 rank 3, remember that's a number of columns or rows on the UMP and call strategy drop and if you notice we don't pass any data we just instantiate the class okay it has a a method called fit in which you actually pass data to it and it will train the ALS model right and that's all you need to train the ALS model so you just have the model we create a simple ALS fit with a training dataset that's it it should take like a few seconds 10, 20 at most okay so that's it I mean with 8 seconds ALS model ready to do predictions trained on a distributed cluster in 8 seconds you didn't have to do much that's quite good but I mean this is not something you do in production where does the 5 the 0.01 and the 3 come from this is completely arbitrary this is completely made up you have no assurance that these are the best values so I'm going to start in a second so Spark has two more classes that I wanted to talk to you about one is called the regression evaluator since ALS is actually a regression in some ways which you instantiate by passing it error measure in our case mean squared error what's the actual column of the data frame that contains the source of the true value, the truth and the prediction column so the name of the column that contains the prediction so this could be anything right or price of the house true price of the house predicted price of the house and which metric you want to use so we just instantiate this and this calculates the MSC and it has a second method on the model which is called transform and this method does the predictions so what it does is you pass it a data frame with users and movie IDs and it's going to predict with your model the ratings for the users and movies is going to be right so if you run it we're going to have we're going to make predictions with our model right with a test data set and we're going to use the evaluator to calculate the mean squared error okay it should be a few seconds as well okay so we have this mean squared error this means that as the name implies the mean squared error means that on average the squared error between the true rating and the predicted rating is this value which is a bit big it could be much better obviously so usually a sanity test that people do when they have a model is how better is this model than just using say the average rating right so if all the ratings were the average the prediction was the average rating how better would the mean squared error from our model would be than that model which is obviously a very bad model so we can do that so we use as you see again the SQL type constructors of data frames so what you're doing here is just like SQL to calculate the mean rating right so you just do group by mean of the column rating then we collect the value that means on Spark all your data is distributed but if you want to call it to your own node you do a collect right so you just actually fetching the value from the nodes and we see well that's the mean rating right that's the mean rating that people gave and now we do we create a data frame with that mean value so this is kind of faking a model that predicts always the same value and we calculate the mean squared error of this bad model okay so this is a trend okay so it was 1.08 something okay so it's worse than our model it means our model is better but not that much better but it's better all right so it's good right it's slightly better that's an improvement okay but I mean the question is now obviously how do we know which parameters should we use to make the model right do you want to do this or do you want me to we can carry on through the notebook do you want to go yeah I mean okay so so I'm just quickly going to talk about parameter estimation so the most typical way since you have a way of calculating how good our model is which is a mean squared error or any error measure we use a kind of brute force technique and what that is is you take all the parameters you want to estimate for and you make a grid of those parameters with a range of possible values so say I want to test all the ranks from 1 to 10 I want to test all the lambdas to 2.0 with an interval a step of certain value right and with those values you create a grid and all the combinations of values are going to be tested by spark and in the end it's going to say from all these combinations of values the one that actually gives you the world's mean squared error is this one okay and that way you you can say from the options that I gave to spark these are the best parameters I can use so there are a few methods to help you with that but also keep in mind this has a few problems one of them is that it's expensive computationally so you obviously want to test for the biggest amount of parameters you can that takes time here the model is quite simple it just takes 8 seconds but you can imagine that the parameter grid grows exponentially if you have one parameter and you already have like 100 100 more tests so it takes a long time and it's a brute force method you might be missing something out if you don't test all the parameters so we can try to do that anyway so this is a class on spark that helps you build the parameter grid it's called parameter grid builder and you basically just add a grid for each of the parameters you want to test and the range you want to test so in this case because I don't want you guys to be here for like 5 days or something I just we just gave them like 2 options 2 for the rank and 2 for the max iteration again it's better than not having any kind of parameter estimation but it's still a bit arbitrary why would I choose 6 and 8 or why would I choose 10 and 12 what is between 6 and 8 you can go with step 1, 6, 7, 8 or 1, 10 how does it know which is the step you actually pass the actual parameters you want to take so in this case you're just passing a list with 2 numbers right so it's just going to test for 6 and 8 it's not a range in python you have I'm sorry I don't remember if you said you had experience with python or not okay so in python you have a method called range which gives you like a range and you can specify the step so if you wanted to do this and you had a time and you want to say I want to test all the numbers from 0 to 10 you could write for instance the same thing and here you put like range 10 or if you want to write for 1000 you could put 1000 also the rank is always an integer it doesn't make sense to use a step of 0.5 but yeah true but if you had to work the lambda for instance then you could you could put but yeah that's the idea so in this case you're just passing the actual numbers so max iterations which is the number of times that we iterate between the optimization steps is a parameter which the bigger you have max iterations the better your results are going to be they're never going to get worse if you run your optimization step for longer however the longer you run your optimization step the longer it takes so in general what we found when we were working on this in practice and trying to tune for maximum iterations was that there wasn't really any point because we knew that if you picked a bigger number it would get better it was all just a matter of time and how long we had we actually found that in general we could use up to 10 iterations even on the large data set so after that there's only a negligible improvement in mean squared error you could use a more clever metric to select a mean squared error sorry to select the parameter value for maximum iterations so something you could do is say compare the mean squared error for a model with iterations 10 and iterations 11 only differs by has only improved by a negligible amount which you have to define yourself then we stop we're going to use that value of maximum iterations but in this case we are just looking at mean squared error so it took about 1 minute 6 seconds to run on the grid of these 4 parameter values whilst on binder it would be quicker if you were just running it locally what we can do there is ask the model to print out the best rank so it tells us that the best rank in this case out of 6 and 8 was 6 and unsurprisingly the best iterations out of 10 and 12 was 12 because the longer you run it the better okay so we've trained a model we've talked about parameter estimation and what we really need to do now is make some predictions so in order to make predictions we're again using this transform function which just takes in our test set and then the code below is just amending the format so that it looks nice so we get ourselves a nice table so what we're seeing here is the user ID the title of the film and the predicted value and we've also appended here the true rating that the user gave so you can see here how simple it is to make and get predictions you want to go? yeah, yeah as Sophie mentioned to the predictions we just use this transform method so it's quite simple and we pass it a data frame containing user IDs and movie IDs and you get this prediction so if you look at it the model which didn't train for that long it didn't exhaust the number of parameters to choose you do get some good predictions but you do get some really bad ones as well but you can see some of them are quite okay they're close to the true value if you wanted to know just if someone liked or hated the movie this would give a good idea but some of them are quite bad but if you want to quantify how bad they are that's another advantage of Jupyter Notebooks you can quickly have visualization so let's calculate the errors so the difference between the prediction and the true value and plot the distribution and you get something like this sorry, I don't think I executed this so when you first opened your Jupyter Notebook if you had previously executed the cells which we had during a different session when we were practicing you can see that the results are already in there nothing is formally loaded so in order to get access to the items in the cells we're having to re-run them yeah, so these are just artifacts from previous rounds if you want like if you want to do the proper way you can just go to kernel don't do it now, but you can go to kernel and you restart and clear output and it just restarts the kernel and queens all the results so okay, so this is an error distribution and you can see, well, it's not that bad the majority of errors are pushing towards 0, at 1 you just have a few of them it's very rare to have big errors so even we didn't have that much data and the training was a bit doing on the knee it's not that bad so it's good if you want to show your friends or ask them which kind of movies you like, you can use it probably and let's calculate the mean squared error of this model so we had like 1.0 something for the other one we had 1.08 when we just used when we returned to every user the average rating from all of the scores and we had 1.02 when we used our initial basic model that was just trained at two random values that we picked, so I think that was iterations 5 and rank of 3 and we can see that now we're making some sort of mean squared error improvement so, I mean, you can see here that it's not excellent it's not the best thing in the world but it's a clear improvement to the previous one where it's just, because it's exactly the same model as before exactly the same data set as before you just change the parameters a little bit you can see the difference it made it's much better if you actually try and we just try two options 8, 10, 12, so you can imagine if you trained for a big range probably somewhere in there you would have like the real good parameters that would give you a very nice mean squared error so, you know, this is like the crucial point always you have to estimate your parameters obviously don't go and do it because someone told you that this rank is good or that parameter lambda is good that doesn't mean anything you really have to test it on your data on your model so what good is a model like this is that you want to do predictions, right? I mean probably if you're now considering using this on production, say you have, I mean you're working on an online bookstore you know the classical examples and you want to recommend books to people you want to do predictions, right? and Spark allows you to gives you facilities to do all these things out of the box as well so you have some inbuilt functions that give you all the recommendations so the first one is called and it gives you a top K recommendation, what that means is if you choose K equals 10, say, then it's going to give you the top 10 movies for each user, right? and that's quite useful and you're doing this in a distributed way, right? so you're going to get a result quite quickly even on a big data set so this is giving, you know, for user 471 it's giving like the top 10 movies for him, that's quite good what you're expecting to serve on a system like that, right? you want people to go on their profile page and see, oh, these are the best 10 books or movies for you, etc. it gives you the reverse, it gives you the top 10 users for each movie as well, right? so you can just recommend for all items, so you execute this and it just gives you which are for each for each movie which are the 10 people that are most likely to like that product, right? okay so you can give it ad doc predictions as well, so the only thing you need to do is construct a data frame and construct a data frame on Spark as you saw is very easy so you just need to instantiate a list of rows and then Spark has that created a frame helper function that gives you a data frame back so here we're putting pairs in the user item format so you say we want to know the rating for user 233 and item 901, etc. so you just create the data frame with that and as before, you know, we're going to use the transform method to create the predictions from the model and there you go, you have the predictions for those three users and three movies the ratings you should give it to them, right? okay, so this is quite straightforward to do it so, do you want to? Yeah, sure, so everything we've talked about so far has been pretty general and pretty abstract, we're telling you that we've trained a model and built it and it's making decent predictions and the mean squared error is getting better but you've sort of got no reason to believe that, I mean you shouldn't trust us at this point so whenever you're doing a machine learning method it's often good to just take a step back and say is this giving sensible predictions, right? Is this doing what I expect it to for some data that I know about it's very easy to just optimize for mean squared error and say oh well the mean squared error is fine so our model is fine, let's ship it, put it into production and actually your results may not be so good so what we're doing in this section of the notebook is making personal recommendations for ourselves now what we're doing first in this first cell is making ourselves a new user ID so we're saying what's the maximum user ID that exists in the set of user IDs in our data set and we're adding one to that, that's us now so we are user 611 Rui's written this lovely little function here which is able to find movies that have a particular word in so I want to make predictions so say okay, I want to make a prediction on I can't think of any films right now someone give me a film Rambo oh yes, my favorite okay, so we can type Rambo into there and we can see typo, never trust your data okay, so we can see here these item IDs so it's giving us the item IDs for the films, so what I can do now is say that as user number 611 we'll deal with the user I adore it so much which is the original it's the first one I don't know, don't look at me 2403, okay and I love it so much, I'm going to give it 5 stars so you can go through and do this for yourself you can see how simple it is to get the films and thank you yeah, so you can go through and you can see that you can rate films so this is actually Rui's rating surprisingly enough and it's a good mix of I don't know what type of films these are action films, more action films Rambo and so what we've done here is we've created a spot data frame we've put our user ID in which is 611 and then we've got our item as an integer and our rating as an integer so you can see this table here and what we're going to do is we're going to join this to all of the other ratings because we want to train the model again and we want to include the ratings that I've just made you know, I want the model to learn about me and be able to make predictions for me so I've appended those and yeah, of course so yeah, that's right so this is pretty straightforward so from a trained model which is the best model and the rank which is the best rank but don't get scared by this like Java, object, parent this is a problem with Python Spark because PySpark is actually communicating with Spark which is written in the scale and Java using Py4j so you know, the problem is some of these things on the API are not exposed on the Python side so it has to access the underlying Java connection directly so don't get scared that's not the usual Spark API I do not need to do that kind of in calculations okay so we've retrained the model with that updated data that includes my fabulous film ratings and what we are now doing here, we want to make predictions for me and the predictions that I want to make are only for unseen movies, movies which I have not seen, I don't want a prediction for Rambo because I already know I love it thank you so this gives us a set of unseen movies so these are the item IDs and it tells me there's 9,721 movies in that database that I have not seen so in order to make predictions I need to add together my item IDs that I haven't seen in my user ID so that's what we're doing here, joining those up and then we go on and use this transform function that we've seen before to make predictions from our new model so the output is this, when I ask it to give me 10 and that's not so useful because I don't know what our item IDs are so in this little bit of code here we are joining with the movies uh data that we loaded in and that movies data contains the names of the movies which is what we want so these are some predictions, the top 10 these are unordered and um arguably it looks ok Rui do you like dirty dancing uh no I'll give it a 1.9 right perfect ok but that is just some anecdotal evidence so let's go ahead and take a look at the best rated films for Rui and me with Rambo in there so there's a few things to call out here but actually last year it was like the French version of Top Monkeys before Top was the movie that inspired you can I just make a note Sophie so I mean if you're using Spark I mean keep banging on about computing framework just bear in mind that some of these operations like joining and doing things like that can be very expensive depending on your data right so if you have a large data set it can be very expensive if you imagine that all the data is distributed possibly no order that you can yourself understand if you're doing something like a sort if you're asking give me all the data back sorted you're gonna have lots of shuffling of data between nodes right so this is I mean obviously sometimes you have to do this but just use it judiciously and you know just keep in mind it can be very expensive sorry no that's a really good point so when we look at the predictions here does anyone notice anything funny predictions exactly so sorry can I just quick question for you a prediction of six is it like a prediction that you really like the movie more than five or not right so I agree so if you imagine that earlier we saw a graph that looked something like this and these were our original data sets and then we fit a line to it when we're seeing something like six what is happening perhaps is that I've come along and I just sit here so it's sort of outside of the domain of things that we've seen before and as a result when we read up off this line that we've fitted we hit something that is larger than anything that we've seen before so that's where it's coming from so we'd say that our model is extrapolating it's making predictions outside of the range of things that it's seen before now given that that is the data we have and this is the arguably the best model that we fit for our data you can you could decide you wanted to discard that or you could say okay well that's the best prediction we can make but let's be cautious in this region so it did have me stumped for a long while when we were first getting predictions that were up at six an important point is that the error of a rating which is six is obviously since you're squaring it it's going to be the absolute value so it's going to be also this is going to count as a bad prediction anyway that's right so if we had rated that film and we rated it it was very good and that was in our test set it would give us an error of around one which is bad it's a typical error of extrapolation it's like one person measuring one meter and a half enters a door one person with two meters enters a door extrapolates the next person it's going to be two and a half meters right so you know that's a problem has anybody heard of the majority of these films the films in the list in the list no anybody heard of any of them I mean Ruby thinks he's heard of one of them maybe some but not okay so when we at the start we looked at some of the films that were in the data set and we saw something like Toy Story right everyone heard of Toy Story if you haven't I'm sorry you should it's great so to me this is a bit unusual it's suggesting films that are actually peculiar okay and there's a good reason for this and although the model is working well we haven't made an error we need to think about why this is happening and we need to do something about it in what's called a post processing step so if we go to post processing imagine that when we put something into production like this it's got all of these components so here we've set up as a microservice type architecture with each component being standalone but they communicate with the other microservices and a post processor would be something that comes after the model so we make our model we're happy with it and then we put post processing in so one thing that you might post process for films case so you can see the example we've done has been general all we've had is user IDs and item IDs we haven't anywhere used that these are films this films data set actually has tags with it so each film has a tag that says this is an action this is a romance so on and it may be that if I rate films the majority of films that I'm rating are rom-coms or some terrible drama or some awful awful movie with the tag hashtag awful movie and so if I was going to put this into production when I was going to make predictions for myself I would probably look at those tags and more likely suggest to me rom-coms that were near the top of my list than some horror films because I haven't watched any horror films and I don't plan on starting now that's kind of intuitive so that would be something that you'd do in the post processing stage but what we want to do in the post processing stage you don't see the arrows oh you can't see the arrows there's arrows between those services but they don't show, they'll be in the slides so you can do many types of post processing just as an example let's try to do this just to see what happens let's try to remove all the movies that have more than 100 ratings because they can have been having great rates on your predictions since you never give a rating to them let's just see what happens you might be complete nonsense if you were so we have your account and we'll see what the prediction is now so you can do this type of exploration and Jupiter notebooks are excellent for this type of exploration because you can just rinse and repeat it's like a ripple with a nice visualization so you can look at them actual movie names and see if it makes more sense now and if it does since you're a bit pressed for time I'll use it as an exercise to you to think why it does make sense do you want to go through the input? yeah I'll talk about it good so when we remove films that had been rated by fewer than 100 people and then return the top predictions for Rui and Rambo then these are the results that we're getting have people heard of these films? do these predictions look more in-fitting with the ratings that Rui gave I would argue yes so again this is just anecdotal and you should definitely go through and change those films for yourself and see what comes out but it's always good to keep in mind that things can be done at the post-processing stage to keep your model it just means that you need to inject a bit of extra information and you can do that quite simply whilst keeping your model the same okay so the film data that we talked about is can I do slides? the film data that we talked about the film data that we talked about so people have made explicit ratings about films we went through and said 10 and so on now in practice we're getting lots of things recommended to us on a daily basis just based on our interactions so when you see ads on a web page they're based on your interactions you haven't explicitly gone to google and said hey google I love bikes so show me adverts for bikes please right it's just use your user profile what we call implicit data to build a rating so we're able to use ALS alternating least squares that same algorithm to make the recommendations in the case where the data is implicit so I just want to motivate that a bit and there's a section in the notebook where you can go through and play with implicit data but I think we'll pass through but please do have a look later if you want to and get in touch so suppose I listen to a song and I listen to it once what does that mean does it mean I like the song does it mean I dislike the song no I might not even have been listening I might have had my headphones plugged into my computer and Spotify was just playing and I was off making a cup of tea right but I listened to it listened in some sense okay we've got ourselves another song I haven't listened to it I may or may not like it right so I might know that I dislike that band and therefore I have actively not listened to it however I may have just not listened because I hear there's quite a lot of songs in the world and I don't think I've listened to all of them I might not know it exists it could potentially be my next favourite song we just don't know yet 100 times yep it's probably by Taylor Swift and I probably like it okay so there's some information in there based on the number of times that I have played a song but it's not like I like this 10 out of 10 or I don't so how are we going to capture this in a model well when we're in the implicit case what we're trying to infer is a user preference so we're saying does user you like item I and that's either a yes or no our preference so a zero is going to denote no and a one is going to denote yes so we're going for concrete you like it or you don't we're going to recommend it to you or we're not the information we have we're calling a recording which is Rui's name Rating R sorry I recording R with subscript U and I so the U is the user the I is the item and so in the songs case our recording would be how many times you listen to the song okay so for song A my recording would be one for song B my recording would be zero you could think of something else for the recording so if we're thinking about TV programs it might be the recording might be the length of time for which you were watching that channel you could collate multiple sources sources of information so it might have been the length of time you were watching that channel and the number of days that you watched that program could then be a multiplier to that and so on so it's some user defined recording you're going to have to put your own instinct into this to decide what you want to set your recording as and then we just set the preference to be one if the user has interacted with that item if the recording is greater than zero or a zero if the user has not interacted with that now Alam Bell should be ringing because what that means is I'm saying I positively like an item if I've listened to it once and I have no preference if I haven't listened to it but we just motivated that there's different levels of liking we have different sorts of confidence depending on how strong the recording is so in order to capture this we introduce a notion of confidence so the confidence for user U and item I is one plus alpha where alpha is some scaling parameter times by our recording so as the recording gets bigger the confidence is going to get bigger if the recording is zero what value is the confidence going to take one perfect so what we're saying is if the user hasn't interacted we think they probably don't like it they have sorry we think they probably don't have a preference and we have some confidence that they don't have a preference so again that's something that we can debate and argue over but for now we're going to stick with it and then the algorithm is doing mean squares minimization so we've got those user vectors the item vectors like we had before P is going to be a preference so it's our preference matrix so this looks similar to the minimization we were doing before but then you've got the confidence factor out front I don't want to bore you with the maths if anyone wants to hang out and discuss this further get in touch so if we fling over to the notebook there's an example in there that we won't stroll through fully but it's using some library data so there's an open source data set that will be loaded in if you use the notebook through binder and it says whether or not people interacted with a book did they take it out from the library or did they not and that's then used to make predictions so the simple thing to point out is that we've loaded in the data we can have a look at the data again it's just of this form user ID item ID so here the item IDs are the ISBNs of the books again it's general we're using books data but it doesn't matter you could stick any implicit data into the algorithm and so the only thing I really want to point out is that the model looks very similar when we are building a model can you find me the model building bit okay so I'm looking for the bit where it says implicit ah okay thanks so this cell is exactly the same as what we saw up top we're doing some parameter tuning and you can run through this interesting to see how the parameters might vary on a different data set or in implicit but everything is the same with the exception of this new option parameter implicit preferences which is set equal to true so this just illustrates how simple it is to amend things and use implicit ALS on Spark so everything we've talked about so far has just been in a notebook and we've noted and discussed perhaps issues that you might have when you put this into production and we've actually built a more robust system that runs outside of the notebook so Ruri do you want to tell everybody about Rad Analytics Rad Analytics is a community project and it tries to showcase the building of intelligent applications using technologies such as OpenShift which you know I'm sure most of you are familiar with or if you're not at least Kubernetes you should be familiar with and Spark as well in some cases as in this case recommendation engines and it shows lots of tutorials and use cases and one of them is the Gmini project so the Gmini project is basically sorry if I can find it here alright okay so this is a kind of production ready system built on microservices on OpenShift using Spark, Apache Spark basically what it does is splits an application so it's a full application from consuming the data having data stores, having a front end having a model store, caching service, everything following microservice architecture so you have some interesting things that you might look at to get ideas if you want you can look at the code for instance some practical ways of solving problems like splitting the model prediction into two microservices so you have something called a modeler which is checking continuously the data store for changes and if it finds a change it just triggers a model rebuild and that model is serializing to a database and it can be retrieved by the predicting service so you have a continuously training model building so I encourage you to check it out if you want I mean if you want to look at the code of how something like this can be used on a real application instead of a notebook just so it's a very good example you do have lots of other projects if you're interested in machine learning and distributed machine learning open shift so I encourage you to check out the rest of the site and please give comments and PRs are always welcome issues as well PR is more welcome than issues so so yes please check it out ok so should I talk quickly about streaming data we have been a little while and people could ask questions 20 minutes ok that's good so yeah sure right so now I just wanted to talk about a little bit about streaming ALS right so spark provides batched ALS of the shelf right that's what you get with spark but the cool thing about spark is that it's also framework on which you can build your own algorithms right so you don't need you're not restricted to what spark gives you and something you can build on spark is a way of training these models in a batch way and that's very useful so how does it work I mean if you recall what we did for batch was we had the ratings matrix right we wanted to calculate the latent factors and if we had like a new rating just a single one and you wanted to retrain the model then you would have to recalculate the whole process again you have to recalculate the two matrices all over again right and this will be like an iterative process so if you had you'll be like calculating the factorization of that a hundred times this is also just one rating like one person said this book is really bad and you have to retrain the whole thing and you can think well this is kind of a waste right so it's batched probably at the end of the day during the night or something but what if there was a way of calculating those two factors without doing the whole thing that'll be really good right and turns out that is so we can do it using stochastic grid in descent how do we do that well I'm just going to give you like an overview of that previously on the batch case we had the predicted rating right and our loss function what we are trying to minimize was the error the square error given that predicted rating so in streaming in stochastic grid in descent ALS what you do is you add a bias term to this predicted rating right and the bias term is calculated with the global bias which is basically an average of all the ratings the user bias which is an average of all the user features and the product bias which is an average of all the product features and basically you have this so this is your update process what you do is you update a row or a column on the latent factors given a learning rate the error between the error of the new rating and the user and product vectors just the vectors so what does this mean this means that when you get a new rating now instead of calculating updating the whole thing you just update these two rows right so I say you got like user X and product Y so you get me and Rambo right so you just update the whole column the row me and the column Rambo right and this is very good because it actually converges so in the end you do get a similar result to this method and that's really good right that means that it's not fully online in the sense that you're just not just updating a single value but you can see they show like a massive amount of users and massive amount of movies you now just need to update one column and one row that's pretty handy so it's important to this method actually it's like the specific name of the method it's called the bias stochastic gradient method of calculating the factorization so it's important to remember that we're trying to do the same thing these two methods their aim is just to calculate the rating factors that's it it's just different approaches of doing it so on the notebook as well you have a little demonstration of how this method works and this is since this is not implemented on spark it's just implemented locally so this is Python just running on a notebook but it gives you an idea of the method right so there is actually there's a link for a repo sorry with implementation of this method for spark if you want to check it out but since this involves installing scale a jar or a jar written in scale for spark to access that'll be a bit tricky to do on binder so we decided just to show the local Python version so basically we just read now the data into a matrix right so we're not using data frames now just reading into a matrix and still reading and the next thing we do we do the splitting as well so this time we do the splitting between training and testing so we do an 80% 20% split so you can see the process is quite similar to what we did before in spark sorry I can just show you the result cells so basically what we do next is then we split right since we don't have any nice helper functions from spark we have to do it manually so we just split into 80% 20% we divide the indices and then we use the indices to build two matrices right these are sparse matrices one is a training matrix the other one is a testing matrix and we fill them with a value so basically this is all just creating the training and the testing so you can see the black values are missing so you can see it's quite a sparse matrix we don't have many ratings in this sub matrix and the second thing we do is initializing the user factors and the product factors and we initialize them with random values right this is all that this is doing right we choose a rank of 20 we could have chosen any other actually we should have tried the parameter search right and this is how the parameters look like so this is like a random feature matrix and then we choose the warning rate if you remember the gamma from previously the user lambda the item lambda the maximum of the iterations and this is just a main loop so you can see it's actually it's updating the rows you know with the gradients that's all that it's doing right but this is doing that for the whole data still so a nice exercise will be if you decide now to create a new user or yourself or add a new movie what would you do you would just add a new column to item factor and user factor that's it and you can just repeat this iteration for that column right and this is it so in a nutshell this is just like a descriptive way so you can see it's training for the same data I mean it's getting like similar MSC considering we didn't even try the simple parameter search so it's kind of arbitrary parameters and this is just the evolution of the MSC with iterations right so what's the big advantage of streaming data well the big advantage is that if you have a big system if you have in production a massive data set you might want to serve predictions very quickly so you don't want to retrain the model say daily you might want to give it on the fly so this is a good way of doing an online prediction method a couple of things to keep in mind just to finish be obviously as with any model in machine learning be wary of the cold start problem so if you're doing something online you might be tempted to say start giving predictions straight away but the problem is you just might have about 10 users so all the predictions are going to be really bad might be rubbish so either model offline with a decent size data set or always validates your predictions right be very cautious with what you're predicting in the beginning the second thing is that parameter search for streaming models is not as is not as simple as with batch models in a true streaming fashion because in a true streaming streaming system you're discarding the data so you get the rating you train the model and then you don't look at that rating anymore because if you're saving the ratings and picking them up from a massive data set and what's the point of using streaming data right you might have used just a batch one so you have to if you want to retrain the model in the batch case that's very simple right you just take all the data retrain the model with the whole thing that's fine with streaming if you want to retrain the model the whole model from scratch you need to take all the data into account again right so that's a bit an anti-pattern if you don't like streaming streaming programming so I think that's it for me this is streaming data are we okay for questions or do you want to close the yeah thank you very much thank you so if any of you has any questions or I'm sorry we had to rush a little bit this bit but you know we're having a nice talk so in the end we just finish not doing the whole notebook but I mean please try it if you want if you have any questions also you can give issues on the repo so when then when we put the slides online we'll make sure that we put a link to all of the resources in the last slide yeah so we've got not just the notebook but also the version which will run nicely on OpenShift from RadAnalytics and then the brewery has some really cool implementations of the streaming stuff running outside of a notebook so yeah we'll set that up but please do shout that's right so we'll make sure that the slides go into that repo and DefCon puts the slides up is that right as well yeah so when the slides are put online it'll definitely be in the repo so how do I know how do I know my model is extrapolating with a more complex model than the line so are you talking about I still talking about LOS or in general in machine learning or data science I'm not sure if this can be answered in general but if you can do how do you know if your model is extrapolating I mean it's extrapolating if it in this case it's quite easy right in LOS it's quite easy because if you look at the original data it's quite bound in terms of values right so if you have like a website that allows people to do ratings you know that you're expecting ratings between something and 5 0 and 5 or 1 and 5 right so if you start getting ratings of 7 you know obviously that you know something is wrong so there are a couple of things you could do you could try to I mean probably it's not it depends on obviously what's happening you could try to renormalize the predictions you're getting you can just ignore them if you think it's like a one-off it's not a problem with your model it's like something just with that user it's because I mean if you think about it I mean on ALS if you're using a massive dataset like the movie lens dataset has a version of it which has like millions of users and millions of movies right so you're always going to get some outliers some anomalies and people have weird rating patterns and their predictions are going to be all over the place so you know it's your choice you see if you know you should ignore them or if it's like a systematic problem then probably you shouldn't ignore it but yeah in this case you have quite well defined boundaries for the predictions you know when you're extrapolating that's true I agree but you can also think of the example where perhaps yes all of our scores are between 0 and 5 but it's feasible that actually there's nothing in the region of 3 to 4 and so if someone puts a a value in that sits in this region you could argue that we're extrapolating here right because we have data here that we for our model to we have data here this is still within the domain but it doesn't fit any of the data that we've seen before so I think data extrapolation is part of a bigger problem and that's to do with your data changing over time so the input data that you're now seeing not being the same as the data that you've trained on solutions are tricky but so what people are doing at the moment is trying to monitor the data that's coming in so by looking at the data that's coming in and saying have we seen anything like this before if your answer is no then you are going to be extrapolating you can try continuous validation of the model as well if you do continuous validation of the model I mean yeah that's why it's but yeah I mean it depends on the severity of the extrapolation you know if there's something really wrong with your model it's going to be picked up by continuous validation hey is it possible to compute some kind of confidence intervals because in the sense that it should become wider right so the idea is that if we are giving a prediction from a region where we have a lot of data then we'd say we are more confident in that than if we were in a region where we have less data interesting question formally I don't know how you would how you would formally compute a confidence interval but you could certainly I mean intuitively there are quite simple ways to say okay the nearest point we've seen is x close to it and there's a few in this area so yes we are confident versus we don't have that much area in this region so we're going to take this recommendation that this prediction that our model is giving you with a pinch of salt has anyone done any does anyone is this any one area does anyone else have any ideas about this I guess I will train just two models what for the prediction what for the confidence can be as simple as a function of the data right right right so the solution would be train two models one for the prediction one for the confidence are you talking about sorry sorry go ahead I was just kind of repeating the question for the mic are you talking about like a confidence interval like the statistical sense of a confidence interval or just confidence yeah I mean since we were drawing well I guess you can do a confidence interval for regression predictions I mean of course what we are doing is different but the line is broad yes I think one thing you could be doing is to use the distance in the space of the computer feature vectors for the movies and to say to compute how close it is to the nearest to the nearest movie that actually got enough ratings that's very clever so the suggestion is that because our feature vectors tell us that's lovely because our feature vectors do tell us something about the users and the products and I mentioned earlier that we can compare them and get a distance between two user vectors to compare users and a distance between two movie vectors to compare the movies we could use that as a notion of distance for how confident we are that we've seen something before so we could monitor those user factors those user factors are tuned by the model so there's still a chance that we are extrapolating but that might be shown it would be interesting to investigate would that be like a different problem by any users? the question relates to whether or not this comparison of users and how close are we to other users is related to classification and grouping users and so on and yes it certainly is you could use those feature vectors if you were happy with your trained model to determine users who are similar so yes that would be a classification problem you can argue that collaborative filtering is doing a very complex classification alright there's a question for all this we discussed now to work we probably should understand what the trained feature vectors mean like if I have two users the same and different in one movie by one star what does it do to the difference between the land vectors is the mix quite the same or completely different? so the question is effectively about the interpretability of the feature vectors and whether or not we can infer what the user feature vectors mean what the entries in them mean is that correct in order to compare so I'm clueless in this case I don't think that we can so in that case if users had rated two identical sets of films and in one film there was one star difference and the other was the same I suppose you would be able to look at the feature vectors for those two users and try to infer something but I would argue that it's much more complicated I think if you wanted that sort of information I would take a step back and go to a more classic clustering algorithm you could use a combination of a clustering algorithm and it was much more complicated I imagine the solution would be simple but the question is what do we see in the in the system of video questions that's if you can understand what they mean if you can understand what the linking factors mean that says the question right maybe we could say if the difference with one user from another is sufficiently small we could just recommend for those two users the same and tell that they are so close that for us essentially it's for now it's the same user so that would be so the suggestion is that if the two user feature vectors are very similar then we could just use predictions for one user as predictions for another user that would save on computational power perhaps depending on how many users there are what we've got to compare and where we're storing our data and that could be done in the post-processing stage as well so that's an interesting tuning that's correct so interesting to look at yes yes that's that's the information you have that you don't use in this case yeah we haven't used that information the tags at all in this oh my gosh this notebook is long okay so yes the ratings file tags, movies, links which is links to the IMDB of the film itself so you can get extra information about the actors who are in the film it tells you the genres in there you could do some cool statistical analysis of the movie poster you could look at the average rating given by IMDB and so on so there's lots of ways to expand this if you were going to try and take the recommendation and put it into production this is actually like the standard alternative least squares methods right and lots of research has been happening in the literature after this there's actually an interesting paper which I don't have the link here where Monte Carlo sampling is used on the actual future matrices right so actually closely what you were saying you know so you're sampling from a region of probability of users being close together or movies being together so and apparently it's like a massive boost in performance obviously because it's just taking approximated values from the users and and projects so obviously all that we talked about here is like standard ALS so a bit like a crash course introduction on ALS but obviously this is now as many things like probably one year after this is done it's not state of the art obviously so if you go to the literature you'll find many many new techniques that expand on ALS you know and build on ALS and are worth reading if you're really into the field they're very interesting reads so I think with that we'd better call it quits thank you so much for turning up and engaging really appreciate it especially after the party and if you hang around in this room then our colleagues Will Benton and Mike McEwen are going to give a workshop on streaming applications that's using red analytics some of the tools that the team has created so would recommend but won't be offended if you leave cheers thank you