 Well, in this part of the course, we're going to be moving into a new section, which is machine learning. There's a lot of material we're going to cover, but at the same time it's really the tip of the iceberg for all the material you can do in this field. Let me try to start by relating to something that we're familiar with, and then connect it to something new we're going to be doing, which is building models. So all of you have been running functions for a long time, right? So hey, maybe I'll write this code, and then our function take inputs, maybe in the form of parameters, and then they have some outputs, maybe I'm either printing something or I might return a value. And so for example, I can imagine that a function could be doing something like making some sort of prediction. Maybe my input is that I have some details about a house that's for sale, and then I might be predicting what it might sell for. And so when I have a function like this, that's an example of a model. And I could imagine having it feed in a bunch of values at the same time and making a bunch of predictions. So the idea of machine learning is instead of having a human write these functions where I have a computer automatically generate these functions. And the way they're going to do that is they're going to learn by example. So we'll feed in a bunch of training data where we have a bunch of different houses that have sold for different amounts and have to have bedrooms and baths. We're going to try to infer things like well how much is the bedroom worth, how much is the bath worth, how useful is it to have a newer house. And then based on that function, we're going to be able to generate that function and then use that to make predictions on other data. And you can imagine why that might be useful for a lot of things. Maybe you're doing property assessments or maybe you're a realtor and you're trying to figure out how to price a house properly. So the example I've given here is an example of a regression model. And the regression model is trying to more broadly a type of supervised machine learning, which is one of the main three categories of machine learning. So I'm going to start brought now and talk about these three areas. And then we're going to be talking about regression in more detail and I'll actually explain what it is. So the three main areas of machine learning are reinforcement learning, which is basically a situation where you have to make a series of decisions. And you're trying to optimize some sort of reward. So you can imagine some sort of robot moving around in the world and picking up coins or something like that. We're not going to be doing that kind of work in this class. Instead, we're going to focus on two areas, which are supervised machine learning and unsupervised machine learning. And in both of these cases, we just have all our data upfront and we're trying to gain information about that. Some people will say there's a fourth category of machine learning called semi-supervised. We won't be talking about that here. Within supervised machine learning, there are two different things that we're going to learn this semester. One is regression and with regression, we're trying to predict a quantity and then with classification, we're going to try to predict a category. So in any of these cases where we're trying to predict something, that's known as a supervised problem. And the way it works is while the data we have has some labels on it. Usually there's some special column that's telling us a quantity like the price for a house or some sort of category. And then from that, we can try to predict that label in cases where the label is unknown. And unsupervised learning, there's no special label column that we're trying to predict. We're just trying to look for general patterns in the data. And so we might do a couple of things. One, we might try to cluster our data where we're placing rows into different groups. Or we might try to decompose our rows. We might notice that I might have these rows, which each have five numbers in them. But maybe every row is like a combination of kind of two component rows. And so there's some simplicity in there, even though there might be a lot of columns in our data. So I'm going to go through these four types of things that we're going to learn this semester and then just try to make it more concrete. So here I have a table. This is just a regular data frame. And so this is my index here. Here are my column names. Right now I have a Y column, which is my label. So that's going to be generally what I'm trying to predict. And then I have these different columns here that I guess I'm just running them X0 through X4. But usually those would have some sort of real name, right? Like before I saw that we have like the number of beds in a house. So this label that we're trying to predict is what we're going to try to do is look for a relationship between that and these other columns, which we're going to call features. And so in general, what will happen is that we have some examples, some rows where we have examples of both. And then there might be some other data where we only have the features, but we don't have the Y label. And so we want to try to predict what should go in here. And you can imagine why that might be. Maybe these are all different houses and some of them have already sold. So we know what they sold for. And then these have not gone on the market yet. So we're trying to predict, well, what would they sell for if they do go on the market? Right? So the problem here with regression, I just want to state again, is that we want to predict the quantity, the Y column in this case based on the features. Right? And by quantity, I mean this is like a number. So how are we going to do that? Well, we might actually break it down into three parts. First, we might select a subset of the data that we, for which we know the answer. And then we might leave some other data aside for which we also know the answer. And what we'll do is we'll run an algorithm that is able to infer what the relationship is between these features and these labels. Once I've done that, I might run my model on these other ones for which I also know the answer. Now, of course, I already know the real answer. And unless my model is perfect, it's probably going to give me somewhat different answers. And so why would I do this? Why would I want to make a prediction if I already know the answer? And the reason is that I can do this to evaluate my model or I might say test my model. So for example, if my model says that this row should have been 70 and it's actually 72, well, that's an error. Same thing here. 60 versus 59, that's an error. And I can try to quantify all of these errors and then give my model some sort of score. That's the testing phase. So after that, after I've learned my model up here and then I've kind of evaluated it on some known cases, then I might actually put it in production. Production means I'm using it for real things. And I'm trying to predict actual unknowns in the world. Like, for example, if I add a new house to the market, what might it sell for? And I can try to put these different values there. Other thing we might do, even beyond making predictions, is that I might look at that model and just try to learn things about the world. So I keep going back to the example where we're selling houses. I think it's interesting to just know, well, for each additional house or for each additional bedroom or bathroom I have in my house, how much does that increase the value of my house? And I can use it to make different decisions, right? Like maybe I want to do a housing remodel. Am I going to get more benefit by adding another bathroom or another bedroom? So we can just kind of learn things about the world and also make decisions in that way. Okay, so all this was regression, which was the first kind of supervised learning we're going to learn this semester. And when we're doing these regressions, the key thing that makes the regression is that we're trying to predict some quantity in our Y label. Now it's totally possible that our features might be a mix of both quantities and categories, right? So something like green, red, blue is a category. Something like shape is a category. A lot of things that are strings are categories. That's fine, right? The distinguishing characteristic of a regression is that the label column is quantitative. If I somehow am working on a problem where my Y is categorical, then this is no longer regression. It's a classification problem. But otherwise, all these other things I've been talking about where I kind of do training and testing and then I put in production. All of that is the same. We're just dealing with categories instead of quantities. Okay, so moving on, we saw the two kinds of supervised learning, which is both regression and classification. What about unsupervised learning? The main thing here, the main point is that there is no label column. I just have a bunch of features. And then I can try to still learn some patterns about this, even though I'm not trying to predict anything. And so one of the things I might want to learn is, well, are there any sort of natural groupings of these rows? And so there are algorithms out there that will, let's say, put all these rows into three groups and I might assign them numbers like 0, 1, and 2. And then just to kind of draw what that looks like, well, all of these rows be lying together. Now, I really want to stress here that there is no data out there that tells me what the proper grouping is or even how many groups there are. So when I'm doing this, it's not exactly like there's one right answer. But that doesn't mean that all groupings or classifications are equal. I can measure within a group how similar those rows are to each other if I have some metric for that. And so then my goal is to have a grouping that kind of maximizes the similarity within each group. And there might be different groupings that are equally good, right? But as long as I'm kind of having a high similarity within groups, well, then I still learn something meaningful. And you can imagine lots of different reasons I might do this. Like maybe each of these things represent a different user for my web application. And if I can say, hey, well, there's these 10 different kinds of web users for my application, I could maybe run a different marketing campaign for each of these different groups. Okay, so clustering again is unsupervised. It's unsupervised because while there was no label column I'm trying to protect. The last kind of machine learning problem where I talk about this semester and which is probably the most complicated is called a decomposition. And decomposition is also unsupervised is again, right? There's no column I'm trying to predict here. And the idea with the decomposition is that I'm going to look through all these rows and see if there's any pattern. Are there kind of a couple archetype rows that really can be mixed together to create other things? So maybe what I see is that with some small error, most of these rows are just combinations of these three rows over here. And I would call these my component rows. So, you notice the columns are the same right between my original data and my component rows. And then to get this row here, like negative 11, negative 7, 3, 2020. What I would do is I would multiply this row by negative 11, and then add it to 21 times this row, and then add it to negative 8 times this row. All right, so I'm kind of taking a weighted average of these three rows to produce this this row. And if you actually crunch these numbers, you'd see that I would get something kind of similar to this, but there would be some error, right? It's not a perfect match, which is fine. I mean, the fewer components I have, well, then I can have a simpler model, but while there might be more error. So I have that here, and I have these numbers here. And what we'll generally do when I'm trying to mix these components to create a row is I'll put these numbers in another table down here. So this will be a table of all my weights or maybe my principal component scores. And so I'll put, you know, negative 11 here, 21 here, and then negative 8 here. And then for this next row down here, right, I'll do the same thing. I'll say negative 43 here, 12 here, and then the negative 6 here. And so since I'm doing this, I'm putting kind of these mixtures for every row down here down here. What that means is that if there's n rows over here, then there are also going to be n rows over here. If there are m columns over here, well, then there would be m columns over here. So basically what I can do is I can take this big table and I can reduce it to having some components here. I can have some weights here. It's useful for lots of things. One is just if I'm trying to save space on my storage system, right? I can have these things be smaller. But then it's also nice if I'm trying to do other phases of machine learning, like a classification or regression. It's kind of nice if I only have like three feature columns instead of the original five. That's going to help me in a number of cases. Okay, so that's a whirlwind tour of these four poms where I solve regression, classification. Those are both supervised because again, it's labeled, clustering, decomposition. There is no column we're trying to predict, so that's unlabeled. It's on supervised learning. And so for each of these four things, there's actually a ton of different algorithms out there. And in this semester, we really have time to learn like one algorithm for each of them. And so if I go to this website down here, this is the website for scikit-learn, which is the module we're going to be learning. And there's probably close to hundreds of different algorithms or different classes they have there. I put a small subset under here. And so I can see, well, here's all these different things they have for clustering. And we're going to just learn one of those, which is k-means clustering. Decomposition, all these different things I can do when I learn just one of them, which is PCA. It turns out that for a lot of algorithms, the classification and regression come in pairs. So for example, here I have like a decision tree classifier. Here I have a decision tree regressor. Here I have a k-neighbors classifier. Here I have a k-neighbors regressor. And that's why I didn't kind of split these out. I just put both of these under these two categories. And so we're going to learn two things here. We're going to learn logistic regression. And we're going to learn linear regression. And this is a little bit confusing because, well, this part's obvious, right? The linear regression is trying to be a regression here. This is the one that people get confused on because even though it says regression in the name, it is not a regression. It's actually a classification, right? So these are the four things we're going to be learning this semester. And logistic regression is one people always get confused on because, well, it's not actually a regression. And I think once we learn all these things, the very nice thing is that the interface to using the other ones is relatively simple. So, for example, once you know how to use a linear regression, you could very easily just replace the word linear regression with ridge. And you're still going to be able to do all your machine learning stuff correctly. Now, before you do that, you should probably learn about how ridge works and then think about which model is best for you. But at least in terms of the code, it's very simple to switch between different models within any of these four categories. So I want to talk a little bit, that was pretty high level. I want to talk a little bit about the foundations where you need to be learning this machine learning, both in terms of the code and then also the math. We're going to learn a few different modules. The main one is at scikit-learn. I was showing some documentation from scikit-learn. We're also going to learn numpy, which has, let's just deal with matrices. It turns out that numpy is really the foundation for pandas, right? All pandas data is actually stored in numpy. And now it will be a good time for us to actually see that. And then we're going to learn this thing called PyTorch. And PyTorch can do a couple of things for us. One is it can do calculus for us, which is pretty cool. Another thing it can let us do is it can actually let us run our code on GPUs, which are graphics processing units. Everything we've been running so far this semester has been running on CPUs, your central processing unit. And it turns out that GPUs that are originally built for graphics also happen to be really good at machine learning. And so a lot of things, if you're dealing with a lot of data or complex models, a GPU will be better at it. We're going to have to learn a little bit of math. I'm not assuming you have any math background besides what you might learn in high school. But let me give you an example of how math is going to come into play here for a regression problem. So we have this example again with all the houses and these characteristics. And then we have a function that predicts the price. How would we do that with matrices? Well, I might take all these numbers in the data frame and put them in this matrix here. And then for my function, I may have kind of just an algebraic expression, which is using matrices. So my x here is this matrix. C is a vector. B is just a number. And when I run this, well, I'm going to get this other vector out here, which actually has all the prices. So to understand what's going on here, we have to learn a little bit of linear algebra. This is not a regular multiplication. It's actually something called the dot product. And it looks like that. So I can take this x matrix here, dot product with this vector, and then add a number. And then that's how I'm going to get my results over here on the right-hand side and do one prediction. What's cool is that if I can do one row times this vector here and get one house value, and it's going to go through without having a loop even, right? The beauty of linear algebra and multiplying matrices together with the dot product is that I can do it in one step and I may get actually all of these numbers. The code for it is pretty simple. If I say data frame dot values, then x is actually going to be a numpy array. And if I want to, I can just say, well, I want the dot product of these two things and I want to add B and it just works. We're going to be talking about that in quite a bit more detail at some point before the end of the semester. One thing I want to note is that if you're reading other documentation, a lot of resources will use A instead of x, which I find confusing. I think that's not intuitive if you're kind of, you know, working all these scikit-learn modules because those will generally use x for data. And then even Stranger, right, will often have a C when we're in scikit-learn, but then they call that little x instead. So just be, you know, as we're learning linear algebra stuff, I just want to say up front and I'll say it again. Be aware that the variable names are a little bit wacky. So what is kind of the scope of linear algebra and what kind of things are we trying to solve? Well, one thing that we're not going to solve is something like this, y equals x squared. That is not linear, right? Anything quadratic or cubic or anything like that. I'm not linear. All we can do is multiply variables by numbers and then add things up, right? So this is an example of a linear equation, right? I just have, I have some different variables, right? And then I'm multiplying them by different numbers. One of the things where I notice is that the way we're going to be doing linear algebra in this course is we actually have very big matrices and a lot of variables and a lot of equations, right? So you can see here I actually have 50 variables. So I think the key takeaway is that more variables, more data, but simpler equations. What about calculus? So here I have that situation again with the house where I have some training data. So I have both my features and my label. It goes into an algorithm and that algorithm will basically spit out this formula for me that I can use to predict housing prices, right? Now it turns out that when I was doing this training, right? I had the original prices and the new price might be a little bit different, right? 140, this is 190, 240, 254, right? They're all a little bit different. And so what I can do is for this given equation I end up with, I can have some sort of function, total loss function that compares the correct answer with my model's answers, right? So I compare these two and I get one number out, right? That's how I like what my error is or how bad it is. And of course how bad it is really depends on, kind of, the numbers that are part of this equation down here. So the whole idea of this training thing with this algorithm is that I want to find out, well, what C can I do that is trying to make my error or my loss as small as possible, right? So we're trying to minimize something. And I don't expect you to take Calculus, but I know a lot of you have and in Calculus we're often trying to minimize or maximize things. And that's why it's trying to come into play a little bit here. The good news is that we don't have to understand Calculus. There's going to be modules that can do it for us such as this PyTorch thing that we're going to be learning. PyTorch is also going to help us be able to run our code on GPUs where we'll be able to do things like take two matrices, shove them over to GPU and then multiply them together and it'll just kind of, it almost feels like it's just magically going faster than it would if we're running on a CPU. And it doesn't take a lot of code to move it around. So PyTorch is going to be very powerful both in terms of Calculus and using GPUs. To conclude this video, I just want to talk about this difference between developers and users and then who we are. When I'm looking at this picture here, right, I'm feeding all this training data into a machine learning algorithm and then that's giving us a function we can use to make predictions. There are classes and I guess people in general who either develop new algorithms or write code and optimize code for existing algorithms and we'll just do a tiny bit of that but that's not our focus. We aren't trying to do machine learning research or come up with novel ideas. We aren't developers, instead we're going to be users of machine learning algorithms that come in Scikit Learn and so some of the questions that we're going to be interested in going forward for the rest of this class is well, which algorithm should we use in SK Learn? How should we pick it and I guess how should we configure it? A lot of these have different parameters. In terms of the data, how can we clean it up so it's trying to work well with the machine learning algorithm we chose and then finally when we actually use this thing we're going to get all these predictions and we can compare it back to the original and how do we want to score that? There's not necessarily one right way to evaluate how good or bad it is and so we want to get some experience with that as well. So that's a bit of a preview about what's coming up in the course and hopefully this is kind of a fun change of pace compared to what we've been doing. Well, in this video I'm going to be training a regression model from Scikit Learn to some COVID data in Wisconsin. So here I am on the Department of Health Services website data portal and I can search for COVID here and the data set I'm using is this one right here, the historical data by county. So there are about 70 counties in Wisconsin and what this data shows me is for each day. I wonder if I can draw the data browser here. It shows me for each date in each county all of these different stats. So for example, how many positive cases are there total, how many new cases are there, how about over the average of the last seven days and how many deaths are there and ultimately what we're going to try to predict and this based on this data is well how many deaths are there two weeks in the future based on looking at the stats for today. So I had downloaded this and I'm not going to do it again and then there was a fair bit of data cleanup that I had to do here and that's not the main point of this lecture so I have a notebook for it that does all the cleanup I'm just going to kind of quickly walk you through this some of the things I did here without spending too much time on it. One I pulled out just a few interesting columns. So for example, how many positive cases were there on average over the last seven days and then how many new deaths were there. We had a bunch of missing data so I just dropped anything with missing data and then I converted the date to an actual panda state time and in the process I dropped what hour it was I just want to get the date without having the hour that it was posted and then finally the documentation for this data set says that negative 999 really means that there's less than 5 5 in whatever that field is so it could be anywhere from 0 to 4 and so I think just for simplicity I just replaced that all with 0 we don't really know what that is you can imagine doing something smarter like maybe 2.5 seems more fair in some sense anyway so I get a data set that looks like this there's a lot of cases while there's no new cases and no new deaths maybe for some of these smaller counties and then of course for larger counties this definitely is not 0 and so then I go down a little bit more the other thing I want to do is I want to add a column which isn't just how many new deaths are there but how many new deaths are there two weeks in the future and so I had to do some trickery with time deltas to basically join the data and two weeks in the future and you can look at that if you're interested but in the end I get a data set that looks like this I know how many new deaths there are on this particular day and then this last field is how many there are after the specific one and I save all this to the Wisconsin COVID data set so that's what I'm going to be working with here and I'm just going to head over here and create a new notebook to analyze that so let me head here what I'm going to do is I'm going to import pandas for starters so I'm going to say import pandas as pd and maybe I'm also going to add import matplotlib.pyplot as plt and then maybe I'm just going to configure my matplotlib stuff first plotlib inline and then plt.rcparams let's just make the font size a little bit bigger great and so now I can actually get my data frame so I'm going to say data frame equals pd.read.csv and what do I want while that by all I produce from that other notebook which is wisconsin.covid.csv and let me just peek at that so that's all good and maybe just to make sure that all these things are not actually 0 I'm just going to say what does the data frame mean and so I can see on average in a given county in a given day on average one in four people have died and so what I'm going to be trying to do is I'm going to try to be predicting this field and I might either try this as a feature or this is a feature and this is going to be my label column that I'm actually trying to predict it's a quantity which is why I'm doing a regression instead of a classification often before I jump into trying to do a regression I'll do a scatterplot to just see if I can identify visually any patterns in the data and so I might do something like this I might say dataframe.plot.scatter and then I can say my x equals something and then my y equals something and so in both cases the thing I'm trying to predict my y is how many deaths are there going to be two weeks after this given date and then my x I guess for the first case I'll try the seven day average like that and so I can see a picture there and then the other thing I want to do is I want to say well if I look at how many deaths there were today what will that tell me about how many deaths there will be two weeks from now and I see a slightly different pattern there sometimes for these I like to say alpha equals 0.2 or something like that just give it a little bit of transparency when a lot of points are on top of each other and so I'll do that as well and so the other thing that I'm going to like to do as we're going forward is a regression model on both of these variables and so I want you to get a sense when we score that how that score corresponds to kind of the strength of the relationship in both of these so I'll start with this one how can we train a regression model that fits this to this and the first thing I have to do is I'm going to have to import it so I'm going to say from sklearn.linear model import linear regression this is the main regression model we're going to learn this semester so I'm going to do that and then if I want to I can create a new linear regression object just like that and so there's a few there's a few methods that we're going to want to learn with this so some important methods are going to be fit, predict we'll fit, score and predict and we're going to be running those things on our data right so let me come down here and what I can do when I'm fitting it or training it is I give it two pieces of information I give it my my x values and I'll give it my y values and for the y values that could be a series if I want so I could just pull out that one column for my x values it has to be a data frame and so let me just take a look at my data frame earlier what I'm going to do first is do a model corresponding to this plot right so I want to do it based on how do the deaths today predict the deaths in the future and so what I have to do is I have to pull out this one column these new deaths into a data frame so the way I'll do that is like this if I do this just new then what then I just did a series the way I can do a data frame is I can pass in a list and I can say some columns here right and so it kind of looks weird right that I'm putting a list inside of brackets but that's what's happening and in this case I'm really just interested in one thing which is just how many new deaths were there on this particular day this nice data frame which is going to work well for us and then for my y values I can just directly pull it out as a series and that will be death new two weeks from now right so these will be my x values and that must be a data frame and then this y values and this can be a series right and in general why is that why I'm trying to predict one thing but I might be making that prediction based on multiple columns so I'm going to head down here and I'm just going to copy these things I'm going to copy this right here and then I'm going to copy this right here just like that and I train it so fit means to train based on the data so I'm going to have a bunch of examples of my features and my labels right so I can do that and that's relatively uneventful and I would like to be able to visualize what it looks like what predictions is it making if I have new data and it turns out that since I'm doing the linear regression on just one column here it's actually going to be fairly easy to visualize because I can ask it for a given x value well what y value do you predict say l r dot predict and then I can have to pass in some sort of value here so I think I can pass in something like hey if there's 10 new deaths say how many do I predict two weeks from now and I have to pass this in as a list of lists because that was the shape of this up here and I can see that there's also this array thing which is an umpire array we're not going to worry about that too much for now the way I'll generally do these predictions is that I might create a data frame that will help me show a fit line that's one way we can represent the relationship here right if I drew a line on this that would tell me for my x variable what do I predict for my y and so I'm going to create a new fit data frame like this that data frame and then I may have a column which will be death new remember that's number of deaths today and for that I'm just going to try to put a bunch of different values maybe I'll draw from like 0 to 500 and let me just take a peek at that okay so that's my death new and then what I can do is I can actually pass that whole thing in to my prediction to to basically figure out what those y values are so I can say LR.predict and if I want I can put in my whole data frame like that and try to find out that column and again I get a bunch of these weird values in this numpy array but the great thing is is that I can say I want to just shove those things in a new column and I may call that this and then I might like to say something like predicted to emphasize that this is not real data it's just a prediction so I'm going to put this up here and now when I run this I can see that I have for a given number of deaths on today well how many do I predict there will be two weeks from now so if there's 500 deaths today I guess I'm predicting that in two weeks there's going to be 160 deaths for a particular county and so what can I do now is I can put df.plot.line and I can say x equals this thing and I can say y equals this other thing just like this and different models will give you different things but the linear regression right here that I'm using is trying to give me a straight line so I'm going to run that and there's some sort of straight line there and maybe I'll just make it red what I often like to do is after I have my regression line I like to compare it to my actual data and so remember I had these similar columns before and that was all my original data frame and for my original data frame I just want to draw a scatter of all those points and then at this point I'm not drawing some sort of prediction I'm actually drawing real data so I'm going to do that and I guess when I'm drawing it down here let me actually try to put that on the same one so I'm going to say ax equals ax and I may have the first one returned in ax and then maybe out here I'll have the alpha be 0.2 again and why did I put ax there there we go and so then I get these nice plots and I can see that's far too long maybe what I should have done is I should have gone to about 100 so maybe here I'll make my predictions over the range of about 100 and so I'm going to do that and I can see how that line fits okay well how well do I do I mean I might not try to intuit that and really what I'm looking at is for each of these points well I'd hope most of them are near the not line and I can see quite a bit of them are not and so the way we what is the variance and the thing we're trying to predict right so what is the variance and this thing of variance is kind of um a measure of how much the values typically differ from the average right so I can get my variance and the variance in this column is 1.35 and the idea is what we're going to do is we're going to say well how much do each of these points differ from the average of the points along the y-axis and then we're going to compare that to like well how much do they differ from that red line that I drew right if the line is good then then that variance relative to the red line might draw down a lot I'm sorry the variance relative to the red line might draw down a lot relative to the average which I guess I could draw is like a horizontal line and and so the way I can do that is I can come back here and just like I fit the data where did I fit the data I fit it right here I can check instead of fitting it I can score it right and what this is going to tell me is how much of the variance is explained by the model and that'll typically be a score between um 0 and 1 with 0 being the worst and then 1 being the best and in some kind of weird cases it can actually go negative and maybe we'll eventually talk about that but generally it's going to be between 0 and 1 and so I see this is not great okay let's try to do it for our other variable that we had as well right so I had this one here and just looking at the plots I might expect that this one does a little bit better than I think what was it like 9% so I'm going to copy these things down here and so I'm going to create a new a new linear regression object and then what then I'm going to fit it where was I fitting before, here was how I was fitting before instead of fitting to the deaths on today well let me try to fit to how many oh this I wanted to get the number of positive cases whereas let me go up and look at my data frame again I wanted to fit to the average number of cases over the last seven days so I'm going to do that now I'm going to train my model by this new model based on this other variable how can I use this to predict how many deaths there will be in two weeks and maybe right away I'm just going to score it as well so maybe I'll just copy this and then score it and I can see it's doing quite a bit better instead of explaining 9% of the variance I'm explaining 20% of the variance if I was explaining 100% of the variance well that would be very remarkable because every point would be exactly on that line let me try to plot it just like I did before so I'm going to copy this here and so I'm going to have this and then remember that for this fit data frame I had to generate that by using a range of values so I'm going to paste this here and so what am I going to do again I'm going to generate values from 0 to 100 and that is telling me what was the positive seven day average cases and then I'm going to get a prediction based on that and I can plot a line like that I'm going to do that and I can see well something a little bit weird I guess I didn't change everything I see the x-axis is still how many deaths there were on a day as opposed to what I'm actually training on which was the seven day average so let me fix that and run that again and so that's a little bit better I can see if the line doesn't extend very far that makes sense previously when I was saying well how many deaths were there on a given day the number was relatively small compared to the number of positive COVID cases and so maybe I'll I'll redo this instead of going from 0 to 100 maybe I'll go from 0 to like 900 and so I run that and I can see this is a better fit right now I'm explaining 20% of the variance instead of just 9 so I did something bad here and I want to talk about that in the next video but I just want to leave you with a thought let's say I'm teaching something in class and I work out an example and so maybe everybody sees that example and they try to learn from it if I put exactly that same answer exactly that same example on an exam what does it mean that somebody does well on that exam I think there's two possibilities one is that maybe the person genuinely learns something from that example and then even though they're seeing the same example again they understand what's going on the other possibility is that maybe somebody just memorized the answer to the example and then when they see it on the exam well they just repeat it and that would be not so great and so the same thing happens here when I'm fitting I'm really giving my linear linear regression model some examples and and then when I'm scoring it well I'm actually using those same examples and so if this score were good and I guess here it's not great but if it were good then I wouldn't really know well did the model effectively memorize the answers or was it overfitting overfitting is what we mean when it basically memorizes it so next time I'm going to talk about how we can actually deal with that problem and did a better sense of whether of whether it's doing a good job so last time I talked about this issue of basically evaluating our model on the same data we fit it to right the model can effectively memorize the answer and look like it is doing well even if it didn't really learn anything and so the way we'll deal with this problem is we'll use something an SK learn called a train test split and this is a general strategy but this specific method will make it easy for us and what we'll do is we'll take our original data frame and give us back two data frames we'll train our model to one data frame and then we'll evaluate it on the other one on the data and the model has not seen before so we'll do a test of whether it's doing well and so I can see that there's a number of things I can pass into this one is I can say what ratio of my data I like to go into my test set I can say other things like try to stratify the data let's say I was dealing with something categorical I might want to make sure that I have a similar number of categories on both sides I'm not trying to do that here but I will specify a train test set I'm going to say from sklearn import it's sklearn.modelslection import train test split and then I'm going to say train test split and I'm going to try to put on my data frame that I have originally and basically it's returning something called an array with two data frames in it and so the way I can capture this is I can say I have my train data frame and my test data frame is in order and so let me take a look at these here's my train data frame and then here is my test data frame and you can see that it kind of shuffled things around if I look at the index and I'm dealing with different data right so this is the data frame that my model is going to learn from and then this other one is the one I'm going to use to actually make sure it learns something and so I can see right now if I look at this there's 6300 down here and then 19,000 up here and that's because the default split is 0.25 so I could do something like this if I want and then that would make this first one bigger and the next one smaller so I'm going to do that and then let me just see what I did wrong here I think I actually have to say test size so I'm going to say test size equals that and then this is going to be a little bit bigger more data to test on and then a little bit less data to train on and so there's some trade offs there anyway now what I'm going to do is I'm going to head back to this example that I had earlier where I was doing my training and the idea is instead of fitting and scoring on the same data I'm going to fit to my training data and then score on my test data so I'm just going to do it again from scratch and I'm going to do a review and I'm going to say LR equals linear regression and there we go and then I'm going to say LR.fit and I have my train data frame and then I have a list of columns here and then here I may have my Y column so that's the general strategy and then what I want to do is I want to score it I'm going to give it a Y column and an X column again, sorry I have to say train DF here I'm going to give it that same information but now I'm going to be doing the test data frame I'm going to say test data frame and then same thing down here test data frame and then what I'm going to do is for this I'm going to pass in a list of columns like before and what was I doing last time I'm going to look at this positive 7 day average and so I'm going to pass that in just like that and the same thing here and then over here I can just look at that thing that I'm trying to predict my Y which was this thing I'm going to pass that in and then the same here I'm going to do that and I see that okay my score is 0.16 whereas what would happen if I wanted to run it on my original data frame like that, training data frame and here it does better on the data I trained it on so this suggests that there is some overfitting here right the model does better the data I learned from instead of some new data it hasn't seen before and so that of course is a concern and so there's different things I can try to do to I'll avoid that and we'll eventually talk about some of those things this semester today I'm going to be talking more about how we can determine whether or not a model is doing a good job or not how can we interpret different scores and make sure that we aren't getting a good score just by chance and so I'm going to be picking up last time with the model that's trying to predict COVID deaths two weeks out based on the current number of cases and so let me just quickly review what I did last time we were doing a train test split on our data frame and our data frame looked like this right here and it was per county and then there's every day in the year there then I created this linear regression model which I imported from sklearn.linear linear model and then I did two things with it I fit it to my training data and then I scored it on my testing data and these were the two pieces my training data and testing data came from here and so when I was doing this what I basically did is I put in my y values which is what I'm ultimately trying to predict and then here I put in my x values or my features which are things I know right now so for example right now I know the 7-day rolling average of positive test cases and then two weeks out I'm trying to predict well how many deaths will there be this can be a single series which is why we just put a string in the brackets after the data frame here we actually have to pass on a full data frame because in general we might have multiple features and when I pass on a list to the brackets after a data frame well I get a smaller data frame that's why I have the double brackets there so anyway so I have this 0.2 and we know that this score will somewhere be somewhere between 0 and 1 so it's a little bit hard to say how good this score is right maybe you always get something like 0.2 by chance do we know so that's one of the things I want to talk about today and then the other thing that you might notice if I rerun this a few times oh now I'm at 27% 32% 24% 28, 26 so you can see that based on how I do this train test split I can do very different numbers and so that doesn't give us a lot of assurance so how can we get some more stable numbers I'm out of the system I'll just give you a hint of what the problem is if I look at trainDF and I look at this column here which is the thing we're trying to predict let me do that so I have all those numbers there let me look at the variance of that column variance is just kind of a measure of how different values are from the average and then I'm also going to do that same thing for the test and so when I do this for the test I see that actually they have quite different variances and if I run this again while I just randomly shake out differently now the test data actually has a higher variance than the training data and we're going to eventually look at how this scoring function works but it turns out that it's very much based on this variance which is why we have such a noisy measure so let me head over to the slides and try to give a preview of the things we're going to talk about today we're going to be learning four new functions related to model evaluation so first we're going to learn these two functions here which will let us score our models second if our model is trying to meet mediocre like mine is I mean 0.2 is not great how can we know if it's not just chance and then we're going to be using something called the permutation test score for that and then finally to get a less noisy measure we're going to be doing something called cross value scoring let me start here with these two metrics here I have this R2 score and mean absolute error and we'll talk about how those work so if I go back to my slides or go back to my notebook right here I'm doing this scoring here I can also up here I can say from sklearn.metrics import there's a couple things I want to do I want to do the R2 score and then the mean absolute error score and so the way all these metrics work is kind of like this I'll call the metrics function and then I'll say something like let me actually just check this here quick I'll say something like I had to make sure the order is different between my true and my predicted I'll say what are the true values and then my predicted values and so for example up here what are the true values and then my predicted values well how do I figure out my predicted values I can just say well model please predict for me what these y values should be based on these x values that I'm going to give you so I'm going to do that and so here I'm putting in x values and this is going to return back to me y values and this weird array thing that we'll eventually talk about but I could take this and I could put this right here and so then I could use lots of different metric functions here I could for example use the R2 score right here and guess what it turns out that this score here that's associated with the model is just defaulting to use our R2 score there's lots of different metrics I could have used instead but this one is the default one so let's talk a little bit about let's talk a little bit about how this is computed so the idea of it is that this is the thing I'm trying to predict right so this is my y column as I'm going to say y is here and I can just peek at that what I want to do is I want to look at basically the squared residuals of this column relative to the mean and so what does that mean so I can take this and I can subtract off the mean of this and then if I want to I can subtract all of that so this is really a measure of kind of how bad the system is if I add all of these things off so this is really like the variance of the system except I'm summing instead of averaging right so this is my original kind of total error variance of the system I might think of it as I'm just adding off so I'm trying to have a sum of squares and then what I want to think about is what if instead of measuring the distance from each each value to the mean what if I measure the distance to my actual prediction right so it'll be very similar here right I can say you know what what is kind of left over if I subtract off my predictions how do I get my predictions well that's this right here so I'm going to grab this piece and I'm going to have my predictions here and then let me take a look at that so the way I really think about it is that originally this was how much variance I had in the system and this is how much I have after I do predictions rather than just subtracting off the mean and so what I could do is well I could say I could say well how much is remaining right so I could say left over divided by the total and why is that let me just put that here I can say that left over divided by the total and I'm like okay why left 79% of the variance on the table basically which means that I took away 20% and you can see that that's exactly what this is up here right so really this is the math that people are really kind of used to evaluate how good a regression is typically we can do that more simply with R2 score and even more simply that's the default of I just say linear regression dot score let me give you an example of another metric people might use so maybe I want to just get well what is the average error and so in that case I would go back to this piece right so this is all of the errors and if I want to get the average I should probably just think about the absolute error and then I could take the mean of that right so this would be the average absolute error and it turns out that that is just rather than me collecting that myself I could just grab this mean absolute error here there's lots of different metrics in here and I think if I hit shift tab well maybe it's just regular tab you can see all the different metrics that come here and most of them I have never used so I paste this here and I see well that's kind of strange it should be the same it should be the same number why is that not the same number because I wanted to get the error my prediction is not relative to the mean and so I'm sorry this was the thing that I wanted to this is the thing I had wanted to grab so I wanted to say here are all my errors take the absolute value of them some errors are positive or negative I just want to have the absolute value then I have the average and why is that invalid syntax that usually means I have a mismatch in terms of my parentheses I see so this one has matched up there so I don't know why I grabbed that squared so I can see that this is all my errors and then I'm taking the absolute and then taking the average and now thankfully I want to actually get the same thing that I have down here and so again this is just a shortcut for this kind of more complicated map but it's another metric in terms of how these metrics work this one is kind of counting all errors more equally an error that's twice as big is just twice as bad because this one up here where I'm doing the where is it right here where I'm doing the R squared score because that one is squaring my errors this will tend to make it look worse if I have a few errors that are really big as opposed to many small errors okay so those were a couple metrics which was one of the things we wanted to answer on the slide so I'm just going to head back here we talked about R squared we talked about an absolute squared error and that's our different ways we can measure our errors the next thing I want to talk about is if our score is kind of mediocre like 0.2 out of 1 how do I know that that's actually go ahead and it's not just chance and the answer is what we'll do is we'll take our original data here and we'll score it with our model, we'll train a model of the x's and the y's and we'll score it as 0.8 which is not that good what I'll do is I'll shuffle around the data so I know that there's no relationship between the y's and the x's so what I'll do is I'll take this y column and just randomly shuffle it and the word for that is permutate it as I get this permutated version of this column you can see that for example 5 used to be the first number and now 5 is down here so I'll shuffle that thing and then I train the model on it trying to look for the relationship between the y's and the x's and there should be no relationship obviously because I just shuffled everything around but I can try to train a model and get a score on it and so if I see that when I'm basically training a model on garbage data if I get a better score than I did originally well that probably means that I probably didn't have any sort of meaningful model and that's the rough idea in actual implementation what this function here is going to do for us is it's going to shuffle around the data like this and it's going to get a score and then it's going to shuffle it again and it's going to get a score and it's going to get something like a hundred or a thousand different scores based on these shuffled data and then based on that we can estimate and really see this score over here how good or does it feel like this could fit in with the garbage data and based on that we can basically say well hey do I trust this model or not so I'm going to head over here back to the notebook and so maybe I'm just going to make some notes in here just so it's clear what we're doing so this part was about metrics and then this part is going to be about permutation testing so let me I actually already imported it which is great so all these things are kind of related to model evaluation and they're under this thing called model selection which is what we'll often do is have a few different models and we're trying to have different tools to say well what is the model that we think is best that we recommend to people so I see that I have this permutation test score and I'm going to paste this right here and I can see that I need three things I have to have my model which is just LR I have to have my x values and then finally my y values and so the way I'm going to be doing this here is since since here I have this test df and and for both my x and my y I'm just going to grab these right so this was my test right here based on the 7-day average I want to predict this right here and so I'm going to grab this and it turns out that this is going to return a tuple of length 3 so I'm just going to run that that'll take a moment and the three things in that tuple are going to be the score of my module model originally it will be maybe I'll just call it the garbage score since I'm permuting the data I'm not expecting to have any pattern there's going to be a bunch of those and then there's going to be something called a p-value and what the p-value is telling me is well what is the probability that a score this good would be generated by a system that's generating all these garbage scores right and so if this is really small then I can see well this is actually much better than my garbage scores and so I actually have a significant result so since these are the three things that it's returning right I know that t is a tuple I just put that here and it will automatically unpack those things for me so I'll take a look at this and then we get a score for my model and then I can have my garbage scores here I see there's a whole bunch of them if I want to I can put those in a series and and then I can do a histogram of those and I can see that there are actually around zero or less right all of these and this was around 0.09 which is actually pretty far over here so it seems like we're pretty far away from these garbage scores and therefore this p-value is going to be pretty small it seems like whatever process I'm using to get all these garbage scores is not likely to have a score this good so I will take this as a meaningful result let me head back here for another idea so how do we deal with the noise we saw that when I kept doing it a bunch of times I was getting different scores and for that we're going to use something called a cross-validation cross-validation score and the way it will work is I'm going to split my data instead of just having these train and test I'm going to split into four pieces and then each of those four pieces are going to take turns being the test data so maybe first I'll train my data on these rows and then I'll test it on this and then my model will get let's say a score of 0.2 and then I'll take a different chunk of the data and each of these are called a fold of the data by the way and so I'll train on those first, second and fourth pieces so I get a model and then I evaluate on that test data set and let's say this time I get a little bit luckier and it's 0.3 do it again 0.1 again 0.2 and then I can take the average of these and that would be a more stable measurement of how well my model does it's not as vulnerable to what happens to go on the test or training data set because all the data at some point isn't the test data or the training data so I'm going to head over here and do this so this is called cross-validation and let me call this thing so cross-validation score and what do I have to pass in here I have to pass in my estimator and then I have to pass in my x values and then my y values right so let me grab those things right so it's actually I guess identical to this right here so I'm going to grab my model my x values and my y values and this time I just want to do it on all my data so can I say data frame and then data frame and actually really what is kind of the best practice is just the training data and so I'm going to do that and then I get all of these scores back and the reason is that there were 5 folds by default so I can say in this picture right there is 4 here I see by default there were 5 that's why I got 5 scores and that'd be fine I got all these 10 scores back and these kind of look like those numbers we were seeing earlier like 0.27, 0.17 another 0.27 is 0.304 right if I head back here and I just kind of run this thing a few times those are the kinds of numbers I'm getting out of it as I randomly split my training my test so why is this useful well I can have my scores here and I can say a couple things I can say scores.mean and so I can see well on average my r2 r2 score is going to be 0.21 but I could also get some sense of the variance and that'll tell me how sensitive I am to what data happens to end up in the test or training data set and that probably depends on how much I have some outliers and how much outliers define what happens with the scoring ok so this will be the way we'll generally do it and so one last thing right when I was showing this picture here right I kind of said well hey I have all my data and then I just split it up into training tests why didn't I use all my data here and the reason is that is that even though that would have been fine to do in this example what you're often going to be doing is you're going to be trying to do a few different models and what you'll want to do is you'll want to do cross validation on each of the models and then you'll see well what one has the best score on average and you say well that one's the winner that's the one we're going to use in the future and so there's this risk when you're doing that let's say I evaluate 20 models and pick the best one the best one probably did a little bit better than it showed right if I do 20 models some will just by luck do better and some will do worse and so even though that's the right process to pick the best model I shouldn't go brag about this cross validation score because it wasn't like I was just doing one model as in here I was doing many models so what I would do is I'd look at this cross validation score across each of my models pick the best one and then finally when I go back and actually do my real test data which is still hanging out here and that's what I would report is the kind of accuracy of my favorite chosen model well in this video we're going to be learning about scikit learn pipelines we've been learning how to use linear regression as a model and often our model will have to transform the data in some sort of way before we can actually analyze it and so what we'll end up with for our models are these pipelines where we do a series of transformations and then at the end we'll use an estimator is what they call it in scikit learn to actually make predictions and so for this example I might be using a slightly more complicated data set in Chicago which is right on Lake Michigan they have all these sensors on different beaches which are measuring things about the waves and sort of looking at this data set of all these measurements so I can see well here's an Ohio beach here's 63rd street beach and I know all these things about these sensors like how warm is the water turbidity and other things like that and the why that we're going to try to predict is well how big are the waves on this beach and we'll be using things like like wave period to predict it or maybe by looking directly at what beach we're on and so there's some garbage data in here so I've cleaned that up so this is a picture of all the data where I have the wave period on the x axis and then on the y axis I have wave height then I also try to break it down and look beach by beach so here I'm trying to pull out all the beaches as a sorted list that's unique from the beach name and then what am I doing down here I'm drawing each beach separately so I'm creating some subplots and I'm looping over those and I'm basically plotting each beach right I'm doing some filtering and in pandas I'm plotting each beach in a different AX region and so I look at this and so there's a couple observations right away often before I do modeling I like to just do a lot of scatter plots and try to get an intuition for what's going on one is that I do see that what beach we're on is an important variable some of these beaches have different patterns in terms of the waves the other thing I see is that that the relationship between wave period and wave height is not linear it's not as if the bigger the wave period the bigger the wave height or vice versa what I see is often there's kind of like this hump in the middle right so kind of if we want to get the biggest waves it needs a wave period that's somewhere in the middle so I'm going to be making four different models here that try to analyze this data and they're just going to be using different variables in different ways first will be something very similar to what we've done before I'll try to predict the wave height based on wave period just using a simple linear regression and we'll see what performance we have there next we're going to learn how we can do a polynomial so that means if I'm drawing a line on here well the line doesn't have to be straight anymore finally I'm going to look at well not finally but next I'm going to look at what beach if the only thing I know is what beach I'm on how well can I predict and then finally I'm going to look at the combination of both beach and wave period with wave period being treated as a polynomial then see how much I can predict there so in terms of my imports I have some things that we've done before we've the first thing we learned was how to do a linear regression we would fit our linear regression to training data set and then we would evaluate it on a test data set at the very end along the way we might also do cross validation within just the training data set so these things are old and then these things down here are new right so in order to be able to deal with this data and I want in a polynomial way I might have to transform it using this polynomial features thing if I want to deal with categorical data like well what beach am I on that's not a number right that's a category I have to use this thing called one hot encoder and and if I want to use both of those I have to use this other thing make column transformer to combine them and in total what I'm going to do is I'm going to have this pipeline and in the end of the pipeline I'm still going to be doing just a simple linear regression but I'm going to be making all these transformations to the data before it gets to that and that's why maybe I'll do these more complicated models so let's start down here and I may call I'm starting to number these four models one two three four just following the numbering up here it's I may have model one is going to be just a simple linear regression like that and and then what I could do is I could say M1 well actually I could do different things I could say M1 not fit and I would fit that to my x values and then my y values here this is uppercase because it's a full data frame and then this is lowercase because it's just a single series and I guess before I have to do that I actually have to separate out my train data frame from my test data frame I'm going to say train test split df and maybe I'll just look at the length of both of these and so about 25% is going to testing which I'll be fine with that and so then down here I have to have my train df and then here I can put a list of features and then over here I may have my train df again and then I'm just going to have my y column and so my y column the thing I'm trying to predict is just well the wave height like that and then my list of features well first off I've put a list here and that's why I get these weird double brackets I'm just going to start with wave period just like that and I probably have to capitalize to match what is in my data frame from earlier if I check back yep it looks like all these things are capitalizing and so I can do that and now this model has learned the pattern and so then what I could do is I could say m1.predict based on my testing data so I could say test df only x values and then it's going to predict what these y values are for me and so let me why did I call it m2 m1 and so I get all these predictions and then what I could do is I could compare those predictions to what I actually have for that wave period right I could compare these numbers these numbers down here in any glance I see it's not very close the easier way for me to do that comparison and actually get a score will be I call this scoring right so the score function will automatically call that predict on this and then it will compare that to my test df of wave height and it will tell me what percentage of the variance I'm explaining there and I see right now it's terrible I'm not really explaining anything if I just always predicted the average wave height that would be better than what I'm doing right now now of course some of this is lock in terms of like how I did my train test split and so the more reliable way to get a measure here let me just print that would be to do my cross validation so I could say cross validation score and then here I have to give it three things I have to say well what is my model my model is m1 what are my x values and what are my y values I'm going to do this up here and when I'm doing the cross validation I usually do that all within the training data and I just try to hold back the test data at the very end of my project to have a final analysis well I see I actually get a variety of scores and we can see that yeah it is a matter of luck sometimes I do worse than zero most of the time not I can say how many pieces I want to break my data into remember that I cycle through each piece or fold and it has its turn being the test data set so I can get more numbers here maybe I should say scores equals that and I could say scores dot mean and this is probably a better indication of how well my model is doing I'm explaining one tenth of one percent of the pattern which is not too surprising all I'm using is the wave period and I'm trying to fit a straight line to this and I can see well yeah no surprise I can't fit a straight line to it okay if I wanted to let me just copy this for a moment I'm going to delete it shortly if I wanted to do more columns that would be an easy thing to do so for example I'm going to include let's say like the water temperature I could easily add that as another column right it's very simple to have multiple x values right so I can do that here and then down here as well and then maybe I can get a slightly different score still not great right I'm just going to delete this here right but it's very easy to add these different things and why is that well when I do this and I put that list here I'm just going to create a name of x columns that I want to use for my predictions okay so that's how we can add things let's actually think about how we're going to do this now if I want to have a polynomial so what we'll do is let me actually delete all of this I'm just going to create a demo here which is going to be a copy of my train df and in my train df I may have just wave period let me take a look at this data frame what we're going to be doing if I want to have a quadratic fit I want to just add some columns down here that are going to contain that squared data so I could do things like this I could say demo I could say period squared equals demo of wave period or I could have one that's cubed like this and why is that unhappy because I'm doing a comparison on an assignment so it's assumed that's already exists and so if I do that and you know what I want to do I actually want to copy this so that it doesn't complain I can see what I'll do is that this column 2 squared is 2 2 cubed is 8 3 squared is 9 3 cubed is 27 and so what I can actually do is I can do a linear regression across these things even though it's technically a linear regression it's trying to act like a quadratic or cubic regression because I'm just treating these as regular columns and I can put weights to how important these things are and so how can I do this so here I was just manually adding these things if I want to I can use this polynomial features thing and so down here I may say poly equals polynomial features and I can say poly.fit transform and what I want to do I want to transform my data right here this was what I had before so I'm going to run this thing and I see that I get all these different columns and if I wanted to let me capture those and data right this is one of those numpy arrays which we'll eventually learn about until we learn about it I just want to put it in a pandas data frame so I can better see what's happening so I'm going to say pandas.dataframe and then I'm going to put my data there and then I want to figure out what the column names are on those and it turns out that this thing will tell me that as well so I can say I'm sorry poly will tell me that so I can say poly.fit feature names just like that and I actually have to say columns equals that and now I can see well I have x and then x squared and I actually have to tell what my original name was for that to work I actually have to have a list of that excuse me and as I can see okay well I have period and I have period squared and if I wanted to appear I could say things like well I want to have it be fourth degree right so I could have period, period squared, period cubed, period to the fourth you can see it also is giving me period to the zero which is just a column of ones right and so what I'll often do is I'll disable that it's called the bias column so I'll say I don't want that thing and so now I actually have something that's very similar to what I did above manually and I could do that with my data right I could if I wanted to I could do this transformation with my training data and then my test data and then that's what I would do all my modeling on now I think to keep things simpler what we'll want to do is we'll want to automatically transform and then immediately apply the linear regression and so it turns out that pipelines which I also imported up here make that really easy to do so I can create a pipeline down here like this maybe I'll just draw this this is going to be my second model pipeline and then we have to pass in a list here and the way the list works is that I will have like transformers um and I might have like one or more of those and then at the end I'm going to have an estimator right so all these things are going to be modifying my data in some way maybe adding more columns and then at the very end I'm going to actually do my real model and my real model is just a linear regression like that and then here well what are my transformers well I'm just going to do polynomial features like this and and I think degree of 2 will be fine for us for now and so I have these two things and then the other detail about this pipeline is that we have to name each of our stages of the pipeline and the way that it wants us to do that is it wants us to put these things in parentheses to create a tuple and then put a name as the first part of the tuple so I'm going to call this transformer poly because it's a polynomial transformer and then this thing I don't know I'll just call that LR for short right it doesn't really matter alright so let me just take a look here M2 it kind of shows me all these details of it but now M2 is just a model right and it's a lot like a simple linear regression I can do it in all the same ways so for example if I head back here all of the stuff I did before you know I just created M1 as my model and I did all this stuff on it I can do all those same things down here right so if I run down here and I just say M2 dot fit and maybe I'll just delete this for simplicity everything is the same and I can see well maybe I'm doing slightly better now right so what was my score before that's exactly the same I was expected to do slightly better and why wasn't I doing better probably because I was evaluating my same model that's the curse of the copy paste so before let me just see this I was explaining 0.1% well I guess closer to 0.2% of the variants and now I'm explaining more like 4.6% so I can see that model 2 is a huge improvement on model 1 I think I really did there I'm still doing a linear regression but I'm just giving it a little bit more information to work with by giving it these extra columns I have a column that says something like wave period squared and going way back here the intuition is right as I'm fitting a line here and that line doesn't have to be straight anymore it can be a quadratic line that kind of curves okay so we have two of these models we have this one which was terrible right it told me almost nothing sometimes this one depending on your train test data really lead you farther away from the truth and if you just always get the average this one is actually doing somewhat well right it's explaining almost 5% of the variants let's try the beach and see how we can do that so that will be my third model and so if I come down here I think for this one I'm just going to go back to here and try this as a first attempt maybe I should have some comments here so this is poly on period and then what was this one up here so model 1 this was just linear on period and then I'm going to try doing linear on beach is what I'm trying to do right now and so if I have this model I'm going to call this model 3 and I'm going to delete this again just to keep it clean I'm going to do this down here the main thing I want to do is instead of having it be wave period I want it to be the beach name if I have it all the way up here I just see well that's beach name ok so I'm going to head down here and I want to protect the wave height based on beach name right so just like that and then this is also beach name down here and this is going to complain to me and what does it complain about it says could not convert to a float Ohio street name street beach is not to not be converted to a float and that makes a lot of sense right so this thing here if I take a look at it is that's categorical data and it turns out the linear regression will actually learn why but it wants everything to be numeric and so that's a problem and so how do I how do I deal with this I mean maybe one idea that students sometimes come up with is that I could encode these I could say like one means Ohio beach and then maybe two means Calumet beach and then maybe three means something like Montrose Beach and the problem is is that if I put these numbers like this um the linear regression model is going to assume they're meaningful and so what that means is that if the model learns something about Ohio Beach and then it learns something about Montrose Beach it mistakenly thinks it knows something about Calumet Beach just when I think of that somehow the average of these other two and of course that's not true right I mean just I kind of arbitrarily put these numbers here there's no reason to believe that this beach has characteristics that are kind of average of other two so that's not going to work right I can't encode it that way the idea that we're going to use instead is called one-hot encoding and one-hot encoding looks like this one-hot encoder it's going to be like that and I can say one-hot encoder.fit transform I'm going to fit transform this data right here and I get this weird thing that's well whatever is that that's like a it says it's like a sparse matrix we're going to eventually learn more about that but I can convert it to this thing which is um is a numpy array and then that thing I can actually put into maybe I mean simplify this a bit that thing I can actually put into a data frame or something like a pandas.dataframe just like this and then just like before I want to figure out what these columns are and so just like with the polynomial transformer I can say get feature names same deal here right I can say one-hot.get feature names like that I have to tell what was I originally operating with and I guess I was like the beach and why is that unhappy so the shape of the values past is this and the c's imply something like that I wonder what is drawing out right there I think my problem is that I need to say like columns equals that um okay and so how is this working so the first one my first one was Ohio Beach and you see that I have actually a column for each different beach and what we'll do is we'll set within that we'll set it to one if it's an Ohio Beach and then it'll be zero and all others um if I go a few down I see that I have a Montrose Beach in position four right and so that case I'll put a one under Montrose Beach and then a zero is under others and that's why we call it one-hot right so I guess the place where it's hot is where we have a one and then everywhere else it's a zero so if I have this even though I started with some um categorical data I can end up with something where um where I have a bunch of numbers and so if I clean this up here let me throw back to this I may actually delete all of this because this was kind of a dead end right what will be a closer inspiration is the pipeline I used before with um with the polynomial features like polynomial features is a transformer um so also is one-hot and toting right so I'm going to tweak this we're on model three now and this is one-hot and here I can just blow all this away and I can say one-hot and toter just like that and then I can do my linear regression down here and so then I can say um down here I can say uh beach name right so beach name and then uh actually I don't even need this right I'm kind of this will automatically do the fitting for me and tell me how well it's doing um I wonder why this is complaining up here I didn't nervous that's read anyway it'll probably tell me shortly um I can say beach name here and then let me try that and then sure enough it's invalid syntax um because I didn't have a match there and so this does even better than before right so I guess just looking at the beach name is even better than knowing what the wave period is right so my model is getting better and better right first I started with just a linear model on uh on the wave period and then that's terrible right I'm not even explaining one percent then I said well let's do a polynomial fit on uh on the wave period and then I'm explaining almost um almost five percent and uh four and a half about four and a half percent then I just look only at the beach and I get five point five percent and so the um kind of natural next thing to do is well um let's do the regression on both beach and um and also a polynomial of the um wave period and maybe I can do even better right maybe the both of these things can give me some information and so here's where I get to have a challenge right because I want to do one hot encoding on the beach name and I want to do polynomial features on my other column right so I have to have some sort of wave combining these things and it turns out that that's that last piece that I imported up here the make column transformer is going to let me stitch together multiple transformers each of which applies to a different column so I'm going to come down here and um and I'm going to copy this for my inspiration for um for kind of model four so I'm going to have model four now and what am I going to do so this is what has to change right so I have to have something here that can capture both columns right because what I'm going to do down here is I'm going to pass in both beach name and wave period like that in an effort to it to predict my wave height and so how do I do that well I'm going to call that thing that I imported which was make column transformer and I'm calling that thing and when I'm calling that what I'll do is I'll have just a series of transformers and so I'll have like transformer one and then transformer two and if I wanted to I could have more in this case um each of these things each of these transformers is going to be a tuple and the tuple will be the transformer and then it will be um a list of the columns that it applies to right so both are going to be like that right I have these um tuples and maybe I'll just try to put a new line there so I'm going to have these two transformers and then a list of columns and so I think what I want to do is I want to do um one-hot encoding of the beach name so I'm going to copy my one-hot encoder right here and then what column does that apply to well that applies to the beach name column and then my other one was polynomial features which I should just copy from up above that was this thing right here polynomial features is my other encoder and what does that apply to that only applies to wave period okay so I have this um this model four now and uh and then I can well try running it and now I see I'm getting up to 9.5% of varying it's explained which is actually well I guess I was wishing I wish I was explaining 100% but I can see by uh kind of considering both these factors I'm explaining quite a bit more um than I would have otherwise um so one last thing I want to do before I wrap up this video is I want to talk about why we have these names here and um I can use this um we can uh use pipelines like a dictionary right so um so for example that means I can put both here and what will that give me that will give me this column transformer that I created right here and so if I wanted to I could um kind of peek at what this thing is doing as a way to debug my model I could say fit transform and then what I could do here is I could say well this is the data I'm working with so I could see that um well and then I guess I'd actually uh need to have some column names here as well but I could see well this is what um what I'm dealing with so let me do a data frame here I could see that it has all these columns here that have the one hot and turning and then I have these columns here that are doing um polynomial features on other things so this is the data that I'm using to try to predict what the height is so even though I'm kind of starting with just these two columns after I do all this transformation I'm actually giving um I'm giving the linear regression here at the end of my pipeline quite a bit more information to work with and that's why we're able to do much better um better here so as the very last step um what I would like to do so I saw that um model 4 was clearly the best one right so that's the one I'll recommend now it's possible if I'm doing four different models one of them just kind of by lock does better and that's why I hung on to my um uh to my training data right I only used uh around my test data I only used training data for cross validation scoring so at the very very end I'm going to fit this data to these things just like that and then let me just run that and then I could use it to make predictions right let me just remember how to do this because it's useful I could use this to make predictions on my test data frame like that I can make all these predictions for what the waves could be or I could um as a convenience score how good those predictions are against my test the f of wave height and I see while now I'm explaining uh 8.4% of variance right so I did a little bit lucky there but still I can see this is clearly better than my other models were hello so when we started learning about machine learning the first kind of problem we talked about was regression and after we learned about regression we learned about some of the linear algebra underlying it and now we're going to be moving on to a second kind of machine learning problem which is classification and and so I just want to review the main categories of machine learning um there's three main categories some people will say four uh but there's reinforcement learning um which is about these multiple stage decisions we're not doing that in 320 we're really interested in supervised machine learning which is where we're trying to make some sort of prediction about uh maybe the future or about some other unknown and then there's unsupervised where there's no particular thing that we're trying to predict but we're looking for some sort of patterns or simplicity within the data um we've been learning about regression which is where we want to predict the quantity and so we're going to be learning about the other um most common kind of um supervised machine learning which is well how do we predict the category and that's called classification so just to review these the two differences um here I have a big data frame and all of these things here are features and among these features I have a mix of both um uh quantities and categories and that's not really relevant looking at my features to figure out what kind of problem this is but I want to think about what kind of problem I'm dealing with I'll look at this label what am I trying to predict is that quantitative or categorical in this case it's quantitative so this is a regression problem and so what will we do I might have some data up here where um we're both my features and my labels are known it's all fit a model to that and then I'll use that same model to predict uh perhaps where that label is unknown or I might pretend it's unknown so I can basically test the effectiveness of my model if that's some sort of test data set the classification problem looks very similar again I might have some mix of um of quantities or categories as my features um the main difference now is that I have a categorical um Y or label column otherwise I'm still going to be fitting my features to my label and then trying to do some sort of prediction on it so um as I mentioned right we have these four big categories and uh and sklearn has so many different algorithms for each of them or implementations for each of them and so the one we've learned so far is linear regression that's what we've been using for um regression so linear regression model is what we call a regressor um very confusingly what we're going to be learning now is called a logistic regression and it is not a regression it's actually a classifier right so the name says linear regression but it's a classifier so don't get confused right even though we're learning linear regression and logistic regression I am teaching you a regression a regression model and then also a classification model so I'm going to exit out of here and head over to my notebook to try to introduce this and let me see here here is my notebook I have some stuff important maybe I'll come back to that later um let me just jump down here for now I have this data frame which has some data from a very famous machine learning data set called the iris data set the idea of the iris data set is that you have all these measurements about iris flowers so for example what is the size of the petals what is the length or width of the saples which are I guess like the green leaves between the petals um there are different varieties of um of irises so I put the varieties here in this far right column this whole thing I'm looking at right now is a test data set and it only has 10 samples in it um which is tiny right normally we'd have much larger um test data set but I'm just trying to keep it small and simple um in this case I'm passing in this random state so that even though it's um somewhat random I mean every time I run it will be the same if I put a different number here I would get a different random and I'm using random kind of carefully uh order each time so this is just so I can reproduce it even though I want something basically random I want random but reproducible so I've done that train test but I'm just putting 10 here and so what I'd be doing is we're going to be um fitting different models to the training data and then uh just seeing how they act with this very small um test data set and um looking at this there are three features I'm going to be interested in we're going to look at the dimensions of the samples and then I have this constant column um remember that sometimes when you have these models you could have coefficients in a separate intercept or you could just have coefficients and then the last coefficient could be multiplying the one column and that's what I'm doing here I think I'll make the later examples a little bit simpler those are my features um I actually have multiple y columns here I'm just trying to see if I can predict different things over here can I predict what the pedal width is and I predict whether or not a particular variety um is a sitosa um there's actually three varieties um in general so can I just predict any variety you can see that whenever I have a sitosa here uh it's true for other cases it's false right it's not a sitosa so that's what I'm going to do I'm going to see if I can predict these three different columns based on these three different features I guess it's really like two features and so there's four things we're going to do we're going to do a regression on the pedal width and that's really just kind of a review we're going to do a binary classification on the sitosa column and so we're going to try to predict whether it's true or false binary means two and so that's why there's just two things here all it's either true or false I'm going to use that same model to not just tell me whether it's true or false I'm going to ask the model for some sort of probability that it's true and probability that it's false rather just say hey it's true I'd like to see something like oh there's a 95% chance it's true and then finally so binary means two multi-class means um I guess more than two and so we're going to do that over here right you can see I have three different categories here and things get a little bit more complicated in that situation okay so I'm going to head down here and I am going to first just start with the regression and so I'm going to say regression equals linear regression this is just a review from before and I have all these options in here and this fit intercept one is something I'm actually going to turn to be false so fit intercept equals false what this normally does if it's true is it would add this one's column for me effectively when I'm saying false because I'm doing it manually and that just is going to make my example cleaner later on so I have this thing and I want to fit it to some data so I'm going to say regression.fit and when I fit what do I do I have my x and my y and then after that I could do regression.predict maybe some other axes and then that would return a y so I have something like this maybe I'll say like y, y2 and x2 so in my particular example I have my training and test data and so what am I trying to predict right now I'm trying to predict the pedal wet so I'm going to copy this column name and I am going to when I'm doing my when I'm doing my fitting I'm going to pull that out of my training data so I'm going to put that in here and quotes that's my y and then my axe it will be it will be again my training data and then I have to have some columns here I guess a list of columns and so I actually already created that right here I have these axe columns those are the sepal length sepal width and then constant and so this will be my list that I put inside of here and so I'll basically get those three columns let me just so if anybody is having trouble visualizing I'm getting just those three columns out of my bigger one whereas when I'm doing this I'm going to get a panda series that contains that pedal width column so I'm trying to predict this based on these three things okay so I'm doing that and now I want to do some predictions down here and that's going to be the same right I'm going to put in axes basically the same format except now I'm going to use my test data and maybe rather than attach that a variable it looks like for now so these are my predictions for what should go in this a pedal width column and so if I wanted to I could maybe even add those into my training data frame my test data frame and I could say my prediction equals that and then I could look at my test data frame and the thing that it's complaining about is that I'm trying to add some values to a slice of another data set so when I did this here these are slices of the rows inside of my big data frame and so if it's confused when I'm trying to add columns to one of these that's not the original so it's actually an easy thing to fix I can just say like test equals test.toppy and then test will be completely detached from my data frame and I can add columns to it without complaint so let me run that and now I can see I have this prediction column over here and I can go through and see how these predictions are so I predict 1.3 it's actually 1.2 I predict 1.5 9 actually 1.4 and so I can see sometimes the predictions are good and sometimes they're well not so great anyway that's a regression let me let me go on and try this next piece I want to do a binary classification on this column right here and so the code is actually very similar to before I'm going to head down here I'm going to change a few things first off I want to have a logistic regression and remember it is a classifier despite the name so I can deal with a category like this and then for my y I'm going to have this satosa column is it a satosa variety or is it not and so I'm doing that I may also rename this just so we remember it's a classifier I'm going to call it CLS then down here I also need CLS and I do that and now I can see my predictions here I can see that this column is telling me what the flower actually is and this column over here is telling me what the model predicts it is and actually I guess we're doing quite well here is completely accurate which is great all right let me go back up here so we did the regression using linear regression we did the binary classification using logistic regression and then basically saying well do I have a true or false now what I'd actually like to do is know what is the probability of getting a true and so I can do that like this I can say this has an extra function it's very similar to the predict but it will be like this it will be predict, probability, A and I'm going to get an umpire array of all the probabilities and so what this means is the way I interpret this is that there's a 94% chance of false and a 5% chance of true and that's why it ultimately reduced that to a false 97.9% chance of false and only a 2% chance of true that's why I get another false let me look at this one this one had a 93% chance of being true which is why I have true true there and so I could even, if I wanted to I could try to pull out that last column I could say something like I want to have some sort of slice and I have like a row slice and then a column slice when I'm doing this I want all the rows and I want that second column and so if I wanted to I could say something like test test of probability equals that and then I could look at it again and then I can kind of see in each case well based on these dimensions what probability does a model think it has of being a satosa and so sometimes it's not quite sure right I could based on this I could identify the cases where the model was not very confident and then I could identify other cases where it was quite obvious what it was so I don't have to do a new model for that I just have to call predict probability a instead of predict and okay so let me head up here and we're going to do this last piece now how can I do a multi-class classification on variety and variety is a little bit trickier because I have three different categories there I guess it's going to be trickier when we actually get into the math behind it but it is not trickier in terms of it is not trickier in terms of actually running the code so if I actually copy this here and then I head down here I'm just going to call this multi so I can remember my three different models let me say multi and and then what else do I need to change I'm doing a different column this time and so I can do that and now it's going to tell me for my predictions I see my predictions are still false and true and that came from here and so this is returning true and false because all my new model is called multi and I'm using my old model I'm going to do that and now I can see while writing that I can see what was the predict for this one that I wanted to write here it actually made a mistake models make mistakes that's not surprising I guess that was the only mistake in this data set and then of course this probability column was from earlier so I'm just going to ignore that because I keep on writing this probability or this prediction column for each of my four examples okay so there we've seen a few different things we've seen a regression and then we've seen these two different kinds of classifications let's try to get into the let's try to get into the linear algebra behind each of these examples and so I'm going to head down here and remember that I had REG that was my model earlier and inside of here I have coefficient and I also have intercept and the intercept is zero because earlier what happened I passed in that intercept equals false to all of these if I had not done that then what would have happened then my coefficients I only have these two columns here or two numbers here corresponding to the weights on these two columns and then instead of this being basically my intercept that would have gone to the separate variable down here instead of being here so basically these are my weights on my real columns and then this is the intercept or the weight on my one's column okay so I have that thing and what I'd like to do is reshape it so that it can be I want it to be however many rows necessary and then one column so it'll be vertical like that the other thing I want to do is I want to get my X data which will be my will be my test data frame and then it'll be those X columns and maybe I'll say dot values and so I'll look at that so I'm pulling those first three columns out of here okay so I have three columns here and then the column over here is basically three entries in it and so if I wanted to I could take the dot product of this with these down here and that's exactly how linear regression does predictions so if I come down here so I can say define regression predict and if I have some X values here what I could do is I could return X dot basically these things right so I could say um maybe I'll call this like vector one or I'll call it probably efficient you know what I'm just going to actually why not just put it here directly I don't need a separate variable for that and so I'm going to look at this regression predict and I'm going to do it on my X data and I get these predictions 1.32 1.59 let me actually draw up here earlier and I see that those are exactly the predictions that my linear regression made earlier 1.32, 1.59 right so this regular regression dot predict like this all it's doing is this right here it's doing this math right like this ok let's try to see what the logistic regression is doing so it's actually going to be very similar so remember before what was I doing I was saying CLS my classifier dot predict and um I can just do that these were the values I was getting out of my classifier what math is this doing it's actually almost identical to this let me copy this in fact I'm just going to tweak it very slightly I may have my CLS predict and I may say CLS predict my X data and um and the difference right I have all these numbers now and actually sorry I have coefficients from my other one so my CLS coefficients I should look at those as well before I jump into this I may have different coefficients why are all the numbers the same that doesn't make sense and I get all these numbers and remember our goal is to predict false false or true or things like that and so the way it works is that we're getting a score for each entry and if the score is greater than zero we predict true otherwise we predict false and so the shape is a little bit different here maybe I can just reshape so that's more obvious but otherwise that's what it's doing all these numbers up here are the same same down here and I think maybe this is why they maybe they even call this logistic regression even though it's a classifier the math is basically identical to what we have for a linear regression right at the heart of it we're just doing a dot product the only difference between the linear regression that we did before and the logistic regression working now is we're just checking if some number is greater than zero okay so let's do this next piece so the next piece was I'm going back through my examples before after we did predict which was trying to say are we predicting true or false we wanted to get this probability that's a setosa and so how can we get a probability out of this if I head back down here you actually see that before I added this greater than I had a numeric score so I'm going to go back to that and I'm going to say I want to predict probability A now I'm just going to go back to this and predict my probability and I see I have all these scores and of course these are not probabilities because a probability would be between zero and one but it turns out there's a very simple function that can turn it into probability and that function is called the sigmoid function and I had it at the beginning of the notebook but I haven't talked about it yet I'm just going to head up briefly and we're going to talk about the sigmoid function and how we're going to use it we don't have to go into a lot of the details on the math I don't care if you remember that I certainly don't remember it the important thing for it is that the X value that's growing in can be as large or as small as we want but if I get very very negative numbers I effectively approach zero and if I get very large numbers I effectively approach one so the nice thing is that I could take any sort of numeric score and it's going to give me back a result between zero and one so basically I can take some other sort of score and turn it into something that at least looks like a probability and so what I'm going to do is I'm going to call this sigmoid function down here and so instead of having all these numbers like negative 2.76 or 2.57 I'm just going to take the sigmoid of all of those things and then they will be these numbers all between zero and one and it turns out that that's how we were would actually be doing the probabilities so just like before I was trying to say well, predict the category on X the same way down here if I maybe I'll do this right here above I'm going to say predict the probability of A let me just look at what's happening here these numbers right here which is the probability of it being true is this column right here the probability of it being true the way I've written this code I'm not computing the probability of it being false because well that's boring I mean they add up to one so why would I do both of them but you can see these numbers here are exactly identical so getting the probability is very simple I'm still just doing that core dot product and then I'm just applying the sigmoid to every number inside of the result so we've seen for all these cases so far this dot product is just extremely important but I'm taking the dot product of a matrix with a vertical vector let me go up and talk about the fourth model we did with a multi-class model the multi-class model was right here and I was trying to predict variety and variety could be any one of these three things and what is going to turn out is that my coefficients for variety are going to be a lot more complicated so if I look at this is my binary one if I look at the coefficients for my binary one I just have these three numbers so if I look at my coefficients for my multi-class I actually have not a vector but a whole matrix with all of these numbers in it and so let me use these now and what we're going to do is actually just like the way they set it up we're going to have to transpose them we're going to do that and I'm going to have this here it's going to be very similar again but instead of multiplying x by just a vertical vector I am going to multiply it by this whole thing right here we multiply it by that whole thing and then I get this result and so I'm going to just draw this quickly I'm sorry this is multi-predict now multi-predict and I'm going to call that multi-predict of x and I'm actually not doing a sigmoid anymore at this point apologies and so how do I interpret this basically every one of these columns corresponds to one of the three varieties if I had had more varieties I would have had more columns up here but I would have still had three rows in this coefficient matrix so you might imagine I don't know which one is which but you might imagine that this is the Satosa variety and it turns out when I take my big matrix of data and I multiply it by all of these columns right here what it really does is it draws column by column it takes this column times my data matrix and it uses that to produce this output column so these might be the scores for how much it looks like a Satosa and maybe let's say these are the coefficients for a VersaColor then these would be the scores for a VersaColor if these were the coefficients for a Virginica then these might be the scores for the Virginica and so what I can do is that each one of these rows corresponds to a row of my original data which means this row corresponds to one particular flower one particular iris I have a score how much is it like a Satosa, how much is it like a VersaColor how much is it like a Virginica and so what I can do is I can try to take these and I can see which score is the highest and in this case it's this middle one so whatever whatever type of iris corresponds to that middle column is what I want to predict and so I can check that I could say if I go back to this I could say I think it's classes I could see okay well I guess well this is the Satosa and so that first one would be a VersaColor and so one of the ways I could actually do this is it turns out that there is an arg argmin let me just try to do this here let me try to get it right I am getting it right and so what I guess I want argmax what this is going to do is it's going to give me which index is basically giving me the largest value but I have to specify what axis and zero is down and one is across so I'm going to say one and what is this telling me I'm going to reshape it so it's a little bit more obvious I'm going to say negative one one so what this means is that at index one I have the biggest number in that first place the second position the biggest number is here is that true? yeah that's true this number is bigger than those other ones let me check the fourth one that means the zeroth position is bigger so this is bigger than these so it's absolutely true and so what's cool about if I have a numpy array like this I can put that into another numpy I can take these classes and I can put this array of numbers into here and I'm going to get basically what are all these categories and so I'm going to take all of this and and I'm going to put it back in here and I can see that then I'm going to get the predictions for all of my for all of my flowers I can see for each of them what am I predicting it is and this will correspond with my predictions earlier so again that product is at the heart of it but now since I have different possible classes I have to have coefficients to get a score for each class and so that's why we have to multiply the data by a full matrix instead of just a vector and that's the first example we've seen like that a real practical use case in this course in this video I want to talk about how we can visualize the decisions or predictions that are made by a classifier when we are doing regressions we often visualize this by drawing a fitline and the equivalent here is that we'll often at least when we have two features is we'll draw those two features on an x and y axis and then we'll separate the area into two different regions one where we predict true and other area where we predict false and so the function that we'll be using in map.lib is called contourF and it can do that kind of plot so we want to evaluate CLS and the idea is that we want to put in a bunch of different values for for both sepal length which I'm going to put on the y axis and sepal width which I'll put on the x axis and so the way to get every combination of these maybe most easily is with a numpy mesh grid and the reason I'm using a mesh grid is that it creates some arrays with different combinations and exactly the form that I'm going to need later so it creates arrays and form needed for contourF later and so let me actually leave this comment here like that and what is it trying to return well it's trying to return two arrays it's trying to return one which is one of my variables and the other will be the other variable so maybe I'll say I'll put my sepal a width first and then my sepal length just like that and then what I have to do here is I have to have a range right so I'll have like range one and range two and really what the mesh grid is going to do is it's going to give me every combination of these two ranges so here I'll say np a range maybe I'll go from zero to zero to ten and maybe do a zero point one step and then the same thing over here as well okay so I have those two and let me just take a look at what these things look like I see that both of these are a two by two matrix that are showing me every combination so let me look at the other one as well so these are exactly the same shape and the values in it are just giving you the coordinates right so the first one is giving me the x coordinate basically the sepal width and the other one is going to give me the y coordinate or the sepal length and so if I wanted to I could then call plt.con tour f and this is taking three things it has to have my matrix of my x coordinates and it has to have my matrix of my y coordinates and then it has to have something that says that gives the color and that can be some sort of expression so I could say if I wanted to sepal w and this will show me some stripes from left to right or I could do sepal l and that will show me vertical or I could have some sort of mathematical expression like this and at each point it's showing me what would happen if I multiply the value in one of these by the value in the other so I can get these nice contour maps now what I'd really like to do is to just have two levels two numbers let me just show you what this is now lots of different numbers I'd like to just have two numbers here basically one and a zero that correspond to predictions so my x axis is going to be the sepal width and my y axis will be the sepal length so I have to get all of this data into a format where I can do some predictions so let me leave this here for now and I may come back to this this is my goal where I'm working towards what I would like to do is put these things in a data frame so I'm going to say this will be my contour data frame data frame and my data frame has to have all of these things up here because I want to do predictions on it and these are my x columns and so I'm going to have something for that I'm going to have something for that and then I'll do my constant first that's easy, that's just one these values I can pull from these and so I can put this here and then I could put up here I could put sap w if I wanted to now this is not going to quite work because these are those basically square matrices and down here I'm trying to put all of this in a data frame so I can do my predictions it has to just be a simple column so I have to flatten these so it's just kind of one dimensional and I can do that and then I can look at my CDF if I like and I can see what is going on here is I really have every combination of length and width and then I have my constant column so this isn't a great format for me to then do some predictions because I can just say it has everything that I need for my predictor I can say cls.predict just like that and then I get all of these values right here and if I wanted to I could add those to that data frame as well I can say CDF I could say predicted, maybe I'll just say prediction and I remember that was the pedal width is what I'm trying to predict I can say, or I'm sorry it was the category, is it setosa or not that's what I'm trying to do so I have that and now I'd like to be going down and doing my time to work plot down here so I can have this and let me think a little bit carefully here so if I have my set W let me look at the shape of these that's a 100 by 100 because that's what my range was I was drawing over 100 numbers this is 100 by 100 matrix this is 100 by 100 matrix this is also going to have to be 100 by 100 matrix right now it's just a small column so just like I have to convert these matrices two columns with every shape here now I have to go the opposite direction I have to take values and then reshape this to be matching the format of these right so I can actually just use these and say well my prediction should be whatever shape my X values are and settle line up nicely and so I can do that and now I can see well the two sides of this are going to correspond to a prediction of it being either a setosa or not a setosa let me plot with a scatter plot on top of here um I have my data frame here of all my original flowers I'm going to plot all of them ideally I would just plot only the training data on top of it so I can actually have a better sense of what errors are made but I only have like 10 rows of my my testing data and so I'm just going to plot the whole thing so I'm going to say a dot plot dot scatter and actually what I'd like to do is I'd like to separate this out so I may have something like setosa dot plot dot scatter and then I may have something like other dot plot dot scatter and then those things I can get just with some filtering or somebody say setosa equals data frame where data frame of variety equals setosa and then my other ones will be basically where it's not that so I'll say where it is not setosa okay so what am I going to do down here well I may say my x value is going to be um well I was putting sepal width there so I better put that sepal width on my x axis and then my y axis should be sepal length okay so I'm going to do that and I'm going to do the same thing down here for a moment and I see okay well I have my decision boundaries and then I have my two separate plots down here I'd like these to all be on the same plot and normally the way that we've done things like that is we would say aax equals this and we pass that down below let me just show you quickly what the type of aax would be if I were to do that it's this thing here this um quad contour set and we can not plot on top of that so if I wanted to reuse the same um axis some plot area what I can actually do is I can say matplotlib dot hit current axis and that will give me an axis some plot that I can pass in elsewhere right so I could say up here I could say um uh maybe I'll go on to the next one I could say aax equals that and same thing here um so then they'll all go to the same area so let me do that and so I have all of these points and um and so then what I like to do now is um well it's kind of strange that it's not overlapping that boundary um did I get my axes mixed mixed up I did so here a sepal width sepal length okay great so let me let me just switch this back so this should have been sepal length and this should have been sepal width and now I can actually see that makes a lot more sense I can just try to separate the ones that are from the ones that are not and so to actually make this work now I should have the color be different in some way so I'm going to have the setosa's be uh red and then maybe I'll have the other ones be um I don't know maybe it can be gray right so just like that and so now I can start to see um what mistakes will be made I can see that there's one setosa that is not going to be recognized as a setosa because it's on the wrong side of this boundary um I should probably also have some um labels here so I'm going to say like label equals setosa and then down here maybe I'll say uh label equals equals other just like that and so I can see what's happening here now and so there's a couple steps to all of this right um maybe I can delete these extra things up here so I can have a minimal example I had to create a mesh grid and that basically for every point it one grid had the x values and the other one had the y values um I had to reshape those to try to convert them to data frame uh columns and at that point I really had every combination of these in some row once I did that I could add predictions for basically a recombination and then if I converted um um my predictions back to the mesh grid formula like these two have then I could do my contour and that's how I could create this map uh and then on top of that I could plot my scatter points and see what's happening one last thing I want to do here and that is I want to try to do some polynomial um uh vets so just like we can use polynomial features for a regular regression I can use those for classification as well and so let me import some stuff I may say from sklearn uh dot pipeline import pipeline and I may say from sklearn dot preprocessing I may say import polynomial features okay and so before what did I have I had um I just had a logistic regression like this with um with what was it it was um I was just going to search up in my page I think it was this fit intercept thing that I had off for all of them excuse me so I had that this was my model before and that's going to be part of a pipeline right so now I'm going to say pipe equals pipeline and the pipeline is going to be this list of stages just like this and so that will be the first stage well I'm sorry that will be the last stage and then before that I want to have my polynomial features just like that and um then the other trick is is that each stage of my pipeline has to be a tuple and the reason it has to be a tuple is because I have to give it a name right so I guess I'll just call this poly and I'll call this one lr so I have that and it's straight and so just like before I could have my um before I had something like this fit and then I had uh train of x columns and then train of what was my y column I guess it was um uh it was going to be my pedal width I think so let me let me just copy that it was like this but it was pedal width that's what I was doing before and I could just replace my uh logistic regression with this pipe so let me try that and I get some sort of error here unknown label type continuous oh I'm sorry I'm trying to predict the whether or not it's a sitosa right it's complaining because it's continuous so it's saying hey you're trying to do a classification on a quantity which we don't do classifications on we do regressions on quantities so what I'm going to say here is is it a sitosa that's what I care about and then I train it and that's all great now if I want to I can come back here and I can repeat these all these steps right so I could um if I wanted to maybe I'm just going to move all of this up here and then when I'm plotting this when I want to look at those decision boundaries instead of using my simple classifier before I could use my pipeline classifier and then what's going to happen well that boundary line between them is just slightly more curved it doesn't quite help that red point is still on the wrong side but um you can see I can do different shapes depending on the complexity of my model and what I have in the pipeline beforehand well on this video I'll be talking a little bit about how we can score our models um on our training data and um scoring them is very involved learning some new metrics and terminology uh it turns out there's different kinds of errors and um and if sometimes we'll look here more about one kind of error than another uh we might want to use different metrics to deal with that case um so for this I have some really simple dummy data uh basically I have this data frame um that has an x column and the x column is a number and then the y column is a is a boolean and it's true whenever x is positive and false when it's negative okay so not much data there and um and just for simplicity I'm trying to break it into the first half and the second half I wouldn't normally do this right because maybe the data is not shuffled and maybe I'm kind of getting very different data in the two halves so this is just trying for an example there I have it I have my training data and my test data down here and um and if I want to try to figure out what is the relationship between y and x and then um measure uh the model's understanding of that relationship um I'm going to use some sort of scoring function right so I'm going to do this whole logistic regression um however this is not really a regression it's actually a classifier right because I have this categorical data that I'm trying to predict and so I'm going to train one of those and then I want to score it so the first step right is I'm going to say l r equals logistic regression right and um and then I'm going to say l r dot fit and when I'm fitting let me actually just try to run this uh once I run I can actually set shift tab and into the hint I'm going to have to give it the x data and the y data and so I'm going to give it both those things from the training data first and so I'm going to say training data and um and then the columns I want are just x and then the thing I'm trying to predict is just the y value just like that and then after I do that I can basically run a command that's very similar and that command is score I can evaluate how well it does um the test data and I get 0.75 um so what is this score function doing? it turns out it's a shortcut and I'm just going to show you a little bit in the documentation what it's doing if I head over here I see that the score function for the logistic regression is giving me the mean accuracy and if you look at other estimators um they might be doing other kinds of uh using other metrics for the scoring okay so mean accuracy um so that's the default here uh but it turns out there's lots of different metrics we could use and so if I go to the metrics page for um scikit learn I can see that hey there's a whole bunch of um metrics here related to classification uh clustering which we haven't talked about yet and then um a whole bunch related to regression as well and so the one I'm using right now the default is that accuracy score right you saw that how it's saying here that we're um just getting the accuracy and accuracy is very simple so what percentage of the times um did we get it right and so I certainly could have instead of using that score function I could have used this one myself and I think that's a good thing to do right now because once you understand how to use this manually uh without the shortcut then we'll understand how to use these other functions as well okay so I can see this is the metrics um sub module and so when I head back here I'm going to do that so I'm going to say uh and this was that page I was just on I am going to say from uh from sklearn dot metrics import accuracy score okay and um and let me just run that and when I call accuracy score I hit shift tab here um basically what I'm doing is I'm saying well what are the true values and what did my model predict okay and and so there's different ways I could do that I mean I could say well the true values are a and and b uh but I actually predicted a and c and it turns out if I do that well I'm just 50 percent right right so this was my actual values and my um my predicted values um so let me actually try to get these uh from up above right so before I was trying this score thing um and then I was trying to give me my x values and my y values and and so I'm going to pull pieces from this um the first piece I'm going to do is I'm going to actually figure out what the predicted values are so those are going to be the predicted values I actually have to call predict here instead of score right when I'm predicting I'm not giving it any y values right predict tells me what the y values are and um and then I also have to have my actual right and my actual was just the second piece from before on that column so maybe actually let me just do that so I'm going to say let me look at the actual values and the predicted values um as a list okay and I can say the actual were true false false um but I'm actually getting true false false false and so the second one is the error right so I'm going to be drawing 25% of the time and so accuracy is actually going to be 75% which is what we saw before and we're going to see it here too when we actually pass in uh these two things I pass in the actual values and those predicted values I'm going to get that 75% okay so that works fine uh but there's going to be cases where we don't want to just know how often we're right but we want to know about um you know about what kind of mistakes we're making right so for example um let's imagine different things that this y column might be um let's say that this y column means it's a good investment maybe it's for a stock or something um I don't need to know about every good investment uh but if I have some system that kind of tells me like hey these are some good investments and it's always right and it doesn't tell me about every good investment that's a pretty good system um in contrast maybe this is telling me um I don't know is somebody contagious for COVID 19 or whatever right and in that case um it's much safer to make the mistake of saying they are contagious even though they're not it's there's different kinds of errors false positives and false negatives and so there's a lot of metrics based on that and uh the simplest place to start is with something called um a confusion matrix and a confusion matrix um shows the categories that things actually are and then how they mistakenly get classified as other things and so just like before just like we have an actual predicted list when we have a confusion matrix we're going to do the same thing so imagine I had pictures of animals and um and you know I had four dogs three cats and two mice and um but then I have some machine learning system that's looking at those pictures and it's maybe predicting these other things um what I could do is I could um create a confusion matrix uh using scikit learn right this is also under metrics just like accuracy score and so I can create one of these and um and just like with the accuracy score I can put the true values and then the predicted values or something say actual and predicted and um and this is a little bit confusing because um each one of these values is trying to show us how many uh categories fall into like a specific actual category in a predicted category and um it's not really clear to line it up right well is this first row dog or is it cat and so what people often do right is that they'll say labels like this and um just to control the order so for example I want to say like um cat uh dog mouse just like that and um and then if I pass in these labels uh it's going to be a little bit different right whereas if I say dog cat you can see these numbers switched a little and another reason for this besides just controlling the order right so that makes sense is that I could um you know maybe there's things that uh I know exist but it even show up in the data right there's no horses right um so so let me actually really try to talk about these numbers me and I think it'll be a little easier to put in a data frame and um and so I'm going to put this in a confusion matrix here and now let me put this in a data frame a pd dot data frame confusion matrix and when I'm creating a data frame from this both the index and column labels are going to be the same right so I'm going to have have it just be like this and and so when I'm looking at this confusion matrix what does it mean so the row means what it actually is and uh the column is going to be what it got classified as right so I can see in this right that there's four dogs and um of those four dogs that correctly classified three of them as dogs but one of the dogs was mistakenly identified as a cat okay it looks like there are three cats in the system two of them were correctly identified as cats and one was considered a dog um and then I can see other things here like well the system's really good at mice right it always correctly calls the mice and doesn't mix them up with anything and um and so this is useful right when I have a this try matrix I can see what ways um the classifier is confused that's it's called a um you know a confusion matrix right try me how the model is confused right um so hopefully that's helpful now uh it's a very common that we'll be having these uh confusion matrices where the class is instead of being um you know different animals it might just be true and false that that would be for a binary classifier where I just have these true and false values and so let me actually go back to what I had before right when I was computing this um accuracy score and uh let me create a confusion matrix here and maybe I should make copy some of this um here I'm just gonna grab this and uh in this case the labels are false and true and it doesn't do anything okay so I have these different categories and just like before I think it's going to be helpful to put this in a data frame right uh I'm gonna put a data frame here just like that and so here again the row is telling me what it actually is and then uh the column is telling me how those got got classified and ideally in a perfect world right everything will be on the diagonal right that means um uh there were no mistakes maybe there is no confusion okay so I have that and um and it turns out there are special names for each of these four values and so I'm just going to go through this quickly um the the top left one actually let's start with the bottom right if I go to if I go to let me put this in a actual data frame uh that's going to be my confusion matrix is a data frame now okay and um and if I go I look one one that's going to be that bottom right that's going to be called true positive and you're in practice trying to remembering these these words are important terminology right so I'm going to do this and then if I had been smart with this example I would have made sure all these numbers were different so we could more easily identify right and um and then the other number is on the top left and those are true negatives right so true means that that the model is doing the correct thing right so in this case I have uh one you know what I should really do is I should do it like this true positives true negatives right people often abbreviate it this way and then there's the mistakes right the mistakes are false right so I'm going to say false and false false and um and so what are these called the false positives all where are those so false positive means that this column right and so it should be false right that's what it actually is in the data uh but the it gets classified as true right so what does that mean that means I'm in row zero column one and then the opposite down here right sometimes it actually is true but the model says it's false that's a false negative and false negative just like that right so these are the four different cases I have and a lot of the statistics we're going to be looking at in the next video are combinations of these and I'll be kind of talking more about why those are meaningful in the last video we learned a little bit about confusion matrices and uh confusion matrices give you the whole picture but often we want to summarize things in just one or two numbers and one of those most important numbers which we've already seen is accuracy accuracy tells us what percentage of the time our model is correct and but when it gets wrong it doesn't really tell us what kind of mistakes are being made and so we're going to be learning two metrics recall and precision which really you can think of as accuracy on a subset of the data right so there's still going to be interactions between zero and one and but they'll kind of help us pinpoint where the mistakes are actually being made okay so to review here's a confusion matrix along the rows I have what the data actually is along the columns I have what the model thinks it is and so right now I have zeros in all these places but if I were to see an actual mouse and the model predicted that I would go to the mouse row and the mouse column and increment that number by one right and that's good any time we're incrementing numbers on the diagonal or confusion matrix that means we made the right decision um here's an example of a wrong decision if our model took a look at that picture which is clearly a dog and it's need of some grooming and it predicted as a cat then we go to the dog row because it's actually a dog and the cat column because that's what was predicted and increment that and so we might do this over our whole data set and we have a bunch of numbers there now from that we might want to figure out what the accuracy is and the accuracy is well what percentage of the times were we correct and so the way I think of that is adding up all the numbers on the diagonal that's how many we got correct and then dividing by all the numbers in the matrix right so then look at eight over ten percent and some observations here is that you know this number is a fraction of kind of a subset over a larger amount this always be going to be between zero and one and the good number is always in the numerator for accuracy so one is the best possible number and so precision and recall have those same properties but they're going to be on different subsets of the matrix right we are going to be taking the whole diagonal divided by the whole whole matrix um so precision and recall it turns out we can actually have these metrics for each class right so I'm actually going to have six different metrics here I have dog recall cat recall mouse recall and then similar dog precision cat precision and mouse precision and so I'm just going to look at a few of these so when I'm asking well what is the cat recall what I want to know is when we actually have a cat what percentage of the time is the model right and so asking about what is actually the case what I'm really doing is I'm dividing by the sum of numbers in a row right because each row represents what um what the data actually is so in this case right so the denominator will be the sum of the row and the numerator will just be a single number which is cat cat how many times do we actually call a cat a cat in this case we'll get two over four um and so this is actually way to remember recall versus precision is because recall has an r and row also has an r um if I'm looking at dog recall okay so when we actually have a dog what percentage of the time is the model right I'm just looking at that top dog row and I'm dividing dog dog by the sum of everything else and in this case we always get it right when we see a dog right so four over four 100% dog recall so the precision questions are asking something a little bit different what we're asking here is say for dog precision when the model predicts that it's a dog what percentage of the time is it right and so when we're looking at all the predictions now we're talking about columns right because each prediction is along a column so in this case we're dividing dog dog top left by the dog column and all the different things we predict and we're going to get four over six and then similarly for cat precision we're dividing cat cat by that cat column we see that there's perfect precision here and hopefully what you can see is that that they are making different kinds of mistakes right for the cat we're great on precision but we have a recall problem for the dog it's the opposite we have perfect recall but poor precision and so these kind of two metrics that are kind of showing an error right cat recall and dog precision are two ways of looking at that same problem sometimes we see a cat and we think it's a dog the opposite is not true I'm not going to talk about it more here but I just want to give you some exposure to it often people try to reduce these numbers down to a single score for example there's something popular machine learning called the F1 score and a lot of these kind of simple scores are just combinations of these other metrics like precision and recall so these are kind of building blocks for other metrics let me head over and write some code for this to jupiter and in this case I have my confusion matrix converted to a data frame and I'm showing it down here like so kind of similar to one in the slides but the numbers are larger now and I also have a horse and so the diagonal is good so I can see this is actually not doing too bad I have a lot of large numbers on the diagonal I see that there's a horse problem when I see a horse it actually 90% of the time thinks that's a dog the other problem I have about half of the cats it sees get misclassified as dogs that's a problem so what I'm going to do is I'm going to look at I've already produced this confusion matrix I want to look at things like accuracy score, recall score, precision score then finally this new metric balance score that I'll introduce so first let's take a look at the accuracy score so I'm going to run accuracy score and I need to feed it in the actual values and then the predicted values so I'll do that actual and predicted these are the two lists that I used to construct my confusion matrix and I see that oh let me just run this again and I see that the accuracy is 78 well it's about 80% alright so that seems pretty good and the key thing to note here is that when we have all these different classes right it might seem like we're doing good overall but there might be cases where we are making a lot of mistakes right so for example when we see a cat we end up being wrong half of the time worse when we see a horse we're wrong 90% of the time and so these other metrics are going to help us dig in and actually identify that okay so let's say I wanted to look at recall for the horse which I'm expecting to be 10% right when I see a horse we only know a 10% of the time so one way I could do that is I could in my confusion matrix I could get the horse value right from that bottom right and I could divide it by the sum of all the values in the horse row so I could do that and I get 10% just as I expected right the shorter way to do that would be to use this precision score function that's actually built in to sklearn so I'm going to call this thing and so I have the true values and the predicted values so I'll say actual predicted and I actually get an error here and it's complaining about something called multi-class versus binary these metrics are kind of set up for the simple cases where our two classes are just false and true as opposed to four cases like dog cat mouse horse and so I have to clean that up a little bit and the way I'm going to do that is well first off let me expand this a little bit I have to change this average value so there's different ways to kind of summarize information I'm just going to set average to none and I'll actually know it's unlike that and then what it's doing is it's actually giving me four recalls one for each of these classifications now the order might not be the same as up here and so I'm actually going to pass in these labels as well to make sure that that I can kind of actually compare these numbers to the different values right so what I see here is that in terms of actually I want to do recall first I'm sorry I want to do recall first and so for this recall right I'm row by row and what I see is that recall for the dog is perfect if I see a dog the model is going to recognize it as a dog it's also perfect for mice right if it sees a mouse it's going to recognize it as a mouse for cats right if it sees a cat 50-50 on whether it will correctly identify it and then for horse only a 10% chance at it correctly identifies it okay those are that my four recall numbers sometimes what I'll want to do is I'll want to kind of see how I'm doing overall by taking an average of those and I get 65% in that case and it turns out there's a special name for this average of recall scores and that special name is the balanced accuracy score right so before this accuracy score was saying hey we're doing 80% but now I actually do this balanced accuracy score it's only 65% much worse and in some ways this is more meaningful the only reason we were very accurate before is we were seeing very few horses right even though our model is terrible with horses we could just score is there are not many horses in the in the model right so when we're using these balanced metrics just trying to take into account for that as I say even though we have more dogs than anything else really we're going to consider these four classes equally important in terms of coming up with our score this will be a great one to use if you have a lot of imbalance in your data center right and accuracy can be a little bit misleading in that case okay so that was the recall score let me similarly so that's I'm going to actually do the precision which I guess I was already doing earlier inadvertently what happened there there we go I'm going to paste that and instead of that I'm going to get this precision score and now I see something different right I see that actually we do perfect on everything except the dog and why is that so when I'm talking about precision I'm really going column by column and what I actually see here is great right I except on the diagonal I only have zeros in each of these columns and so that means if this model is predicting a cat a mouse or a horse it's probably right only when it's predicting a dog is there a good chance that it's making a mistake right in that case you know only two-thirds chance that it's actually a dog right so this model likes to predict dogs a lot if it predicts something else it's sure but if it predicts the dog it's only kind of two-thirds sure alright so that's the we talked about accuracy recall balance accuracy which is an average of the recalls and precision one last thing I want to talk about is binary classification and so for binary classifications instead of cat dog mouse we have just false and true and so I'm computing the confusion matrix here for that and if I want to I can compute these same metrics like I did before so for example if I do a recall score down here I can pass in you know false and true for my labels and why is that unhappy maybe because I didn't run this yet there we go I can run that and it's telling me okay row by row in that first row one-third is correct right so in the second one 70% is correct those are my two recall scores I can do that just like before but it turns out when we're dealing with binary classification metrics people will often just talk about predicted I'm sorry they'll just talk about recall and precision without specifying what class they mean and when they're doing that what they're talking about is the positive class right so if I just talk about recall in general well I don't want that I'm just trying to talk about recall in general look I'm talking about that positive class and the same thing for precision and actually this is probably the majority of the cases you'll see precision and recall use is kind of in this special case where I'm having a binary classifier so just know that when we're doing that we're talking about the positive class well in this video I'm maybe talking about two topics one is regularization and the other is standardization standardization is something that we have to use a lot in this course and understand and it's relatively simple but regularization is a very complicated topic and might require a lot of time and a more advanced machine learning course I'm not going to get into any math regarding regularization I'm just going to try to give the kind of simplest intuition and we aren't going to get deep into it but I just want to know that that's an important and deeper topic so in terms of things that we've already done we've been using logistic regression a lot and a problem that it has that I haven't talked about is that it's very sensitive to scaling and so for example you might have a data set and there might be some numbers on it with specific units and you might get one result if you do the classification if you change those units so for example maybe you change miles to feet you might get a different result which of course not what we want we just care about the actual kind of information not what units somebody arbitrarily chose to use. Why is that? Well it's because logistic regression is applying this technique to regularization which tries to use smaller coefficients and in general not have a very large coefficient on just one of our features you can imagine that if I have lots and lots of feature columns that just by chance maybe one of them does better on the training data than others even if the other feature columns are kind of still somewhat useful and so what I wouldn't want to do is just by chance choose that best one because then it won't work well later on a test data set. Regularization basically is providing a motivation to use multiple features and not consider one too heavily even if that would do better in the short term. So logistic regression does this linear regression which was the first model we learned of this course does not but there are also things very similar to linear regression that do such as ridge regression and lateral regression. We're not going to talk about those at 320 but they're important and used all the time and so know that this regularization thing is a big deal. So what would we really like? We don't want our model to be sensitive to units we would like to standardize it in some way so that we have the same numbers going in regardless of what the original units were. So for this example I just made up kind of a fake scenario. We're measuring some sort of quantity in the real world three times and based on what we're trying to predict what sort of category it's in, the category will either be true or false. The underlying rule is that when the true length, which we don't know, is bigger than five then the category is true. It's less than five the category is false and so these three kind of noisy measurements, even though they don't tell us exactly what the true length is, they give us some information about that that can help us guess whether it's true or false. So here I have that data, that fake data I'm talking about the Y column here is the category and then I have my three measurements X1, X2 and X3. Let me just talk a little bit about how I'm generating this. So under numpy.random there are a bunch of functions that will generate random data. I'm doing a normal distribution here. You can sample from different distributions. You don't know what that means that's fine for this course but basically I'm generating a thousand random numbers with an average of four and putting them in here and so this will be an array of numbers and then I'm saying whenever that's five, greater than five, I want to have a true, otherwise I'll have a false. When I'm looking at this data frame down here, true feet does not directly go into any of these columns it's not known but category does and category is what we're trying to predict. So how are we going to try to predict if we don't know true feet? Well I have these three other columns which are basically just true feet plus some random noise. So if I look at it down here, let me look at this first one. All three measurements were less than five so it makes a lot of sense that I'll predict that the Y is less than five. Maybe I can even look at some more cases here. I wonder if I can see where it's true. Let me do that. So I can see some other cases here where it's true. In this case all the measurements were greater than five so I say it's true. This is kind of a more interesting example. This number is very large. One measurement was like almost seven and even though the other two measurements were less than five this was enough of a signal that the model decided it's true well it's true overall and so hopefully the model will decide the same. Okay so that's the data we're working with and let's see if we can train a model to try to predict this. So I'm going to create a model and I'm going to use a linear or logistic regression model and I'm going to fit it to my data and so I'm going to have some X and some Y's. For my Y's I'm just going to pull that Y column from my training data and for my X I want to pass in a list of all my columns that contain features so X1, X2, and X3 and I'm going to be using these again so I'm actually going to put this in a variable called X columns and then I don't have to keep typing that whole long thing every time and so I fit it and that's great and so pretty soon I'm going to look at the coefficients for this model but before that I just want to as an aside see what accuracy it has. If I want to see the accuracy of a model I can just say instead of fit I want to score and then to be realistic I should score it on data that I haven't seen before instead of the thing I trained it on so this is trying to have a better test and I see that it has 89% accuracy is that 89% seems high that we would be right that often but let me show you why it's not necessarily. If I look at this Y column I see it's almost always false and indeed if I say value counts I can see it's only true less than 20% of the time and I can actually just divide this by the length of tests to see that and so what this means is that even if I had a very naive model it just always says false I would be getting 81% accuracy. 89% is better but in that context it's not so amazing just given that there's so much skew in that column. Okay so after I look at the accuracy of a model the next thing I'll often want to look at are my coefficients and I like to plot those in some way my model coefficients and I see those are right here and I'd like to have some sort of bar plot that these things are paired up with these X columns right so this is the coefficient for X1 so on and so forth and so the way I'll often make such a bar plot is I'll say pd.series and then I will have my Y values and then I'll have index equals X values dot plot dot bar so on the X values I'm going to use those column names and then I'm going to put the coefficients to basically have the quantities that are going to go to the Y axis and it's complaining that the length of one of these things is 1 when it was supposed to be 3 and so X columns that's pretty simple there but if I look at this this array right here what do I see I see there's it's really a 2-dimensional thing I can flatten it into a 1-dimensional array with 3 numbers or if I say negative 1 it'll make it 1-dimensional but it'll make it it doesn't matter how many numbers I have and I'll figure it out so I'm going to put that here now and now I get this plot and this is interesting right I was talking about how maybe sometimes just by chance we focus more on one column than another and that happened here right X1 X2 and X3 were all equally noisy but it just so happens that based on the training data the model thinks that X1 is more kind of more useful right that was just by chance right and so you can imagine a worse scenario where it picks one column that it really likes and ignores all these other ones that have good information in it and so the model will try to avoid doing that regularization means that we'll try not to put too much weight on just one factor we'll try to spread it out a little bit and then if you took it to an extreme you might imagine that I could have a model where I just look at my intercept right my intercept is you can imagine that being like the average and the model could just predict that and all my coefficients could be zero in that case well we always just predict the same thing I mean we want to have this there's trauma I guess we have another problem I just want to be very accurate okay so I have that and so let me head back here and I'm going to re-randomly generate this data but this time I'm just going to change the units on this column and I'm going to change the units to be miles and so there's 5,280 feet in a mile so I'm just going to make a comment here this is on feet to miles like that and so I have the same kind of data on just different units so I might hope that my model won't do anything that differently and so I'm going to run this again and I see that not too much has changed here and then I want to think about what's going to happen when I re-run this so in this x2 column the numbers are all much smaller now because it's in miles and so I might expect that to use this I might have a bigger coefficient on x2 if I wanted it to be just like before it turns out when I run it I actually see the opposite it's adverse to having such large coefficients on one column because of that regularization thing I talked about so it actually decides hey I'm just going to ignore x2 entirely I have to put a bigger or a bigger weight on that than I'm comfortable doing to have it be a factor so I just lost some information there by using these two columns anymore now of course that's silly putting a bigger coefficient on it isn't really waiting anymore it's just cancelling out the fact that I have different units on it and so there's different ways to solve this one is that I could just insist that I have the same units for everything other way I could do it is I could try to make this a little more uniform in some way and so that's what I'm going to do here I'm going to head back here and let me take my training data and my um actually where do I want to do this I'm going to take these x columns even sooner um actually no that's fine I'm going to leave that there so I'm going to take my training data and I want to take a slice of it and I want to get all the rows and I want to get columns x1 through x2 and so this is just my features now well through x3 sorry these are just my features and I want to have somehow um standardize it so that uh they all have roughly the same scale and so what I'm going to do is I'm just going to pull this out into this x variable right here and um and there's going to be two things I'm going to do one is that I'm going to take the mean of each uh of each column just like that and I'm going to subtract these numbers off of each column so I'm going to say that and so now all of these columns are centered at zero right after I subtract away the mean the average of every column is zero it turns out that um that is also helpful for logistic regression to run faster I'm not going to get into details about why that's useful and then more importantly I want them to be on the same scale and so oops what happened there um I jumped onto a new column or to a new cell um and so if I look at this that's a standard deviation of each column and and and if I have larger numbers while the standard deviation will be higher and so standardization the real key part is that I'm dividing all of this by that standard deviation and if I do that I may get a bunch of small numbers that have roughly the same scale so after I've done this all of them will have the same average zero and then the same standard deviation of one and so this would be a better um x data and I can actually put this back in to my training data like this so I'm going to say this equals my new x data so I make it out here I'm going to say standardize the data and so after I do that I run all of this stuff again and now I see that um great uh x2 is back in play even though I have different units it's not getting obsessed with these other columns just based on the units so this was a good thing to do okay that's what standardization is now it turns out that um that to do this right I have to calculate this mean and standard deviation on the training data but then I have to use that same mean and standard deviation on the test data I can't retake the mean on the test data and so the methodology of this gets a little bit complicated and so generally we won't do this generally we'll have sklearn do that for us and so it turns out that there's a preprocessing stuff called standardization and we're going to use that instead of manually doing this so I'm going to head back here and you can see I've already imported my standard scalar and so I'm going to run this here and this is skipping now um for my model right I'll just actually leave this for now and that'll be my bad model what I'll do is I'll create a new model which will be a pipeline model and in that pipeline model I want to have a standard scalar followed by a logistic regression just like that and this one so I'm going to have to actually create them like that I have to give them names right so I'm going to pass tuples here so I'm going to call that a standard scalar okay I have to put Thomas to separate these things and then maybe I'll have this logistic regression like that and this is my new model and then it turns out all this stuff I was doing before of like fitting for example it can work the same way it can fit just like I did before because I called this new one also fit and so I can do that I could also score it like I did before let me score it now and I get something very similar and then what's going to be interesting is that when I actually do this when I actually try to get this bar plot it should show that it's back in play right even though the non-standardized version is ignoring X2 now this version should show it so remember on this there should be a small error here and the problem is that pipeline doesn't have coefficients this pipeline as a whole doesn't have coefficients but the logistic regression inside of it does have coefficients and so how can I get to that it turns out that any pipeline works like a dictionary and I can for example I can copy these names and use that like a key and so that would get me my standard scalar from the beginning where I could pass in this key and that would give me a logistic regression stage of it and so from that I could actually see what are the coefficients involved there and I would paste this here instead of what I originally had and so now I can see that when I have this standardization in play as a transformer before my estimator it will automatically do that and then it will do the right things as well if I do my fitting here it will calculate that mean in the standard deviation it's just going to use the same mean in standard deviation from before it would not look at that for the test because that would be kind of a method a lot logical mistake so we're going to be generally doing this whenever we have a logistic regression unless we have some very special scenario for example that the data has already been standardized so we've been doing a lot of supervised learning lately in particular we've been doing regression and classification now I'm going to give an example of an unsupervised learning problem which is clustering and clustering might feel like it has some similarities to classification in classification sometimes I would show these scatterplots where there's different kinds of points and we were those points were labeled maybe there's some red points and some blue points whatever and we were trying to find boundaries or rules to separate the different kinds of data points that we had and we did that based on some predetermined labels that came with the data we know which kind of point is which in clustering we might similarly have some sort of scatter of data or the multi-dimensional equivalent of that but the difference is that there's no pre-existing labels on the data that's what makes this an unsupervised learning problem the algorithm itself gets to choose the labels you know a million different ways you can choose to apply labels to an existing data set but we still have some constraints or maybe I should say like a goal our goal is to choose those labels so that we're kind of grouping similar data together and there's ways to measure that so clustering is this general problem there's lots of different clustering algorithms by far the most famous is k-means so that's where I'm going to start and so I'm doing some imports here I'm using the k-means that comes in sk-learn but to help you understand how the algorithm works I'm actually just going to write the code from scratch in this video before we actually start using this one so in sk-learn there's this data set sub-module that can make blobs or these blobs are basically clusters you tell them how many points you want, how many kind of different centers they cluster around and then something about standard deviation it returns two things it returns x which is actually two columns an x0 and an x1 and then a y which is indicating what cluster or blob each of these points is part of so I don't really care about y I'm just going to throw that away I'm going to throw those two x values in here and then I have this data frame just like here and so what we're going to be working towards is trying to find the other clusters of different points in here where it's centered around something and so here's a picture of those points that generated you can see it's pretty random although they kind of center around three different points I'm just putting a question mark here for now because these are unlabeled there's no real category I just have these two y-axis and x1 along the y-axis for my coordinates and so ultimately to do this I have I'm doing a data frame dot plot dot scatter like we've done lots of times before the reason why I'm writing this special function here Km scatter Km stands for K means I'll talk a little bit more about why we have that name is that I'm going to be wanting to show different symbols for different points and there's not an easy way to specify a column that gives the type of symbol so we have to loop over that and that's going to be determined by this column called label if there is one not necessarily and so this is automatically going to be plotting x0 along the x-axis and so on and so forth so I'm going to be using this as I go forward you can probably already see there's three clusters here and we actually know that because we randomly generated the data but how can we find kind of good indicators for where those are and those indications are going to be called centroids we're going to ultimately try to say well here's the center of these three clusters that we discovered how can we do that automatically and so that's a hard problem trying to find the three best points an easier problem in general right then finding the best answer is to just take a bad answer and make it slightly better if you know how to do that and you can repeat it well that often ends up giving us a pretty good answer and then this is a strategy that we use for grading at the set it's very pervasive in learning and it's a strategy we're going to use now for the k means right so we take a bad answer and the bad answer looks like this I'm just going to randomly choose some starting points and assign them each a different symbol and for now I'm just going to assume that I may have three points three clusters here we'll eventually revisit that assumption and I'm going to scatter it down here and so you can see that this is where it thinks those three clusters are and of course that's horrible right that's not where the three clusters are so how can we automatically identify the centers of those three clusters and so the strategy that we're going to use is we're going to alternate between doing two things first we're going to do something called assignment which is taking each of these points and putting it in the cluster just saying it's going to be in the cluster with the centroid that's nearest to it so these three things are centroids centroid is kind of a two-dimensional mean right so it's average x0 and average x1 right so and so the k means that's the name of this algorithm right so in this case k is just a variable and so really we have three means or three centroids right we're going to find the best location of those so like I was saying we're going to assign each of these points to the centroid that's closest to it and that's a point assignment and then the other step we're going to do is we're going to update where these centroids are so that they get closer to the values that are assigned to them and we keep alternating back and forth between deciding which points go with which centroid and then where the centroids are and eventually it should hopefully converge and kind of discover these three points and so to do this I'm going to be building a new class and I'm just going to call my class km and I'm going to have it in that method and maybe what I'm going to do is pass in the data frame with all my data and then people in a lot of these implementations they would specify something like well how many clusters are there for simplicity I've already created this data frame of clusters right here which if I look at that what do I have for my clusters I already have the data for these three points I kind of did that and so I'm just going to keep that outside of my class for now just to keep the code a little bit cleaner if I have these things I'm going to say that self.clusters equals clusters and I'm going to be making a lot of changes to these things and I don't want to change the original data so I'm going to make copies of these data frames and maybe let's just see that we actually have something here I'm just curious what is in this label self.clusters equals self.clusters and I want to look at that label column just like that and maybe I'll just convert that to a list and then I may print that self.lables okay so I want to create one of these things .tam and I need to pass in my data frame with all my points and then my clusters and what am I doing there is a little silly of me I want to save that in that variable and cool those are my three clusters I'm just choosing cluster names that happen to be simple right so I can easily plot these typically people just kind of arbitrarily call these clusters one two and three remember there's not any label in the original data the original data looks like looks like this thing right where I have 100 rows and then my clusters actually kind of look similar right I have the X to X values and then also the label okay so one of the first things I want to do is I want to be able to plot this as we go because we're going to be making changes right so I want to grab this code that I had before to just see what's drawing on just like so and I'm going to plot that and let me see here I guess I can't use data frame and clusters because these are attributes now and I don't want those versions I'm going to say self.df and it's self.clusters and there we go right and our kind of initial state of the system that we want to want to make better and so remember there's these two phases that I talked about we're going to have something where we assign the points that we're going to do and what we're doing here is we're really kind of drawing from clusters to points right based on where our cluster maybe I should tell them centroids are and where our centroids are that's trying to affect what happens to our points we're going to assign each point to a centroid and I have that and another thing I may have is update centers right I'm just going to by alternating calling this and this and this and this we're ultimately going to end up with a good solution to this problem okay so first off how do I do the centroid assignment that's maybe the harder one this function here is maybe a little bit easier and so well what I want to do here right is I want for each of these points I want to assign it to one of the clusters okay and it needs to be the closest one so maybe the first thing I'm going to do is here let me just do this KM.dataframe I'm going to add some columns here right this is one of the reasons why I copied that data frame when I started I'm going to add a column for each cluster that specifies how close this point is to that cluster and then once I've added those three columns it'll add yet another column that says well what which one is closest which one do I actually want to be in okay I'm going to do something like this I'm going to say I'm going to loop over all the clusters right so I guess the labels and so I'm going to say for cluster and I'm going to loop over this thing maybe let me just print this loop over it as I'm going to do enter tuples excuse me and this is where I give me named tuples right so let me let me do this KM.signPoints and maybe looping over these named tuples and so I know where the center of each of these things are okay and and so now I want to update right I'm updating my points data frame and so what I'm going to do is I'm going to look at these columns and say x0 and I want to compute for each row here the distance between that along the x0 axis and that center right so I'm going to take that minus x0 and I'm going to save that as x0 def and then I'm going to do the same thing along the other dimension and then what I ultimately want to do let me what I ultimately want to do is I want to compute the distance between these points and the center of the cluster right so I have the differences along these two dimensions so the distance is going to be like this it's going to be x0 def squared plus x1 def squared and I'm taking the square root of all that that's the distance 0.5 to take the square root and let me think here so this is an individual number right I'm looping so but for each pass or loop it's an individual number this is a whole column so this is a column this is a column really I'm computing all the distances at once and so I'm going to say self.df and let me come back to this this is going to be those distances I'm adding a new column and what I'm going to use for this column name is the cluster that I'm currently in I guess I'm just going to put cluster here and so now if I run this I'm not printing anything right now you know what let me just clean up this too I don't need that anymore or better yet just delete it and let me look at what happens to the data frame after I run that something horrible which is well it's adding these weird things which what I really want to do was to be the cluster name I do that and now this is great right I can see that my x0 my x1 and that's a point and I can say well how far is that from the all cluster how far is it from the plus cluster how far is it from the x cluster and it's closest to the all cluster so that's ultimately what I want to get this one to be in right so what I'm going to do is after I've looped and computed these three columns as I'm going to say cell.dataframe and label I want it to be one of these three you know what let me just poke around down here first to see how I can get to that what I really want to look at is I want to look at I want to look at those three columns there and figure out well these columns has the smallest value in each case and it turns out that there is a pandas function that does that very easily and it's called index and normally what that's doing is it's throwing column by column and telling me oh the smallest value in the all column is at position 11 the smallest value in the plus column is at position 78 that's not quite what I want I want it to go horizontally so I want to find out which you know I'm sort of looking at these index values over here on the right I really want to look at the column column index here instead and say well which of these which column gives me the smallest and so I'm going to paste this back and instead of saying axis equals zero I'm going to say axis that's vertical I'm going to say axis equal one which is horizontal and then I can get all of these classes right so I'm going to put this back here and I'm going to run this again right so look at my data frame I run that and I can see okay great so I have my original data which never changes by the way right the data never changes then I compute the distance to each of these clusters and then based on that I'm like okay well this first one oh is the smallest number so that's an oh cluster same thing for the second one the third one the smallest value of these three is under the x comps that's in the x cluster so I've been able to assign all of these points and so let me just show you what's going to happen right let me run this again so here are the points if I do the assignments of points you're going to see that instead of question works it's saying what each of these are right so you see that this circular cluster is really big it's actually capturing most of these and then this one is kind of has opposite problem we have one actual cluster and it's being shared between the plus points and the x points over here right but but it is some clustering and and now that we've actually kind of started with a bad answer we can make it we can make it better and the way I like to make it better is that now that I've decided well which points are in the circle cluster I can kind of find where that circle cluster is right I can see that at this red circle here that's not a very good real center because that's way to the right of all the points that it's representing right and so that was going to be a little bit easier now we're actually going to update these center points okay let me before I do that let me do one other thing sometimes notice how I'm trying to calling this one and I'm calling this one and each time I say km km when people don't have to return anything from their functions right I don't return anything here what people will often do right sometimes you'll see this is they'll just return self right and the advantage of that is when I call this it does some stuff and then it returns km and because it's returning km I can just chain this along like that right so that's one reason you often see people just returning self in a method where I'm doing the same thing down here right but let's actually update these centers and try to do something and the easiest way to do this is with a group right I want to if I go back here let me let me do this and look at that data frame again ultimately what I want to do is I want to find the new centroids which are kind of the average of these columns for each label and so the way I can do that is I can say group by label that leaves me this weird let me just stop plotting for a moment that leaves me this weird data group by object and but then what I can do is I can compute the means on it just like so so this when I do a group by right that's when I go to the index right over here on the left I'm kind of getting the mean over all these other columns and you know what there's too much stuff there right because when I'm competing centroids I don't really care about the averages of these anymore so I'm just going to do this I say I just want my my x columns right and then the last thing that's a little bit weird is you noticed before like when I started um label was just a regular column the group by made it not a column it made it an index but I didn't really want that so I'm just going to do a reset index here right so this little line here right this one line is a quick way to compute what I like the new clusters to be you see it has all the same data as before label x0 and x1 right but but now instead of you know my data started off randomly right uh warrable right now I'm actually having some sort of meaning to it right I'm actually saying well um let's put our clusters at the center or our centroids at the center of of the cluster of data that they're representing right so this one's going to be very simple right they self like clusters equals that and um it may be let me just um I split this off right so I'm just going to say clusters equals that and stuff.clusters equals right so so the first step is I'm just doing the mean for the centers right of each each label and then just kind of pulling out the columns I'm wanting fixing it up so it's in the original original shape right okay let me do this I haven't called it yet right I just called the one we did before that's the one we usually start with um but now I can do this right after I've assigned those points now I can run the other function and make it better right so maybe let me actually do that I'm going to just plot it here I am not plot but let me just look at the original so the first off I don't know anything right it's just everything is random and I do an assign points okay that's good and then after assigning the points then I want to do what I want to um update the centers let me just do a quick experiment I wonder if I can just even put this together in this one that's going to make my life a little easier great I can see those two things right so um right so the data started looking like this I'm gonna up assign the points to a cluster and then update the cluster centroids right so here I update the points and and then you can see wait a minute what happened here let me just try to start I ran it twice I'm sorry alright so you can see the first thing it did right is it assigned the points and then it moved it over you can kind of see that you know this first one's assigning the points the second one's updating the centroids you can see that um a couple things happened right like this this red circle moved to the left to be closer to um where it's supposed to be um and then this plus is kind of encroaching and there's no reason for it to be hanging so far out and so if I run these two steps again like this let's try to get even even better right so I'm gonna run that again and um and now what you see right is not that much happened uh not much happened well the red points that move originally right and not much happened over here on the left but but do you see what happened down here right this plus sign kind of grabbed some more points after it moved in and so since it's grabbing those points X's remaining points are kind of have a great center of gravity more to the left right so when I update again I'm kind of bumping that X a little bit more to the left right and if I keep running this right it should keep bumping it farther and farther over right if I keep updating this it might take a few times it got stuck there there we go and um and so you can kind of actually see I run into a problem here and um and the problem is that I've hit a local minimum right I can I can clearly see that it will be better if this red point this red circle bumps up here to the top and then this acts and it grabs the cluster down here but it's not doing that because it basically has to get worse before it gets better right so trying to um hit what we might call local minimum and so well how do I solve that it turns out there's not a bug in my code I got unlucky and I wasn't anticipating this it will happen but it's a nice opportunity to talk about it and I got unlucky because of where uh where these starting clusters were right I kind of randomly decided where the starting clusters were and they happened to be a point where it didn't kind of gravitate towards the three actual um clusters and it turns out that this is a problem in every implementation if I go to the the real one um it will have this thing here which is the number of times we should try running the algorithm and um and each time right it's trying to start off with different uh starting points and randomly updates it and the hope that it converges right and so then it'll take the best of those and so I wasn't anticipating that it happening during this demo right because it didn't happen when I practiced before but I'm going to redo it now so I'm going to leave my data alone right I'm going to see what happens when I come up and I kind of re-randomly generate my starting points so I'm going to start with these as my three starting points now and remember the default right one we're eventually using is we're going to start over 10 different times and see what happens each time now when I run this let me you know what let me I don't want to have any old plots that are confusing confusing things I'm going to do this I'm going to say it says my starting point I'm going to say km dot sign points dot plot that was what it was called right dot sign points and then update second step update centers right so I'm going to do that and okay I can see that I've signed the points and then updated the centers and now I can see I got a lot here this time and it's not perfect right I can see that at this point there's still some weirdness right that this one is re-assigned over here but if I keep even just a couple passes it actually quickly finds out where the three clusters are and usually it's somewhere in between there maybe it takes a few times to actually converge on the right thing okay in the last video we we built our own k-means class and this one we're going to be learning about how the one that comes with sk learn works and usually you'll want to do that because it works it kind of gets all these tricky details right for example you can easily have it basically generate varying numbers of starting clusters it will often have strategies that are smarter than pure randomness for choosing the positions of those before it runs the algorithm it generally has some logic around when it has been updating the centers and often it's not going to get any better and so you can set an upper cap on that but it's not going to do it more than necessary in general so overall you're going to want to use k-means rather than rolling your own the k-means actually has these three methods with it that we need to know we have fit which is not surprising right we've seen fit for both transformers and estimators what's a little bit strange about it is that it has both transform like you might expect for a transformer and protect like you might expect for an estimator so k-means has some similarities with both transformers transformers and estimators even though estimators and prediction is a little bit it's kind of a strange use of it right because we aren't predicting some label that was given to us we're both coming up with the labels and predicting them at the same time so it's not really a classic prediction so I'm going to just show some rough code or kind of data to demonstrate what these three are doing in the context of the code we wrote before we're going to go into using k-means so before in our class we saw that we have these assign points an update center and that was the real core of what we needed to do and a fit method what it'll probably be doing is some sort of loop like this for i and range of something it's just trying to be calling both of those it'll call the sign points and it will call update centers so we'll do that a bunch of times to try to find the right answer and how many times is that well when I call that number of times it does that the number of epochs right and so I might say epochs here and like I was saying right in the actual k-means that comes with sklearn this will be an upper bound right if it sees that this is not improving anything further it might have some break that happens if it's already done getting any better I might do that I'm just trying to write kind of rough code to give you an idea of what's happening and so when we do this right on our version I plot this at the end hopefully it's solving this and kind of updating those points and indeed it is right it's trying to figure out where each of those centroids should draw so that's the fit method now in the process of doing this we created a lot of supplemental information for our original data frame right so this was our original data frame that just has some points and in contrast to that we have this data frame that the k-means class was using with some extra information first we have the distance each of the clusters and that can be useful information in and of itself but looking at those three distances we can figure out which number is smallest in this case number the next column is smallest so this is going to be this row is going to be in the x cluster right in this next one the O number is smallest so we're going to be in the O cluster and so when we're looking at let me actually shorten this up a bit when we're looking at using k-means for either transformation or prediction the only difference is whether we're using these distances or the labels and so let me do the transformation first to show you what we'll get there we'll be effectively getting this data that's what we'll get out when we do a transformation in k-means and I'll eventually talk about how that's useful as a pre-processing step before we do something like logistic regression and then for prediction all we're really getting is well what group does it fit nicely into and again right this is not really like classic prediction because we're both deciding what the labels are and deciding which points go with each label okay I'm going to use k-means on this same data from k-means from sklearn this time instead of our own and so I'm going to say k-me equals k-means and there's a bunch of configuration options here for example how many clusters we want to start with I'm going to say we would like to start with three of them and then what we can say kn.fit we always have to do a fit regardless of whether we're going to do a transformation or prediction next and I want to fit that data frame let me just look at that one more time dataframe.head I want to fit that data and we do and once we do that then we can do either of those things we could say transform and there might be some cases where this is a training data and then we're trying to apply our clusters or force our clusters on some second dataframe maybe some test data and it'll be very common that we want to do it to the same original data and so when I'm doing this transformation here I saw well I have three clusters and that's why I'm doing three columns of numbers here right that three distances or a row on my original data set and it's very common rather than doing fit and then transform both at once just like so that would be a fine thing to do the same way I can also do a fit predict and then instead of saying well which group is in it it's trying to tell me specifically oh you're in group zero you're in group two so on and so forth and so something we might want to do is that I might want to create a copy of my original dataframe and then add this prediction in and so I'm going to say what am I going to say I might call that cluster right I could call it a classification something like that and let me actually look at this now right I can see well these are the clusters that it's predicted to be in maybe I'll look at the tail to see some others and so I could plot it and I could assign different colors to the different clusters if I wanted to right I could say dot dot scatter x equals x zero y equal x one right we're having x along both dimensions I'd really like to see what color they are so I'm going to pass in color equals dataframe two of what cluster you are and you notice these vanishes those zero ends up being white and so what I should do is I should pass in a different color map and so let me head over here and look at the different color maps and matplotlib cluster zero is not more similar to cluster one than it is to cluster two right so I don't really care about getting what is called sequential color map something like that there's kind of on the spectrum really zero one and two are just different categories for me I'm going to be looking into the qualitative color maps and I'm just going to go with this one here is a set of colors and I'm not going to have more than 10 clusters I'm going to do that and say I want the tab 10 color map and now I can see it's actually giving us different colors to those different groups of points if I wanted to I could also look at the centroids and draw that on top of here and I get the centroids like so I can say data I'm sorry k-means dot cluster centers and what am I getting here of all the coordinates of each centroid are like a row here and I have three centroids and that's why I have three rows and so I can absolutely wrap that up in a data frame and I could plot it I could say dot plot dot scatter and I could say all the x is x zero y is one and I could plot those three points and let me just make them larger and red right so I'm going to say color equals red size equals 100 actually use s here and I should really combine this with what I had before I can actually see the centroids and to really make it work I have to say that it should use the same region so let me just split this up here it's getting too long centroid and stop on top scatter same ax and so I can do all that same stuff just like I did with our own version version before let me address an issue which is how did I know that we should use three clusters and well the answer here is that I just kind of eyeballed it what if there's like 20 clusters right that might not be so easy to do or what if instead of having nice two-dimensional data I have x zero, x one, x two, x three, x four, x five it won't be obvious before and how many clusters there are and so the strategy that you'll do is you'll try to print over a cluster and see how well it does and this measure of how well it does is called inertia and so I can look at inertia in our data like this and well what is this measuring it's measuring the average distance from points to nearest centroid that's what that means right so for example this one over here is actually kind of far from that centroid right so that'll contribute a lot to this score whereas this one right here is really close to a centroid right so hopefully everything's kind of neatly around a centroid and of course the more centroids I have this number will go down this inertia number will go down lower is better that means well the number means everything is near a centroid and so what we'll do is we'll actually try different numbers of clusters and see how quickly inertia drops off so let's do this so I'm going to go back and try to grab all of this stuff I had before and somebody grabbed this and actually let me do this this is what I really need I am going to have a little loop like this and I don't even care about making the predictions anymore I just want to know that inertia score I'm going to say inertia well okay and so you can see what I'm going to do here and I'm going to try different amounts and of course as I add more of these things the inertia goes down until if I have the same number of clusters and points then luckily each point hits its own cluster so I'm going to have a loop here I'm going to say 4k in range and I want to have 1 to 10 clusters right so k that's why it's k means k is the number of centroids and mean first the fact that a centroid is kind of the average of all their xy values I run this thing and I want to put all of these in a dictionary or better than a dictionary even a series right so I'm going to say scores equals a series and so I'm going to do it like this I'm going to say scores of k equals this thing this inertia I'm going to try running this and when I'm all done I think there's maybe some issues here still one of the issues that's actually relatively new in pandas is that they don't like to leave it ambiguous what the type is going to be so I'm going to be very explicit up front this series is going to have flows and then the other thing it's complaining about is well I have a key error of 1 and the reason why I have that is because when I just put brackets after a series it's guessing whether this is an index or an integer position and I was guessing incorrectly as an integer position right so it's guessing that which of course doesn't work if I change it to that well I can get my scores and once I have my scores of course I can plot my scores like so and I should also do this I should say I should say that some labels here the x label is k which is the number of clusters and my y label is what my y label is the average squared distance to your nearest centroid and so when I'm looking at this here I see that having two centroids is much better than having one so there's two very clear clusters right from two to three and other big improvement after that doesn't make sense to have four centroids right that's not going to give me much improvement let's try running this again I may run it from the top because sometimes these clusters will tend to overlap each other so just because you know way back here I created three clusters doesn't mean there's going to be three clear clusters at the end let me just run this again and here it's not as clear as that two clusters or three clusters and not surprisingly well there's less benefit going from two to great let me just try running it a couple more times get some more intuition here that it's very clear that we want to go to three I want to see one where it's really overlapping with this kind of a matter of luck here what about that one not so much benefit going from two to three right because these two are overlapping each other so much right okay so that's what you'll often want to do so one of the use cases for these things is that you will create a plot like this just so you can say something about your data you can say how many kind of distinct clusters there are and that of course is a matter of prediction next time I'm maybe talking about how we can do where was my notes what are the uses for these transformations why might we want to get data about the distance to each of the clusters in the last couple videos we've been looking at how we can use k-means to identify how many clusters there are in the data and that can be useful in and of itself another common strategy is to really use k-means as pre-processing for another stage in our pipeline and more generally right you might apply some unsupervised learning technique like k-means or principle component analysis to create better inputs for a supervised technique for example logistic regression and I've really tried to create data that will really make this work well and so here I you can see on the left both my training data and my test data on the right and what you can see in the training data is that I've created five clusters here and these clusters are kind of right on top of each other and out of the five clusters four of them have black dots and one of them has gray dots and other than that there's just gray dots kind of randomly distributed throughout the space and then on the right we want to predict that and so clearly there are some patterns here for example as a human I might predict that these are in a similar area as this cluster of black dots over here so I probably guess that these are black and then other dots right like if I'm saying in this space right here those are probably gray because in the training data those were gray and so certainly it would be hard to draw a single line that separates the black dots from the gray dots for the purposes of logistic regression I have to do some sort of preprocessing. Okay so I'm going to create my pipeline down here P is a pipeline and a pipeline is just a list of steps and the last step the most important one is going to be logistic regression and the first one is going to be standard scaling and then eventually I'm going to add in k means as a preprocessing step to help logistic regression work better and so I'm going to do this I'm going to fit my model so p.fit and what do I want to fit to? I've already taken my data frame up here and split it into training and test data and I can see that my two input columns are x0 and x1 so I'm going to just put those in a variable here let me say x columns is x0 and x1 then my y column I'm going to predict is just y as I'm going to predict on the training data or fit on the training data those x columns and I'm going to compare that to my y column and then after I do that I want to score how well does this classifier work so I'm going to score it my testing data and so I run that and I see it's not doing very well it's only getting about 63% correct because it's hard to separate those black dots from the gray dots and that's because when I just have regular logistic regression it has to only put a straight line there so if I introduce k means as a preprocessing step let's try that create k means here and let me specify the number of clusters I'll say let's say 3 first let me try running that what happened there I have an extra parenthesis just randomly there we go that's where it's supposed to be there's not enough clusters here if I try jumping up to 5 you can see I'm doing significantly better if I go up to something like 10 better still I can kind of capture the different areas and realize how close they are but having more clusters when I'm doing this preprocessing is generally not going to be as problematic as having to a few clusters so what's happening here when I'm taking input variables to this I'm using this k means step on the data so what this is outputting is the distance to each of those 10 clusters is one of my variables well what is the distance to this cluster here and of course if that distance is small then it's probably a black point I might also have another column that says well what is the distance to this cluster here if that's small well then it's actually probably a gray point so this is one way to do the preprocessing an alternative which probably works just as well would be a polynomial features could also kind of figure out a more complex boundary between the black and the gray