 You've had an opportunity now to read through the basics of machine learning. Now what we're going to do in this notebook is have a look at one machine learning technique, one algorithm called the K nearest neighbours algorithm. Now it's very useful because it's easy to understand and that gives us the opportunity to learn some of the basic steps that we always go through when we use machine learning in our research. Notebook 13, machine learning using the K nearest neighbours algorithm. As I've mentioned one of the easier algorithms and we're going to use it just to showcase all the steps that we have to go through when we design a machine learning solutions to our problems. So one thing about machine learning it uses some of the same subjects and objects that we have in statistics for instance but as with many fields it just changes the names slightly. So as we're in statistics we would have the term independent variables you can see how we refer to those as feature variables or simply just the features and those are going to be used to predict a dependent variable and in this case we refer to them as target variables or outcome variables and if these target variables are categorical in nature then we refer to all the sample space elements as classes. So just some terminology there for you to get used to. So let's import numpy and then from pandas we'll import some specific functions and now scikit-learn and our scikit-learn is a package and python very important package used quite considerably in the world of machine learning and we're going to make use of the k nearest neighbors classifier and the k nearest neighbors regressor and then also some metrics and scaling and we'll see all of those used when we go through this notebooks. Of course for plotting we'll use our good old friend plotly and we're going to use some other plots as well so let's just do matplotlib and seaborne there as well so two more plotting libraries in case this is run on a macbook with a retina display we'll have our magic command there and also just to showcase tables nicely you're familiar with all of these functions now and we're also going to work with a data set and that is saved on google drive so we'll import the drive function there. So in machine learning as far as supervised machine learning is concerned the two basic type of problems a classification problem and a regression problem in a classification problem our target variable our dependent variable will be categorical in nature so what we're going to do here is just to generate some data and we'll put that inside of a data frame now when it comes to nearest neighbors the term nearest would infer suggest that we're talking about some distance and that's exactly what it is we're talking about how far different observations subjects the rows in your data set how far each of them are from each other so we have to invent some way of measuring distance and the most common distance we all know about is called the Euclidean distance that's just the distance on a flat surface how far two points are from each other and how far they are from each other remember we always express that as a positive value just the the distance between two points and so let's let's generate some data so we're just seeding the pseudo random number generator there and we're using the rand integer so we're just going to take random integers on this interval from 10 to 20 and we want seven of them that's going to be our feature column and our target column that's going to be a random choice between these two categorical classes a and b so we refer to them as classes and you can see the data frame here so we have a features column and a target column the features are numerical values and the target there it's a binary classification problem classification and as much as this is a categorical variable that we're going to try and predict given these input values so let's plot these out and we're using just the figure object here from plotly so there we go we can see for our two classes a and b we can see the feature variable now this would be like a one-dimensional problem as much as all our feature we only have a single feature variable so we can plot that all on a line and when we think about the distance between these points it's very easy for us as human beings I mean we just subtract one value from the other value and we take the absolute value remember taking the absolute value makes any negative value positive value so for instance if we look at the points the values here 12 and 19 this is 12 minus 19 that's the Euclidean distance between them so 12 minus 19 as you can see here so we're just using this equation here 12 minus 19 is minus seven and because it's absolute values that's just a mathematical notation for that these vertical lines that's just equal seven because distance as human beings we always see distance as positive not as negative values and that's as simple as that so let's introduce unknown value into our dataset and that value for as far as the feature variables concerned would be 12.5 so if we plot that out we see this 12.5 and we want to know does you know can machine learning model predict where the this observation will be in class A or class B and what k nearest neighbors does it just does what it says on the label it's just going to see what the distance is between some of these points and the k and k nearest neighbors refers to the fact that we want to know how many of the nearest ones are we interested in and that k will usually always choose as an odd number and you'll soon see why so let's choose k equals 3 and visually it's quite easy to see that there'll be a class B that's the red dots second nearest will be a class A and the third nearest would be a class B all these other points are further away so there's the absolute value for these feature values 12, 13 and 14 as far as our new value 12.5 is concerned and you see we do the subtractions there and express the absolute value of them and because there's an odd number there's always going to be a majority class so this is all about majority voting so we see two red dots these three nearest ones two red dots one blue so B is the majority is in the majority of the nearest neighbors and what the k nearest neighbors now would certainly simply do is to say that well two thirds 66.67 percent of cases are from class B and only 33.33 or 34 percent would be from from class A as far as my three nearest neighbors are concerned so it'll put a probability of those and the highest probability wins I mean 66.67 percent or two thirds of the cases with B as far as the nearest neighbors are concerned and our model is going to predict that this class or this new observation this new subject in our data set that it has never seen before a model is going to predict that this would be class B so it's a really as simple as that now let's bring in just a second variable so I'm going to call that feature two for our second feature variable also random integer between 10 and 200 100 and 200 and we divide by seven so and then it's going to be between 10 and 20 with one decimal value so let's look at our data frame now and now we see we've got the feature a target but we've added a second feature so now we're living on not on a straight line anymore as far as each observation now has two variables two feature variables so if we plot that out that's going to be a plot in the plane so let's have a look at these and there they all are the two classes and they each now have a you know two feature variables and so how would we now determine the distance between any two points well we just use the Pythagorean theorem if you think a right angle triangle you just sum the squares of the two values and once you have that sum you take the square root and that's the the length of the hypotenuse and because we're talking about distance remember the square root of 9 is plus and minus 3 because negative 3 squared is 9 and positive 3 squared is also 9 but we're not interested when it comes to distance in the negative so any the positive value and you can see equation 2 there because we square the difference between the two sides of that right angle triangle we'll always have a positive either 0 or positive under the square root sign so we can just take as far as real numbers are concerned the square root of that so that works out for us so let's put in a new observation for feature um it's feature one variable or just this feature variable value will be 14 and feature two or 18 and we plot it there our little green friend there our new subject and we want to know using a k nearest neighbor's algorithm you know which ones are the closest we'll choose an odd number and the majority class is going to win now we can just eyeball this in in these two dimensions and we know it's going to be group b again or class b because these are the three nearest points and that's all we have to do we have to work out this little distance this q distance here and that distance there it'll work out the algorithm will work out the distances between more but we're looking for the three nearest ones we can actually create a little python function a user defined function and i just want to remind you here how to do that so the keyword is def for definition we give our name for our new function and i'm going to call it dist underscore 2d and then we set the bunch of arguments that we will have to pass to the function and there'll be 1 2 3 4 values and it's x1 y1 and you can imagine that's the coordinates of a point so that's going to be feature variable ones value and feature variable twos value for that observation and then for the second observation and if we just use our equation there for the Pythagorean theorem it's going to be x1 minus x2 squared and y1 minus y2 squared and it's going to return this variable internal to our local to our function called distance so we have it there so let's have a look at those two points my first point was 14 comma 18 so for that observation for feature the feature variable that had a value of 14 and for feature 2 that 18 and our new value that we've plotted there as you can see well let's do that as 14 17.1 is the 1.12 and 18.2 is the third one and 13 19 2 is the third one what we want to know is the 14 18 so you're going to pass that 14 18 and one of the points was 14 comma 17 comma 1 so it'll our little user defined function is just going to give us the Euclidean distance between those two points those two points and those two points and again all the others are going to be further away and what we're going to do is majority vote or the highest probability so two thirds were from group B and one third was from group A so the majority class is going to win so let's ramp things up a bit because it's still very easy in two dimensions but whatever we have a data problem that has many feature variables we can't plot anything more than three dimensions it's just impossible for human beings to visualize something that's in 50 dimensions or or even more so what we're going to do here is just use the make classification function remember that's in the data sets module of scikit learn the scikit learn package and we imported that and it returns to value to two objects for us it's going to return a matrix x and a vector y the matrix x it's just a spreadsheet rows and columns of our feature variables so each column is going to be one of the features and every row is going to be an observation and then the y is going to be zeros and ones that is just going to be a class it's a classification problem so there's going to be two classes one and two and we set that by setting up all of these arguments so this is just a simulated data set that we can use the make classifications function to do for us so the number of samples that we want is 200 so 200 observations 200 rows the number of features five and we can set the number of informative in underscore informative three so only three of the variables are really going to aid any model in determining what the outcome has got to be or what the target variable values got to be so they're two redundant ones the number of classes that we want in our targets only two classes so binary classification problem and we want to flip 10 of them at random to be the opposite class so you can always make your your problem the model that you're trying to simulate here the data that's trying to simulate for your model you can make it much more complicated by just you know increasing that flip y and it'll become much more difficult for your model to find the truth and because we all want the same results i'm just setting the random underscore state variable there to 42 so that's going to create two values for us or two objects for us the type of x remember that's going to be a multi-dimensional array so bunch of columns bunch of rows and we can use the shape attribute just to see there so we have 200 observations and five variables for each this is now not a data frame remember it's just an umpire array so we don't have column header names but it's you can just see it as the values in the spreadsheet you know just stripping away that first row that contains all the column headers all the variable names and they all 64 but floating point values so decimal values the type of y is also a numpy array but the shape is only going to be a single column so you'll just see the 200 comma nothing so it's a vector just of those values so and if it wasn't a spreadsheet we'd have to take that column out that becomes the thing that we're trying to predict and the the type there would be an integer because this is going to be zeros and ones so let's build this into a data frame so we're going to pass x and then have the columns feature one feature two feature three up of the feature five and then we're going to add a brand new column called target and that's going to be y and then we're just going to just make sure that pandas understands that this is a categorical variable so i'm going to add another column and that's going to be target class which is also y but i use a series i assigned to that a series object with my column of y values but i set the d type to be a category so it knows that the zeros and ones are just encoding for some nominal categorical variable and then we can look at the first five real rows just using indexing there so feature one two three four five my target which in this instance is going to be just numbers but yeah i'm just making sure that pandas understands that this is actually a class a classification of a categorical variable so let's have a look at just the script of statistics of all of these features and we can see the mean and standard deviation of how scikit-learns make underscore classification how it did that we also have to look at you can look at the correlation between each of these remember two of them were redundant we were wondering which one it is so scatter plot matrix we'll just make pairs of all our numerical variables and it'll show if we can see what the correlation is and if we look here at feature two and feature five for both classes see the target class zero and one there there seems to be a good correlation between those and here as well in three and four so there's definitely some some correlation there but some others like one and three it doesn't seem to be any correlation there so now we're going to you know build a real model so first of all we have this k nearest classifier class in scikit-learn and we have to instantiate that that is computer language peak four we create an instance of it so instantiated and there's a bunch of parameters that we can set we'll leave all of them at the default values except for the number of neighbors we have to specify what the k in our k nearest neighbors is and this time we'll set five please check out the five nearest neighbors for us and we're going to assign that to the computer variable nigh in the nigh nigh h short for neighbors and then it's a very simple thing to do we're going to use the use the fit method so now that we have an instance of this classifier nigh nigh for neighbors dot fit we just simply use the dot fit method and we pass our feature variables and our target vector and that's all it does and there we go we now have a model model has been built and you'll see the values that were given to all these arguments that we left at the default values the algorithm is set to auto that's how it shifts through all these distances and it uses some tree it builds some tree for that and there's a leaf size to that the metric is minkowski and that metric refers to how the distance is measured and the minkowski distance we can call it that with a p is two that's just the euclidean distance so you get also get p equals one and p equals three etc it goes up but two is this will just be simple euclidean distance and the way that it searches for this the weights will just leave that's just left at uniform so when you look this up on the scikit-learn website you can learn all about these what these pram these arguments do so let's create a brand new observation so our model is now trained and we want to bring in a brand new subject and we want to know what the prediction would be what would our model predict this new observation to be so we're just going to use the rand n function from from numpy and because it is a single observation what you will see is when you build these invariably you get the dimensions wrong of these new observations that you want to bring into your model to use your model to predict an outcome but the error messages are usually they're very clear and I'll tell you remember you're only passing information about a single new object here or subject or observation please reshape it one comma minus one so it's very useful it's very informative the error messages and then we've got our trained model now here nae for neighbors and now we'll just use the dot predict method and this unknown observation we pass those five values five feature variable values to it and it'll predict for us what it thinks it would be and it predicts that this brand new subject of ours according to the model that's been built our machine learning model it predicts that this new observation will will belong to class one as opposed to class zero and we can also use the predict underscore proper and that's the probability that it has assigned to this being in this class and we see for the class zero is predicted a zero percent probability and for class one a one hundred percent probability and that's going to mean that the five neighbors closest to this point well they were all of class one as simple as that so let's ramp it up from now we understand a little bit more and we must build up one of the next steps that we always do in the machine learning and that's to split our data up till now we've just plunked all our data what a lovely word plunked we plunked all our data into for instance when we build linear regression all our data all our observations just goes into those kind of models not so much in machine learning we actually split our data we randomly take out some of the observations and we keep them separate from the model so that the model never sees them so when we get to more sophisticated things like random forests or deep neural networks it becomes very very important you want to take some of your data out of the ability for your model to learn from because machine learning learns from your data but you want to keep some of the data totally separate that the model's never seen because we'll use them then after the model has trained and we'll test our knowledge test how good the model does against that unseen data and because this is a supervised learning situation we know what the outcome is going to be from those that we take out randomly and we can measure how accurate how correct our model was based on this unseen data so we call that data splitting so we're going to create two data sets one is called a training set and the other one is called a test set so you can see involved a training set and a test set so any machine learning model is only going to learn from the training set we then use the test set to check on the performance of our model because we have to somehow gauge and express to others how well our model is doing because invariably for these machine learning models we want to deploy them in the wild and we want them to be accurate the self-driving car has got to be pretty accurate when it drives models that predict diagnosis on an x-ray has to be very accurate and we've got to express to possible users of our model how accurate we think or how well that performs there's certain metrics that we're going to learn about so it helps us to introduce these two terms that you'll see quite often and that's variance and bias and that's always this balancing act that we have to do when it comes to machine learning models we want this balance between variance and bias before we used to refer to this as the variance bias trade-off but you know that that term is it's not so common anymore but we need we need this balance between variance and bias so what is variance a model with high variance really does well on the training set and you have this problem where the model actually memorizes the training set and it will do very well on the training set but as soon as you show it real-world data or data outside of its training environment it's going to do very poorly on that so we also use the term it overfits the training data and that that that can be quite bad and sometimes we need to pair back make our models simpler so that it makes more mistakes on the training set because that usually or can mean that it generalizes better to unseen data and then the opposite is the bias when we have high bias it does rather poorly on the training data and depending on what the actual values are there's certain times that we would call a model over a high variance model that overfits or that there's a high bias the next term that you'll see there's hyper parameters now hyper parameters as far as machine learning is concerned are the values that we choose as human beings we have to choose those values in the design so for k nearest neighbors it's very simple the k we have to decide on we can do three five seven or even more that would be referred to as a hyper parameter and all the other things all the other arguments that are built into these classes in scikit-learn those are all hyper parameters that we have to set so whenever it is something that comes up in the design of deep neural network random forest machine k nearest neighbors machine learning model all the things that we set at design time those are called hyper parameters what the machine learning model tries to learn and more sophisticated modeling than k nearest neighbors are the parameters it has to learn parameters and with those parameters it makes a prediction but the things we set as hyper parameter so let's do the strain test split and i suppose that's another kind of hyper parameter because we have to decide how much data do we want to randomly pluck out of our complete data set and that really depends on how much data we have here we have only 200 rows what you want to do is take away enough it's another balancing act you want to take away enough so that this set that you're going to train gives a fair indication of how good this model really is the problem is the more you take out the less data there is for machine learning model to train and machine learning models are very hungry when it comes to data the more data you can feed it during the training phase the more you know the chance you improve the the chances of this being an accurate model so it's something we you really have to sit and think about you also don't want unbalanced classes and we'll have a look at that so let's just use the strain test split function in scikit learn and what you pass to it is your set of feature variables and your target vector and you set a test size so in this instance i've said i want 20 percent of the data split taken off at random and what the train test split function does it returns four different objects and we have to just assign names to them so very commonly you'll see x underscore train and x underscore set so that will be the training set sets feature variables and the test sets feature variables and then y underscore train and y underscore underscore test as the name suggests there's the descriptive names that we've chosen that will be the training set and the test sets target variables and we set the test size to 0.2 as i said and then random underscore state so that if you run the code you're going to get the same pseudo random split to your data now as i said we have to talk about unbalanced classes so you might have a problem where you deal with data and one of your classes is most of your observations of that one class and that'll be unbalanced data so you just have to make sure in your training set and in your test set as far as the target variables concern that there's not too big an imbalance so let's use the numpy unique function we do y train and we return underscore counts we set that to two because with that unique function it says well i found two classes zero and a one and there were 83 and 77 of the two respect respectively so quite close to each other that's not a big problem what you have to do is make sure in the test set that you have a fair representation of all the classes and that is also why it's important to take enough data as far as your test set is concerned so that you don't have class imbalance and we see 18 and 22 there so that is a fairly you know equal distribution there so not a problem for us in this toy model now we set our hyper parameter and that is k and we want k equals 5 so let's instantiate another k nearest neighbor's classifier using the k neighbor's classifier class there we instantiate that and we set the nearest neighbor's argument to 5 and then it's as simple as fitting it with the x train and y train values so only that 80 of the data it is now going to use so this is something that we will use again and again in machine learning train test split make sure everything with well design the architecture of your model what goes into it by setting the hyper parameters and then we just fit the data to that and the learning will take place now we've got to talk about you know how well this model does now fortunately we have 20 percent of the data that this model's never seen and we can now feed it to this model of ours and it can make predictions and that is with this predict method yes so our trained model nae for neighbors dot predict and we pass the x test the feature variables variable values in our test data set now and we're going to assign that to a variable called y underscore pred y prediction because we know what the y test values are the actual values this is a supervised machine learning and we can measure the prediction that it makes against actual values and so from the metrics module there in scikit-learn we have this accuracy underscore accuracy underscore score function and what we pass now is the y test that's the actual values and the predicted values from passing the the test set and we get a accuracy back in this instance we can see from that test set our model very simple model which kaneer's neighbor's very simple architecture 90 correct 90 accurate on this unseen data and with classification problems it's very useful to visualize the results and we do that with a confusion matrix and in the matrix module of scikit-learn there's a plot underscore confusion underscore matrix so a confusion matrix you can also express this as a real matrix but it's easier just to visualize it so it's a bit unfortunate i think that term confusion but it stems from the fact that we'll get a very good idea of what it got right what our model got right and what it got wrong so let's plot this and you'll see so there's a confusion matrix plot so on the left hand side you'll see two label so label or class that would be terms for the sample space elements in our target variable or our dependent variable so it had zero and one in it and here's the predicted label or predicted class on the horizontal axis so if we start here at the 17 at the top the actual value that we knew from our data set had zeros in it and the model predicted it to be a zero in 17 cases so that would be done quite correctly down the diagonal we see the 19 where the two label was one and the predicted label was also one so that was done correctly as well but there was one instance where the two label was zero but the model predicted it to be one so that was obviously wrong and there were these three cases that the actual label was one but the model predicted a zero so on this diagonal here that be all the mistakes so let's just play a bit with one important hyper parameter here and that's the value for k so what I've set up here now is k equals three so I'm re-instantiating the Scania's neighbor's classifier assigning it to the same computer variable so this is going to override that we fit the training data to that and now we've got a brand new fitted model and let's do a confusion matrix plot now with k equals three and now we see the mistakes have actually gone down now there's only one two mistakes down here still the one there so that's an actual improvement and if we create a new white bread by calling me dot predict passing x test to that and then using the accuracy score from the matrix module we're going to see that we now have gone from 90% accuracy to 92.5% accuracy so you can see how important it becomes to play around with these hyper parameters and for some architectures like deep neural networks there are so many hyper parameters and it becomes a bit of an art to to create the best models so now let's import some data and that's some much closer to real world data you can find this data set many places on the internet and that's a bunch of feature variables pertaining to the microscopy of specimens from breast lumps and those breast lumps has a target variable which indicates whether that observation had a cancerous or non-cancerous lesion so quite quite a serious data set and the non-cancerous those are encoded as a b for benign meaning it's a non-cancerous growth and m for malignant which means it's a cancerous growth so this data set was downloaded and corrected so that we can use it here so we have to import it from our google drive so what i'm going to do is use the drive function just to mount my google drive and then change direct directory to this data directory where the csv file is going to be so this is going to require me to do that whole procedure which with which you are now familiar and that's to import to at least re-sign up into our account and get that little security code so after i've done it we'll rejoin there we go we've changed directory with a percent cd so we're in the right directory and we can now just use the breast underscore cancer dot csv file as i said this file is available on the internet researchers made this available so let's have a look at the first five rows using indexing there and we see there's a id column there's a diagnosis so these would all be cancerous legions and then from these specimens there's a variety of feature variables the radius mean the texture mean the perimeter mean the area mean the smooth smoothness mean so these these samples they were investigated under microscope and data on what was seen on these microscope slides were captured in numerical form and then also in the end whether the pathologist thought that this was a malignant or benign lesion so we see the data set there let's just call the info method on our data set just to have a look there and we see the id is that that's not going to help us because it's a unique id for every every subject the diagnosis is an object so that's a categorical variable m and b is the two sample space elements and then all of our feature variables here are decimal values 64 but floats and you can see the data set was cleaned up so that there's no missing data yeah so that would be quite important as well so there's also this unnamed colon space 32 column in there and we've got to get rid of those so the id column is not going to help us because that's just unique to every subject and that doesn't help us in the machine learning model and then this unnamed 32 that just came in and the way that the file was saved so we just have to get rid of those so we use that data frame dot drop a method there we pass as a list the string version of these two column headers access one means you know these will refer to columns so drop those two columns and in place equals two so it's just going to make the change permanent and now those two columns are gone we don't have to worry about this so remember the next step we have to look for class imbalance so let's just look at a bar chart of these and on the x-axis we'll have the diagnosis column and we're just going to count how many of the b's and m's are two classes there when you can see we have a bit of class imbalance here a lot more non-cancerous lesions the b benign lesions than the malignant lesions but you can see the numbers here you know over 40 000 on the one side and over 100 000 on the other side let's just express this as a fraction so we use the dot value counts method on the tf dot diagnosis panda series so on that diagnosis column and we see 62.7 percent was benign and 37.3 percent were malignant so that would be quite a I suppose a bit of an imbalance data set it's also interesting in the fact that if we had no model and we said let's just take the majority class as our prediction so we say no matter what the values are what any new observation is our model will say it's benign and we'd be correct in 62.7 percent of the time so very importantly that's our null or our baseline prediction so a model a totally naive model if it just predicted the majority class was going to be correct number of times equal to the fraction of that majority class so we had better build a model that was better than just predicting the majority class so let's just look at a summary here of all the variables so that would always be an important thing to do but what I wanted to really show you is look at this this is a very high-dimensional problem we have many feature variables here so there's no ways for us obviously to plot this next thing that we can always do is just look for correlation between these and we can see a correlation matrix there so every pair of the animator variables probably easier just to create a plot of that it's a bit more visual but because there's so much data here we've got so many pairs of feature variables it takes a little bit but there we go so along the main diagonal we can obviously see that every feature variable is 100 correlated with itself but you can see here the zero ones are these light colors and dark blue is absolute positive correlation and this dark red is absolute negative correlation so you can see there are some of these features that are correlated to each other so you know that might be a problem it's obviously a problem when it comes to linear models but it's it's just part of the important information that we need along the way when we start building more sophisticated models so what we want again is just our feature variables on one side and our target variable on the other side so we're going to create these two computer variables y lowercase and x uppercase it's just sort of standard usage computer variable names and to y we're going to assign this tf.diagnosis column and x we're just going to drop that diagnosis column because we only want the feature variables in there so we have our x and our y exactly what we do and as before we're going to do train test split we've got enough data we're going to split off 20 of the data as the test set so you know that we pass our feature variable matrix there our target vector the test size is 20 and random state I've set to 12 here and we've got to assign that in this order to four computer variables because at this function train underscore test underscore split returns four objects so we're going to run that cell and we now can just look at the shape of these and just use the shape attribute so we see in the training set we have 445 and examples across 30 variables and 114 in the test set and then y shape will just have 455 so that equates to that 455 and 114 to 114 but it's just a single variable the next step after our test train split is to somehow we're going to use the genetic term normalize the data although normalization is a specific thing that we do and so it carries both the genetic term of what we're doing here but it's also one of the types and so one of the types we can use here is called standard scaling so what that does is very important machine learning to have all your feature variables in the same similar range so that you don't have some feature variables that have values in the thousands and others are fractions 0 comma 4 and many machine learning models those big numbers are going to dominate when the model starts to learn from the data and we don't want that so we scale it all down and standard scaling brings it all down to a mean of zero and a standard deviation of one each of our columns and what it does is it just calculates a z statistic for every value and that is how many standard deviations it is away from the mean so it just takes every column one at a time calculates for that feature it's mean and standard deviation for along all of those values and it just every value you just subtract the mean from that and divide by the standard deviation so you're going to get on the negative side on the positive side so that you now have a mean of zero and a standard deviation of one so in scikit-learn there is a class called standard scaler so once again we instantiated assign it to the computer variable scaler we set some of the some of the arguments copy we set to true with mean is true and the standard deviation is true so it'll know what to do we have the standard scale instantiated now and now we can actually use it so we're going to overwrite our extreme values by calling this scalar class of ours this instantiated class and we're using the fit underscore transform so it's going to fit the values so it's going to do all of those calculating the mean subtracting every value the mean from every value divided by the standard deviation so that's what the fit and the underscore transform bit does because you do get it separately fit and transform but you can do it all in one go fit transform it actually changes the value it overrides what we have there and we're passing extreme to that so let's do that so we've changed the actual values now they're no longer what they were when the data was captured when the original data was captured all the values are now changed now we've got to do the same thing to the test set as well with a very very very important difference we can't just do the standard scaling on the test set itself we have to use the parameters of the test set to do the scaling very important to do that so we say scalar dot transform we don't fit this so we don't calculate for it a mean and a standard deviation what the actual mean and standard deviation was of each variable in the original actual training data set we use those that mean and that standard deviation to do the change so you scale by the mean and standard deviation of your training set you don't do it individually for the test set that is very important otherwise you're introducing some artifact and unfairness in the way that you're going to do your metrics and perform analysis on how well your model does so here we use only the dot transform function we're just transforming it we're not fitting it so we're not taking this test set and for each variable calculate its mean and standard deviation in the test set no we use that from the training set so it's this scalar dot transform x test and we're going to override x test and now we can instantiate like our classifier k nearest neighbors classifier and this time we're going to do three nearest neighbors and we're going to assign that to the computer variable knn and all that remains is to fit our feature set and our target set to that so let's use knn dot fit and we pass those values and now we have a fitted model now we can look at how accurate our model is by considering our test set so I'm going to use the plot underscore confusion underscore matrix and we're going to use that function from the the matrix module in scikit learn I pass my trained model to it knn and then the test set the features and the target variable for the test set so we can now see how how well our model did so look at that not too bad at all it got 66 benigns correct and got 42 of the actual malignant ones correct and if it was really benign and did not misclassify any of them as malignant but when they were malignant six of them were classified as benign so we can use the uh accuracy underscore score uh function there from the matrix module and we pass y test to it and I didn't save it specifically as a computer variable the y pred that we did before so I'm just saying knn end up predict given the x test so this is going to give me the predicted values and I'm measuring then against the test values and you can see the the accuracy 94.7 percent so all that does is just going to use these four values actually and count can calculate the accuracy for us so let's just have a look at what it did on the training set so you can actually look you can actually pass the training set back to it because it's not going to be 100 accurate on the training set and you don't want it to be otherwise you're going to have a high variance model and there is a bit of variance here because 97.8 percent correct on the training set that it trained from but only 94.7 percent as far as the test set that it never that it never seen before just want to remind you there's also the accuracy underscore there um that we can use and that's going to give us the exact same as the just the dot score the accuracy underscore score in this instance it's going to mean the same thing now if you have unbalanced data there's a whole branch of machine learning that deals with that fortunately what we can do here is we can just use the balanced underscore accuracy score and that's going to penalize the accuracy of our model a little bit by considering the fact that our classes were imbalanced so if we do that which will give us a more you know better representation of how the model would do out there in the wild and that's 93.75 so that's slightly different now remember i had this null score that is you know just the majority class which was 57.9 percent so certainly our model is much more accurate than the null model the null score so that's a first win for our model now the accuracy is this how many it got right divided by how many there were and that's certainly a good metric but there are other and depending on the circumstances there are the more important metrics and you can see some of them here sensitivity or recall just depending on the field that you're working in they're different terms there's specificity there's a positive predictive value or sometimes known as precision and there's a negative predictive value now we've got to at least understand what those four are because they're very commonly used now in order for us to to use them in a classification problem a binary classification problem more specifically we have to as researchers have interest in one of those two target classes one of the target labels that is the one we interested in and we're going to assign to that the positive one of them becomes the positive class and the other one becomes the negative now it has nothing to do with human understanding of positive and negative the positive is just the one that we are interested in studying and it can be either one of them so in this instance and we've got to do that otherwise we can't do use these metrics let's assign the positive one to malignant because we we we we want to know if you know this is a test we're doing we you know there's a biopsy it gets sliced up put on a microscope gets placed under microscope and the interest here is whether this is cancerous or not so naturally we would choose the cancerous lesion as our positive lesion so let's do that and then benign as our negative and what we can do in many machine learning models is we can say the model is going to predict with a certain probability the positive class and the negative class and we can set this arbitrary line in the sand and usually we'll start with the default of 50 percent if the probability of the class that is higher than 50 percent that's going to turn out to be the prediction that our model makes we can change that 50 percent though depending on the situation and we usually call that how expensive it is to make an error if it's very expensive and expense doesn't only refer refer to finances but if it's important to pick up in this instance malignant lesions we might set that bar a little bit lower because then well the model will over predict this to be malignant and because making that error of calling something benign non-cancerous and it is cancerous is perhaps slightly higher than the cost of predicting the other way around so we might bring that down but by default it's going to be 50 percent so let's look at the first actual value in our test set that was a malignant m so let's look at what the model predicted we're just going to say can end up predict and then pass that first set of feature variables that first row that first observation and again don't worry if you make an error here if you've got to get to put the reshape or it must be reshape negative one comma one one comma negative one the error message will always tell you and you can just correct that it's just common to make that little mistake but you can see it was also malignant so the model predicted malignant and the actual value was malignant but remember the skin nearest neighbors is going to be a majority class so if some of the nearest neighbors were actually benign as well but because we chose an odd number the the one is going to win out but that's purely because we said the 50 percent so the highest probability because most of the neighbors were on one of the other side that is going to be our prediction and that gives us these four terms two positive two negative false positive and false negative and that comes by assigning this idea of our research question in assigning one of our classes as the positive class and one is the malignant class so true positive you can see now a little table there the and the abbreviation is tp those would be if we look at our model here as far as the confusion matrix is concerned would be this idea the 66 the 42 the zero and the six so if malignant was our was our positive class then these 42 would be two positives that means B was our negative class and it correctly had the 66 so those would be two negatives now the false positive and false negative means a false positive would mean it predicts a positive but it was actually negative and the false negative was a predicts negative but it was actually false positive so those values you know they really easy sort of easy to interpret so let's go down let's have this two positives let's set some values there 42 66 0 and 6 two positives two negatives false positives and false negatives and that allows us to calculate by equation four here these four equations here the sensitivity or equal the specificity the positive predictive value or precision and the negative predictive value and you can see the little equations there so quite easy to to calculate so the same sensitivity and specificity the sensitivity is how many of the the actual two ones it predicted as true because a true positive in real life was a positive and a false negative in real life was also a positive so how many of those were picked up specificity would be how many of the negatives were correctly identified to be negative so in the denominator two negatives and false positives because remember the false positives were actually in real life also negative so how many of those did it really pick up the positive and negative predictive values we as usually used after the test is done so once the result comes back positive you know what fraction is that as you can see there for false positive for positive predictive values and negative predictive values and I put just a little sentence in there that you can read the interpretation of all of these very dependent on the type of research that you want to to do just want to also show you there's also an f score f1 score I should say and that is how you calculate the f1 score and we actually have a metrics dot classification underscore report function if we pass the actual values in our test function and the predicted values and test there if we let's just change that back if we run the cell it's going to give us a little bit of a report I wrapped all of that in the as argument and argument in the print function we can see the accuracy there and we can see precision recall and f1 score we can see there I just want to show you this idea of what the model actually does and that's a probability so we're going to pass this first 10 observations in our test set and that's what it's going to be we used a certain value for k I think what's k equals three so in this instance in this first one all three of them were the nearest neighbors were all malignant so it was a hundred percent prediction that this would be malignant but when we come down here we see 66.67% and 33.33% there so obviously two of the three were in the zero class and only one was there but that's what we say we use this cutoff of 0.5 so that the majority class always wins but we can obviously change that so let's just play with our hyper parameters a bit and here we're going to use k equals seven so we instantiate the k nearest neighbors classifier and we set the nearest neighbors to seven we fit the model with our training data and we look at how accurate it was on the test data 95.6 so that's actually gone up so in this instance k equals seven was better so let's have a look just to make sure what it does because we have seven of course now we divide 100 by seven so that we get these kind of fractions so in that first one all seven obviously were of the one class and this one it was all of zero class and you see those distinctions we can actually just create a data frame of that and just create a little bar chart of that just to see you know the different percentages or probabilities that it was one of the two classes because that brings us to this idea of a receiver operator characteristic and that really helps us to decide you know where we should draw this line of making depending on how expensive a mistake might be in either direction you know where to do that and receiver operating characteristic uses false positive rate and true positive rate and the thresh thresholds so there's a matrix dot roc underscore curve and you can see the parameters that we pass to that and that's what we have here I'm going to show you the plot just to explain what's going on so we've calculated that and now we can do our plot so what we have on the x-axis a bit difficult that's one minus the specificity here so this zero would actually be a specificity of 100 percent and you know now what specificity is and we have a sensitivity here on the on the y-axis so this dotted line that we see here that's just the blind guess 50 50 guess so you really want your curve to be above that so if this is zero which is one minus specificity so this is almost 100 specificity so by you know we don't have to lose a lot of specificity to already jump to 90 percent in our sensitivity so you know what we actually want is a curve that goes way out here and you know by losing some depending on what we choose losing some specificity you know starts gaining us on sensitivity but we don't actually have to do that very much our model gets way up here very quickly and this whole area under the curve is called the area under the receiver operator characteristic curve and this was from the world wars i think world war two of course with the radar and that is to to take the likelihood of picking something up you know what's this a warplane coming in of course that is not my field i'm not much of a history buff buff when it comes to people killing each other which is a horrible thing but certainly you can read up on on the history of this the term receiver operator characteristic the characteristic of these signals that that were received and in the wartime so we can actually just do the calculation of that and a model with a higher score higher area under the curve under the rock curve the rock are you see the higher that is the better your model is so one idea comes up and this is how lucky were we in determining what the accuracy of our model would be out in the wild we extracted 20 percent of the observations at random but what if we you know did this again and and but the random nature would mean 20 percent other values were extracted so our model was going to learn from different data and we're going to use different data for our testing so we might have been very lucky this time around or we might do this a couple of times and just report out there you know what the best one was but that's not how it works we've got to be scientific about this and one way to do this is just to do this over and over again and the idea here comes in and that is cross validation so what cross validation would do is that we won't split the training in a test set we'll keep the whole set together but we're going to set another parameter and it's then going to divide our data set at random into these number of blocks and unfortunately that's also called K there's something the term is K fold cross validation and that K is nothing to do with the K and K nearest neighbors I don't confuse those two K's it's K fold cross validation so if we set it to five it's going to randomly break out data up into five equal sections and what is then going to do is going to train this model seven times over and every time one of that those five blocks now if it's five blocks out of a hundred percent so again that will be 20 percent of the values will be taken out at random for each of these five blocks it's going to in turn use each of these five blocks as its test set that means we're going to get five accuracy scores and all we're going to do is we can report on the lowest accuracy and the highest accuracy and also we can average over these and that's going to give us a better indication or should give us a better indication of how good our model is going to do there out in the wild so we have this function cross underscore val underscore score now we're going to use this K nearest neighbors with K equals seven model that we trained we pass the whole of X and the whole of Y and we have this CV how many of these blocks do we want to build so we want to build five and the scoring we want to return is the accuracy so it's going to do that so it's run runs this model seven times and what we have now is the five scores and we're just going to express the mean score and you'll see this is going to be slightly down from the high value we got when we build the K this K equals seven model and we can actually look at the minimum and the maximum and we'll see there's quite a bit of range that depending on you know how lucky we are and what we take out our accuracy drops down to 87.7 this five time and up to 94.7 so it really is very sensitive to that random draw of your test set and the your remaining training set so k-fold cross validation when you want to report this in a paper make use of k-fold cross validation so that you have a better understanding of how good your model actually is and then this idea of a grid search for the best hyper parameter values fortunately with scikit learn there is this ability to give our model a range of possible values for some of our hyper parameters and it'll actually run through combinations of these hyper parameters and give us back the best ones now of course you'll have to do this many times over to get the best ones but we can do it a couple of times and already have a good idea now every time you run this you're going to get something different so what we're going to do is just look at three of our hyper parameters that we can set one was the leaf size so i'm going to put that between one and 50 so say look through all of those for me and see what the best leaf size is going to be for the nearest neighbors we're going to do one to 20 and for the p values we're going to have one and two so that's our minkowski difference distance the two was just Pythagorean distance Euclidean distance and then the one is actually just an absolute value distance so i'm going to set all of that up so these are the ranges of hyper parameter values for those three hyper parameters that i want the search for and i'm going to build that in as a dictionary with the actual the keys are going to be the actual argument names and then the range of values for each of those are the values of our key value pairs so once we've set that we'll instantiate our k-nearest neighbors classifier once again and then we're going to use this grid search cross validation function here we pass our model to it that we've instantiated all the hyper parameters that it searches over and the five-fold cross validation so we're going to do all of that so this is going to take a little bit longer wasn't too bad on our google machine here and then we can just fit our x and y and then get back the best values so that of course we're just instantiating it now it's actually running and i think every time i run this on google colab it takes about one and a half minute because it has to do so many of these sets of hyper parameter values over and over and over again and do cross validation so you know it's going to take a long text can take a long time depending on the power of your system so that's one and a half minutes so i'll pause the recording here and we'll carry on when that's done and there we go and now we can just show what these values are going to be so we call the best underscore estimator underscore then the get params method and we want what we want back is the leaf size so this is going to be different every time it says the best leaf size was one let's look at the best nearest neighbors choose that was nine and then the best the value as far as our distance metric was concerned that was one so let's use these one nine one that's what i got back last time the previous time when i ran this it was actually a nine nine one that it returned so every time it's going to return something different so let's just use this nine years neighbors leaf size of nine and a p value of one irrespective of what we had just seen there but it just showed you how sensitive these are and so now that we have that let's do a five-fold cross validation and uh so let's run that yes and then just print out the mean and now we can see with that we get to 93.85 so going through all these steps is really important as far as expressing or trying to understand how how good your algorithm really is and when you read papers in the methods section make sure that that your researchers went through all these steps to accurately represent the to to a fair extent how good their model really is in this last section i'm just going to show you what the nearest neighbor regression model looks like so we're just going to use a small number of just a little bit of toy data just to show you the little bit of difference actually works very much the same so not much that i need to show you other than for you to understand that our target variable is now a continuous numerical variable it's no longer different classes and so let's generate some data i'm going to set up x values there and so i only have a single feature variable and my y i'm just simply going to calculate twice x so if we look at our input variable and our output variable here our features in our target we see we see the values there so let's create a model from this and we're going to use k equals three nearest neighbors so i'm going to show you how that works so instead of k neighbors classify we'll have k nearest regressor only difference so we instantiate that architecture there with a neighbor set to three and remember x you'll have to do the reshape negative 1 comma 1 in this case so just be aware of how you have to reshape that and now we can test our model by putting in a new value so let's put in 5.5 and see what it predicts and it predicts a value of 10 on that continuous numerical scale as far as our target variable is concerned so let's have a look how sort of how i did that so there we go and there's our feature variables and there's our brand new our brand new variable and you can see the three neighbors that were closest to it there so there for our feature variable was four there was five and there was six and all we're going to do is we're going to average over those so for the three nearest neighbors we're going to average over that and that's how we got to 10 because our values for the target our actual values was 8 9 and 10 8 10 and 12 and the average of that those three nearest neighbors is 10 so the model is going to break at 10 so it's really as simple as that so i really hope you enjoyed this lecture you might be your very first introduction to some machine learning i chose k nearest neighbors because it's quite easy to understand and while it might seem a slightly trivial it actually has very good uses and can be a very powerful algorithm but it's understandable and i really want to introduce you to all the steps that we go through and expose you to some of the the lingo as far as machine learning is concerned and of course a very exciting field and a very modern approach to understanding your data bringing that knowledge or story out of your data in a powerful way and as much as that we can use it for prediction and also very different way that we're very very interested in how accurate our model is by giving it unseen data very very important concept here and as far at least as these supervised learning models is concerned you know something that's imminently usable in real life