 Hi, and welcome to smart India hackathon 2019 designing detail solutions training program. I am Rajiv Kali a technical trainer at persistent university at persistent systems limited. In this second module on machine learning part 2, I will take you through the remaining classification techniques like support vector machine, k nearest neighbor and then we will move to unsupervised learning algorithms provided by scikit library. Later on I will take you through some of the best practices followed by machine learning professionals and then we will end the course. So let us begin. So far we have seen how machine learning is divided into different categories like supervised learning unsupervised learning under supervised learning we have seen two further classes one is regression other one is classification. In case of regression we are what we are looking at is continuous values in case of classification we expect discrete number of classes in the target column. In case of linear regression we have seen in scikit how to use linear regression algorithm and we also saw one example notebook on the same. In case of supervised classification algorithms we studied how logistic regression is formed using a sigmoid function under which we feed linear regression output after that we looked at another classifier called Neobase classifier which is based on Bayes theorem after that we took a look at DC entry classifier which works in the form of if and else kind of flow chart kind of a approach and which has a tendency to over fit and after that we saw one variant based on DC entry classifier that is called random forest. Now we will look at another famous and very popular classifier called support vector machine. Let us look at this data set you are very familiar with on the x axis we have tumor size on the y axis we have edge and there are some green points green circles and there are some red crosses they represent harmless and harmful as a two as the two classes. Now typically if we use a logistic regression on this all that it requires to do is draw classification boundary or a straight line that divides this data into two parts. For example if your classifier draws a boundary like this is it not looking a good classifier yes it does simply because it is serving the purpose as you see on one side of the line there are all red crosses on the other side of the line there are all green circles. So effectively this particular straight line or this particular model appears to be pretty good how about this particular line even this line looks good because it is serving the same purpose of classifying green versus red. However support vector machine takes into account something different it is a futuristic approach or what it does it tries to figure out where my next incoming test samples would be it rejects both of these lines and attempts to draw a line something similar to this which is called a large margin classifier what it attempts to do is maintains enough margin on both sides of your data points. So that is why it is also called a large margin classifier as logistic regression is based on linear regression support vector machine is a variant of logistic regression. So all the inner workings which are based on logistic regression with a small mathematical tweak will also have tuning parameters similar to linear regression all the regularization techniques that we use for linear regression are also applicable to logistic regression and in turn to support vector machine. Now you would be thinking that this particular graph is showing the data points which are which are which are already having a large margin but what about data points if they are already mixed for example let us take a case here. Now in this particular case if I ask you a question whether a decision boundary on the left hand side is better or the one shown on the right hand side is better. Typically looking at this classification boundary the popular answer would be the one on the left hand side is better because as you see there is no misclassification all the red points are on the bottom side of the line all the green points on the upper side of the line and you are not finding even a single point going into the wrong class in case of a left hand side graph. However there is a graph on the right hand side which is showing a classification boundary which is which is a vertical line and if you look at this particular line you will see that one of the red points that red circle is getting misclassified into into the green zone. So there is one point getting misclassified so naturally a popular answer would be that the left hand side classifier is better than the right hand side. However in reality in practice the right hand classifier is chosen by support vector machine although one point is getting misclassified remember these are all training data points and something getting misclassified from a training data points is not of much concern what its concern is how correctly I would be able to classify when I will receive a test sample which is not yet seen. So this is called a large margin classifier and the there is a tuning parameter called C if you maintain a large value for C then it is un regularized and if you bring the value of C smaller and smaller that means effectively you are making better and better regularization. So this SVM will naturally select the right hand side boundary which will which will work better in the field. So we will we will take a quick look at the Jupiter notebook but before that let me ask you a question. Now supposing if you want to classify this type of data points as you can see here there are some red points and green points and they are mixed in such a way that no matter how you draw a line on this it like this for example or a horizontal line like this or any other kind of line these two classes are inseparable. To simply put this type of data is non-linearly separable how to handle this kind of a thing in in in with SVM. So now SVM has got another variant called kernel SVM have you heard of any transforms in your college days. So if you if you know Fourier transform or Laplace transform for that matter what what does it do the whole idea of using a transform is to convert a data convert a problem which is difficult to solve in one particular domain into a take it into a different domain where it is easily solvable and then solve the problem and bring it back to original domain. For example if you have a signal which has a noise on it what teacher used to say us use the transform convert it from time domain to a frequency domain design a filter filter out the noise and bring back the clean signal back into the original domain is not it. Same way we have learnt in the colleges how we use logarithmic transformation. So when teacher used to ask us to multiply two big numbers what we used to do we use to take log once the log is taken we get two small numbers which are easy to add. So we add those two numbers and then take an anti log to get the product of the two big numbers. So these kinds of transform says are handy to take the data into a different domain make it simpler solve the problem and bring it back exactly same philosophy is going to be used by Kernel SVM to solve this kind of a problem. So let us look at it this is the type of curve which we are looking for how to draw this curve now as you know logistic regression is based on linear regression and the limitation of logistic regression is it can draw only a straight line now because support vector machine is also a variant of logistic regression the same limitation applies to support vector machine as well. So there also you can draw only a straight line however there is a variant to support vector machine that is called Kernel support vector machine which is capable of drawing curves or rather I would say take the problem into a different dimension solve the problem and bring it back for example look at this diagram. So on the left hand side top portion there is there are some data scatter plot there is some data points which are blue and red and this is a two dimensional data X1 and X2 naturally if I ask you draw a classifier boundary where we are restrained to draw only a straight line that does not look to be possible at all since the data is non-linear. However I can define my own kernel or a transform such as phi for example which will convert my data from X1 X2 that is two dimensions to something like Z1 Z2 and Z3 which is the three dimensional data. Now I need to define my own kernel here that kernel I can define something like this for example I can say Z1 is equal to X1 Z2 is equal to X2 and Z3 I can define as X2 X1 square plus X2 square. If I define my kernel in this fashion Z1 equal to X1 Z2 equal to X2 and Z3 equal to X1 square plus X2 square and then apply that kernel on my original data set. Once I apply that on the data set what happens is all the red points go up all the blue points remain down and once they are taken into three dimensions as you can notice here it is very easy to cut that into two classes by using a simple straight line. Once that is done I can bring back that converted data back to my original 2d data which you are seeing on the right hand side bottom portion where a nice non-linear circular boundary you can see which classifies blue points against the red points. Fine we can see this practically happening using Jupyter notebooks. So, let me quickly jump to Jupyter notebooks and we will see demonstration of usage of support vector machine. So, I am going to classification two folder and there is a one a notebook called support vector machine. Let me open that notebook. As usual I am clearing of all the output that was generated when this notebook was previously run we are learning support vector machine. So, I am importing data sets I am importing matrix and I am importing from sklearn.svm import svc that is support vector classifier. Once I import it I load the data set this is the famous iris data set that you are now familiar with and what I am going to do is I am going to define model equal to svm.svc kernel equal to linear c with a value of 1 and then I am going to say model dot fit on data set dot data which is an x part and data set dot target is nothing but the y part. So, to a model dot fit statement I am giving both input and output and once I execute this my model gets trained and now model is generated I can use that model using model dot predict method for in the in this in this particular sale you can say expected is equal to data set dot target which is nothing but an actual value predicted equal to model dot predict on a data I execute this and then I get a classification report and a confusion matrix which I had shown you earlier how to interpret. So, in this case if you look at the confusion matrix support vector machine is able to classify setosa 15 numbers as setosa nothing went to the wrong class in case of versicolor it could classify 49 correctly 1 into wrong class and in case of virginica all 50 were classified correctly as virginica nothing went into setosa or versicolor. So, this looks to be it has better managed the classification. Let us look at the tuning parameters and if I draw classification boundary I can show you. So, in this in this particular case I am using interactive widgets to control the kernel that I am using if I say kernel equal to linear then the only regulation parameter is C and gamma has no effect on it and if I if I play with C all that you can see is it is changing the classification boundary. However, one thing one thing to notice is they are all straight lines now that is the limitation. So, how do we use kernel kernel SVM? So, all that I have to do is I will have to select RBF this is radial basis function these are a cyclic library built in kernels and I can play with the values of C and gamma and see how the decision boundary changes. So, with more regularization you can see what effect it is having. So, suppose I choose not to regularize at all then you will see if I keep a very high value of gamma and choose RBF you will see that this is this is a sort of an over beating over over over feeding diagram here. So, you can see that all the blue points and decision boundary is very close to all the blue points. In fact, if you can notice a let it stabilize a bit because I have made too fast a change on the slider bar it will take a while before it stabilizes it is picking up the new values. Now, see see this see my I have selected a gamma of 31 and a C of 6 effectively having both gamma and see very large values which means that I am not regularizing and if I do not regularize what happens is these are the kind of decision boundaries it is drawing and as you can see it has drawn a individual decision boundary on one of the points which are which is further away from the normal grouping. So, this is this exactly means this exactly what we call it as an over fitting situation it is completely memorizing where my training data points are and trying to figure out how to draw decision boundary to have all those all those data points within it. So, if suppose a new data point comes somewhere here for example, where my mouse pointer is that data point naturally in this particular case which will get misclassified because it is real class belongs to this blue group. Now, if I change the regularization and make it more regularization you will see what what what it does. So, give me a minute it will stabilize as you will notice now as you can see on the screen it is having larger and larger and larger area encompassing those data points which is which is slightly better looking picture than before. So, this is how the new decision boundary will look with some more regularization if you increase the regularization further it will be even better and this is how you should you should use regularization techniques to control over fitting problems for your data sets and then before you release the model in the field you must check that you are your model is not facing from over fitting issues. Now, the data point that we saw on the PPT I have one more notebook to show you which I will quickly run I will not take you through all the cells, but what I wanted to show you is how how a mixed up data which is non-linear can be separated out. So, I will run all the cells in one go we will not go through the code, but I wanted to show you this this picture. So, this tells me that I have selected kernel as RBF I have selected values of C and gamma and then with by playing with values of C and gamma I can control. So, this this looks much better bifurcation between the red points and blue points. So, this is how we can separate out non-linear data. So, this is the demonstration of what we discussed in PPT that is about on support vector machine is very popular algorithm is a but what we call it as an expert algorithm. In fact, support vector machine has some more variants as well. So, there there is a regressor also then there is a one one class classifier also. So, as we go along we probably will encounter some more more variants of support vector machine ok. Going back to PPT I will take you through one more algorithm and that is called k nearest neighbor classifier. So, let me ask you a question supposing you go home from your college or work and then you find that your at at home you have a power outage. Once you notice power outage what is your first reaction what do you do? So, typical reaction for for us is to ask your neighbors you just look at from your window to the neighboring apartment or neighboring building and check out whether they also have power outage. In case you find that they are also having power outage you rather relax thinking that yes everybody seems to have power outage and eventually the power will come back. However, if you find that your neighbors have power and you do not then you sort of panic is not it. So, in the same philosophy the same human psychology is used by this classifier nearest neighbor classifier all that it does is it ask its neighbor what is what is your class? The neighbor replies with whatever class it belongs to and this particular data point assigns that same class to itself for example, take a look at these data points these are the data points given to you these are the this is under supervised learning. So, you have right answers. So, on x axis we have tumor size and y axis we have h. So, basically we have two input features x 1 and x 2 x 1 is tumor size x 2 is h and we have right answers given to us on the last column we have whether a particular combination of input is harmless or harmful and this is the plot of that data points. Now, with this set of data points we feed this data points to our k nearest neighbor classifier and what classifier does is it keeps record of all the data points and what happens is when the next test data point comes in or the new data unseen data comes in all that it does it finds its distance from all the other points finds out which is the nearest one and whatever the class nearest one belongs to it assigns it to itself for example, let us say this is the new test point. Now, this test point appears at the position I have shown here which is the nearest point to it it appears to be that this this red x seems to be the nearest point to this in terms of distances. So, it ask that point what is your class it says I am harmless and so the prediction for this particular blue circle here will be harmless, but what happens if this is this test point had come somewhere here then again as you can naturally see this belongs to harmful area. However, just because there is one data point which is close to it of class harmless it ultimately turns out that it the class of that test point also becomes harmless. So, the solution for this is instead of asking for one neighbor it is better to ask more than one neighbors multiple neighbors and typically odd number of neighbors. So, you can take majority and then finally, the class depends on if suppose if you ask three neighbors then it will ask this one, this one and this one and then two of them say harmful effectively giving the class for the new unseen data as harmful. What are the different tuning parameters for this nothing much except what kind of distance measurement you can do. So, that in the scikit library there are various ways to measure distances between two data points they are called Euclidean distance, Manhathan distance and Minikovsky distance with different formulas shown to you. So, like in case of say x 1 y 1 and x 2 y 2 in two dimensional data you can calculate distance between the two points with y 2 minus y 1 bracket square plus x 2 minus x 1 bracket square and then the whole thing under square root. Similarly, if the even if the data is of multiple dimensions for example, 3, 4, 5, 6 dimensions still the same formula can be applied to calculate distance between two vectors. So, this in nutshell is K nearest neighbor classifier which is which works pretty good and in fact, the results shown by K nearest neighbor classifier are also very close to any other classifier. Let us quickly take a look at Jupyter notebooks I have one Jupyter notebook to show it to you. So, that you understand how to how to define how to call in scikit K nearest neighbor classifier. So, eligible sale all output clear this is how I import and define from a scale learn dot neighbors import K nearest neighbor classifier this is how I run it. I am importing data set I am importing matrix and importing K neighbor classifier after that I am importing data sets which is an iris data set. Then I say model equal to K neighbor classifier into bracket I am giving tuning parameter number of neighbors is equal to 7 and metric equal to Euclidean. Now, this 7 can be 3, 5, 7, 9 or anything of your choice. So, here I am asking 7 neighbors then I am doing a fit which means machine learning actually in case of K nearest neighbor classifier when you do a fit it has hardly any work to do except recording the data points where they are it is real work starts when it receives the new data point as a test point where it needs to find out who are it is nearest neighbors. So, I say predicted equal to model dot predict expected equal to target and then as usual I can see what is what is its score and what is its confusion matrix. So, in this case 50 setosa are classified correctly 47 versicolor classified correctly 3 went into virginica 49 virginica classified correctly 1 went into wrong class and the score the model score returned as 0.97 or 97 percent kind of thing. So, that is about that is about the K nearest neighbor classifier. So, we learnt so far what we learnt we learnt K nearest neighbor classifier in supervised learning we learnt linear regression and classification and in linear regression we looked some of the Jupiter notebooks and under supervised classification we looked at various algorithms like logistic regression then neighbor classifier then decision trees then we looked at random forest which is variant of decision tree classifier after that we learnt at support vector machine which has two variants one is a linear SVM which can draw only straight lines, but with bigger margin and another variant called support vector machine kernel variant which is capable of bifurcating non-linear data. Now, the time has come to look at unsupervised learning. So, what are the characteristics of unsupervised learning in case of supervised learning we said that right answers are given and we have to produce more right ones. In case of unsupervised learning one thing that is noticeable is we do not have a labeled column or we do not have a target column at all all that we have is the input feature data. In unsupervised learning we have three sub sections one is called clustering another is called anomaly detection and third one is called dimensionality reduction we will see these one by one let us see what clustering is clustering simply means in English grouping. So, what clustering does is now let us use the same data that we are very familiar with tumor size and age and is the kind of data given to us what is the difference between what we are seeing on the screen just now and what we have seen earlier. In case of earlier plotting we used to see red crosses and green circles is not it. So, basically there were two classes one belong to harmless and other one belong to harmful. So, when the data was given to us the data provider has not only given us tumor size and age, but it had he had also provided us with additional column which is called a target column or a labeled data we call it and that target column contained class labels for example, harmful and harmless and in case of iris data set class label was setosa versicular virginica. In case of whether Sachin will play tennis or not the class column contained whether he plays or he does not play, but now look at the difference here. Here this is unsupervised learning we are talking about unsupervised learning where right answers are not given to you. That means all the data points are shown as crosses and in the red color. So, there is no distinction between data points as far as the given data is concerned. Now what machine learning algorithm is going to do for us is to classify this data into number of groups. How does it do it by looking at the similarity between the two points based on its feature inputs? Now the question will be how many group is it is expected to make? Now that decision of how many groups somebody would like to make out of it has to be given by the user or the one person whose requirement is to make clustering. For example, in this particular case our interest is to classify between whether it is harmful or harmless. So, we ask the clustering algorithm to separate it out into two classes. So, once this data is fed into algorithm it will split into two groups. Now all that algorithm will return to us is the data points that belong to group 1 and the data points that belong to group 2. Algorithm would not be taking any part in labeling the data like in this case whether this boundary on the left hand side is harmful or the boundary on the right hand side is harmful is to be decided by the user who has asked for making such groups. So, it is the user's requirement how many groups to make. For example, I will give a live example here. Supposing I am owner of a startup company and I have 200 employees in my company and I wish to give say Diwali gift or a Christmas gift or a New Year gift to all my employees and I have certain budget for this particular gift. One of the approaches could be decide on one particular gift. For example, I can decide to give say wristwatches to all the 200 employees. However, there someone can think about some better approach like not everybody would be would love to have a wristwatch as he might already be having one or two or more wristwatches or if I decide to buy say a set of books and gift it to some gift it to all employees. Some of the employees may not like because they may not be interested in reading. So, typically what the owner of this company can decide is he can decide to make say certain number of groups say four groups and he can decide to have a choice of the four gifts like one of the choice can be a set of books, the other choice can be a movie tickets, the third one can be a sports gear and the fourth one could be something like garments. So, depending upon the liking of individuals these he can make four groups and then he can ask admin to buy four different kinds of gifts and then decide to give respective groups the respective gift that will that will work better. But then who decided to make four groups is the owner's choice it is it is it possible to make 200 groups? Yes why not, but it is not practical solution. I cannot have 200 groups for 200 people and asking I cannot run after everybody asking what we want. So, the better approach could be say practically decide 3, 4 or 5 groups decide on which gifts to be given to which of the groups and then once the algorithm returns you the groups with the employee IDs you can then see what is the similarity within that particular group and then you can probably buy the right kind of gift which all the people would appreciate. So, how to split machine does how many groups to split user has to say and how machine splits it based on the similarity between between the input features. Now, which are the famous clustering algorithms available in scikit. So, let us see one of them k means clustering this comes under unsupervised learning and I will explain to you the theory behind it and how it works in brief. So, let us say these are the data points given to us and we are supposed to make groups out of these data points. Now, let us say we want to make two groups. So, the way process runs is something like this the algorithm is given handed over this data and we also tell algorithm that I want to make two groups. The first step that algorithm text is it throws the two random seeds on this data. So, as you can see on the screen there is a red cross and there is a blue cross these are the two random seeds thrown on these data points why two because we have asked algorithm to form two groups how are they thrown randomly. Then this is the first step in the second step what algorithm does is it goes to each and every green point and asks it a question. It asks each green point are you close to red or are you close to blue if you are close to red turn red if you are close to blue turn blue this is the question asked to each and every data point. Once that question is answered the data points change their color in this fashion. So, all the points which were closer to red cross they turn red all the data points which were closer to blue point they turn blue. After this step the next step is these randomly thrown seeds they move to the mean of those all the red points and the randomly thrown blue seed will move to the mean of the all the blue data points. So, now there is a step of moving the seeds themselves to the new mean. So, they move in this direction this moves in this direction and eventually they will be at the new position. Once that happens all the data points are change back in their color to its original color that is green. This is end of iteration one. Once this iteration is over the second iteration begins and exactly in exactly follows the same four steps one by one. For example, with this new position of seeds each of the data points is asked are you now close to red or are you close to blue if you are close to red turn red if you are close to blue turn blue and this is how all the data points change their color. Once this color change happens the time has come to move the means to move the seeds once again to the new mean. So, this moves in this direction this moves in this direction and eventually the means move to their new position and all the data points go to their original color. And same cycle repeats again and again and again and again till what time till the time you stop number of iterations. So, you have two controls in hand number of iterations. So, you can specify when you call the algorithm that run this algorithm for say 100 cycles and stop or other alternative when this particular iterations will stop is when means stop moving. So, eventually means are moving to their newer places and then one stage at a sometime a stage will come when the means will stop moving and then we can say that the this algorithm has converged and there are no further movements and two groups are perfectly made. So, this is how the algorithm works it goes on and on and on and it continues and finally, it separates into two groups. So, how so now the groups have been made now the question to you will be how do we determine how good the clusters are think of it how well the clusters are. In case of supervised classification algorithm it was very easy to find out how our classification algorithm has performed how our train model is behaving how did we do that we did we did comparison between the real class of a data point and the predicted class of a data point because we we already had two answers we have this was a supervised learning and in classification we had a labeled column which which told us whether this particular flower is there is a setosa versicolor or virginica and that type of real answers were given to us what we did during model training we split the data between training and testing we kept the test data away. So, that we can evaluate the evaluate the model we trained the model using training data and evaluated using test data we compared the expected value that is the real value with the predicted value and that is why we got that confusion matrix you remember. So, 50 setosa's were correctly classified as 50 and nothing evident to different class in case of other classes there were some misclassifications and that is how we found out how our model is performing. But in this particular case this is an unsupervised clustering algorithm where we do not have right answers. So, what is the way to find out how well the clusters are ok. So, one of the criteria is within clusters some square error. So, as you can see within clusters some square error. So, now that you have formed two groups for example, these two groups all that you have to do is find out the distance of all the points within the group from its mean square it and sum it same thing you apply in the second group. So, if that sum of the square distances is lowest then it is a better and better clustering. Obviously, if you make more and more number of clusters naturally your within cluster some square error will start going down and down. For example, suppose if you have 150 data points and as an extreme case if I decide to make 150 clusters out of it then naturally every point will be in its own cluster and then within cluster some square error will be 0. However, that is not the only objective objective of clustering is in users mind nobody else knows why he want to make groups like an example I gave you about a startup owner who wanted to give 200 people some give and he decided to make four groups. So, that was his choice. So, within cluster some square error is irrelevant here the purpose of the user has to get served. However, in our Jupyter Notebook and scikit library there are certain mechanisms which will help us to find out whether clustering is good or bad apart from within cluster some square error there are certain graphs like right hand elbow method and Cilhauti plot which will help us do that. We will quickly take a look at Jupyter Notebook, but before that let me finish with one more clustering algorithm and that is called that is called agglomerative clustering. So, this is one more approach this is also called the bottom of approach and how it works is in this fashion. Now, this is called a dendogram. So, if you see here there are data points A, B, C, D lying in the space and on the bottom on the left and bottom side I have drawn A, B, C, D, F, G up to L what you have to do is basically find out which are the two closest points here as you can see from the data point spread in the space A and B appear to be the closest to each other than any other point. So, I can join this A and B in this fashion the next is J and K. So, I can go and draw a line here and join J and K together. This is how it starts working and eventually it will join the points in this fashion and all that is in our hand now is to make a cut somewhere. So, on the on the left hand side you are seeing vertical arrow that is a distance threshold and on the right hand side you are seeing arrow that is going down which is called number of clusters. So, all that is now expected all that is required to be done here is take a cut somewhere. Now, supposing if I cut cut something somewhere here like the like this red line red dotted line and now what you have to see is how many vertical lines are coming down from this red line for example, this is one line this is second third fourth and fifth. So, there are total five clusters in this particular cluster what are the points A B C D E up to E these are the points under this line. Now, point F is in a completely different cluster and that is the only point there this particular vertical line if you see there is a point G and that also is in a separate cluster look at this line now this entire stuff entire points from h i j k m and n will come under one cluster and this point L will come under another cluster as you can see point L point F and point G are anyway further from each other and they are belonging to a separate cluster. So, with this red line cut I am getting 1 2 3 4 and 5 clusters and those those clusters are defined by respective points coming down from that line. So, as I as I change the cut if I go up then naturally I will get less and less number of clusters if I bring this red line down making the distance ratio smaller and smaller I will get large number of clusters. So, I have to play with distance ratio just to get as many clusters as I want this is called the bottom of approach alright. So, let me switch on to Jupyter notebooks and show you how clustering works. So, let me go back to Jupyter notebooks I am in I am in the unsupervised learning and I will I will take you to basics of k means clustering. So, let me open this notebook as usual I clear all output that was previously generated and then I can run this cells one by one and explain to you what is happening here. So, the first cell is mat totally bin line. So, that I can I will get the plot within the Jupyter notebook itself it is importing it. In the next cell I am creating 150 data samples this is a synthetic data. So, I am just defining the data set and then I am plotting it to show you how data looks. So, as you can see here there are 150 scatter points shown on this particular graph and naturally by humanize you can see that there are three distinct groups. However, in real life scenario you will not get such distinction between data points, but let us see how clustering algorithm works here and how we how we makes three clusters out of it. For example, I will import k means this is the name of the clustering algorithm. So, I am importing that and let me say number of clusters to be three which is natural looking thing here. So, to start with I am just making number of clusters three in it equal to random and I run this and I run this and I am just plotting it and you will see the plot itself how it looks. So, as you can see the number of clusters here has formed three clusters you can see a blue color you can see a green color and you can see a yellow color as well and you will see their centroids. So, these are the new these are the centroids of these three clusters and of obviously, within cluster some square error will be called distortion here. So, if I want to get within cluster some square error I can print distortion which is called k m dot inertia and then in this case you can see number of within cluster some square error or distortion is 72.48. Assume that I just make I make I choose to make four clusters of the out of this. So, if I make flow four clusters with the same data points let me see what happens and you can see here it had make four clusters one is this blue yellow red and green with their respective centroid and naturally when I find out the within cluster some square error I expect distortion to go down because when we make four clusters instead of three naturally the within cluster some square error will be lower. So, it is now 62. So, if I say let us make five clusters and see what happens. So, if I make five clusters then it it made these five clusters. So, green blue red black and yellow and the within cluster some square error now 55. So, if I go on increasing to number of clusters to say 150 naturally this distortion will become 0, but that is not the objective here. Objective here is to make as many clusters as user wants. So, let me stick to three because that looks natural pyrification and bring it on. Now, I wanted to show you elbow method to find out the optimal number of clusters. So, let me execute this if you here I am selecting number of clusters and I can see from this graph on the x axis we have number of clusters on the y axis we have distortion or within cluster some square error and you can see a noticeable drop in the within cluster some square error up to cluster number three or three or four. So, you can probably say that we should stop making more and more clusters after say point number four that is number of clusters is equal to four appears to be optimal. If you still go on increasing the number of clusters naturally distortion will fall. However, there would not be a significant fall here. So, main thing to note is that the domain knowledge is essential the user who is asking to make cluster needs to decide how many cluster it wants to make, but supposing if you do not have domain knowledge and if you still want to make some number of clusters you can take help of such such plot. There is one more method called Sillowty plot. I will show you that graph as well and in this particular graph you can see lobes here. So, if I make three clusters you see these three lobes having very uniform width and length. If you get this Sillowty graph in this fashion then probably your clustering is good. However, if I say make four clusters in in this for this particular database I will get this this lobes in this fashion which are not looking uniform. So, two of them are very thin and short two of them are big and even the length is more. So, that is that is not very good clustering if I make five for example, that is even worse. So, this Sillowty plot can also help you determine if the quality of clustering is good or bad alright. So, that is about clustering let us move on. I will go back to PPT and cover whatever remaining topics are in hand. So, this is how clustering is made. Now, let us look at one of the other techniques called principle component analysis. While we were going to the material on data science Python for data science we just touched upon one of the techniques during data preprocessing or feature engineering that is and that we said that we will cover it in detail during machine learning topics and now the time has come to look at what is principle component analysis. This is comes under unsupervised learning. So, this principle component analysis is typically used in many fields it is not necessarily only to be applied in machine learning. So, principle component analysis is used for data compression in communication systems it is used for feature extraction in computer vision or for example, image processing and in case of machine learning it is used for dimensionality reduction. Now, I will I will give an example suppose we have data points like this these are two dimensional data X 1 and X 2 and we have 1 lakh data points like this and if I do not have enough memory to store this 1 lakh data points typically to store 1 data point you will have to store its coordinates that is X 1 and Y 1 and to store one coordinate I will need 2 memory locations. So, to save 1 lakh data points I will be needing 2 lakh memory locations is not it. Now, supposing I do not have 2 lakh memory locations I have to manage somehow in 1 lakh memory locations what I will do. So, what technique they follow is from this data plot you can see that data is aligned in a particular direction or in particular axis. So, this is the axis around which most of the data spread is there. So, what they do is they draw a line like this and they what they do is a 2 D data they make it horizontal and what they do is they draw each of the points which are above or below the line onto the line. So, effectively the new data on the z axis looks like this. So, we have some data which is 2 D data now I am converting into a 1 D data. So, for the initial 2 D data I had coordinates X 1 and Y 1 in the case of new representation I just have one coordinate that is z 1, z 2, z 3 and I just can manage with half the number of memory locations. But then how do I reconstruct the data back? So, what they do is they turn the z axis to have original slope and original y intercept and then simply drop this line and this is how I get the data back. Now, you would say is this the same data as I had in the beginning? What I had in the beginning was this kind of a data, but what now I have got is a new representation which is which is not exactly same as before however very close to it. So, basically this is a lossy technique some data is lost. What data is lost here? Basically it is losing the information that is the height height of the data points from the new axis that is the information lost. And if you are ok to lose some information the dimensional dimensions can be considerably reduced by using a technique called principal component analysis and there is a circuit library routine to execute this. Now, if you look at this original data you have x y coordinates. So, if you look at their values you will say 10.1, 20.3, 15.6, 24.5, 18.8, 26.2 kind of thing these are nothing but x and y coordinates. After conversion I get two components PC 1 and PC 2 these are the two new accesses the information is mostly spread across PC 1. So, I have that information stored. So, the PC 2 is a orthogonal access to PC 1 going through its mean. So, my PC 1 new values will be 5.5, 15.3, minus 25.8 kind of thing and on the PC 2 I have very small numbers 0.3, 0.16, minus 0.24 this is nothing but the height of each data point from PC 1. So, in case of my earlier representation I cannot chop off either x or y. If I chop off either x or y to say memory I will be losing lot of data. So, that is not on. So, what I can do in case of transfer data is I can chop off PC 2 here. So, I can retain PC 1 and I can just chop off PC 2 because that anyway contains a very small information. This is how we convert 2D data into 1D data. So, after transformation all that I am going to save is PC 1 not PC 2. If I have to say both PC 1 and PC 2 effectively it means storing like x and y, but because I have less memory all that I have to do is transform x and y into PC 1 and PC 2 and chop off PC 2 that is how I can save half the memory. This is how 2D data can be converted into 1D data. Now, let us take a case of what happens to 3D data. If now we have a data in a three dimensions what is the way out? The way we draw a line on a 2D data here in this case we can draw a plane on a 3D data. So, let me draw a plane here and what I will the next step is to drop all those points on the plane. So, all the points which are above the plane will be on the plane which are on the bottom of the plane will get up to the plane and eventually I will get a 2D data. So, effectively I can bring down three dimensions to two dimensions. Earlier we learnt how to bring down data from 2D to 1D and here we are learning how to do it from 3D to 2D. So, if I can bring down from 3D to 2D I can definitely bring it from 3D to 1D and so mathematically if I am successful in bringing 3D data to 1D I can bring n dimensional data to even a single dimension and that is an amazing technique which we can quickly demonstrate using one notebook. So, let us see how principle component analysis works. So, I am taking you to principle component analysis and for a minute I have to go to Python for data science. Learning PCA is a notebook I am opening that notebook. In this particular notebook I will clear up all the sales first let me write output. In this notebook we shall show PCA in action how you can plot data after PCA transformation which is originally form is having four dimensions. PCA helps us to retain most of the information even if we reduce dimensions to two. Later we show effect of standardization while we apply PCA. What we conclude is whenever we apply PCA it is important not to forget scaling input data using one of the data scaling methods such as min max scalar or standard scalar. So, whenever you apply PCA do not forget to do data scaling that is what is the conclusion of this notebook. I will just run it and show it to you this is to be run on iris data set. So, as you know iris data set has how many dimensions four dimensions. I am showing you a plot why I am showing you a plot with C bond simply because I wanted to show you that I can plot only 2D data. For example, in this case plotting I am using sepal length and sepal width. So, what I am doing is I am I am chopping off petal length and petal width in my in my plot I am not using that data at all. So, this is how the plot looks is not it showing is it in looking completely mixed up now what I will do is I will apply PCA. So, I have the original data which is four dimensions which is defined here x is equal to iris all the four dimensions are here and then I am using PCA and I am saying number of components is equal to 2 which means bring down the number of dimensions to two dimensions. So, my reduced x will be PCA dot fit transform x I am feeding x to it and it will generate reduced x I am just showing you what are the new values for PC 1 and PC 2 and then what I am going to do is I am going to plot PC 1 and PC 2 with legend and that is how the plot looks. Now, do you think that this plot is substantially better than this plot this is simply here I am I am chopping two columns and that is why this plot is looking like this. Here I am not chopping two columns I am converting I am transforming my original 4D data into two dimensional data that is transforming 4D data to 2D data without losing much information and that is why I could successfully plot PC 1 and PC 2. So, the new accesses are PC 1 and PC 2. Now, what I will do is I will apply this PCA on standardized and non standardized data and show you the difference quickly. Feature scaling standardization we have learned this I am applying on and then I am applying a PCA and then if you can see here this is the plot before this is the plot before standardization after applying PCA. Here again you are seeing some mixing, but if I do standardization and then apply PCA I get a beautiful bifurcation in this way. So, clearly you can see the picture green then this blue and other color they have been bifurcated nicely here the only step additional step that I took is before applying PCA I made sure to apply standard scaler. Let us go back and cover some other topics. Anomaly detection unsupervised learning also contains one area called anomaly detection and what is anomaly by the way. So, anomaly is something that is not normal. So, let us look at the catch me if you can notebook 284 K good transactions and 492 fraud transactions with such a data set even if you predict all transactions at good transactions it can give you 99 percent accuracy. So, I will open that notebook and show it to you this is basically on a credit card credit card fraud detection let me go there and I will go to unsupervised learning catch me if you can here let me find out where it is catch me if you can this is the notebook I want to show you. So, I will open that notebook and show you what are the contents of this basically it is interesting to see what kind of data it contains. I want to run this notebook as running this notebook takes a little while little while, but there is already an output generated out of it. So, this is a credit card for fraud detection system the banks or the concerned credit card company has given its data and what this particular data set contains is there are many columns here. So, 31 columns and the last two columns is an amount column and a class column amount is basically a transaction amount and the class column is either 0 or 1 0 means a good transaction and 1 means a fraud transaction. And if you notice in this case the type of data that it has if you look at it you are getting a 99 percent accuracy straight away and the reason for this is if you go down and analyze the data you will find that one particular class that is a good class that is a good transaction class contains what 284000 transactions and if you consider other class that is class number one which is a fraudulent transaction it just contains 492 fraud transactions and typically that will be the proportion most of the transactions that happen with credit card or banks are always good. So, in in 3 lakh transactions if some 500 or fraud transactions bank will pay us lot of money if we can catch some of these 500 there is no point in declaring that all transactions are good even if you declare all transactions are good and valid in in such a such a data such a data set then accuracy you may get 99 percent because the fraudulent transactions are very small, but the whole idea is to catch how to catch as many transactions from this fraud as possible and that can come out only if we convert this particular problem into unsupervised anomaly detection problem. So, let me explain to you what anomaly detection is convert such a problem into unsupervised learning anomaly detection problem fraud is to be considered as normally during model training what what we do is we can simply drop fraudulent transactions from the input data. So, all the input data that we consider for our algorithm is all good transactions and we drop the class column itself this will generate data points that are closer to each other forming a group once model is ready ready new data point is sent to it it is checked if it falls within the boundary of this group if it does not it can be classified as a fraud. For example, look at this particular picture if you look at this picture these are all the data points of a good transactions and all these data points is the only thing that is fed to the algorithm it notes down these points and forms a model. Now, what it does is if the if a new incoming data or a new incoming transaction on a credit card if it falls within this group then it classify it as a good one or if it falls outside the group it can classify as a fraud transaction. For example, this rate point this is an unseen transaction coming in if it appears to be or it happens to be in the position it is shown then it it says that it is a good transaction if it appears somewhere like this then it will say it is a fraud transaction and how it does that using something called Gaussian kernels. So, Gaussian kernels are nothing but a hat type of structure. So, the top view looks like a hot hat and this red circles on the top of the red circle you have one and as you go away from that on the center you will you will go towards 0 and what you do is you place this hat on on this particular data recording and whatever whatever falls under this hat you can classify it is normal and whatever falls outside the hat you can classify it as anomaly and and the size of the hat you can adjust depending upon the application if it is a financial transaction and if you are very strict you can keep the diameter of the hat very small. This way you can convert your supervised classification problem into unsupervised anomaly detection problem especially for the cases where the two classes are skewed that means unbalanced classes like in the case of credit card fault we have two lakh eighty four thousand odd transactions as a good one and 492 bad transaction that is a completely imbalanced data on such cases we can use anomaly detection technique alright. So, let us move on and see what we can do with the the algorithm that is used for that is one class support vector machine. So, this another variant which is used for anomaly detection and this particular now in normal situations you may not always get points in a circular area. So, what you might get data points in this fashion and what you need to do is draw a curvy boundary around it and this can be managed to individual kernels which are placed above individual smaller smaller areas and whatever falls which whatever data points which does not fall in under any of the kernels you can classify it as a fraud else it can be normal ok. That is about unsupervised learning let us quickly now move on to machine learning best practices streamlining workflows with pipelines. So far we have seen that we do we we select scaling we select dimensionality reduction and we then select algorithm and we then pass the data from one to another one to another. What what scikit gives us a facility is to put everything into a pipeline for example as you can see here within a pipeline we have scalar that is you can use either min max scalar or standard scalar then the data is passed on to dimensionality reduction for something like PCA and then we select an algorithm and then the whole data is passed to the pipe it goes in the pipe and comes out of a pipe. So, instead of saying model dot fit what all we can do is say pipeline fit and instead of model dot predict we can say pipeline dot predict and when a machine learning professionals share their share their models they basically share their pipelines and the pipeline contains various things like scaling dimensionality reduction and learning algorithms. Another good practice that is followed is cross validation techniques as as we discussed earlier we split the data between training and testing 70 percent in test training and 30 percent in testing, but what happens is typically people go on tuning their algorithm based on what accuracy they get on a test data. So, in this process what happens is if they are not getting good test accuracy they go on playing with model parameters eventually consuming the entire test data set and over fitting on a test data. So, the better way is to further split training data set into two that is one set is called training set other one is called a validation set on the validation set you go on fine tuning your algorithm as much as you want and once the model is ready and fine tuned finally, you use the test data just for the purpose of finding the final accuracy that is how you that is this is this method is called a holdout method, but again there is one more problem in this method that there is a scope to manipulate where to have your training set and where to have your test set or other which samples to keep in a test set which samples to keep in a training data set. So, it can be deliberate or it can be even without any intent however, there can be hard samples in the training data set and sometimes simple samples in a training data set or vice versa effectively giving wrong picture about the model performance. So, the better way designed to handle this scenario is to use k-fold cross validation method here the training data set is split into 10 parts in the first iteration first 9 parts is used for training and the last fold is used for testing in the second iteration first 8 and the last one is used for training and the ninth is used for testing. So, in every iteration the part of the data that is marked blue will be used for testing and the rest of them is used for training in which in this case as you can notice here at least once every data point is going into a test data set and all the data all the points in the data set are at least going into a training once and in the test once. So, effectively nullifying the effect of how the data is distributed and then finally, using cross validation techniques you can you can declare the performance your model in terms of accuracy as a as a mean accuracy plus minus some variance this is how this is one of the best practices. Now, let us look at learning with ensembles is nothing but collaborating with each other we have seen many of the classifiers such as logistic regression, name based DCN trees, support vector machine, k nearest neighbors and so on. Is there a way where we can take advantages of each one of them bring them together and make a collective effort to to generate the final prediction yes it is quite possible and people have done lot of experiments with this for example look at this diagram this is called a majority mode classifier. So, in a training data set is fed to many classification models. So, C 1, C 2, C 3, C n these are nothing but your logistic regression, neighbors, DCN trees, support vector machine and so on and so forth all of them fine tuned and cross validated and they generate the predictions on a new data say they generate p 1, p 2, p 3, p 4, p m and then a voting takes place. So, whatever is the majority waiting is the final prediction. So, what is what is the drawback of this this is sort of a teamwork. So, sometimes teamwork benefits or sometimes teamwork may not benefit at all sometimes it might happen in in in some cases it might happen that a particular a particular good classifier like support vector machine was the only one predicting correctly. However, because the rest of the weaker classifiers they predicted incorrectly and you take the majority. So, ultimately you may end up misclassifying a particular data point this can happen. So, now it is not a guarantee that doing a teamwork will will will 100 percent give you benefit it is not like that. So, what about how how can we keep this can we give it is weightage. So, if we say let us give weight 1, weight 2, weight 3 to different different classifiers how we determine the weights one of the ways is to find their training accuracy and then give weightages. So, instead of giving what one more to every classifier we can find out which classifier works better on a particular training data set and gives its more weight that is one way. However, since we are learning machine learning is not it better that if we outsource the whole job of giving weight to a classifier that will work even wonders and this is why stacking approach came into play. Training data is fed to classifier 1 and then to classifier 2 then to classifier n and then the output generated of these classifiers and then further given to a classifier which in turn gives the final output. So, one particular classifier classifying correctly other classifier not predicting correctly third one predicting correctly for a given real value is itself is a machine learning using this machine learning you can generate and this way you can cascade to whatever levels you want. So, this called the stack generalization this technique is used while the stack generalization is used there is another technique called bagging here the training data is split into different different parts smaller parts is called sample data and this is fed to classifier 1 classifier 1 and then all these classifiers. So, in this case I am sure you have seen this type of a model before let me remind you of random forest in case of random forest we had subdivided big data into smaller parts and fed it to decision tree classifier the only difference here is we can use any other classifier as well, but the philosophy is same you split the data and feed it to a classifier and then allow it to over fit on that particular data and then find earlier voting strategies applied this is called a bagging strategy. In case of bagging is used typically when you want to reduce variance while retaining the bias retaining the low bias I mean bagging is not recommended on models that have high bias parallel execution is possible naturally because you are splitting the data and feeding it to different classifiers. So, we have seen boosting that is stack generalization and bagging now let us see what is boosting. Boosting is typically applied to weak learners for example, if you have a particular classifier or if you have a particular student in a class who is not doing well a teacher encourages him to do and what teacher typically says is try again do not repeat the same mistake the same philosophy is applied or in boosting and what they do is to a weak learner they say try again, but have more focus on the on the ones that have been misclassified last time and this way it goes on classifying it it it misclassifies different different points every time, but collectively after three iterations you will see if all the three are combined together and redrawn then it will classify correctly. So, it is more of a iterative process and there are certain algorithms like Exhiboost, Adaboost for this particular this particular methodology it converts weak learners is a strong learners unlike bagging boosting is a sequential method. So, that is about all we have covered lot of things during this particular module we have covered supervised learning under supervised learning we covered regression and classification. We looked at various classification algorithms called logistic regression, native base, decision trees, random forest, support vector machine, linear support vector machine kernel, k nearest neighbors and later on we moved on to unsupervised learning where we discussed clustering, dimensionality reduction and anomaly detection. And after that we covered some of best practices including cross validation techniques and then we finished with some of the other ensemble techniques. Thank you for going through this machine learning part 2 I hope you enjoyed this module. So, the two modules on machine learning we looked at various algorithms provided by Psykit library on linear regression on supervised classification on unsupervised learning as well as best practices. I hope that the two modules on Psykit library best demonstrations as well as Python for data science would get you going for your smart India hackathon challenge all the best for smart India hackathon 2019. Thank you.