 Namaste, welcome to the next session of our course on Practical Machine Learning. In this session, we will go through some of the machine learning concepts visually so that it is easier for you to understand or get intuition for these concepts. We will use a neural network playground as a tool to visualize machine learning algorithms. So, we have a tool called neural network playground with a very interesting tagline tinker with neural network right here in browser, do not worry you cannot break anything. So, let us explore bravely without worrying about things getting broken up. So, before we begin, let me tell you few things about this particular tool since most of you are new to this. So, in the previous session we looked at, we talked about componentized view of machine learning model where we said that we need training data, we set up the model, we train the model and we evaluate the model performance. In training the model, we define the loss function and we also define what kind of algorithm will be used. So, if you are using let us say gradient descent or any of its variant, we have to set up things like learning rate and we also have to sometimes set up use regularization in order to control the model complexity. So, we studied all these things in the previous sessions. Let us try to use those things in practice and see how they affect the training of our machine learning models. So, here on your left you see a pane where we can define different kind of data. So, you can think of these as simulated data sets. So, the one that is highlighted here is the data which has got clear linear separation between 2 classes. So, we have positive class which is denoted by blue color and negative class denoted by the orange color. So, you please note that orange color in the playground is used for negative values whereas, positive whereas, the blue color is used for the positive values. So, you can so, this is about data set. So, we have 4 different data sets one which is linearly separable, one which is not linearly separable or rather the remaining 3 data sets are not linearly separable, but have varying degree degrees of complexities. And we will see how to make use of techniques like future crosses to fit a model over here. We will be using this neural network playground later with neural networks where you will see how some of the things that we construct by hand in traditional machine learning algorithm is kind of taken away by use of neural network. So, we can choose the data set here in this pane. We can specify what is the ratio of training and test. So, remember we talked about training test split and you can specify in what what percentage you want to split training and test data. So, we have a slider here that defines how much data is used for training and how much data is used for testing. So, here you know we are going to use we are going to use let us say 70 percent data for training and 30 percent data for testing. Then we can also add noise to this data set. So, currently in its current form on the screen you can see that this data does not have any noise it has got 0 noise. As you add more and more noise you will see that the points will. So, the classes get polluted with point from the other class. So, this is like 45 percent noise and 50 percent noise you will see that there are some of these negative points that are present among positive class and vice versa. So, noise is so with noise we can actually simulate the real life data set which generally contain some noisy labels. And finally, since this neural network playground uses mini batch gradient descent we also get to set the batch size. So, let us so you can you can also expand by setting the batch size let us set it to 16. Then you can simply press the generate button that will generate the data for us. Then you know here so each of these data point has two feature which is x1 and x2. And you know this is the part where we build our model. So, currently we are using a logistic regression model and logistic regression model is set up by having one output layer which has got sigmoid activation. So, this will be clear to you once we get into neural network we have briefly seen neural network. So, you kind of know what is activation and what is output layer. So, here we use one output layer and we use sigmoid activation over there. We can also specify a bunch of other parameters like learning rate, regularization right now we are not using any regularization, but you can use either L1 or L2 as regularization. You can also you know fix up the rate of regularization which was denoted by parameter lambda as we saw in the previous class and we can define the problem type here we are defining classification as a problem type. So, you can think of this particular part defining the hyperparameters of our model. And this is where you visualize you know the prediction how the model is what kind of prediction it is giving. And this particular part you will you will see learning curves appearing when we start training and you will be able to see test loss and the training loss. And this is the button where this is the part where we control or where we so if you press the play button the model start training you can see that see this you can stop the training you can revert to the initial situation. And you can use this particular button to see what happens how model trains stepwise. So, you can see what happens in the first step, second step and so on. So, let us try to solve this problem with the setting where we use 70 percent data for training and 30 percent data for testing. We have a data set without any noise and we are using batch size of 60 and let us see we are using learning rate of 0.3. So, what we can explore is we can see how learning rate affects the loss or the convergence. We saw that learning rate affects the convergence. If you have too small learning rate it takes longer to reach to the to the minima. And if you use very high learning rate there is a possibility that you will never converge because you might be oscillating across minimas and you will never reach the convergence. So, we will have to find a sweet spot between two extremes. We do not want to have too low learning rate or not too high learning rate, but we want to have sufficient enlarged learning rate so that we can safely train our model in an efficient manner. By safety I mean it does not do oscillations or anything like that. We will also you know try with different regularization along with regularization rates. So, let us see what happens. So, what you see is so we have done the we have you know reset everything and now what you see is the model is initialized with some weights. So, there is a weight of 0.46 on x 1 and is a weight of there is a negative weight on w 2 on on x 2 this is w 2 and this is w 1. And you can see that the width of the line defines the strength of the weight. So, you can see that as things appear right now x 1 seems to be a stronger predictor over x 2. x 2 seems to be weak predictor because the line is quite faint and it has got and yeah. So, and the color of the line tells you the sign of the weight. So, this color is slightly oranges. So, this will have negative this the weight will be negative and this will weight will be positive. Let us start stepping through the model. You can see that the error is reducing gradually on training loss and test loss. Now it has slowed down. You can see that training loss is moving towards 0 as we go through more epochs. You can see that after 27 epochs a training loss is 0.001 and test loss is 0.002. Now so, we achieved this state after 24 epochs if you go if you reduce your learning rate let us reset and see what happens. So, we have. So, now you can see that the points are you know the weights are randomly initialized. Now there is a high weight on x 2 and low weight on x 1. So, this is the initial stage let us start stepping through. You can see that it is it is kind of learning very slowly. If we play this further you can see that it takes for longer it has already taken more than 500 iteration 500 epochs not even near to any of the numbers that we got from our first experiment. You can see that it is training very slowly. Let us go to the other extreme. Let us use the learning rate of 10 which is very high and you can see that in one epoch in one epoch we managed to get to the 0 loss. If you reduce it to 1 you can see that pretty much one epoch we are able to achieve very low training and test loss. You can see that when we use 0.1 in first 7 epochs we reach very close to 0 loss. So, now let us try to increase the noise level and see how algorithm responds to this. Now we got loss which is much higher than what we were getting previously and it is obvious because we have some of the points that are misclassified and you can also check the weights here on each one of them. If we increase the learning rate we are able to train faster. So, we are not able to get the 0 error rate. So, what we can do is we can increase the complexity of the model and try to learn some more complex representations so that our losses are reduced. So, one way of increasing the complexity is by raising the polynomial degree of the original input features. So, here what we will do is we will raise the power of x 1 and x 2 to 2 and also add an interaction term and see how it affects the performance. So, you can simply click on this to add the square of x 1 we can add we can click on this to get square of x 2 and we can click on this to get interaction term. And now let us try to train again. You see now when I used learning rate of 10 you can see that there are oscillations. Let us play it back and see how the oscillation plays out. You can see that it is oscillating. So, of course, is a very high learning rate we will bring it down to 0.3 and see what happens. So, now you can observe that the so in the initial two experiment we had a boundary which was linear, but now you can see that we have boundary which is non-linear, we have a boundary which is slightly curved and this kind of complex boundary we will we were able to add because we added these interaction features. And you will be curious to know what are these colors here. So, you can see that if I just use x 1 as a predictor this is how I can separate the points. If I use x 2 this is the separation I get. If I use x 1 square it only operates on that particular highlighted part x 2 I get this separation x 1 x 2 I get this kind of separation. So, you are essentially getting separations for x 1 square this is the separator between for x 2 this is the separator for x 1 x 2 the separator is and the overall separator is you know linear combination of all these individual separators and you can see different weights here. So, let us try to train even more slowly and see what happens. We seem to have reached or I mean we are not able to get it below that. So, let us add a couple of more terms which are signs of the original features and we will try to you can see that now we have boundary which is even more interesting even more complex than the previous ones. So, at this point you can pause the video and try to play with learning rate and regularization to see whether you can get a better model which is which will not overfit on the test data. So, let us try to add regularization and see the effect of regularization. We will start with L 2 regularization before doing that let us once run the whole thing and note down the weights. So, you can see that x 1 and x 2 are strong positive weights and then there is some weight on sin x and then all other elements do not have that high weights because they are quite faint you can make it out based on their width. So, let us say we use L 2 regularization with regularization coefficient of 0.1 let us see what happens now. You can so probably learning rate is a bit higher you can you can see more complex boundary getting learned and it is very interesting to see that it is using now small weights everywhere else except for these two features. If we use L 1 regularization here let us see what happens. If I use L 1 regularization, L 1 regularization has tendency to put 0 weight to the features that are not important and you can see that all these features got 0 weight. So, only features that are important here is x 1 and x 2. So, you can see that L 1 regularization can also be used for feature selection for getting the features which are probably more important in the classification task. And indeed L 1 regularization is used for modern feature selection. One of the way in which you can build machine learning models is take your features if you have enough competition power you can raise it to some degree of polynomial and you know through L 1 regularization with sufficient regularization rate you will be able to get feature selection in the process of training. So, the training will happen and you know most important features will get picked up. Later we will see that we do not have to construct the feature crosses by hand and neural network takes care of that automatically. So, let us try to go to another data set and try to see. So, here unlike our previous data set this data set is non-linearly separable. So, even if you have a clean data without any noise we cannot just use the original features x 1 and x 2 because they simply do not have capacity to learn the complex boundary which is circle in this case separating both the classes. So, what we will do is we will write away we will actually train it once and see where we reach. You can see that the training error is 0.49. Let us add the interaction features and see where we reach. And now let us we will train it again. You can see that within so we are getting very low training and test error. So, we are getting almost a perfect classifier which is a circle which is separating two classes. So, we can stop it and let us try to use L1 regularization here with a 0.1 regularization rate and see what happens if we retrain this. So, you can see that we have again achieved fairly low training and test error. And now you can see that only features that are important are the squared features. It is quite obvious because it is a circular decision boundary. So, squared features obviously will have larger say. So, you can clearly see that here we did feature process raise it to degrade to polynomial and we simply applied L1 regularization which gave us these two features. Let us try to apply L2 regularization and see what happens. L2 regularization also got us fairly similar training and test error. But you can see that L2 regularization does not assign 0 weights to the features. Instead it assigns weights which are very small. So, this is one of the differences that you can note that you can observe in L1 and L2. So, you can. So, as an exercise what you should do is you should pause the video and try out bunch of different combinations. I would suggest not to change the activation type here because we are solving this as a classification problem. Sigma it is the right activation type. But I would suggest you I would strongly encourage you to change the learning rate, regularization and regularization rate. Try to add more noise in the data and see whether you can fit the model and how the model looks like after getting fairly low training and test error. So, let us try on the final dataset which is XOR data. This is even more interesting dataset. You can see that the classes are in a XOR situation. So, let us hope with a simple linear classifier. So, we add interaction features or the second order polynomial features and we can do the training and see what happens. So, quickly it went down to reasonably low error and we can see that we have a complex decision boundary and the most important feature is the interaction feature that helps us predict this particular thing. All other features have very small weights around 0. So, now if you apply if you do not apply any regularization still we see that this is the most dominating feature. If you use L1 regularization you can see that all the features have been driven to weight of 0. Only one feature which is the most important feature which is the interaction feature have got a strong positive weight and we are able to separate the two classes. So, this was a nice visual way of intuitively learning how machine learning algorithms perform in under different datasets and different noise added to the dataset. So, in this session we looked at linearly separable dataset, nonlinear separable dataset and XOR dataset and applied classification technique on them to classify points into the correct classes. We also studied how we can use the interaction features and L1 and L2 regularization in the context to control the model complexity. Hope you enjoyed learning this session with us. This brings us to an end of machine learning refresher using neural network playground. In the upcoming session we are going to do a similar refresher for deep neural network. We will start with the basic primer on deep neural network. We will follow that up with some mathematical foundations of deep learning through coding and we will also visualize some of these, some of the concepts of neural network and their application to different dataset to neural network playground. Till then, goodbye. Thank you.