 Namaste, welcome to the next session of practical machine learning with tensorflow 2.0. In the last session, we studied how traditional programs are different from the machine learning systems or in other words how machine learning algorithms are different from the traditional programs. We also looked at data as an important prerequisite for machine learning and learned various characteristics of data like features, labels and different type of features. Depending on the type of the label, we get different machine learning algorithms and we also studied those types in the previous session. Now that we have data, we know what is the machine learning process, let us try to get into the details of the machine learning process in this particular session. So let us note down the machine learning process for a reference, data pre-processing. In this case, we take the data that is given to us, the data comes from different measuring instruments or through different business processes and since there are lot of stakeholders or lot of pipelines generating the data, some of these pipelines might be erroneous and that leads to some of the data quality issues. So in data pre-processing, we try to remove inconsistencies in the data, try to look for outliers which might occur due to some of the erroneous process and try to remove those outliers. The data quality, maintaining the data quality and getting the high quality data to training process is key to building successful machine learning algorithms. So let us try to understand some of the common steps in data pre-processing. So often what happens is that we have different features for data and these features are in different scales. For example, in case of housing price prediction, we had area of the house in square feet which will be in hundreds but the number of bedrooms in the house will be less than 10. So you can see that when we try to train machine learning algorithms with features that are on different scales that causes us some kind of an optimality in the training process. So the first thing that we do is we normalize features and bring them to exactly the same scale. Normalizing helps us in getting better convergence in the training data. So how do we exactly normalize the data? We will look at one of the technique for normalization. It is called as Z score normalization. So what happens is that for a particular feature, let us say for feature Xj, we compute the mean, let us call it as muj and the standard deviation which is sigmaj. Then what we do is we calculate the normalized value for a new value, we essentially subtract a value from the mean and divide it by the standard deviation. So this particular formula gives us distance of the point with respect to the mean in terms of the standard deviation. This is called as Z score normalization and this brings the feature Xj in the range approximately minus 3 to plus 3. This is one of the possible ways of normalizing the data. Other way could also be we find out what are the minimum values, let us call it as minj and maxj. Maxj is the maximum value of the feature, minj is the minimum value of the feature and what we do is, we subtract the minimum value from the current value and we divide it by the range of the mean and max. This also helps us in getting all the features in the range 0 to 1. So these two are the techniques that are used quite often in normalization. Apart from normalization, sometimes we use log transformation in the features to get them to a new domain. So we also use what is called as log transformation or we can also apply, it is a square root transformation. So all these transformations along with normalization, so all these constitutes to what is called as data preprocessing in the domain of machine learning. After preprocessing data, we also encourage you to visualize the data and explore it and try to understand the relationship between the features and the labels. Knowing that understanding of the data is crucial in building machine learning models because once you have these kind of understanding, you can think intuitively what kind of features will have an impact on the label. So now that after data preprocessing, we have a data ready for our next step. So once the data is ready, now what is the next task in machine learning? The next task is to build a model. What is model? Model is nothing but a mapping from the features to the labels. So second step is a model building step. So the simplest model is a linear regression model. Who specify these models? So you can see that when we were talking initially about machine learning, difference between machine learning and traditional programming, we said that machine learning model essentially maps input to the output. So it is some kind of a mapping or function in other words and you can see that there are infinite function classes that are possible. But we take some leap of faith and assume some kind of a function class. So when we are selecting this function class, we also take into account some of the domain knowledge as well as some knowledge that we have gathered while practicing machine learning model building. We generally begin with simpler models. The simplest model is a linear regression model which has the following form. Y is the future, Y is essentially B plus W1 X1 plus W2 X2 Wn Xn. You can see that in this model, B, W1, W2, all the way up to Wn are the parameters. Y is the label and X1, X2 up to Xn, Xm are the features. So we have a setting where we have m features, we come up with a very simple mapping between the features to the label and we hypothesize that there is a linear relation. So geometrically this represent an equation of a hyperplane. Let us start to understand this with respect to a single feature just to build an intuition about linear regression. Let us assume that we just have a single feature, then we have very simple model Y is equal to B plus Wn X1. This is a familiar equation to each one of you. We studied this in high school. This is very similar to the equation that we studied in high school which is Y is equal to mx plus c. Do you remember this equation? This is an equation of line and in this case there are two parameters B and W1 are two parameters that we want to learn and B is nothing but a bias, B is called as bias but it is really a Y intercept and W1 is the slope of the line. So let us try to represent this line geometrically. Let us say these are some of the points we have X1 and Y as an output. X1 is the feature, Y is the corresponding label. This is one such line. This line passes X, passes the Y axis at some negative number and it has got some slope. So this is an example of a simplest model that we can use to map the input feature X1 to the variable Y. We can also come up with some complex models where let us say we have data which is distributed like this. Let us say this is feature X1, this is Y. Obviously, any other model is probably not a good choice. So data is slightly, data is represented by slightly higher order model. So here we can say that this is essentially we can use some kind of a polynomial model where we say that this is W1 X1 B plus W1 X1 plus W2 X1 square plus W3 X1 cube. We can see that we raise the power of the input to the second order and the third order and this is a polynomial regression problem. This is a polynomial regression model that we are using to map the input X1 to the output Y. So this particular approach works very well when the output Y is a real number because by solving this particular equation we will get a real number. How do we, what kind of models can we use for the classification task where Y is a discrete quantity as in case of handwritten digit recognition, logistic regression. So instead of predicting the real number we are interested in predicting some kind of a discrete quantity. So let us represent the real number that we got out of linear regression using Z. Let us say Z is B plus W1 X1 plus W2 X2 all the way to WM XM. Given M features we can perform the multiplication of the parameter with the feature value and then we add all these things together this is called as a linear combination. So the linear combination of features and the parameter values we get Z and our job is to take this particular real number and convert that into a discrete quantity. Let us say you want to predict whether a particular house will be sold or not. So now here the label that we are interested in is whether house will be sold which will be represented by number 1 and if house is not sold we call it as label 0. And we have all the features of the house and based on that we want to predict whether house will be sold or not. So what we can possibly do is we can create a linear combination of the features and the parameters and we get this intermediate representation which is Z which is the real number and we want to convert this real number into a discrete quantity between 0 to 1. So we have a special function called logistic function that comes to our rescue here. So logistic function has Z let us say going from minus infinity to plus infinity and let us say this is your 0 and this is the y that we are trying to predict y goes between 0 to 1. So what happens is this particular function takes a real number and squashes it between 0 to 1. It crosses the y axis exactly at 0.5 and this h shape function is called as a sigmoid function. The sigmoid function is represented by sigma Z is nothing but 1 over 1 plus exponential of minus z and all that it does is it does linear combination followed by a sigmoid which is this particular function and sigmoid for Z is equal to 0 the sigmoid function has value exactly 0.5. As we go away from 0 we get value closer and closer to 1 we go from 0 on the right hand side this tends to 1 if we go away from Z this will tends to 0. At 0 it has got value of 0.5 if we go away from the mean in the positive direction sigmoid value tends to 1 if we go to the left in the negative direction of Z we get sigma value closer to 0. So this is a sigmoid function if you apply the sigmoid function what we get is we get a number between 0 to 1. So which we can conveniently interpret as a probability. So we say that the probability of y i equal to 1 given x i is nothing but the sigmoid function exponential of minus whatever b plus w of x 1, x m so this is the sigmoid function it predicts the probability that the data item will take table 1 given its feature. Now we will be wondering what kind of decision boundary because we are trying to essentially divide the area into positive class and a negative class. So what we do is let us try to see what kind of decision boundary logistic regression gives us. So there are let us say two different kind of classes, crosses, circles so this could be one of the potential decision boundary between two classes circles and crosses. So logistic regression is also linear classifier because we get some kind of a linear classification boundary in the most basic case. You must be wondering what if two classes are not linearly separable. So let us take a situation where we have cross inside all the circles in the these are features x 1, x 2, feature x 1 and this is a feature x 2. So here what we can do is we can use polynomial features to get a decision boundary like this. So instead of just using feature x 1, x 2 we also introduce features like x 1 squared, x 2 squared, x 1, x 2 which is the interaction feature between which is an interaction feature made up of two individual features. So from two features we will get these five features and see if we can find a decision boundary that is able to separate two classes. So we saw that in order to separate this particular non-linear non-linearly separable classes where the circles and crosses are separated by a circle or a circular boundary we what we did is we took the original feature x 1 and x 2 and constructed more features by calculating the feature cross. The feature cross gave us three more features which is x 1 squared, x 2 squared and x 1 and x 2. By using all these five features we are able to construct such a decision boundary. However, if we have a very large number of input features constructing such kind of feature crosses by hand can be very, very expensive. So let us look at a technique called neural network or a feed for neural networks that will help us to free up some of our task of generating this feature crosses and neural network does this feature crosses automatically. Let us see how neural network does it, we are anyway going to study neural networks in far more detail in the next section but let us try to understand neural network from basic perspective. So what neural network does is, so we are going to look at a specific neural network called feed for a neural network. So we can think of a neural network as a mechanism to construct complex functions by taking simpler functions. So we are essentially going to construct complex function by composing simpler functions. Let us try to understand that in more detail. In the context of this particular example, we have two features x 1 and x 2, they are the inputs to the neural network. Then we can have a neural network or really a toy neural network which is to begin with. We can have maybe another layer with three hidden units and then we have an output layer with a single unit and this output layer will output one of the two classes. It will output the probability of the class one or the positive class and what happens is that we are going to connect x 1, we are going to send x 1 to all units in the next layer. This is called a dense architecture where the unit over here is getting input from all the units in the previous layer. In the same manner, we are going to send this particular input to each node in the next layer, each unit second layer we are sending it to the next layer. This is the direction of the arrow for the sake of simplicity, I am just drawing a big arrow here. Now you can see that the unit, so let us get the terminology right. So this is, this layer is called as the input layer of the neural network. These two layers are called as hidden layers and this is an output layer. We also have, we have, we also have a bias unit to each of the nodes here. So you can see that from just two features which is x 1, x 2, we have constructed a complex representation containing far more features. So how many features are there in all in this neural network? So the number of features is equal to, you can see that, so we have two features to begin with, we take these two features and make four more features out of this and then there are three more features over here. The effect of this is that we get a model with far more capacity or in other words which has got a very large number of parameters. How many parameters are in this model? The number of parameters are equal to the number of the number of connections. So that is there is one to one relationship between number of parameters and connections each connection represent one particular parameter and each unit over here has got two parts to it. So let us concentrate on this particular unit. So it has got two inputs x 1 and x 2, it also has an additional unit called bias unit. So this particular unit what it does is, it first does linear combination which is B plus W and x 1 plus W to x 2, let us say this is Z and this is passed through some kind of a non-linear activation. It is a common practice to use ReLU as an activation function here. The ReLU function looks like this. So for let us say x axis is the value of a linear combination and y axis is a non-linear activation, so ReLU of Z. So let us try to see how this ReLU function looks like, use different color. So for any value of Z which is less than 0, ReLU of Z gives us 0 and for any positive value essentially that positive value is returned. So ReLU of Z is equal to Z if Z is greater than 0 and 0 otherwise. In the output layer, since we are interested in having two classes or a binary output, we use instead of ReLU we use a sigmoid activation. So this is called as, so there are two steps, one is linear combination followed by non-linear activation. We will see in more detail why we use non-linear activation in the next stage, in the next class or in the next session. Just to complete the point, it is important to use non-linear activations here. Non-linear activations help us to build models corresponds to a non-linear surface. For example, if you want to build a model corresponding to the circular boundary, non-linear activations make that particular thing possible. So Newell network is one such model that as I said earlier helps us to compose, help us to break down the complex function into simpler functions and we combine the output of the simpler functions to get us far more powerful models in terms of their capacity or number of parameters. In the next session, we will study how to train these models, hope you enjoyed this session. See you in the next session, namaste.