 Namaste. Welcome to the next module of our course. In this module, we will study basic concepts of deep learning. AI is the bigger outer circle. Within AI, we have machine learning and within machine learning, we have deep learning techniques. So, it is very important to understand this perspective. DL and ML are not two different things. DL or deep learning is a subset of machine learning, which in turn is subset of AI techniques. So, within realm of machine learning, we use deep learning to learn representations from the data automatically. We will see lots of such kind of examples in our course going forward, where we are using deep learning to learn representations of the data automatically. Normally, we use human experts to give us features, but you can see with deep learning, we can automatically learn the representation of the data without any human intervention. Deep learning techniques are not new. Deep learning techniques are known for quite some time. However, there has been a tremendous progress in last 7 to 8 years in deep learning, which is mainly propelled by availability of three things. The first thing is data. The large internet companies have collected massive data sets which are propelling the growth of or the progress in deep learning. Second is specialized hardware to train machine learning algorithms. So, there are hardware like GPUs and TPUs which are used for rapid trading of neural network models. And third is algorithmic improvements which are helping us to train really deep models. What has deep learning achieved so far? Deep learning has already achieved human level performance in the computer vision task. So, deep learning is able to recognize objects in the images as good as a human being. So, let us understand the concepts behind deep learning. As you study them, you will realize that these are simple concepts built on the basics of machine learning that we studied in the previous module. So, we studied a technique called logistic regression. In logistic regression, we had several features as an input. So, let us say there are M features as input. We will draw a node corresponding to each of the features. So, there are M nodes and we used to first perform and we have a special unit called a bias unit which is 1. So, each of these inputs are fed to this particular unit which does a linear combination of parameters and the weights. So, there are weights corresponding to each of the connection. So, this particular unit performs linear combination. So, B plus W and X1 plus W to X2 plus and so on to WM XM. After performing the linear combination, we have another unit. So, let us call this linear combination as Z and we take this Z, give Z to this particular unit which is a sigmoid activation unit. So, what it does is, it applies, it passes Z through the logistic function which is a number between 0 and 1 and we interpret this Y as a probability of the given example belonging to a particular to a positive class. We interpret this Y as the probability of an example belonging to a positive class. So, let us try to understand this more clearly. So, what happens is what we pass here is a one single example features of a single example. Let us say this is ith example which we denote by X superscript I. So, this is ith example, we pass the features of the ith example through this and if we know all the weights what happens here is we get a linear combination and followed by a sigmoid activation that gives us a number between 0 and 1 which is interpreted as a probability of the positive class. We are representing logistic regression here in the style of neural network. So, it is important to note that lot of people get confused about these nodes. So, there are nodes, there are exactly m plus 1 nodes. So, m is number of features, m is number of features and we have m plus 1 nodes in the input. So, this is the input layer and then we have. So, the input layer is connected to a unit which is called a hidden unit and hidden unit has two components. The first component performs linear, first component performs the first component performs the first component performs linear combination followed by nonlinear activation. In case of logistic regression we use sigmoid as a nonlinear activation. So, this is an input layer of a neural network. This is the input layer and this is an output layer. In output layer in case of logistic regression there is a single unit which does linear combination followed by a sigmoid activation. In the same way we can also represent linear regression as a neural network. Let us look at that. I would encourage you to stop a video here for couple of minutes and think how you can represent linear regression in the neural network style. So, let us say again we have input data or each example represented with m features and we have a bias unit. Then there is a linear combinator or a unit that does linear combination. So, we connect all these inputs to this particular unit and this unit calculates a linear combination of features and their weights. So, there are weights on the connections and this particular unit computes a linear combination of the features and their coefficients or weights. We pass this Z through a linear activation function. In case of this we have a linear activation function. So, the value of Z is essentially passed through this and we get Y which is a real number and it is an output of a regression. So, this is a representation of linear regression in the neural network style. In case of logistic regression this is a unit of if this is a neuron or a unit in neural network which has got linear combination followed by a linear activation function is what we use in case of linear regression. So, this is our input layer and this is our output layer and it output layer we have a linear activation. We have used linear activation in the output layer. Let us understand how to construct a deep neural network model with a simple unit that we studied in case of logistic regression. So, what we will do is we will take, so there is this input layer and we will stock multiple units that perform linear combination followed by activation and we connect each of the input unit to these units in the second layer and then let us say we have one more unit where all these the output of these intermediate units is connected. So, let us say this is these are all inputs x1, x2 all the way up to xm. I am not explicitly showing the bias on each of the units there is additional in there is an additional parameter here which is a bias parameter. So, we get, so what happens is that in each of the unit a linear combination of the input and the weight vector happens followed by nonlinear activation and there are few nonlinear activations that we can use one is we have already seen it is a sigmoid activation. In case of sigmoid activation a real number is squashed between 0 to 1 apart from sigmoid there is a another activation called ReLU which made deep neural network training possible. So, what ReLU does is for anything less than 0 for any negative number we ReLU return 0 and the positive numbers are written as it is. So, the ReLU of z is equal to z fz is greater than 0 and it is 0 otherwise. Third activation function that can be used is a TANH activation function. So, TANH activation function is actually returns a number between plus minus 1 and plus 1. So, it squashes a real number between minus 1 to plus 1. Apart from these three activation functions there are several variations of ReLU like ReLU or ReLU have been tried out in different deep learning algorithms. So, these nonlinear activations help us to learn complex models through neural network. If we use a linear activation over here instead of these nonlinear activations what happens is that we simply get a linear combination and we are not able to get the complex model that we might need in certain cases. So, the first layer is called as input layer, the second layer is so this particular layer is called as hidden layer and the final layer is called as output layer. So, if you want to have if you are solving a binary classification problem we generally use a sigmoid activation function. If we want a linear, if you are solving a regression problem then we use a linear activation function in the output. If you are trying to solve a multiclass classification problem there will be multiple units in the output layer and we will be using either sigmoid or a softmax as activation function. So, each unit has two parts one is the linear combination and second is nonlinear activation. So, each of the unit operates that way the activation function changes depending from problem to problem and it is really at a discretion of designer to select the activation function. So, these are the basic building work of neural network. So, there is an input layer there is there can be one or multiple hidden layers and there are output layers. So, in this particular case we have a neural network with one hidden layer and the hidden layer has got 4 units, the input layer has got m inputs there are 4 units in the hidden layer and there is one unit in the output layer. We can also have multiple hidden layers if we include multiple hidden layers naturally the number of parameters will increase. How many parameters are there in this network the number of parameters are equal to number of edges or the number of connections that we see in the network plus the bias. So, it is very easy in case of neural network to get complex models all that we have to do is we have to increase the number of layers or we can increase number of units in a single layer. It is generally advisable to add more layers rather than increasing units in a single layer. Notice that a neural network learns complex functions by breaking them down into simple functions. Let us see how it does it. Let us take a toy neural network with 2 inputs one hidden layer with 2 units and one output layer. So, if we and let us say we are trying to solve a classification problem. So, now y is let us call this output as z 1 and z 2. So, y is sigmoid of z 1 into the weight. So, let us call this weight as w 1 on layer 2 plus z 2 and w 2 on layer 2 plus b. This is how we calculate y. So, we get inputs which are z 1 and z 2. We multiply them by the corresponding weights. We add the bias term and we apply sigmoid on it. Now, z 1 itself is calculated as follows. So, we know that z 1 itself is calculated as. So, let us say we use ReLU as a activation function here and we use sigmoid here. Then we say it is a ReLU of b plus w 1 x 1 plus w 2 x 2 and z 2 is calculated as ReLU of b. Let us call this as 1 1 1 2. So, let us say 2 1 x 1 plus w of 2 2. So, you can see that this particular function this is a complex function and it is broken down into smaller functions which are determined. So, this complex function is broken down into simpler functions and that is the power of neural network. So, neural network essentially learns a very complex function by breaking it down into simpler functions and then combining the output from those simpler functions to gradually learn more and more complex functions. Now that you know about the basic building block of neural network, it is important to understand how do we set up number of hidden layers, how many units do we have in each of the hidden layer and what kind of activation functions we should be using. These are part of configurations of neural network architecture. Number of units and number of hidden layers are specified as configurable parameters as part of neural network architecture. In this course we will be studying some of the popular architectures for solving problems in image recognition as well as tech generation along with the feed for a neural network that we have seen so far in the course. So, the first architecture that we will study is a feed for a neural network. So, feed for a neural network. So, what are the machine learning components for feed for a neural network? So, there are four components in any machine learning model, one is the model. So, for a feed for a neural network, we have an architecture where each unit from the previous layer is connected to every other unit in the next layer and we use multiple hidden units and we use multiple hidden layers. So, this is a classic feed for a neural network. It has got two hidden layers. In each of the hidden layer there are currently three units that we are using and there is one output layer and in input layer there are two parameters that we are passing. So, these are parameters X1 and X2. So, what is the loss function? If we are solving the regression problem we used min squared error as a loss for regression and if you are solving, if you are using feed for a neural network to solve a classification problem then we use cross entropy loss. And what are the optimization algorithm that we use here? We can use gradient descent but more popular algorithm for neural networks are RMS prop or ADAM and these algorithms extend some of the ideas that we learnt in stochastic gradient descent. So, they use the concept of momentum and make sure that the neural network is not stuck in a local minima. We will look at how to visualize the neural networks and training of neural networks. So, the deep playground that is familiar to us. So, we will take the linearly separable data and now what you can do is we can define a neural network here. So, we have a control over here to add a hidden layer. So, let us say we add a hidden layer and here we have a control to add more neurons in a hidden layer or more units in the hidden layer. So, let us add 4 units and let us try to we use sigma activation function because we have a classification problem here and we do not use any regularization to begin with. So, let us use learning rate of 0.03 and we can see that within few iterations we got almost a perfect classifier and you can see the weights on each of these on each of these connections. If we apply regularization let us say we apply L2 regularization you can see that some of these weights are very small. So, let us apply L1 regularization to drive some of the weights of parameters that are not useful to 0. You can see that within 89 iterations we reached quite a small loss. We can also add more layers to the neural network by simply pressing this plus button. So, we have more neurons now and we can again learn. For a linear separable case this is really a complex model we probably do not need such a complex model. Let us try to use a neural network to separate 2 non-linearly separable classes. So, let us add a hidden layer with 2 units and let us see whether we are able to separate the 2 classes. You can see that we are not able to separate them perfectly. So, let us add one more hidden layer with 2 units and retrain the model. Looks like training loss is improving, but validation loss is not improving. So, you might be hitting overfitting kind of a situation. So, let us try to add a regularization with a small regularization rate and let us return the model. Model is still not learning to separate 2 classes. Let us try to get a separation between 2 classes by adding 2 hidden layers each with 3 units. Let us try to train it and see whether we are able to get a reasonable decision boundary. Yes, you can see that now we are able to separate 2 classes with a complex decision boundary. The important thing to note here is that we achieved we are able to achieve the separation between 2 classes without adding any hand tuned features. Instead, we added hidden layers and units in each of the hidden layer and those hidden layers helped us learn the non-linear decision boundary as you can see over here. So, you can see that neural network can generate features that can help us to separate non-linear classifiers. Earlier when we were using a simple linear model, we had to hand code the higher order or the interaction features between the original input features. So, that is why deep learning is also used in a representation learning where we take the raw features and we learn some interesting combinations of them to learn complex patterns. So, we can try one more example on this particular data set. I would strongly encourage you to pause the video here and build your own neural network model to separate the classes in the XOR. So, in order to solve this problem, let us add one more hidden layer with 2 units and try to train the model. So, we have a batch size of 10. So, we are going to use mini batch approach to optimisation. So, we are going to use batch size of 10 and so, let us train the model with Rayleigh activation function and learning rate of 0.01. And you can see that it has learned some classifier which is separating 2 classes to some extent, but there are some points which are still not separable. So, let us try to increase the complexity of model further and see whether we can separate them. That is quite interesting. You can see that the classes are separated precisely here. And if you look at the individual neurons, in the first layer neurons are only learning the linear separators. In the second layer neurons are learning regional separators and in the third layer they are learning combination of the separators that they learn in the second layer and they learnt how to separate the classes that are having points in the XOR fashion. So, this is again reinforcing the fact that we did not hand code any of the features and just with the help of a neural network, we were able to classify points which have a very complex decision boundary. So, hope this gives you some sense of how these hidden layers or the neural network architecture is helping us to transform the simple features into feature crosses or complex features and is able to learn a complex decision boundary or complex model. In other word, what is happening is neural network is learning the complex function by composing simpler function that they learn from layer to layer. So, this is the beauty of neural network.