 Our second lesson. In this lesson, I will try with you to build the first neural network, a small one. We remember from just the recap from what we did in the first lesson for the logistic regression. We had our input X, these were the data sets, and in the data set you remember we had features. We used, for example, for the classification of the picture, we used the pixels as a feature, but for other objects you can use as a feature. This is your interpretation, it comes from the domain of application. Features can be different, they are related to the application. We have our weights and the bias B, and we had our function C equals W transparent X plus B, these are of course all vectors. And then we used the sigmoid function just for the activation to make a decision between zero and one. This is what we did, and the output was a zero and a one, and based on that we get the prediction. And then we were looking for the loss function of our prediction and the label, this means this is the difference. But based on the logistic, we have the logistic equation of course, and we are summing up about all examples. Then we have the cost function, and for the cost function, based on the cost function, then we are calculating the differences dW and dV. These are our partial derivatives of the cost function dJ, dW, and dJ, dV. And last but not least, after all these calculations we are getting the new weights, and this is W minus alpha dW, and dV is B minus alpha dV. And based with these new weights and the new bias, we are doing this the same. And then if you are doing it a lot of time, and based on a really good training set, you have a good trained logistic regression work. In our case now we will see we will also do a network. That is just what we discussed now. And the small difference now is that we have, here you see we have our each and only one node. If we are working with neural network, then at least we have three layers. We have the input layer, we have the hidden layer, and we have the output layer. And the figure, I mean the connection figure which we are using for the logistic regression, now we are applying to every of these nodes. And you see every input is connected to every node. And every connection here has its own weight. In that case we have only three weights plus weight coefficients, and plus the bias is means four perimeters. In that case we are getting three by four already 12 perimeters for the first layer. And here we are getting plus four, because the bias is not included. The bias you should always think this is a special included. Here we are getting plus four, this means we should train, we should train already 16. What is very simple example, 16 perimeters. Again this is not a problem to train 16 perimeters, but it's just only to give you an understanding how the number of perimeters is increasing. So for the training we have now to calculate not only one set, here we have only one set to calculate and using this set to apply to this input to apply the sigmoid function. Now we have at least three sets, three times to apply this. Of course we can do this using not any more vector, but using some matrix. How this we will discuss later. And get an output. An output here, again here in that case we are writing a sigmoid function, but we will discuss is this maybe necessary, is it maybe necessary to replace the sigmoid function with another activation function. Why I will discuss this later. And then we are applying this output again here at that point. We have a second set of weights plus the bias. We are getting a third output. This output we are applying for this output again we are applying the sigmoid function and we are getting our prediction. This is our prediction. And then we are checking the prediction against the label. And again if this is fitting we are happy and if not then we have to calculate all the differences so we are going back. So we are calculating our DA, we are calculating our reset and last but least we are calculating our DW and DB. And then we are applying the same formulas and we change the parameters. And running this again. This is then running in a loop. How to fix it. First of all about the representation of a neural network. As I said already the neural network, the neural network we have this input layer. We have at least here one bit layer and we have an output layer. If we have binary classification then we will have here only one neuron. If we have let's say a classification which is used for example for the unmanned vehicle driving. Then of course the class is much bigger. Of course we should say this is a traffic sign. This is a NASA car. This is a bicyclist. This is some people who is walking. This is a tree. Don't drive to the tree. And this is the car in front of you. This is the car beside you and this is the car behind you. So all these objects should be detected and of course also their locations. This means in that case we have already not only a binary classification. We have a multiple classification. And the algorithm is a little bit changing but not really. The idea behind is always the same. Only the implementation is much more difficult. But if we have a binary classification that the activation function at the output is always a sigmoid function. Of course we have one or zero. But here at these neurons or units we will not use the sigmoid function. We will use another function. If we have a multiple classification then we will not use sigmoid function. We will use a softmax function like this. You can learn later. But it's just only for your understanding. Of course it is not necessary that we are using only one neuron on the output. But it is simple to start with simplification. So as you remember for the calculation this was the input multiplication with the weights plus bias. And this was the argument for the application of the activation function. And the activation function is deciding is this neuron firing or not. This is the reason why it's called activation function. If it's zero it's not firing. If it's one then it's firing. I will not discuss now how similar this is to the work of our brain. But at least some similarity is there to the work of the neurons. And this is the reason why this was also called the activation function. So in the case of a stellar network we have this bigger now. We should apply this to every of these neurons in the hidden layer. This means we are calculating this for this weights. Then we are getting one output. And you see now what we need is not only a notification or let's say an indication on which neuron we are working. We also should say on which layer we are working. Because this is layer zero. This is layer one and this will be layer two. We are always starting the indication with zero. And you understand if this will be a deeper network then it's just we are getting additional layers. It's quite easy to add layers. The same is then working for the other neurons. We have one in the middle here again. You see this is for the second, for the third and so on and so on. So if I put this together, you can see what is written here. These are vectors. But why to multiply this as a vector? I can put these vectors as rows. I can put it as columns in some matrix and then transpose this matrix. So we have two matrices. We have a matrix X. We have a matrix X where I have here my data sets. We said we have m data sets and I have my matrix W. And in the matrix W I have here my weights. This is the matrix W for layer one. And here I have the first weights, the second weights of this layer. And here I have four. Let's say how much I will need an X. No, sorry. This is one, but this is an X. And to multiply this, I'm just only transposing. And then I can multiply it. And then in one step, at least, I calculate my sets for the first layer. So I'm adding, you can add here some vector B also from the first layer. So this means this can be this operation. This operation can be written at one line. Of course, from a mathematical point of view, you cannot really, of course, the output here is a matrix. And here I have a vector. Because matrix times matrix gives me a matrix. But adding a vector, this is not really possible. From a mathematical point of view, I could enlarge the dimension multiplying this identity matrix. But in Python, this is quite easy. Of course, this is a broadcasting of the operation. Python understands that it should add this number to every output. So this is quite easy to write. It's the same like in Matlab. In Matlab you have also this broadcasting of the operation. So when we have this given input, when we have this given input, we calculate it. Then we apply to this input our activation function. And this is now the activation function for just the whole, for the first layer. Then we have this output and the output is the input on the second layer. Of course, you should make sure that the dimensions of your multiplications are always in the right shape. So this is from a practical implementation. From a practical implementation, one of the, let's say, the tricky things, where you should pay attention to that. Otherwise, your algorithms either will not work or you will have problems with the algorithms. And last but not least, we are getting the output. But in that case, this will be a number. And this also will be a number. Of course, we have only one prediction, only one prediction. But if you have multiple classification, then, of course, you will get the vector of output. This is one of the important points for neural networks. Yes, it's more that the first neural networks were introduced somewhere in the 60s. And they started to work with the neural network. Maybe somebody of you remembers the name Prasek Tron. This was one of the first neural networks which were used. They were used as activation function, a unit step function. Every engineer knows what is a unit step function. And I hope that also the people from computer science know this. You see, this is this function. It makes sometimes also called heavy side function. With this function, they tried to do some applications and then came Marvin Minsky. And in one paper, he showed that the applications of this Prasek Tron are very, very limited by theoretical reasons. Not because the people are stupid, but because the theory behind and the neural networks were forgotten for 20 years. And only a group in Vancouver from George Hinton, he was working still on that and he tried to improve it. He did a lot of effort and they introduced also new activation functions and then they reached a really good success. What I told you in 2020, it was even that group, people from that group who had to break through. And based on that, it was working. So what is the problem? The problem is the following. If we have this function, this kind of function as activation function, then this function has a big problem is starting somewhere here and somewhere here because we have a saturation of the function. And this means even if the arguments are very, very different, the output is more or less the same. So and you cannot have, not anymore you can have a distinction. And this means, we will see this later, I will show this later, then you are replacing the input not with a non in your function, you are replacing this with identity function. And if you replace this with identity function, your algorithm is not learning. And if it's not learning, why to spend time? It's a waste of time. What we are doing is now that we are looking for other activation functions. As I said, at the output, if we have a binary, if we have a binary network at the output, of course we will use the sigmoid function. Of course it's good for a distinction between 1 and 0. But here we need functions which are more sensitive. So and one of the possibilities is of course the tension hyperbolic cross function. Because the tension hyperbolic cross function, first of all, it's not so steep. And the second you have at least an output between minus 1 and plus 1. And the last, let's say the group of hidden, what they really introduced was the use of this rectified linear unit function, which is 0 at that area and getting, let's say, it's a linear function at that area. But overall it's a non-linear function, but very easy to implement. Because it's just the maximum between 0 and the input. If the input is negative, then 0 is greater than a negative number. And so the output is 0. And if the input is positive, a positive number, or 0, the maximum of 0, 0 is 0. And the maximum of 0 and the positive number is the positive number. And by the way, what is the derivative of this function? What is the derivative of this function? This is just the constant function. It's that function. Because the slope here is 1 and the slope here is 0. So we are getting for this real function and we will need, we will need the derivative also of the activation function in our calculations. We are getting for the real function as the derivative just the unit function. So this is really easy. The problem, the problem of this function is the negative part. Because the negative part cancels out everything. So at that area you are learning, here you are learning nothing. So you should be, you should ensure during your calculations that the arguments for the activation functions are not getting negative. Otherwise you have too much units which are not learning. But a way out could be a so-called leaky real function, which you see here, the slope of this function in the negative part, the slope is less, but nevertheless you have a small slope. So also you are learning something in the negative part. How do you determine the slope on the left side of the rally? By experience. I will show you, for example it could be 0.001 or minus, sorry. Sorry, plus 0.001. It's a hyperparameter, you are defining it? No, it's not really a hyperparameter. You just implement this as the used activation function. Why don't we just use a linear function here? This one? Yeah. Because you are not learning nothing. You are replacing the input with the output. At that place, here, at that place we have some activation. And if the input is like the output, if we would use this function. Yes, the input is like the output. You are not learning enough. But in case of positive numbers, the input is... Yeah, but the negatives are cancelled out. This is a linear function, this is a non-linear function. So what we are doing is the following. We are replacing, we are replacing at that point where we had the sigmoid function at that layer with another activation function. Of course there are a lot of activation functions, as I show you at the table. It is not necessary now to read it. I only want to show you that this is not only two or three activation functions. There are a lot of activation functions you can look to tables. And it is not necessary, you find it in Wikipedia. The source for me was also Wikipedia because this... You never... Never else you get such a nice, such a nice overview. Because you have also here the properties of these activation functions. For example, monatonic, the derivative is monatonic. Each kind of continuity you have, what is the range for that and so on and so on. But nevertheless, of course I can use, as you said, the activation function, that this is not really good for learning. We have the binary step function for the perceptron or unit function. We have the logistic regression which we are normally using which we are normally using for classification. Tangent hyperbolicus and also arc tangent hyperbolicus, arc tangent. Both are used, both. But in the case of the arc tangent, you should have to be sure that your arguments are only between minus pi over two and plus pi over two. I see you are good students, not like mine students. They would ask you why. They forgot already what is arc tangent. And here you see also other functions. It is maybe also possible that that part is not linear, but here it has some special shape. But the good use for that function is, and also for the linear function, that the derivative is very easy to calculate and very easy to implement. And if we are thinking not only about the theory, but we are thinking also how can we make an algorithm which is working quite fast, then of course we should have also this in consideration. So, going back. But it would, that one looks linear even for... Which one? For negative numbers. This one? No. Yeah. Yeah, that one. Yeah, but I'm thinking about the range from minus infinity to plus infinity. From minus infinity to plus infinity to slant infinity. It is just, the parts are linear, yes. This is the idea. This is a stepwise function. It's a stepwise function. The parts are linear, so the derivatives are quite easy, but the function overall is slanted. And I have not this problem which I sometimes have with the real functions. But then the derivative is like this, and like it switches. Yeah, of course the derivative is something like that. This is the derivative of a league 3. That's true. But this just only... The factor is not very big. What's the reason that we decided to have something linear instead of 0? Because there it was not learning anything. But this or this one? Yes. Which one? What means... Why did we switch from that one to another one? Because it was not learning anything. Because sometimes if you have negative inputs, then it's not really... If you have too much... Yeah, in that case you should make sure that you have not too much negative inputs. In that case you should not think about that. And in the case of this function the problem is the saturation. The problem is the saturation. So this means if I'm taking here an argument, and I'm taking as an argument 1 million, the output is the same. If I'm taking in that case an input at 1 and 1 million, the output is very different. But it avoids this function, avoids the problem of an identity function. Because the identity function I said at least for classification is not learning. And you also have to have in mind in the logistic regression, in the logistic regression we are using a logarithm. And the argument of a logarithm should be positive. Yeah, this we discussed already. This means we are replacing the sigmoid function now with a different kind of activation functions. The activation function on the first layer on the set or on the different layers should not be the same. But the activation function for all units in one layer should be the same. So if you are going to the first layer, you should define normally the real function is used. But if you decide that the tension hyperbolic fits more or you like it more, then you can use it. But please use it for all units in the first layer. Then you are going to the second layer. And in the second layer you can use another activation function for a deeper network. Again, think about which kind of activation function makes sense. And please have also in mind changing the activation function means to implement elements. But if we replace, and this was your question, if we replace this with the linear function, then we have here C1. We are getting this, then the activation is C1. And this is going in here side. What we are learning, we learn nothing. Because in that case, if we are using this activation function, what we are doing is we are just multiplying matrices. If we have a look. This is the input. We have as an output, we are getting the multiplication of these two matrices with that. And we have two different parts, but it's looking in the same way. Do you have the input? Can I have one more question? You mentioned that it's not important to have, we can have binary like in that layer. But it's important that we have a linear function. In the first layer, the idea is to replace the sigmoid with one of these functions. This is not true. We should not replace it with a linear function, but this sigmoid function for a binary classification and for a multiple classification with a softmax function, which is something like the probability. If you have, let's say 10 classes, you are calculating something like the probability for every class in the output. And then you say, okay, I am taking the class and you say, okay, this is probability. How this is implemented? This is not really a probability, but very close. So this is about the activation function. Next discussion is about derivatives. So as you see, we will need also the slope. We will need the slope of our activation functions. We will need the calculation for the calculation of the new weights. So how to calculate this is quite easy mathematical exercise. If our sigmo offset is 1 divided by 1 plus e power by minus z, then sigma prime is then minus e power by minus z divided by 1 plus e power by minus z squared. Everybody agrees? Or should I show it? Then we can say this is the same. This is the same like 1 minus e power by minus z and plus 1, minus 1. And this is 1 minus e power by minus z squared. And we split this. We are getting here 1 divided by 1 plus e power by minus z. And minus 1 divided by 1 e power by minus z squared. And this is sigma offset minus sigma offset squared. sigma offset 1 minus sigma offset. And this fits always. So we always know what is the derivative and we can plug in. If I call this A, then the derivative is A times 1 minus A. So it is quite easy to use in an algorithm. So this we are using for the derivative of the sigmoid function. And in the same way, if you have tangent hyperbolicus offset, this is our new g offset, the new activation function. Then g prime is the derivative. And this you can calculate. This is 1 minus tangent hyperbolicus squared z. Look in any formula. But you can try it also by yourself as x power plus z minus x minus z by x plus z plus x minus z. Sine hyperbolicus by 2. And this is the cosine hyperbolicus. Also by 2. 2 is called out. And then we have this much. So this is also very easy to use. Tension hyperbolicus. This is the tangent hyperbolicus function. Sine hyperbolicus divided by cosine hyperbolicus. And if you try a little bit, you will get to the same point. I will not use my time. You will find it in every book. So this means, if I call this a, then also this is 1 minus a squared. Very easy to calculate. So you see as the activation functions, we are not only looking for nonlinear functions and we are not only looking for nonlinear functions which fulfill the requirement that they give a sensitive output. We are also looking for activation functions which are easy to calculate where the derivatives are easy to calculate. Because we need this, we will need this in the, we will need this in the calculation of the derivatives of our cost function. So, and the derivatives of the real and the linear real or we discussed already, sorry there is something missing. So this is the unit step function in that case. And this is the unit step, but this side is a little bit higher. But normally we are taking here a value which is possibly but very, very small. So this means the slope which we have here should be a very, very, let's say, non-steep slope. How we are coming to the gradient. As a parameter, we have now the input of our, the number of units in the first, this is the dimension of my matrix for the weight in the first layer. This is the dimension for the bias vector. This is the dimension for the matrix on the second layer. And this is the dimension for the bias on the second layer. And if you have a look, this is always, this is the same. These are the same numbers and these are the same numbers also. So you have to have this, and you are checking the shapes of your matrices. You can do this in such a way. Because for this matrix, we will use the transpose. For this matrix, we also use the transpose. So then we are calculating our cost function based on all these matrices. And then we should take, get the gradient descent. And the gradient descent is calculated in the following. That I, I'm now checking what are the difference, what is the derivative of my cost function by the components of the first layer and by the first bias. And this I will use to improve these, these components, these parameters. In the same way, in the same way, I'm checking this also for the components on the, on the second layer. Yes, we are always calculating now the derivatives. So the implementation then would be the following. That from a formal point of view, I have my first output. Then I'm using the activation function on the first layer on this, this output is now the input here. I'm getting an output here. This output is used as an input to the second layer. We are using, we are using this, this weight and we are using this bias. And last, at least, we have here an output. And please have in mind in that case we are using the sigmoin function. If you implement this now for the backward propagation, we are first checking the difference at the output. So this is our predict, our predicted values. These are the labels. And the difference between predicted values and the, and this predicted value and the real labels gives us the differences in the second layer. We are using the differences in the second layer and multiply them with this formal, with this value. And we are getting the differences in the weights in the second layer. Because this was the input, as you see, this is the input to the first layer. And if you are taking the derivatives, then just A is left. So, and last, at least, the bias on the second layer is just the sum, is just the sum over my d-sets, which I have divided by the, the axis is just the programming question in which dimension I am looking for the sum, going for the columns or going for the rows. And then, in the same way, I am now, not in the, sorry, not in the same way, now in a special application I am calculating my d-sets on the first layer of these d-sets. And this is the multiplication of my weights from the second layer with these kind of differences and multiplied with the derivative, with the derivative of my activation function. And this is multiplied by element. This means this is not a matrix multiplication, this is just a multiplication. So, and based on that, if I have this value, then I can calculate the differences for the first, for the first parameters and also the differences for the bias. And if I have this, this, this, this, then I can apply this and change, change the parameters w, my w is in the first layer, my bias in the first layer, my parameters, my matrix w in the second layer and also the bias in the second layer. So, and this computation figure, now we are doing 1,000 times, 10,000 times, depends on you how long you will learn. The question, how often you will repeat this, this is again a practical question. Check the learning curve which I gave you. This means I have your differences or do you have a saturation of the learning curve if you have a saturation of the learning curve just right up. It makes no sense. This is in case of supervised learning. Yes, this is totally different. This is totally different. Let's say in the case of unsupervised learning, what we are doing is, first of all, we want in the case of, we can apply different kinds of unsupervised learning, but let me speak about the class today. In the case of clustering, it means we have some data and we want to find out which data are similar. But the question, how much clusters I will set, this is your decision. This is not given from the data. Of course you can have a first look on the scatter plot and say, okay, it looks so like there are three clouds or four clouds or five clouds. Then you can decide how you will calculate which data are going to which cloud. For example, you can try to work with the centers of cavitation or the mass centers. For every cloud you try to calculate the mass center, then you are calculating the distances when the distance between two mass centers is different than you choose the minimum. Then you separate this. Based on that, you can try to give labels and then make a supervised learning. But it is unsure if you really have the right number of clouds. This is just a practical question based on the problems, based on the practices and of course also on your experience. Because you can say, okay, it looks like three clouds, but I know there is an error in measurement and this error in measurement is responsible for the certain cloud. So for example, you can switch from unsupervised, just do unsupervised learning. First you are doing unsupervised learning and based on this you can try to make a supervised learning when it makes sense. This is, okay, if you like this to do it every day in the afternoon, after dinner you have nothing else to do, then of course you can do it for fun. But normally because the implementation of this takes a lot of time and the cleaning of the data, I didn't speak about it, but cleaning of the data takes most of the time, then I saw you will have a summer school in data science and I think the lectures will spend a lot of time on the data cleaning and you will see how much ever you will need for that. Then you can try to apply this different kind of order. Another, let's say application is reducing of the dimension of your objects or of your data. For example, if we are looking to the colors of a picture, then of course we cannot have only RGB from 0 to 255 and 3 channels, we can have also other values. In that case it makes sense to reduce and based on this, on the reduced picture nevertheless see if you can work with this. This is quite often done with a principle component analysis where you are working with the singular values of the matrix which you get from the pixels. Again, who knows a little bit are numerics and unfortunately I'm coming from mass, so I know how much computation effort you need to make a principle component analysis because this is something like the calculation of singular values of matrices and these are one of the most computation consuming operations and calculations which you can do in mass at all. So it's always a question makes a sense or not, but nevertheless if you have the reducing then you make your operations and then you go up again to the original dimension. But again, do you need it or not? It depends from the real application. Okay, but in the case of supervised learning with deep networks we will need to calculate the derivatives and to improve this with a learning rate. So coming to the back propagation which is the most challenging part. You see, the computer is so afraid about back propagation that he might show it. I'm sorry for that. So computing the gradients in the case of the logistic regression, maybe you remember this. This was, we go for the backward propagation, we go this way. And last but not least, we want to know what is the derivative of this value by w, by the w's, and by b. This is what we want to learn. Of course we could do this in such a way that we calculate this in one step. But then you don't have a computation scheme which you can apply in a program like a sub-program. So for this reason we are using another application. First of all we say, okay, we calculate the derivative of our cost function of our cost function by a. And this is quite easy. Of course this is a log function and you get a minus y divided by a and minus times minus is plus 1 minus a divided by 1 plus a. So this we know. And then we calculate the cost function by set. And the cost function by set using a theory from mass of the chain rule of taking derivatives is dL by dA, dA by dZ. So dL by dA we already calculated. And this value, this value by set, this is just the derivative which we saw and which we discussed of our activation function by set. Of course A is nothing else like our activation function. And we discussed, we saw that the activation function for the activation function for let's say the sigmoid function has a special form, the activation function or let's say the derivative of the activation function for the attention hyperbolic function has a special form and so on. So this is also predefined from the activation function. And then it's quite easy to multiply this but have in mind we should multiply this element wise. We should not multiply this as vectors and we should not multiply this as matrices. So and based on that, based on that now we can calculate dA. Of course this is this value times the derivative. So we have dZ and on the output here we have our dZ on the other ways and then we can multiply and calculate dW and dV. Of course dV is the next sigmoid like dZ and dW is like dZ times X. So having in mind this computational scheme which we are getting from the logistic regression we can try to apply this again to the neural network. Because this was the computational chain this is the computational chain for our neural network. So we are starting again here and we are calculating dA on the second layer. And based on that this is we can calculate also dZ. This is the difference. And based on that we will calculate for the second layer our dW and dV and make then later our improvements. And then we are going back here we calculate our dA for this function and you can replace this sigmoid now with another activation function and calculate dZ based on that multiplied element wise with this product and now we have more or less everything. So we know how to calculate our dAs and we calculate now dW in the second layer our dV in the second layer our dW where I have it our dW in the first layer and also our dV in the first layer. And these are the values which I need to learn because now I can apply this first I calculate the vectorial application and I apply this for the learning algorithm this is the implementation already written how we can write this in Python and the update then is that I say I am learning this I always take I always say dWL is WL minus alpha dWL and I learn or dVL sorry, sorry, sorry, sorry VL this is the bias of layer L minus alpha dV on layer L So we are coming now to the question of initialization If you remember we initialized our indelogative regression sorry how do we get from here that we are actually correcting our weights because we measured how much different we were from inside now we are looking let's say so the back propagation gives us an insight how the output error is propagated to the different weights and biases and we should improve by this value we should improve this output sorry, these weights this is like an error propagation which you have if you are measuring something and then you see how this error measuring is propagated in your calculations I just want to see where this one I will discuss now I only say this is the way how I can improve it I say the new weights are the old weights minus learning weights like the difference and the new bias is the old bias always a layer L the only layer which we will not improve nothing is the input layer because the input you cannot improve the input is as it is now we are coming to the question about initialization in the case of the logistic regression I told you that I can initialize with zeros the question is can we also initialize with zeros let's see what happens and if I have my matrix with zeros and the bias with zeros if the matrix is zeros you remember this I multiply with my data plus zero then I will get here zero and it doesn't matter which kind of activation function you are using you are getting zeros you are learning this so this is the reason why it makes sense to use another way of initialization and the easiest thing for the initialization is if we are using random numbers so for these weights and also for these weights we initialize with random numbers and sometimes also it depends if you want the random number if you want the random number very small then multiply it again with some factor for example 0.01 so the random number in any case if this is a normal distributed random number is between zero and one nevertheless it's quite big so you can make the weight much smaller maybe hopefully you will learn better the bias, the initial bias could be zero it's not necessary you can also choose a random number but quite often we are using zero this is not the only approach for the initialization if you go to the literature one initialization is called the Hei initialization is already a Chinese influence for this Chinese also and another initialization is the Spanish name maybe I remember this one but it means you can have a look to the literature and find out different initialization methods because a good initialization shortens you the computation time or maybe not the computation time but the computation effort and if you know time is money then you are also spending money but let's say no algorithm which tells you for this application you should use this initialization for this application you should use this initialization this is a question of experience so a good start in any case is this kind of initialization but isn't it possible to create some kind of algorithm which can exist? no there are so many different problems where you apply machine learning and on one data this initialization with this you are learning faster on the other data with this initialization okay you will have your topic for the master thesis sorry maybe for the PNG let's say so I don't know such a kind but this should be discussed if this is really possible because in that case you also should be able to classify the applications for example now I am working on an application and then algorithm which is trying to guess what kind of features and cell pipelines our business can use based on their company profile so it's also like the same configuring something and somehow collect data maybe you can do this in that case you are checking the business cases and you say if I apply this to this business case then it makes sense to work with that kind of applications or with that kind of algorithms like a recommender system this is possible if you work on a recommender system and you have a breakthrough then you will get the same money as the team of one time from Netflix this was one million dollars not Netflix this was Netflix announced the competition the competition is over but they announced a competition to build a recommender system for the applications so that they have improvement of the benefit of 10% and of course one million dollars is not so less money and a lot of groups were working on that but the winner was not implemented this was the funny thing why? it was good the results were good but the effort for the implementation was too high and they decided to work with the second winner I think I have to reconnect but if there is any data I mean do people say it's not so that you have a table but of course one moment I will just but you should check the literature it is the only thing that I can tell you it is not so I didn't see at least in the literature you get the table and say these are the use cases these are the activation functions which you should use in that case and these are the initializations but one thing that I can tell you if you are working with a sigmoid function over all the network you will not get really good results with sigmoid functions so in the written layers it is recommended to work with so what we want to do we have a data set like that and we want a classifier or is classifying the red and the blue dots and you see the distribution is not really good distribution we will do this with two different kinds of algorithms we want to apply logistic regression and then a simple network we will see if we have improvement on that the matrix which we are using is just the so the shape of our x is 2 by 400 so this means we have 400 dots and of course we have also 400 links so first of all we will use simple logistic regression in that case I am a little bit lazy and I am using a library skill on with the skill on a library where we have logistic regression we will close the regression ok maybe to explain for people who don't know what is cross validation if you have your data then you can normally let's say the first thing what I told you was to split the data in a training and a test set but the problem in that case is the following if you make your training and you want to test it then you can use the test set only once then of course you can use the same data but again to reshuffle in a random way your data or you are working with other methods to pick the data out the improvement is always if you want to make an improvement you always have to reorganize your data a better way in that case is to split your algorithm in a training set a cross-validation set and a test set so for example this could be 60% 20% and this up to you a little bit to move here so what you are doing, you are training your algorithm and you are checking it with the cross-validation you are not satisfied, you change the hyper parameters of your algorithm check it again on the cross-validation and you do it as often as you want and last but not least you are satisfied only once on a test set what is the advantage the advantage is that you are avoiding the bias problem because every time if we are using if we are using our test set a second time when we improved already our algorithm, it is biased because we used already the results of the test set to improve your algorithm I know that in practice this is done but at least from a statistical point of view not really came and for this thank you for your remark in that case we are using the cross-validation I have a question what is hyperparameters for simple logistic regression why we do cross-validation in this case in that case I think 60, 20, 20 but this is automatically you just plug it in in the library this is like working on TensorFlow if you plug in a comment on TensorFlow you don't know what are the parameters behind and it's the same with the library here you should have a look directly to the library but I think it is this splitting what I showed you, 60, 20, 20 but you will also find 80, 10, 10 this is also possible so the important thing is to show that with the logistic regression we have not a really good result for the for the classification and you see the accuracy is about 47 percent and 47 percent even for machine learning is very, very bad so now you can try to use a neural network and here you have the input we have only one hidden layer we will use four units in the hidden layer the input we have only blue and red and as the activation function we will use here the tangent hyperbolic function and then the output we will use a sigmoin function so and the prediction the prediction is done based on this on this the prediction is if our output is greater than 0.5 then we will put it one and now we should decide it's one red or blue and otherwise so the cost function cost function for this is the same as you know already the next thing what we should implement is the layer size and the layer size is taken from our matrix X also for Y you remember this was the first the number of rows is the number of data and then also and Y is the number of labels which we have an hour later and we rejoin this these numbers how actually do we decide common layers unit? in that case we are speaking about this swallow network only one about let's say but even networks do speak on Wednesday of course there are some special things to have in mind what about nodes? how many nodes too? in that case we choose only four there is no rule which how many nodes I would say use a simple modeling rule, kiss, keep it stupid in that case I am using four nodes but what you will see later we will change the number of nodes and see if we are getting better results the point is the following not always if you are increasing let's say in other words if you are expecting that the increasement of the number of nodes increases your accuracy this is not true there will be some saturation and then maybe even the accuracy is going down so you check it because adding nodes is not such a problem and you can check it by let's say four nodes five nodes, ten nodes, fifteen nodes like cross validation you can but in that case it's really an other organ and we have not the problem so again you see the size of the input layer is five, the size of the hidden layer is four and the size of the output layer is two and this was also what was expected so about the model parameters so we want to have in the model parameters we want to initialize we want to initialize here the parameters for our weights and also for our bias in the first and in the second layer and you see I multiply this with a small factor you can also use even smaller factor you can also skip this factor but then the initialization the numbers are quite high because the random numbers between zero and one have mostly only one one precision of let's say the MSB is one and we want to have MSB zero and then again we are also taking the shapes and we are building something like a library so in the library we have different words and these words have values and first of all we are using the initialized values and later we will renew the values after every calculation so just only though that we see these are random numbers they have at least the second only the second position of the zero is different from zero of course we have also negative ones and the bias the initial bias is zero so now we are doing the forward propagation and as you can see the forward propagation is just using the mass which we discussed on the slides this is nothing special the only maybe important thing is that we are using the dot function and the dot operation attention from the NumPy library but this is just because we wanted to to work with matrices and this is just a test case we are forwarding we are checking the forward propagation test case and then see what is the output and for the output we are just reading the means if the means are more or less in the same region we are sure that the algorithm is working properly so the next is the cost computing I should implement the cost function but again the cost function is following the cost function is following the way how we define the cost function just only applying just only applying NumPy operations but this is just only because this is a matrix and if you would apply normal operations you would get the narrow as an output so we are coming to the challenging part the challenging part is the backward propagation this is in every implementation the backward propagation is the part where you spend most of the time so what we are doing you remember for the backward propagation we needed our parameters we needed cash and then we should calculate our increasements for C2, W2, B2, C1, W1 and B1 and to calculate this we are just following the formulas which we used you remember for the input arrow was this difference this we multiplied if we are multiplied as matrices this part with the output from the layer and so on and so on it's just following the formulas and here we are putting this in the library grats so we have now a library if you put in the input of your function here we have the parameters and we have cash and we are taking always the new values and the new values they are updated they are updated in the cash let's make a test more or less fine now we can try to work it's not that we are not finished so now it comes to the question of the step size this is just a demonstration if you take the step this is a small step and you see with the gradient descent you are reaching the minimum in that case the step is too big and this is just only jumping you will never reach the minimum and that's the point but even if you would have a longer time it would not reach because you can show it because the search this can be shown by mass the search direction is always perpendicular to the gradient this is always perpendicular and if this is too big because this is going up here again you are jumping just in this direction this is just a function for updating the parameters I will not show this in detail because I want to come to an end this is my example in Coursera I can give you this example later in the files which I gave you and now we have the model we are taking the parameters we initialize the parameters and then we run this in a loop we have the forward propagation we have the cost we have the backward propagation we have the updated parameters and it is just up to us to define how often we will go through this loop to make the training so now we are coming to the predictions as you see it looks a little bit better of course we have still arrows but of course in that region we have red points and in that region we have blue points most of the points here in that region and in that region we are red so this kind of classification is even better now but I guess it will get better if we repeat it many times the iteration was 10,000 but we can't improve anymore of course you can improve but it is a question if you have time of course I can change this 15,000 time just to see how much you see sorry this was an achievement of course it is already saturated the vision of my clients is what I want not really of course you see this is a problem what not really can be improved at least if I use such kind of boundaries if the boundaries would be nonlinear if I would use nonlinear boundaries then of course it could be improved and what would be the problem if nonlinear functions were done? no no this is nothing I am speaking about the discrimination the discrimination functions because you see the discrimination is done by a linear function of course it is possible to use nonlinear discrimination functions in that case maybe we can have it but this has nothing to do with the activation function the accuracy is about 90% it depends if the red points would be cancer and the blue point non-cancer I would say not it depends from the application but the sensitivity in that case will be not very high in my case 80% of my clients yes it depends from the application but in any case we have a big improvement against the accuracy which we decided in the beginning with the logistic progression and if you remember I defined machine learning, solving a concrete task defining a performance measure and run this with some concrete algorithm so let me run this for you it takes a little bit more time and this is a question of who asked me this? can you still see the fold that each wave is actually now this is of course if we can see we have to stop the algorithm to see what is in the library what I am doing is the following I am writing the new weights in my library and update the library and forget the old values and then I update again forget the old values this means I have to stop at the step I don't know 100 and then take the next step and somebody asked me what means adding units in the layer and this is for one hidden unit, two hidden units, three hidden units four hidden units, five hidden units and now we see if I even have two hidden units the accuracy is not improving but the calculation time and the calculation effort is of course increasing so this means not always increasing the number of units in the layer brings you better results but of course I would have better results if I would put the second layer I had to make another assumption that the discrimination curves are not linear of course here I have linear discrimination curves if I use nonlinear discrimination curves it will be better but with linear discrimination curves I am not sure that there will be and we have linear in one layer discrimination curves are undoubted may I ask a question for the hidden layers the nodes of the hidden layer should be equal or not different you will see this especially if you go for convolutional layers they are special playing they are playing on the dimensions of the hidden layers of course not otherwise it depends on your data now it depends not only on the data it depends really from the architecture what you want to do if you give me a little bit more time on Wednesday so for people who are really interested I can show a little bit different architectures and then you will see that the layers are very very for example if you take a Alexa architecture and compare this with a Google net architecture normally called Inception architecture and you will see the Alexa architecture has five layers the Inception architecture has around 21 and more it's much deeper and of course the dimension of the layers is then taken based on the function what you are expecting the layer should do because if you play on the dimensions on the layers because you are using in the layers different kind of filters this is the idea of a convolutional network the reason why it's called convolution of a convolution operation and this means there is really an idea behind which kind of dimensions I am using in that case we are just testing adding say ok fine 4 is fine but there is no special calculation in advance to say ok 4 units this is the best one it's just you see this is for 50 layers this is for 50 layers sorry this is for 50 layers this is for 20 layers and here already the functions are a little bit non linear nevertheless it is not taking this one here if this function would go this way it would be much more precise maybe I didn't test and also here if this will at least take away or classify these two points and that point as a red one there are other data sets and you can play with these data sets so I give you this file and then you can take the data sets are in the let's say in the zip file and then you can check it with these kind of data or you load your own data you can also have your own set up lot based on that because the program is managing such a way that the dimension of the input data doesn't play any role you can use any dimension only the dimension of the label set and of the input let's say the x values should be the same you should have as much numbers can we use neural networks for regression problems not only for classification can we use a neural network for regression problems where y dependent is numeric not classification problem you mean for prediction so in that case the activation function is a simple linear regression yes it will be linear regression but of course this is possible to use neural networks for prediction and it is done I showed this here only in that case because for the prediction you have no backward propagation and I wanted to show also the backward propagation because a linear regression for linear regression the gradient descent is much easier and the derivatives are much easier and the backward propagation is much easier so all this let's say the things which make the paper or the salt of this problem stay out but coming back to your question this is possible and it is used so this means I am done for today and who is interested I heard on Wednesday at 10 o'clock so my plan for Wednesday at 10 o'clock is is the usage of complex numbers let's say so I cannot say yes or no I don't know but for what you would like to use complex numbers okay I never saw it this is the only answer which I can give you but I think it's quite easy to go to Archive X and to put these two search theorems deep let's say neural networks complex numbers and you will get if they are papers or not so my plan for Wednesday I wanted to tell you on Wednesday I wanted to show you what is if we are adding additional layers but I will do it it's mostly for let's say full connected neural networks and a little bit but just only a little bit show what changes if we are going to convolutional networks but nevertheless I will show you examples so running examples where you can later the the materials are on the server I gave you the input to the server so you can download the input if you have an anaconda environment running which has installed all the necessary things this is you know first of all I found out I was doing these files with Python 3.5 and 3.6 and already not everything is working until now I couldn't find out why so this is always a version question I cannot assure you that it will run on Wednesday let's say so these files should work but the other files where I am using I try to install TensorFlow and I am going to get it always multiple error and now I should go and sit down and find out why I have this multiple error but it's a version problem and also in some cases they are running so you are working already with TensorFlow 2.0 and I am using here TensorFlow it was 1.2, 1.3 these are elder files and I have nobody who is doing for me to update them the maintenance of files is always a big problem but I think you know it is but it's a big problem thank you