 Okay, so there's more or less everybody here, or a lot of people missing, I don't know. I don't know. There's people here coming. Do you hear? Do you hear now? Okay, better. Okay. Let's start. So yesterday with Matteo, with all the problems we experienced with the computers and everything, I hope he made clear that it's possible to use neural networks, and in particular deep neural networks, as a universal function approximator. So he showed an example in which he was combining just two nodes just to produce any kind of bump, like he was producing, given a certain input x, and combining two sigmoidal functions with some weights. He showed essentially that if you combine again with another sigmoidal function here with other weights, or even just linearly, and with some biases, you can get any kind of function which is like this, or like this, or something even like more closer to a Gaussian and everything. So using a combination of many of these, let's say, unnested kind of structure with all these building blocks, you can imagine that you can actually build any function that you want, even of more than one variable. Okay, so if you have more than one variable, you can approximate any kind of function by just tweaking these parameters in a deep neural network. So what he used it for, he used this kind of structures that I'm going to discuss in more detail today, to basically solve the optimality Bellman equation for the Q function, so for the state action value function. Now that was a bit, let's say, complicated example, so today what I want to show you is how to use this for a more, let's say, self-contained example, which is handwritten digit recognition. Okay, so which is basically how to classify correctly pictures like this. So I give you a picture of this kind, which is basically an array of, in this case, in the NIST database, it's like a picture which is 28 times 28 pixels, grayscale, where zero corresponds to white, one corresponds to black, and giving you this array, you want to tell me what is this number. You have a map, you want to construct a map that given you an array which is like 28 times 28 grayscale. So for example, it's something like this. Okay, I want to construct a map that gives me, it's like a map F, that gives me something in a set which is 0, 9. Okay, so this is a complicated map because, I mean, this is a huge space. It's the space of digits, if you want, is like contained in an interval 0, 1 to the 28 times 28. So it's a huge set of possible configurations. And you want to learn how to map this thing into a set of numbers 0, 1, 0, 9. Okay, so this is the goal. We want to be able to classify, given a set of pictures of this kind, we want to learn how to classify them correctly into numbers. Okay, so this is the goal. If I give you 10,000 pictures like this, I want you to get as close as possible as 10,000 hits and no misses. Okay. So, and what I'm going to show is that essentially what neural networks of the kind that Mateusz presented yesterday and which are deep feed forward neural networks I'm going to discuss on this today are able to approximate this function very well. I'm going to call f hat, which will depend on parameters of the neural network, which are weights and biases. Sorry. No, no, this is 01. So this is grayscale. So it's a continuous variable, which goes from zero, which is white to one. I mean, this is the opposite on the blackboard, but it's zero when there's no color. Okay, and one when there's color. You can do the other way around, but it's grayscale. What are the bees that we're going to show? If you want, this is okay. Let's call this like theta. Okay, so it's like some parameter, a set of parameters of your function that tries to approximate this fun, this actual function. So this is a function that I cannot express like explicitly. It's very difficult. It's very complicated. Say how to map like these array into a set of numbers, which is this small. So what I'm going to show is that there's a smart way of parameterizing functions like this, which is through deep feed forward neural networks. Okay. This is the task of today. So what do you think is the best way of, let's say, if I manage, let's imagine this problem. I want to ask you how to. Let's say, let's say that you don't have any idea what zero one to nine. You don't have any knowledge about algebra. You're like coming from extra extra solar planet or whatever you don't know what these numbers are. So what do you think is the best way for me to teach you this kind of this kind of classification problem. Any idea. Yeah. So, okay. So you want to use the correlations in a picture to determine what is an arm to associate it with the output. That's kind of what's happening. Okay. Examples. Okay. That's the answer that we are going to use today. So I give you the picture and I tell you the solution. Okay. So this is an example of supervised learning. So it's a learning with teachers. So I'm the teacher. I show you the picture and I also present you the solution. So you must do this very, very well because I mean, if I show you imagine that I want to show you to teach you what's the what's the concept of a dog or of an elephant or giraffe. So if I show you three pictures already, you're able to generalize to other pictures of giraffes and you already get, you know, you already tell me how this is a giraffe. Maybe there's an animal which is very close to very similar to a giraffe and maybe you confused. But still, this is the kind of a kind of training that that that works with humans. So let's try to to make it work with artificial neural networks. Okay. Another approach could be that I don't tell you the solution. I ask you what's your answer and I tell you these are the possibilities. So punishment if you are not, if your answer is not correct, or I give you a prize if your answer is correct. So this is instead as I kind of is that like kind of a form of reinforcement learning approach to this kind of problem. But it turns out that I mean, as yeah. Yeah. So if my punishments are very harsh, then you learn fast. But if you if they're not maybe, or if my punishments are stochastic. For example, you tell me this is a tool and I slap you anyway. So then you're not going to learn very, I mean, very fast. But yeah, so it's anyway effective, but it takes longer time. Okay. So but anyway, the thing that that we will happen today is like we use supervised learning. Okay. So basically what we have, we have a set of pictures of couple of pairs. Of pictures and the answer. For example, I give you this and I tell you that this is a two. Okay. I give you another another example. And for example, I don't know something like this and I tell you that this is a five, etc. Okay. As we said, and every time I tell you what what the answer is. And then the task is, what if I show you a picture that we are you're never seen before, are you able to recognize it or not. And the measure of the performance will happen like this. So this is like the training. These are like the training data. I show you together with the with the answer. And eventually at the end of the training, I present you another set of just a set of pictures. Okay. And I ask you what are the numbers corresponding to this. And I tell you how many hits or misses you you got. And this measures the performance of your algorithm. Okay. So this is going to say how, how well you are able to generalize from data that you have. You have been fed with to data that you have never seen. Okay. So that measures how well you learned the concept of two five. And so this is a very simple thing, but these kind of networks work also with very complicated things like animals or like even dynamical data or not really this particular kind of networks. But so actually, these kind of networks are able to generalize quite well. Okay. And yeah, well, the thing is that you don't really need to keep all of them in memory, because the approach that we are going to use is like doing a stochastic gradient descent of some kind of performance. Of cost function that I'm going to show. But basically, if you have a stream that comes in, you, you, you take this stream and you process it in a way. So by and by processing these, I mean you train the neural network. You train your parameters, you tweak your parameters as the stream of information comes in. Okay. So yeah, you need to store it, but it's not really limiting because you can store like terabytes of data without any problem. And you can process it like online somehow. So it's not really an issue. I mean, it runs on my laptop and I get human level performance. So it's, it's not really. But for more complicated things, I don't know. I mean, one has to try. So, so this is going so this lecture today could be like more suited for like a cooking class because it's like more like, you know, many. Say tricks that are kind of motivated by say more fundamental issues that I'm going to point out, I hope. But then the real implementation of it really requires like a lot of trial and error. And so, and of course you have to balance the trade offs with the like memory that you require a computational load or whatever and and performance. So this is like the test data. And eventually you use some training data and then you use some test data that you have never seen before to see what's your performance. Okay. So this is the task. And so, but how do we, so now, okay, how do we train the data? So let's say, how do we formalize this learning? So learning is somehow has in like involve some kind of optimization, right? So you want to optimize some performance. So how would you formalize this performance? So what is the function in other terms that you want to maximize or minimize any guess? So if I forget about this. So you have this fun. You have this stream of data. So you want to build your input and desired output. Okay. So how can you build some performance function that that I want to optimize in order to learn how to classify correctly again function probably. Okay. Okay. Okay. Okay. So basically you give me, I don't know. I have a set of numbers. Okay. And I build some probability that this picture that I showed you is a is a is that number or is another one, etc. And you close and you choose which one is the the highest with the highest probabilities. For example, this is zero one two three. So this is the highest. Let's say so you choose you say that this picture is three. This is how you. Okay. So this is like the answer by the by your machine. Your machine tells you, okay, this is a three. So how would you would you say how would you measure the performance now? This is the answer. Let's say performance function. Let me let me let me phrase it in terms of cost. Okay. So okay. So cost function one proposal by Alessandro. Alessandro says, okay, the number of miss misses. Okay. So number of errors, number of wrong classifications, right? Okay. This is a good measure of performance, but it turns out to be not efficient to train the network. Any other ideas? So this is the most obvious one. But we would like for a reason that I show you later in a second. Some kind of function, which is a continuous function of the output and of the output of the machine and the desired output. So with this hint, can you guess another cost function? Sorry. Combination or wrong? Yes. But how? Yeah. Okay. So let's let's let's go more to the structure of this network. And then let's ask you this question again. Okay. Let's see how this network is done. That's, that's closer to the real, I mean, I probably it's actually the real thing that we are going to do now. So let's talk about the, this network, how it's done. Okay. And then we try to with this in mind, we try to come up with some cost functions, which makes sense. Now, we said that. We had a tutorial on neural networks. So I don't explain what a neuron is in artificial neural networks, what weights are biases, or are you, I mean, are you familiar or are you not familiar? You're not familiar. Okay. Anyway, so the neural network that we use today is a deep neural network. And, and it's a deep fit forward, fit forward neural network. Okay. That's the, the short notations here. Okay. I mean, first of all, neural networks, okay, neural network is a network of neurons is enough. But what is a neuron? Okay. A neuron is some unit. Okay. That I represent with these circles, which receives some inputs. Okay. And gives you out an output. So these inputs are, let's call them x or let's say, okay, in a network, these inputs are the outputs of other neurons. Okay. So if I have another, another neuron here, this will give input to these other neurons. And so for the other hours, okay. So I call the output of a neuron a, which is activity in this case, which is, it's going to be. It's going to be in, in this particular example, like the probability of it being active. Okay. Probability of it being active means that, so this number is going to be between zero and one, continuous function between zero and one, a continuous function of its inputs. And the function that transforms the inputs in the output, I call it the activation function that I denote by sigma. Okay. So sigma is the activation function. The input is a, is a, in, in our particular example is a linear combination of the outputs of other neurons. Okay. Which are like upstream in this flow. Okay. So the activity of this neuron here, let's say that's neuron i and these are neuron j's. Okay. Many j's. The activity of neuron i is a function of, over all the, if it's a function sigma, over all the, of a sum of a linear combination of all the neurons which are upstream. Okay. Which give input to this times some weights to the neuron i from the neuron j times the activity of neuron j. The combination of, of these ones plus something that we call a bias. Okay. Simple enough. I mean, this somehow sets, let's say, so let's, let's say something about this sigma. So this sigma that we are going to use today, it's a sigmoid function, which as a function of z of this input is one over one plus e to the minus z. Okay. If I write here, you see everybody sees here, if I write here, not really. Okay. So this is something that looks like this. So z, this is, when, when this is lowered, much, much smaller than zero. So it's minus infinity. Am I right? Yeah. So it's zero. Then here it increases and then it goes to one plus infinity. Okay. Something like this. So if I look at these as a function of weights and biases, the strength of the weights somehow or let's say, yeah, the strength of the weights says something about how, how narrow is this region in which this activation function is kind of sensitive. Okay. So in which it varies quite a bit. And these bees here instead, I mean the ratio between B and W sets where these active where this neuron is going to be activated. Okay. So it's a kind of so minus this B is a kind of threshold. Okay. Which sets. Okay. Before this threshold, the neuron is going to be inactive. So sigma is going to be zero. A is going to be zero and beyond this threshold is going to be active with almost one probability one. Okay. Now, now this is the building block. Okay. Of the neural network. Now it's a neural network. So there's a network underlying. So that's going to be the most more interesting part. Now, can I cancel here? Okay. So deep fit for deep fit forward neural network. So deep. Okay. And also fit forward means that the structure of this network is done like a lasagna basically is like a lay is a layered structure. Okay. In which the contribution to each layer, the contribution to each layer comes only from the neurons in the layer before. Okay. And so these are going to be like arrows like this and so on. And also for all of this. Okay. So in between two layers, we have full connection. Okay. All of these nodes are connected to all of these nodes. But what is forbidden is that you don't have connection between nodes in the same layer and between nodes, which are like far by more than one layer. Okay. So you don't have connection like this. So this is forbidden. Or maybe I can write here so it's more clear. So this is forbidden. And you don't have connections like this. These are forbidden. Okay. So this means this is what is called a fit forward neural network because so you see that the output of neurons is just fed forward. To the to the to the next layer. And it's only forward. It's never backward. So that that is a property which is very nice because it allows us to to use an algorithm, which is called back for back propagation, which works only for this kind of structure for this kind of for this for next year with this kind of structure. Okay. So you don't have arrows like this. You don't have bidirectional arrows. So in which basically the output of a neuron also becomes a its input through these other neurons. Right. So it's a bit of a like this. Okay. So these are these are other neural networks, which are very interesting that they are studied in the literature, but they are not. So there are no algorithms which are as efficient as the ones that I'm showing you now for training. Okay. For supervised learning. Okay. So what do I want to say? Okay. So this is the thing. So these are the weights to each of the arrows. Let's say that this is near on I1, I2, I3. And this is near on J. So this is WI1J, WI2J, blah, blah, blah. Okay. And to each node is also associated a bias B I1, B I2, B I3. Okay. So which are like this kind of it's associated to this threshold. So okay. So the level at the level of input such that the neuron is is active beyond that level. Okay. And so, okay, this is a neural neural network that we are going to use. Deep because it has in general more and more layers. Okay. So I'm going to show the one which is already already has a good performance is one in which you have the first layer which is the input. Okay. It takes the picture at this side. So it's the input and gives you and with one intermediate layer and one output layer already gives a good performance. So this is get the kind of structure that we that we use today and later we develop with more layers as well. Yeah. You output one the first layer. Sorry. Input the first layer output the last layer which is a third and one hidden layer. Yeah. Exactly. So all the all the layers which are in between are called hidden layers. So here we have one it's proved it's there's a theorem that says that we by using sigmoid functions which are not sorry activate activation function which are nonlinear for example sigmoids with only one intermediate layer. You can represent any function provided that you have enough and now that this layer is big enough. So you have enough parameters to tweak and to adjust. Okay. Deep neural net neural network is a more general class of network. So it's it's it's just it's just a combination of any kind of this of this sort but it also includes like possibilities like I don't know that the thing feeds itself its own thing or that let's say you also have something like like these it goes to another so you also have loops for example. So these feeds like these so you also have this kind of touch but we don't want them just for practical purpose. Okay. Because the algorithm that I'm going to use uses the fact that the neural network has this structure which is layered and it's only feed forward and never feed back and never feed too much forward. Okay. Now this thing is now we discuss about so the input. How to choose the input layer the input layer is like the for example in image recognition is all the bits that sorry all the pixels of my picture. Okay. So I will feed here a vector. Okay. The input is a vector is a vector x which is in this space 0 1 to the 784. Okay. Which is the I guess yeah 28 times 28. What about the output? Any idea of how the output should be from 0 to 9. So what do you mean from 0 to 9? Who said that? Okay. So what do you mean from 0 to 9? Okay. So you want the output to be we call it y or if you want we call it a of the last layer. Okay. So this is the output of this of the last layer we call it. So we denote it let's say that this L is equal one. Sorry. This is L 2 layer 1 and this is layer L which is free in this case. Okay. We label them like this with a lecture L. So the output of the of the last layer. This is in general what we have. You say that it's a number. So it belongs to the integers. In particular you belong to 0 9. That's what you will say. Okay. Other ideas. Binary presentation. So basically it belongs to it's a vector which gives you which is in our for if you want 0 1 to the 4 or to the something n and gives you for example if I give 8 this gives you 0 0 0 1 times 2 to the 3 and then 0 0. So this so but how many bits you use? I would say that you need 4 because 2 to the 3 is 8 so you don't get all of 9. You need at least 4. Okay. So for example a more parsimonious thing is that this belongs to this. Okay. So you have 4 units in the last layer which which code for the binary representation. Okay. Other ideas. This is if I understand correctly this is like this. Okay. So each of these numbers is like the probability of of that. Okay. So okay. So basically you want to get out a probability distribution over this set. Okay. So this is a good thing. So it's neither of them. Okay. But this is a smart way. Okay. But and but it's not useful for constructing the cost function but it's at the closest answer to this problem. Okay. So all of them would be good. This is not particularly good because it it relates to what your colleague said that it's it's not a good answer for practical purposes but these are both good in principle. Okay. And I'm going to show that it's something like this the output. Okay. So let me say that for now that this is the answer. Okay. And then we use we we say what how do we use how we use it. So. Okay. So this is a way here because of this sigmoid function which goes from zero to one. Okay. It gives out some activities. Okay. Which are between zero and one. It's not necessarily that the sum of them has to be one. So that's why it's not a probability distribution over the the numbers. But it's not needed to be not needs not be normalized. Okay. For practical for practical purposes. The thing to use is that the output a l it's going to be a vector. How do we write it something like this which is a l zero a l nine. Okay. It has 10 components and each of them belongs to zero one. Okay. So this is what my output gives out. Okay. Yeah. Yeah. Yeah. Just allow me to draw a few just not to make a you want me to put. Okay. These are 10. No. I mean it's just this was just pictorial thing. But so we and and and and these for practical purpose. The idea is that you have like around 100 here. So I don't draw 100. But that's yeah. That's correct. So here we have 10. Okay. All of them which outputs a number between zero and one. And the answer that we are going to use. So the answer of the classification problem is going to be which one of these is maximum, which is related to what I was saying. Okay. But we use these vectors to construct the cost function. Okay. So the cost function that I propose. Okay. So it's not not me actually, but I know. Okay. Someone else proposed is. For a given input X. Okay. The cost function for this given input is. One half. The square at the, the, the Euclidean distance between. The vector of the desired output, which is why now I'm going to say what it is exactly. Minus. So the Euclidean distance between the desired output vector. Y of X. Minus the output. Of the neural network squared. This is the Euclidean distance squared. This is one proposal. It's not the only one, but it's one. Yeah. Now what do you mean? What do you mean one maximum? You can have two. Yeah. Okay. Okay. So these are continuous variables. It's very difficult that you get to exactly equal. Yeah. But you might have a different way of decoding the output to say to get an answer. For example, I don't know. I set to one. All the outputs, which are below beyond a certain threshold, like for example, if you have two, which are beyond like 0.9. For example, then you have an output, which is 0, 0, 0, 1, 0, 0, 1, 0, 0, 0. And that's a kind of confused answer. You don't know whether it's one number or the other. It's one way of decoding, but this gives you only one number. Okay. For the answer. Okay. But that's exactly a problem when you want to decode a kind of answer like this, like the binary representation. Because how do you choose which bits are going to be on and which bits are going to be off? That's a problem with this kind of output layer. But let's stick to that through our other hands. So this is one particular thing for a given input. The cost function that our network is trying to minimize actually is a function C, which is the average over all the training data. Okay. So this N is the size of this vector of training data is the arithmetic mean of this cost function CX for any given input. Okay. Does that sound good? Yeah. Yeah. Sorry. Okay. Then the mist data set, the data set that we use to train data, it made like this, actually. So it's a vector of 784. So it's just made the table is made as an arrow as an array of 784 and a vector of length 10. Okay. We need this is from zero to one and there's only one index which is one, which is the index corresponding to the number. Okay. To the answer and all the others are zero. Okay. So. Okay. So that just for, for example, zero, zero, one, zero, this is equivalent to two. Okay. Zero one two. Okay. So this is the output. This is the desired output. It has something, it has to be something which matches the, our last layer. Okay. You want to, to compare like the, the actual output that you would like to have to have an ambiguous answer. And so you want to give data, which are like, which have the same structure. If you were, if we were using this kind of output layer with binary representation, I mean, then you will, you will have to give like, for example, two would be something like with four units. And if you say that this is corresponding to two to the zero, two to the one, two to the two and two to the three, then it's going to be zero, one, zero, zero. For example. Okay. But if you have for it, and this is two, but if you have five, for example, then five would be one zero plus four. So this is five, but you have two bits which are on. So again, you have the problem of decoding your last layer to get an answer. Okay. So it's a, I mean, I don't know how to address this problem. And it's pretty, you might find out that this works very well. And you might, instead, it doesn't work very well, but it's just a heuristic answer. You try it doesn't work. And by the way, I'll give you the code so you might want to try this thing if you want and see the performance. I haven't. Okay. So this is a cost function. Of course, it's not. This is just a proxy to say how well you're doing. Okay. It just guarantees you that if you get zero for every trial input, you're, you're a monster. I mean, you're very good. Okay. But it doesn't mean that if you were, if your cost function is not zero, you don't get the answer correct. Okay. It's just a proxy that you say you minimize this and you guarantee that you get the correct answers. Okay. Um, the nice thing about this, which is not a property of this instead is that this is a continuous function of the parameters. Okay. So this output, this one is a fun. It's, it's our F hat, which depends on the weights and biases of our input. Okay. Is the function given by the, by the neural network and this object here could depend in a continuous way on the, on these parameters. Okay. Well, this doesn't numbers are discrete. Okay. Uh, to take an average or like a person for what happened here. Okay. Sorry. Um, so you see, so this is a discrete thing. I mean, even if you say, okay, it's a percentage and you take many, many data, the percentage would be something close to a real number. But anyway, you need a huge sample in order to approximate with a real number. And anyway, you don't know how to, how to optimize this function in a continuous way. The idea why we want a continuous function is because we want to use an, this, this kind of learning method is called that we are going to use is, is going to be a stochastic, a stochastic gradient descent. So we want to, we start with an initial guess of the parameters W and B, which are the weights and biases of this neural network. Okay. Some guess that one has to, let's say device wisely. Okay. And we need to have an estimate of the gradient of this function of this cost of this function with respect to the parameters. Okay. And then update your my parameters by just descending the gradient. So this is like how stochastic gradient descent works. It's stochastic because in the end, what I'm going to do is that what the neural network is going to do is that it takes a stochastic, this function in the end, it's a stochastic quantity. We will evaluate the gradient of a stochastic function. Okay. We estimate this gradient in order to move. So this is why stochastic gradient descent works very well also with not stochastic function. You take just a well defined function and you just compute the gradient and you make small steps in order to get to the minimum. So this is a stochastic thing. So in the end, it's going to be a noisy search of the minimum, but anyway it works. Okay. So can I erase here of C of C in the end the cost function because we want to be optimal. In one input, but with many. Okay. Yeah. Stochastic gradient descent. How does it work? You have a function. Okay. You, you are in a certain point, let's say W and B. Okay. In our parameter space. We compute the gradient of this function. We compute all the weights to this compute DC dW, all the parameter, all the weights and DC in the B, the biases. Okay. We compute this and what we do, we do something like this. We do the next parameter, the next set of parameters, W and B. Let me write one at a time. It's going to be updated to be W prime equal to the previous W minus something which is proportional to the derivative of the cost function with respect to W. Okay. The derivative of the cost function with respect to W evaluated at the current estimates. Okay. Let's say that this is like W t and B t with some time, which is whatever. Okay. This is going W t plus one W t. This is evaluated at W t and B t. Okay. Times or something which is called the learning rate that I call eta. Okay. Which also has to be chosen wisely. Okay. Because you want to make small steps. Otherwise you have a function like this. And if you're here, you take the gradient is like something like this. You make a jump like this, you never converge. Okay. This learning rate has to be small enough. Again, also this requires a lot of cooking and trial and error to find it. But you have to set one. So this is the learning rate. So this is the stochastic gradient descent. The stochastic gradient descent. I mean, here is like just gradient descent. There's no stochastic here. I mean, C is some function. The stochastic thing comes from the fact that this C is sampled. This gradient is measured as a stochastic sample from our data. So I indicate it like this. So it's meaning that it's a kind of a sample average over my training data. Okay. Okay. So this is the stochastic gradient descent thing. Now, natural question from you, I would like. What's missing now? Eta, we fix it. B, of course, yeah. B has to be also updated. Exactly. So question is how do we calculate the gradient? Okay. How do we calculate the gradient? It's actually the core of these training algorithms. And it's called, it has a, I mean, it gained a lot of attention because it works very well. And it's called back propagation. Okay. It's an algorithm that works for this particular network, which are like layered and feed forward. Okay. I'm going to, I'm going to derive the equation of this, of how this algorithm works. Okay. Can I erase this? Because functions is clear. So it's like, it's like an analytic derivation of how the algorithm works. But you, I mean, it only requires to know how to take derivatives of functions. So it's not really an incredible task. Okay. So first of all, we want to define, we define a quantity that is going to be, it's going to be an auxiliary variable for our algorithm, which measures somehow the distance from optimality. Okay. Of our network, optimality, meaning how far we are from the, from C equal to its minimum. Okay. Given our set of data. So I define this as the error, which is defined as I denote it as delta I L. So where L is the index of the layer. So it's a quantity defined for every neuron in the, in the, in the network. Okay. Every unit in the network. So for every layer for any unit, any neuron inside this layer. Okay. So if the layer two has 10 neurons, then if this is two, this goes from 0 to 10, to nine. Okay. So just should be clear. Okay. And this is going to be the derivative of the cost with respect to the input, I denote Z the input of lay of neuron I in the layer L. Okay. So this is a, I mean, I don't have a transcendental meaning of this thing. I mean, that deep meaning of what this is, but let's read it backward. I mean, if you have this, if you have derivative with respect of the cost, with respect to the input of this neuron equal to zero. It means that your network already is a proprimality. So your cost, you are at the bottom of your, of your cost. So you're at least in our local minimum. Okay. And the delta is zero. So when, when this is zero, this is zero. So that's why it measures like the distance from optimality somehow. Well, if you are, and okay, and Z, Z is the input, is the input of neuron I in layer. And it's calculated as, as we saw before, as the sum over the J, which belongs to the neuron of layer L minus one. So it's the input. So the, it's summed over the neuron in the previous layer. So this is the feed forward part of the network. Okay. So this is the input of W, right? W L minus one, I, L, IJ, sorry, times the activity of the neuron at layer L minus one, neuron J at layer L minus one, and then plus the bias. Okay. And this A is actually, is actually sigma of Z, J, L minus one. Okay. So there's a recursive expression for the inputs at successive layers. Okay. Okay. So with those two expressions, now we derive the, this algorithm, which is called back propagation algorithm. We start by, by, by one simple expression, which is the error at the output. Okay. So the delta I of the last layer, of the output layer, delta. What is it? The derivative of the cost with respect to the input to the last layer. So the Z, L. Now, actually the cost depends on, on the input through the output. So we said that C, if you want is a function, sorry, it's, it's a, it's a function that, for example, is for any input X. Okay. Let's imagine that now that the, let's, let's imagine that here we have X. So for any given input. Okay. We want to do this. That's what the algorithm is going to do, actually. So for any given input, we, we have this derivative and CX we said, for example, in this case is one half the, if you want some over I, some over J of the output of Y, J, which is going to be a delta over the, a chronicle delta over the right answer. Okay. Minus A L J squared. For example, it's one. Yeah. Yeah, yeah, yeah. So these, these all depend on X in this. Yeah. Yeah. So this is a function of X, but I don't write it just because it becomes too much. But yes, it depends on, on the data that I have. Okay. This CX depends on Z through the output through this formula. Okay. So basically we can apply a chain rule. By saying, okay, I take the derivative with respect to the output, which has these as input. So derivative with respect to a L I times the derivative of a L I with respect to the input. So this we know because it's given by the definition of our problem. We define a cost function. And this we take the derivative and we just code it. Okay. And this we also know because we know that the activation is just sigma of Z. So this is basically sigma prime of Z I L. Okay. This is the first formula that is going to be used. I cancel here. Can I erase the network? Okay. So, okay. So basically what you do, your algorithm does this. It, it takes the input. It goes, it feeds forward in the network until you get to the output. You get to the output and you get these Z and these A's and everything. Okay. And you calculate this. Okay. For the, for all the neurons in the output network on the output layer. And then you want to calculate this delta L for all of the previous layers. Okay. Now the back propagation back from the name back propagation tells you how it's going to be done. So it's going to be fed backward to the previous layers. Okay. In order to calculate them. And now I derive the equation for this, for this, for this step. Now the second equation, which is the back prop, the main equation is going to be used. Is basically how to calculate delta I at layer L, small L, well, which is any of the previous layers. Okay. For any L actually except the first. So any L between two and L minus one, because the first layer is just the input. So it's given. Okay. So there's no, there's no error there. It's just the input, the picture. Okay. Um, okay. So then again. Definition. Okay. And now again, so. This delta, delta, that delta z I L determines what's going to be the output. Of the next layer. So I can use again the same, the same rule. I mean, applying the river. I mean, I'm packing this thing into derivatives in a chain. Okay. And I can say, well, I can express delta C with respect to, I can, I can take the derivative of C. Again, always this is CX. Okay. If I miss it here, for example, it's for one given input. Okay. For this thing with respect to the Z of the next layer. Okay. And then I multiply by the, by, by delta. So if you want here, we have to sum over all j's of delta z j times delta z j L plus one. Is it clear what I've done here? Okay. And now I use the fact that z j comes from this expression, right? Z j at the next layer comes from the z j at z i of the previous layer. So it's just a shift of indices, but it should be clear. And it depends on it also through this, this sigma. So what is going to be, what is going to happen is that basically this object is going to be equal to some of this. Sum over j. Ah, this is by definition the delta j L plus one. It's the definition. Let's change the index. Okay. So this is going to be a sum over j of delta j L plus one. And then I take the derivative of this. Take the derivative of this. If you want, I first take the derivative with respect to a, which is going to give me the w. Uh, j i L plus one. And then I take the derivative of the a with respect to z, which is sigma prime. Okay. Sigma prime of z. I. And this is very nice because I derived the formula, which tells you how to calculate the errors in the previous layer. Linearly given the errors at the next layer. So this is a linear operation. So you have all the possible tools for doing linear algebra on your computer. So this is very, if you can do it very efficiently. Okay. And this is very nice because the, the only dependence on of the unit of the, of the output of the units is only through the outputs of the previous layer units. So, and so you can use this, this kind of tools. And in the end, here is just a matrix. Okay. You can store these in a matrix and just multiply from the right of this to this vector and you get your, your delta, the previous layer. So this is the back propagation formula. And now, okay, you say, well, okay, I've, I've, I've derived, I've been able to calculate this delta. So what do I do with them? Well, let's see how the derivatives that we in the end want, we want to calculate. The derivatives of the cost with respect to the parameters. We don't, what do we do with these deltas? Well, let's calculate then the derivative of CX with respect to the W, L, I, J, for example. This is one partial derivatives that we, you want to use in your training, I mean, in your stochastic gradient descent. And so this is going to be, I can express it as, again, chain rule. I mean, it's going to be like the very simple tool of calculus 101, but I mean, it's going to be used in all the, in all the derivation. I have to sum over, I take the derivative of CX with respect to something that depends on this thing, which is the Z. Okay, the ZL, the ZL, I, is that correct? I mean, actually it's better to do K, in principle it depends on all the K's. Let me write the, I mean, you can derive it yourself, it's very easy, but let me write the final expression. So this, the only thing, the only Z that depends on the W, I, J, L is the ZIL. Okay, doesn't depend on all the other, I mean, the other Z's don't depend on this. So I write it like this times the derivative of ZIL with respect to W, W, I, J, L. And again, I use that formula over there for the feed forward. Okay, and it turns out that this is going to be, this thing is linear in the W's, because it's just the coefficients of the linear, of the linear combination. And so this is going to be, this is going to be delta I, L, Y definition. And this is going to be only basically sigma of ZIL, right? So this is delta and sigma ZIL times delta I, L. So this is, now we know why we introduced these deltas, because we propagate, because in the end it turns out that these deltas are proportional, they are proportional to this grade, to these deltas, okay? It's very, very simple. And then last formula, I write it here, the last formula is the gradient with respect to B. But the gradient with respect to B is even simpler, because, because B, now again, same trick as before, delta C always CX here with respect to ZIL, it's the only Z that depends on the, on the BIL, right here, times delta ZIL with respect to B. Yeah, maybe I'm wrong. Sorry. No, you're right. Because the Z, B, alora, so the Wij determines the ZIL. So I have to take a, start what do you mean? So wait a second. Might be a mistake, might be my mistake. Okay, so we take the derivative of the, so the W, C depends on Wij through the, through the Z. Yes, you're right. You're right. So this is L minus one. No. So if I take the derivative, plus one, right? I'm making a mess. And I realized ZL plus one, wait a second. I take the derivative with respect to, to WIL. So I take first the derivative with respect to ZL plus one, which depends on W. Probably, okay, I'm sorry. Probably the thing is that this is an L in my notes. This is ZL minus one. That was my mistake at the beginning. Sorry. I mean, I'm sorry for the massive indices, but should be like this. So then this, this thing is delta IL. And this thing is delta ZIL for sure. And then I take the derivative with respect to this, which is, which is correct. Sorry, I probably was that index. Are you fine with this now? Not really. Are you okay? So yes, he says yes. Yeah. Yeah. So the layers here go, this layer here goes from two to L because the first layer doesn't have any input. The input is X is the, the data. Okay. Is the picture. So this L goes from two to Alma. I wrote it here, I guess here. So the L goes from two to L because the, the a one, a one is basically, I put it equal to X to my vector of, of a grayscale pixels. Okay. So, okay. So it should be correct like this. I'm sorry. I was putting a mistake there. But basically you have a way of calculating the derivatives now. Okay. So you have a formula. So I'm finished this thing. So this is the error. So this is the delta I L again by definition. And this is very simple because Z depends linearly on B. I mean actually it's just an additive constant. So taking the derivative with respect to this or this is the same. Sorry. Derivative of these times or derivative is this is just, derivative of these with respect to B is just one. Okay. It depends on trivial on B. So this is just one. And so this is just the delta. So the derivative of C given this input with respect to B is just the delta. And now with this you take your, your, your step. You're ready to take your gradient descent step in your cost function. Sorry. I'm slow. Can, can you say it again? Here. Top right here. Yes. Center left. Wow. Center left, center left here. Here. Okay. Yes. No, because I'm thinking the derivative with respect to W. And then yes, you're right. I guess you're right. Yeah, sorry. I made some confusion. I also in my notes with, I got some formulas with the plus one and some others. But the code is correct because it was written by someone else. Yeah, just a shifting of the labels. You have to be careful with that. And it's usually like a mess, but okay. There's not, there's no substantial thing. Okay. So that now we, we have this, but we don't take the step as every, so let's say material yesterday talks about the, this mini batch. Okay. This mini batch update. So this is actually a trick that is very useful to, to do stochastic gradient descent because it's, it speeds up, it allows for, for the possibility to speed up your algorithm quite a bit by using some already made some libraries for linear algebra. And also because, and it reduces a bit the noise. Okay. So basically the learning algorithm is going to work like this. Now have you written the back propagation formulas so that I can cancel them? Or maybe, I mean, anyway, I'm going to show them on the laptop in a while in a second. Probably instead of writing, I can just turn on the laptop, which is easier. Okay. It's going to turn on some at some point. I hope. Okay. Let's write it. So the algorithm is, is, it goes like this. Now you have some training data that we said is like a vector of like 784, 84, 0 to 1 real variables and a vector of 10, which is your output. Okay. You have many. You have N. Okay. Now the trick of the mini batch is that you don't calculate the gradient every time you get one data point, but you split this thing into some subsets of data. So let me call it X for input and Y for output, which is the desired output. Okay. X one of size M. Okay. Which is going to be set. There's no general recipe to set this length of this, of this subset, that's a very important to set it right. Okay. Okay. So you have many of these subsets of data. Okay. This is what I called mini batch. So what do you, what happens is that essentially you take a batch, a mini batch of inputs and after you have taken that and you calculated the gradient for all of them, you take a sample mean of the gradient over this mini batch and then make a step. Okay. So you take this and you calculate sum of one over M over the X, which is inside this mini batch. Okay. Of DCX with respect to the, this let me call just W, but there's to be clear that it's W's and B's. Okay. And this is our mini batch, mini batch average of the gradient. Okay. Now one comment about the cost function, the cost function has to be, it's, it's nice that it's a, written as a sample mean of cost functions for revenue and input, because it turns out then that this, the mini batch average is going to be an unbiased estimator of the gradient. Okay. So it's not going to be a bad estimate. Okay. So you're always taking the good direction essentially. It's going to be noisy because this M is much smaller than the full string of data that you use for training, but it's going to be anyway pointing on average in the good direction. Okay. It's going to be on average the good, the good estimate for the gradient. So once, once you've taken this, then you do the update. As we said, the W is assigned to W, the old one, minus a certain small number, I mean small enough, times this mini batch estimate of the gradient. Okay. So this is going to be the algorithm. Period. Okay. And after, once you, you have exhausted your, your string because, I mean, you have a finite number of mini batch, you can test the goodness of your, of your thing. But the, the test has nothing, I mean, the counting of how many misses or hits you got doesn't provide you any information on how to train your data. It's just a test. That's your ultimate goal that you want to train your neural network for. But it's not going to enter into the algorithm. Okay. The only thing that enters into the algorithm is this, this cost function that you set before that you have to set in a smart way in order to achieve good performances. Is it the flow is every, all the flow is clear. So, okay. Now, I basically train myself on this subject by looking at this reference here. It's a very nice book. It's very introductory. So it's a quite lengthy discussion and quite worthy discussion about how this thing goes. And it gives you a lot of principles that I don't have to discuss. Of how to train, I mean, how to to, to set these parameters, like for example, the size of the mini batch, this eta, or how to choose wisely a cost function, how to choose wisely an initial condition of the W's, which is not trivial, either. So it discusses all these kind of things that maybe I mentioned in the end. Now I want to to give you a practical demonstration of how this code is made. Now I show you, so this is basically the, so the SGD function is this one. Okay. So which is the stochastic, it's like the, the big function that does the training for us. So what it does is does, it takes some training data, all the strength. Epox is just basically how many times you go through all the data. Okay. You pass once over the whole set of data and that's one Epox. Okay. It's just a matter of names. After this Epox you can test how well you've done. But in principle you go through it many, many Epox. Okay. Many Epox is just a name for a sweep, a sweep over all the data. Okay. Yeah. You, you have to run it many times. It's because you have a limited sample. So what it's going to do is this, you see, you get the training data and, okay. Well, if that's data means if you want also to test, to test your performance, you also give as an input also this test data. In, by default is set to none. So you don't have any, you don't test anything. But it's good to, to check how, how good is your training, how good is your algorithm. So you can set this flag to, through the name of the list of training data that you're using, of testing data that you're using. And then the main algorithm is basically this. Okay. So for every epoch, so every time you go through your set of data, you shuffle them. Okay. Because if you go through it with the same order, you might not, you see, I mean, these mini batches will be always the same. I mean, will be always the same data entering the mini batch. It's, it's just a way of getting better statistics over your estimates. Okay. You shuffle them. Then you, you divide your, oops, I'm doing this. Okay. Then you divide your data into these mini batches. Okay. And, and then for every mini batch, what you do is that you do self, I mean, sorry, this, you apply this update mini batch. This update, update mini batch basically does this. So it calculates the gradient. I mean, it does this. So it updates the weight by measuring, estimating the gradient over the mini batch like this. And into this step, it uses back propagation. Okay. It's all nested thing, but it's, I mean, if you look at the, the code is quite self-explanatory. So it's, it's very nice. I think it's very well written. So this is the, basically the argument. Yeah. No. No. No, sorry, you say it again. Yeah. The different runs, different epochs, they just do the same exact procedure, but with just shuffle data. So you shuffle. What do you mean? I don't know. Maybe. Yeah. Exactly. Exactly. Yeah. Yeah. And you shuffle because you want to kind of remove some kind of correlations that you have in the, in the estimates. We want. Okay. Then the, the function that, okay. So I'm going to show you how the, the code runs. There's two, I mean, very little, but so the cost function that we, we implement, I mean, it's implemented here is the quadratic, is this quadratic cost. Okay. So any, any cost, which is like convex in the output. Okay. And has a zero or like a minimum in the desired output is, is good. Okay. But it might be that it's not, good for practical purposes because, because maybe what happens is that these derivatives turn out to be zero or turn out to be very small. So you don't learn much. So you, you want to devise, you want to define a better cost function in order to, to avoid this problem of gradients, which become too small, because in the end, if you remember the delta at the last layer is basically the, the derivative of the cost with respect to a, which is something. Okay. In the quadratic thing, this is just basically the distance between. So I am basically, this is just a, I am minus y I. So it's just the, the, the, how to say the, the distance between the, the, the error with respect from the, the deviation from the desired output. But then you also have a term, which is sigma prime of the input. And now he will correct me on the indices. Fortunately, he's not paying attention. Okay. Now it's like this. And if this thing is, since the sigma is a sigmoid function for our problem, if this z ends up here or here, this sigma prime is very small. And in general, with this kind of sigmoidal neurons, you encounter very often this problem of vanishing gradients, because you, this delta's become smaller, become very small already at the output. And also when you do the back propagation that I canceled, but it's somewhere here. When you apply this equation, okay, which iterates, goes backward in the layers, you multiply many times, as many layers you have, times this is a sigma prime. So it turns out that you, you, your gradient, your estimated, your error in, in previous layer is smaller and smaller. So the derivatives are smaller and smaller. So the learning in the previous layers is smaller and smaller. It's lower and slower as you add layers. Okay, so just to keep in, keep in mind that, for example, there's a nice choice of, of cost function for this particular activation function, which is called cross entropy. Okay, which is a measure of somehow the, the, how to say, of the surprise, somehow it's like, it has an information theoretical, I mean, meaning, okay. It's a cross entropy, I wrote it here, and there's in this book that I gave you as a reference, there's a discussion of how this, why this is, is better. But the reason is simple because, when you take the derivative of, of this, of this cross entropy function with respect to a, with respect to the output, this, this thing gives you one over this derivative. So this derivative cancels. And basically the error is just purport, is just this, okay, for this choice of the cost function. Okay, it's all well explained in the notes. And, and if you look at the code down here, okay, I, I wrote a part of the code where, where this, this is the definition of the cross entropy cost function. If you take the derivative with respect to these, log becomes one over a, here becomes one over one minus a, you combine them basically you have below, you have a times one minus a, and if you take the derivative of sigma with respect to z, it's basically sigma times one minus sigma. And so this sigma, so this is basically a times one minus a, which counts as this. Okay, so it's a, it's a trick. But, so nobody says, it's not like a prescription by, by anybody this thing, but it turns out it's very good with the sigmoidal neurons. Okay. So this is a way of improving like on the, the learning with the cost, with the quadratic cost function, which suffers from these vanishing derivatives already at the output. Okay. And then here you have the code. Let's try to, so, okay, so one word about how the, the NIST database is loaded into the program. Now, you see that this, this NIST loader, load data wrapper function, whatever, it takes this NIST database and loads it into some lists, which are training data, validation data test data. The reason why you have training data is, is basically this string with, they're all with input and output that you're using through this, for this stochastic gradient descent. But then you want to, you might want to test your network, to see how many hits or misses you get, how many correct classifications you get. But, so then you say why you use, why you need another set of validation data. Any idea of why you want to use a second set of data? Some who maybe know something about this. So, validation data is because, it's a kind of test about how well your, your algorithm is learning, not about the performance, but, okay, so you might want to use one set of data, also for setting these parameters for example, like the mini batch size, the eta, or even the structure of the network, okay, so you might want to change these parameters as well, or, I don't know, in the beginning, so here is not clear, but, I mean it's not written there, it's in the code below. You might want to initialize your, your weights and biases in a different way. So, if you, if you test your algorithm, over one set of data only, you might end up saying, okay, well, my parameters that I'm tweak, I mean these, these parameters, which are called hyper parameters, because they're like parameters at a higher level, okay, about, about the way that the network learns, might be describing well this test data, but not others, okay, so I might do, I mean, run into overfitting, so you are overfitting your test data, so you might want to check against another set of data. So this is like, it's called a hold out technique, so you keep one test data to see how well your network is learning, okay, by changing the, the, the hyper parameters, this epochs, I mean, not epochs, mini batch, eta, and other parameters, like when you introduce regularization, to avoid overfitting of the training data, so much, I mean, I don't have time to go through it, unfortunately. You keep another set of data just to see how well you have set your hyper parameters, okay. No, I don't know, I don't think, no, I mean, it's a name, you can call it test data one, test data two, test data three. So validation in the sense that you are, I don't have a deep understanding of why, of the name, but yeah, I mean, it's like, it's like validating your, your hyper, your training algorithm somehow. So it's, yeah, okay, so there's a lot of, as I said before, I mean, it's a lot of cooking, trial and error, okay. But there are some rules, I mean, there are some rule of thumbs that you want to use. And it comes all from experience, okay, which is not my case, but, so, for example, the mini batch size, it's something that, for example, in the notes that are, that are referenced there, is never changed, okay. But it's, it's something that you want to change at the end, okay. So when you are tweaked, all the other parameters, in a way, in some way, then you might want to see how well you learn with, by changing the mini batch. The eta is one of the first ones that you want to see, that you want to change, okay. So how fast you go down, I mean, with one eta, if, as a function of the, updates, so, of the time step in your, stochastic gradient descent, if you choose different etas, you might find that one goes down like this, one goes down like this, much slower, one might do like this, and this is like the performance. So the percentage of, of correct classification, for example, I mean, you clearly see that this is like not good. So you stick to one of these, then you set it. And then you try to try to say, well, you have other, many other parameters, for example, again, one very, common problem in this, in this, supervised learning things, is, is overfitting, okay. It might be that you're very good at representing the test data, sorry, the training data, but then when I show you something different, you are completely confused and you, you maybe you have, you have learned the details of those pictures and not like the general, the general, let's say the higher order, or like the more, you haven't been able to extract like the higher level information somehow, no? So there are several things, there are several tricks to do this. One of them is generally is to, to regularize the parameters, meaning to the cost, which is C0 is the, one of the cost that I wrote before, okay, like the square, or like the cross entropy or something, you might want to add one other cost, which is like, for example, lambda over, I don't know, n number of data, times sum over all the links, so I and J and L of W, I, J, L squared. Now, if you use this cost, this is called L2 regularization, so L2 regularization, so L2 because you add a cost, which is the L2 norm of this, of this vector of Ws, okay? So, when you set this, basically this gives an update rule, which is W times 1 minus this thing, so probably I miss, so eta lambda divided n plus the derivative of the, of the C0, okay? Just for the Ws. In general, it's a good idea just for the Ws or not with the Bs, because the Bs are not really important for overfitting, but the Ws are, okay? Again, this is a heuristic, heuristic evidence about the fact that the Ws, you want to keep them quite small in order to be able to generalize more, okay? Otherwise, you learn too much the details of the training data and you don't generalize. And, but it turns out to be basically very simple because it's, it's just basically a, you multiply the old W by a factor which is smaller than 1, so if you, your gradient becomes 0, you go back to 0, okay? By, if your gradient you evaluate it, it becomes always 0. The W goes back to 0, okay? I mean, it's just a, that's like exponential or geometrical, it goes to 0. Okay, so this is one way to prevent regularization. And one parameter that you want to set also, one of the hyperparameters is this lambda, okay? This is also an important parameter to set. After the eta, you might wonder which is the best lambda. And then there's other techniques for regularization which is, so suppose that you want to, you want to see, you monitor, and this is the case here. Let me run it while I talk. Is the here, is the example here, if I'm not wrong, okay? You might want to monitor both. The performance on the training data that you're feeding and the test data, okay? Imagine that you run through epochs here, okay? As I'm doing here. I am doing only 30, okay? What you see is that, well, now I wait for the graph, I'm not drawing it. So, yeah, you have two minutes. Okay, so I think it's plotting, okay? What you see here is that I test while I train my network. I test both the performance on the training data which goes up to 100%, okay? Performance one. Why? Because I'm very, I'm learning very well the training data. So I'm learning also the details of my training data. But when I test it on the test data, okay? It stops increasing. It goes to a plateau, okay? So the idea of regularization is that you want to tweak your parameters in order to reduce as much as possible this gap, okay? So if you're on your test data, you go to a level which is, you stop here, okay? For example, here, you might say that you're still growing a little bit. I used only 1,000 data points for the training. If you use 50,000 already, this is the number of data that reduces this gap already. It's difficult to learn the details of many more data, okay? But you might want to say, oh, okay, well, my network, I don't want to extend my training above like 5 or 10 epochs. This is a ridiculous number because the training data are very few. But I don't want to extend more my training because if I go on more, I'm going to learn the details of the training data and I'm not able to generalize. So this is called early stopping, okay? So I stop at an epoch in which I realize that my performance on the test data doesn't grow anymore while on the training data does, okay? So I stop before. So I prevent overfitting this way, okay? Other methods, again, I'm going through a list so that, I mean, it's part of the kitchen, okay? It's called the dropout. I tell you just this, then it's done. No, let me, I skip the dropout. I just tell you something else about convolutional neural networks, which are like basically the best thing that you can do for picture recognition and everything. So something that you might say, well, I want to recognize, I want to classify pictures. I want to classify digits, for example, or like dogs and cats and elephants, okay? And so, but I mean, I have a picture like this of a one, okay? And I have another picture of this of a one. So what's the difference? It's just translated. But my algorithm that I wrote there, it's extremely stupid because it doesn't see the translation or it doesn't see this. Is this still a one? It's just a rotated version of this, okay? So I'm able to recognize because I, I mean, in my brain, I don't know, I mean, I, or because I don't know, I recognize that there's some symmetry there, no? There's some kind of rotational symmetry, some kind of translational symmetry. And people do that even if they're not mathematicians or physicists. So you recognize this. And so how to implement this thing into an neural network that kind of implement this kind of symmetries? Is it possible or not? So it's actually possible for some kind of set of symmetries, for example, translation, okay? Because translation is basically something that leaves convolutions, okay, invariance somehow. It's like a symmetry for convolutions that, and this gives the name of convolutional neural networks. So it's basically networks that kind of take this picture and this picture and this picture and try to kind of, it blurs somehow this picture, okay? By translating, shifting, rotating, do some kind of symmetry, I mean, apply some symmetry transformations, small symmetry transformation in order to kind of extract some kind of generality out of these pictures, okay? So there's a pre-processing that the machine learns by itself. It learns how to extract generalities out of pictures which are just invariant by just small transformations, small symmetry transformations, okay? And so I put in the drive thing also the code which does this, if you want to, you look it up. And there's also a discussion in the notes by Michael Nielsen, which discusses about this, but this is basically the last stage of these kind of algorithms that perform very well also because they are able to recognize these kind of symmetries, okay? I'm sorry, I mean, I don't know about it very much and to tell you more, but this is it. Any questions? Any more questions? Okay, I'll target you now, okay. Thanks.