 So today we will continue, hopefully we can get to the beginning of deep learning such that we continue on deep learning after we come back from the reading week. Since we lost the Tuesday anyway, so I will use the tutorial time today. We have a long lecture, maybe around six, five, six, ten, we make a short break such that nobody falls asleep. Just a warning that if I make mistakes today, that's because I have not been sleeping last night. I was laying down on the floor of some airports because all flights were delayed. So I have a good justification if I make mistakes on the board today. Just pay more attention. Okay, so we start talking about the XOR problem. And we said that if you have, so here you have your zero, zero, you have one, one. Then you have your zero, one, and you have your one, zero. And we said there is no single line that can separate this two sets. Basically you need something like this. You need a line here and you need a line here. And then you say this is class one and this is still class one. So, and everything that is inside is class two. So you can take it to higher level to make it a little bit more. So this will be class two. The XOR problem is an academic example, of course, but you can take it to three digits, five digits, ten digits, 20 digits. Well, just to show the problem that we cannot, sometimes we cannot separate. So cannot be separated with one line. So there are difficult problems that we can, that means to me, there are still difficult problems that we can solve with lines. But we need more lines. So is that what you're saying that you can solve a non-linear problem? If you use many lines, yes, because we want to separate things against each other. So if you use more than one line, we may be able to really do something sophisticated. So if you sit down, sit down and play with it just manually, you see that if X1 and X2 are your digits, each one of them can be zero or one. So you have two Boolean variables, they come in, they can be zero or one. And just as a gedankin experiment on the paper, we want to see how, how would that something like this look like. And let's say I use two of those artificial neurons or perceptrons. And of course, then this will be connected to both of them. As we learn that connections are everything in neural networks. And then we somehow get those to be combined to another neuron, aggregate them, and give something out. And we want, if I have my X1, X2, so if this is zero, zero, Y will be zero, if this is zero, one is one, one zero is one, one one is zero. So that's basically a, the XOR problem is recognizing what cases you defer or what cases you are similar. Which is a very good distance measure in many, many feature calculation scenario. As a matter of fact, we are at the moment spending a lot of time doing, having distance calculation with XOR, fast operation. So, and you see sort of the non-linearity of it in the way that it works. So okay, so of course, we have, we have some weights here. We have weights here. We have weights here. We have weights here. We have a weight here. We have a weight here, and I may have some biases. Well, I, I rather have some biases, and some bias here too. And we said that rationale for the bias was, if I don't have bias, the sum of W's times X will always go to the origin, zero, zero. So I can rotate it, but I cannot shift it. So I need the bias to, to shift it around. Such that I can, I can shift this anywhere I want. Okay, so now if you sit down and say, okay, you know what, I want to put this to minus one, minus one. For whatever reason, I put the bias to 0.1 for this one. Why are you doing this? Well, let's say I just randomly did it. Even if I do it, so the, the main question right now is one, two, three, four, five, six. So I have six weights to find somehow that can learn the XOR problem. Well, XOR problem is implemented in any computer. Computer, we just pair the bits and we do it, it's not a big deal. We have been doing that for 60 years. So why do I need a normal, wow, exactly because of that. So we apply new concepts, new ideas on problems that we very well know. Such that we know what is expected. So if you can do this simple problem, then I have some trust in you to continue and say, okay, what, what can we do more with that? So if I put this to plus one, this is minus one, minus one, plus one. And then this one are also plus one. If I do that, I have an artificial neural network that now you can feed in bits, binary bits, and you can do XOR. So I did some experiments on the paper. I played with this number, that number, with the sign, and this works if you use a sign logistic function, logistic function f of x or f. So of course, inside the neurons, I don't want to draw everything because it gets messy. So you have your sum and function, sum and function, sum and function. Okay, so if you, the things that come in get multiplied with the weights, we accumulate them, we put them to a logistic function which is sign. So if your negative is zero, if your positive is one. Simple because I'm interested in binary values. Okay, so how did I came up with plus one minus one and all this? Well, I experimented on the paper and came up with those numbers. And it's now working. What do you want from me? Well, I want a generic form. What about I do for three bits? Then I need probably what? Three neurons, you're getting there, yeah. What about four bits? You need four neurons maybe? So manual adjustment of weights by trial and error. And of course, there are rule of thumb. So you don't start putting a weight of 125. It has to be small values. So keep it simple and we start with small values and then see, should I do this, should I do that? But we can do it really just to make a point for a very simplistic example. We cannot do it if you have a network of even, I don't know, 10, 20 neurons. Isn't, combinatorially, it gets out of hand. You cannot do that. Okay. So this will enable a nonlinear separation via two lines. So what I created manually here, I have two neurons here. It will create these two lines. Not exactly these two lines. It could be here a little bit here. This could be a little bit like this, but it will give me two lines. And I can separate the cases. And the neurons use the sine function, just because it's so simple due to its simplicity. So if I look at that, don't worry at the moment, how do I find the weights? Which is the main question for us. How do I find the weights? Because intelligence is adjusting the weights. But more importantly, so okay, you are telling me if I add a neuron, I can draw a line. If I add two neurons, I can draw two lines. What about if I have half a million neurons? Wow. You can draw half a million lines. Where? Well, it depends on the architecture of the network. So, and do you need half a million neurons to draw half a million lines? So, if I have one neuron, and I have my X1 going into the neuron, having the weights W1, and the X2 going into the neuron with W2, and I have a WO with the value of one as bias, and Y comes out. So, this is the big picture of summing everything, putting them through the logistic function, just to limit everything such that things do not get out of hand. So, what this creates, basically, you have your WO plus W1 X1 plus W2 X2 line. And that creates some sort of line that I'm just arbitrarily saying this is this line. And if this is greater than zero, so you are belonging to this class, and if the same value, WO, W1, X1 plus W2, X2 is less than zero, you are belonging to that side. So, the line itself is just a decision boundary. We mentioned that. You create a line and say, if you're on this side of the line, it's a very simplistic view of thing. In spite of all the success that you're having with it, it's an extremely simplistic view of thing. So, I just draw a separating line. So, one neuron gives me a line like this, and if I'm using a sine function, I say, okay, so if you're negative, you are on this side, if you're positive, you're on this side. Keep it simple, because why, if I have several thousand of those neurons, things get really complicated. So, I don't want to put X square root of X, I don't want to do those type of things. Just keep it as simple as possible, because naturally they get more complicated. Okay, so what happens if I take those individual neurons or perceptrons, and I put them together as I did here just on the paper, and I did some lucky guesses, and found some lucky values for my, especially this one, some lucky values for my biases, which I cannot justify at the moment, why I did that, and voila, I can do XOR. Go through it, just put some numbers through it, see whether it works or not, and if it doesn't work, why it doesn't work. So, what about now I put a, now I look at one hidden layer. So, which means I grab now, let's say three neurons, and let's say I have my bias here is one, and I have my X1 and X2. Now I'm keeping the number of inputs the same as I had, just two inputs, but now I'm adding more neurons, so I'm adding more processors, I'm adding more CPUs, and say okay, so this is my bias, this is my first input, this is my second input, and of course this is a neural network, everything has to be connected to everything. As we learn that the power is most likely, as we guess, not just in the number of neurons is rather in the number of connections, so everything gets connected to everything, this is the input node, input node for bias, input node for X2, and so when you do this, you have to collect the information on the other side somehow, and for that you meet another type of neuron, it may not do much, but so you get the output of each one of them, you put it through an output, you get your final output. So if you do that, basically you have one hidden layer, and why we call it hidden layer? Well, for one layer it's a little bit difficult to justify perhaps, so this is my input layer, so this is accessible from outside, this is my output layer which contains only one neuron, I can hardly call it a layer, it's a layer with one neuron, and I have really a hidden layer, which means this is a layer that is not accessible from outside, is not accessible from this side, so when you put your numbers in, why, because again all of these connections have some weights, so and you have to go through those gates if you wanna call them, so therefore we call it a hidden layer, so now that's a hidden layer of four neurons, so what does that do for us? If I have something like this, if this is my X1 and X2, so these are my inputs, my variable, it could be raw data, it could be features, representation for raw data, and let's say I have a class like this and I have a class like this, let's say I have only two classes, I'm using a neural network with one hidden layer, four neurons, each one of them again will give me, let me just forsake of simplicity, let's say this is neuron one, neuron two, neuron three, neuron four, so each one of them will give me a line, line number one, line number two, line number three, line number four, so the four lines will basically encircle this guy, so this is inside this class, this is this class, so this four line are separating that section for me, of course I'm making up that example, the distribution of data, it doesn't matter, what do you want to get out of this example is that if I use one processing unit, one neuron, one artificial neuron, I can draw a line basically, separate things, if I want to draw four lines, I need four neurons, so now if that's the case, that means if I use a deep network with half a million neurons, you get half a million lines, yeah, that's why they are so accurate, so you can really separate stuff with each other, such that you draw so many lines, it becomes a curve, it's not a line anymore, okay, so what about now we do two hidden layers, two hidden layers, now we are getting auditions, now I have two hidden layers, I have my inputs, again one as the bias, X1 and X2, and then I have one layer here, again let's keep it at four, I have another layer, let's say with three, and I have one output layer that gives everybody out, so just for fun, we can go and really draw everything, again everything is connected, sometimes for simple things like that is a good practice, just to draw everything, because that gives us an appreciation of how heavily connected neural networks are, so everything is connected to everything, so and of course I have still my inputs or input layer, I have my output or output layer, I have my first hidden layer, and I have my second hidden layer, so this is my first hidden layer, my second hidden layer, okay, now something interesting happens basically, now with two hidden layer is a big statement, with two hidden layers, and that was the reason that we disappointed so many in back in the 80s and 90s, with two hidden layers, you can basically solve any problem, yes, any problem if you can compute it, so if I have my X1 and X2, of course now having two features is very difficult to have a highly complicated problem with two features, maybe those two features are PCA reduced from 500 features, maybe Disney gave us that two features, but it's very unlikely that if a problem has just two features, it becomes so difficult, so this is one class and this is another class, what problem can be two features get that nonlinear nasty, it's just an example, just again, assume that X1 and X2 are two principle components that we got out of 5000, so the problem is still difficult, so if you do that, basically what you get, you can do one, two, three, four, five, how many do we have? Six, seven, so now you can easily say okay, so now this is my region, so you can separate it, okay, if I have two layers, the main question becomes how many neurons per layer, which nobody has an answer for, it's not like there's an equation that we put in and then we know how many neurons we should use, yes, because there are things like the one example that we talked about are exactly this, that is you need more resolution in the feature space to separate things that are not linearly separable, so this will, again, this will give you some possibility to separate one if your boundaries are still well posed, so but if you have a nasty example like this, you need another layer of neuron, so you are just taking it a level higher to be able to create a convex region that you can separate classes against each other, how, why exactly, I would say nobody knows, we empirically know that when we go to second, we can basically solve any problem, theoretically, any random function, yes, but not nonlinear functions of hyper dimensions, so there is, so you cannot solve all problems with one layer, it's impossible, but we can solve every problem given with two layers because the two groups now are separate, every group of lines gets separate independent, you can basically position them in any, so that could be a little bit misleading, but I could draw this like this, that, so this line, and this line, and this line, let's say, are coming from this guy, so it's not that simple, but the oversimplification may help a little bit, so every layer becomes responsible for a certain region of the search space or the classification space, so any shape can be bounded by a number of neurons, any shape, but that's a huge thing to claim, well, claiming something, and theoretically viable, that's something, but when you go into practice and you run a difficult, face recognition, okay, I wanna do it with two layers, well, you realize you can do digit recognition with two layers, and you can get to 98% accuracy, from noisy data, so what you go to face recognition, that doesn't work anymore, what you told me with two layer, I can do everything, I know, I know, but can you get a million neurons here and two million neurons here? You could, but you cannot rein it, so that's a challenge, so the computational challenge has nothing to do with the feasibility of it, again, theoretically, all you need to classify anything is two layer of neurons, the question is how many neurons here, how many neurons here, and then empirically, again, the more complicated the problem, the more neurons I would need here, but even today, 2019, you cannot put one million neurons here, you can't, we cannot do this, yes, because they are dependent on each other, but we are adjusting these weights separately, so, and the error that will come, and we adjust this, and then we adjust this, so, and each one of them, now, the line, if you have just one, if you have one layer, so the position of those lines, that those neurons of the first layer generate is depending on nothing, just a weight adjustment, but now, we are interconnecting this, so you can draw your line based on what information you got from the previous layer, so it will force you to do this and this, but you have still some degree of freedom, so this blue line that I drew arbitrarily is constrained to be here because of these connections from the previous layer, it cannot be here, so we are constraining it, we are putting some constraint, if you put them in one layer, they have too much freedom, so when they go all over the place, and then you have to push them back, and apparently it takes a lot of computational power to push them back, we still cannot do it, so you cannot have a autoencoder that hopefully we get to talk about it today with even half a million neurons, oh, okay, I need 5,000 GPUs, and sometimes it's not just a question of how much power you have, it's simply intractable, it becomes intractable, okay, let me keep this, okay. Important is we can stay and talk about this a lot, and there are some formalism about it, but for me it's important that we take away from this, look, the topology of the network and the number of neurons in each layer has a direct relationship with the difficulty of the problem that you want to solve. You don't go with 10 layers and each layer, half a million neurons, if the problem is simple, of course not, so now, if I want to draw the XOR again, so let me draw the XOR, let me draw it three times, that's the reason that we use simple example to make a point because when you go complicated shapes then details will overwhelm us and we may not be able to, so I want to classify this with one neuron, I want to classify this with two neurons, I want to classify this with two layers, well if you do that, if I use one neuron, what, I get this, which means it says this is a class and this is another class which of course is misclassified then I'm missing something, so one neuron cannot solve the problem because the problem is more difficult than we solve with just one boundary decision, you need at least two boundary decisions, decision boundaries, sorry, you see that the flight tick is kicking in, so now if you use two, you may get something like this and the lines could be in any position, it doesn't matter, so you get something like that, so you are saying this is a class, so two neurons actually solve the problem, okay, so XOR in two dimensions, two lines does it, I don't need more than that, but you get enthusiastic and audacious and no, no, no, let's go with two layers, let's put 200 neurons in them and see what we do with XOR, why you overkill it, so I can go with two layers and let's say each one of them four neurons, so I get this, this, this, this and I get this, this, this and this, so each one of them, each one of this layer will separate a different corner, section, fragment, region of the feature space, so two layers solve the problem two and more accurately, but do you need that, do you need this? I go with two layers, each layer four neurons, every neuron draws a line, now I have two, the separation of layer, the freedom to put four lines here, four lines here, so separate stuff and looks much better, so because that says this is the clouds and this is the clouds and nicely I have separated them, the only problem is, the only problem is if this is not XOR, so this is again a made up example, just don't take it too seriously, if there is a problem that suddenly, this one is wrong anyway, suddenly another example happens here, the two line is okay, but the same issue happens here, your two layers are wrong, what? I thought using more is better, well less is more, so since you use too a lot of stuff, you overfit it actually, your problem is too small for that architecture, which means you cannot actually generalize compared to two neurons, so the new example comes, of course for XOR there is no such an example, so this is the one one, this is there is no, imagine that in a different feature space, so it could be that if you more accurately separate stuff, which means you have a big network, new unseen data comes and boom, you are wrong, you drop from 98% accuracy to 52% accuracy, is that clear, so I just wanna make a point before we go deeper into how this type of things happen, okay, good, how do we learn? For the XOR we did manually, we know that we have to put some layers, put some neurons in it, we have some sense for it, take some new ones, not too many, not too little, I have no idea, perhaps I have to experiment with it, yeah, all that is the nice daily workflow of every machine learning algorithm designer, so how do we learn? So learning algorithm, so we usually do that for MLPs, which are multi-layer perceptrons, perceptrons, so the simple structure that I drew, and you have two layers, so you have multi-layers, more than one, and each one of those neurons, let's call them a perceptron, so you have multi-layer perceptrons, or the entire network is called an MLP, so we also call them, that we also call, that we also call feed forward MLPs, we see that it doesn't, it matters how you are moving in artificial neural networks, for the time being, and perhaps for the remaining of the course, we are assuming, we are working with feed forward networks, whatever I put in, it goes through the network in one direction, there is no circle to come back, you cannot come back, when you have put it in, you have to go through, get out, big assumption, because if something comes back, we'll just mess up everything I'm calculating, we cannot come back, so it has to be feed forward, this tiny objective of feed forward, is very, very important for us, there is no artificial neural network, as we wanna talk about them, if we cannot make that assumption, that things cannot come back, for MLPs, that we also call feed forward MLPs, we need to solve the credit assignment problem, so that's learning, after we have some abstract idea about Neuron, what does it mean, a synapse, is a weight, is either a small value, which means there is no activity, or is a high value, there is activity, and apparently when we put some of those perceptrons together, we can do something, we can learn some lines, put them more together, you have more lines, adjust them, but I'm getting worried because if I put even 500 of them together, I have many, many weights that I have to adjust, how do I do that? Well, that's called credit assignment problem, which is basically, again, let's go with the example of one X1 and X2, and we have our three neurons, and these three neurons go to an output neuron, so let me again go through the suffering and connect everything to everything, and let's see that these are the weights, so these are the weights of those connections, and you calculate the output, so you put some inputs in here, and you get an output, whatever, you get output is 10.2, so X is whatever, X is five, X2 is 6.3, and the weights have some values, we randomly initialize them, this, is that clear? This is just a bunch of addition and multiplication, right? Assign some random values to those blue dots, put these numbers into that, they will go get added, here go to a function, come out, gets multiplied with this, get added, here go to another function, it's just bunch of additions and multiplication, nothing is magical about that, but the structure that they are connected together, that brings the magic, so okay, I put some numbers into it, and they go through my randomly selected weights, and I get something out, and then you look and say, okay, so what the output should be for five and 6.3, and you go into your magic of data, magic, long, big, labeled data, and you read there that Y should be five, so if X1 is five and X2 is 6.3, Y should be five, you read it from the table, that's the data that the customer gave us, that the data we downloaded from Kaggle, that's the data that we measured, it doesn't really matter. Okay, that's a big difference, that means at the moment I'm stupid as a network, because instead of five I generated 10.2, so there is an error, there is an error of 5.2, that's a big error, that's a big error, because this is just one instance, now what happens if I add four and 11, and 2.3 and 6.5, so, now I add all this, I send this, and for each one of them I get an error, the error accumulates really fast, before you know you have an error of five million, what the error should be, zero, wow, okay, okay, so error should go towards zero, at the moment for one instance I have an error of 5.2, now, the question is this, this is my error, I have to come here, look at this guys, and ask, whose fault is it? Guys, you are the immediate synapses that your value I am adding, pushing through the function to get 10.2, so you are messing things up, so, two of them are shy, they don't say anything, one of them is a really fresh guy, excuse me, what can we give you, so look, these guys at the beginning, at the door of the university, they gave us crappy numbers, so, oh, okay, now I have to go back to here, wow, now I have to deal with many noisy people, and ask them whose fault is it? Whose fault is it? And they start fighting among each other, and say this is your fault, this is your fault, well, this is called the credit assignment problem, credit assignment problem, or blame assignment problem, so, who gets blamed? Or, if the error was really small, who gets the reward? Who did the job? So, you cannot simply ask these guys, they have some values, they have some values, this is 0.1, this is 0.23, this is 0.56, 65, whatever, they have some values. Now, because we have an error, we have to go and change these values, but how should I change them? This is a crossroad, now this is 1982, and Rommelhardt and Hinton are sitting in office, and they are writing with the chalk on the board, and say, how can we find an answer for this? Let's play with random, we randomly add an epsilon, and it doesn't work. If a small problem doesn't work, yeah, yeah. It's our problem we can show, random sampling, Monte Carlo, fantastic, yeah, for a small problem. What do you need a deterministic approach to learn to adjust these weights, adjust these guys, and put 5 and 6.3 again, and suddenly I don't have 5.2, I get 3.1, the error comes down. Oh, okay, I have to figure out a way to bring the error down, but not just for 5 and 6.3, also for 5 and 11, and 2.3 and 6.5, for all of them at once, at once. You cannot adjust it just for the first set of inputs, and that's what learning has to look at the data at once, at any time, at any given time. So how do we do that? This is by far the biggest question, so if we don't get this, if we don't get this, we will not appreciate any learning technique, because this is very important, how we propagate back, and I guess I have to ask one of the colleagues who said that first, Ronald Hart or Jeff Hinton, I don't know which one of them said that first, because naturally it comes probably over your tongue that if you propagate the error back into the network, so this is error-back propagation. Learning is learn from your error and go back make adjustments such that you do not make a big mistake in the future. Okay. So, intelligence is updating the weights, which is solving the credit assignment problem, the credit assignment problem. So, for us would be then propagating into the network. So, you give me error, I push it back into there, I cannot deal with this. You make this mistake, you take care of it. So push it back into the network. So push it back into the network. If we have our neurons, some neurons, and we have our X1 and X2 and X3, and doesn't matter, I don't wanna draw all of them, I just wanna draw some of them. You get why? So you have your feed forward procedures. So the data will be pushed through the network. You put one and two and three and you accumulate the error on the other side. I cannot adjust one by one, why not? Why could? But why should I not do that? Why I should not adjust one by one? I have one million rows in that table. Send one row, X1, X2, X3, X10, and then measure Y and then make some sort of adjustment for that measurement. Why this would not be a wise approach? Could be a factor. We wanna get an overall picture of the data. So the question is, how can I minimize the error for everybody at the same time, not just for one row? So we have to learn the pattern. The pattern is embedded in all measurements, not in individual measurements. So now, and after we propagate that, then you get your comparison with the desired. So you have your error. You compare what you calculate with the desired output and then you go in and you do your back propagation. So now you start putting back the error into the network. It's like you throw garbage at me and I throw it right back at you. So deal with it yourself. And you see, I keep doing that. So every time you throw a smaller garbage at me, oh, this guy is really persistent, so. So how do we do that? Well, everybody cooks with Waller. I have to look at the total error. The total error is again, the sum of all variables. Put in the feed forward mode through the network, measure the error on the other side, accumulate that error. First time, the error was 5.2. Second time was 3.6, 5.2 plus 3.6, add them up. Just add everything error. So I wanna see how badly I am just messing things up. And then we look at the weight updates. And then you see if you do that, you may say any type of shape of error function. You measure that, you plot it, you see it. Of course, what we want is this guy. So we want the global minimum. Why am I talking about global minimum? Because this is also a minimum. This is also a minimum. This is also a minimum. This is also a minimum. I don't want a local minimum. I want a global minimum. The behavior of this curve, again, is hyper-dimensional. We don't see it that easily. So it's not like you're telling me, for God's sake, just look at it, you see it. Well, we don't see it. The algorithm is supposed to see it. So we wanna find that one. But how you do that? When you randomly initialize your network, you may end up anywhere. You may end up anywhere. You may end up here. You may end up here. You may end up here. If you are lucky, you may end up here. By random initialization. So means I had 500,000 weights. I generated 500,000 numbers. Went through the network. The first feed forward. My error is that much. If I do it again, I will be somewhere else. And then, ideally, if I'm here, I wanna go in this direction and I wanna get here. So I wanna descend. I wanna descend toward the global minimum of the total error of network when it processes all data points. And I will descend based on the gradient because there is nothing else. So I looked at, am I going up or am I going down? I look at the gradient of the error while the error should go down. So I wanna go in the opposite of the gradient. Okay. It should work, theoretically. And it does. So take a step. Take a step in the direction, in the direction resulting, resulting in a maximum, in a maximum decrease of the network error E. Of the network error E. This direction, this direction is the opposite. Is the opposite of the gradient, the gradient of E. E being the total error. Total error E. So I don't wanna have the maximum error. So I go in opposite direction. So that opposite direction should enable me to descend toward the minimum. Can it happen that I am here, I land here and I descend to a local minimum? Yes, it does. Yes, it does. How should I know? Not that easy, but we will talk about it a little bit. Getting caught in local minimum has been a historical problem on neural networks. But now we have some mechanism to counteract that. Okay, so let's get serious. Does it mean so far we have not been serious? Yeah, sort of. No, I wanna have some equations now. I mean, that's great. We have TensorFlow, but it's not bad we know some of those equations. If you look at the code, it doesn't matter what it is, is it PyTorch, is it TensorFlow? And the equation does not immediately come to your mind, you have some big deficit. You should see the equation in the code. Oh, this is the total error, the sum of these difference. Because in the code everything looks different. So, okay, intelligence is updating the weights. Should we take a break and then continue? Yes? Okay, so it's 625, we continue 635, okay? It's good enough? Five, seven minutes? Okay, we continue. Because if I go into this, then we have to go for at least, at least, I don't know, 45 minutes. So, we wanna get back to the rate update and now we are talking specifically about the weights. W, soft J, I, because now for every layer, every connection and every neuron, we have an index to manage it so you will have a matrix and then you have multiple layer, you have a tensor. So, to work with. So, but let's keep it simple. So, the weights that I have, it's the weight that I had plus some delta. So, of course, this is pseudocoding programming because otherwise I have to add here, this is n plus one, this is n, this is n. I just don't wanna do that so I don't write it anymore. So, I'm doing those calculations and adjustments in different iterations of every epoch as we will then learn to call them. So, the entire thing comes, whatever weight you have, add or subtract something to it or from it and hopefully we're gonna be fine. So, what we have to solve the credit assignment problem, which has to have some intelligence into it such that it descends toward the minimum error. It has to have to be somehow intelligent. So, this delta w sub ji is minus, because you told me I have to go in the opposite of the gradient of the error. So, it has to go in the opposite direction, minus. Let me keep my eta, my learning rate. So, introduce parameter has nothing to do with the network. I just introduced it. I wanna have more control over it. And that everything is depending now on the gradient of the error at the local neuron. So, this is one of those things that is very somewhat inaccessible to the young researchers that you say when you say credit assignment problem solve it and then you say solve that after you write two pages of equations and then people ask, okay, where did you solve it? Write it, we solve it. You said, the credit I give you in the opposite direction of the gradient of the error at the local neuron. So, you can only blame yourself because this is as much as you are contributing to the error on being absolutely fair. And I wanna get rapidly to the solution so I go in the opposite direction of the gradient. And I'm a control freak. I wanna have some control. So, I'm introducing eta and eta is a number between zero and one. I just wanna have more control. So, no rational beyond that for eta because sometimes I wanna make big changes. Sometimes I wanna, sometimes you tell me, yeah, make big changes. And I say, no, I don't make big changes. You have to be done very careful when you do that because then you're interfering with the intelligent scheme that you have designed. Well, okay. So, the input of the jth neuron. What is the input of the jth neuron? The input of the jth neuron, we're not talking about input neurons. Any internal neurons. Vj is the sum of w sub ji times yi. i going from one to up to n. So, anything that comes to me is the output of other neuron, hence y sub i, right? So, this neuron is coming. So, anything that is coming to me is the output of another neuron. Is that clear? So, this is the yi and I have some weights wj. So, hence, this is what I get. So, my input is what the others created as output. This is the connection of the network. So, don't get confused that why is his output my input? Well, this is what we are putting in place. So, for one of the neurons, and at the moment I'm excluding the input neurons, I'm just talking about internal neurons. For any internal neurons what you get, the input of the jth internal neuron is the sum of weights times what is coming from other neurons before. Okay. Good. Now, going back to high school and we use the chain rule. High school was fun, wasn't it? So, the error, the gradient of the error with respect to the weights w sub ji, we can break it down to the error, the gradient of the error with respect to what was at the input of that neuron, times the gradient of the input with respect to the weights. So, I'm really looking at what is happening locally, but I have correlations dependencies that coming from before, I have to consider that. So, I use the chain rule. I know that, so this is what I'm interested in. How would the error of the network change if the weight at this location changes? That changes because the error changes because of the input of that neuron and the input of that neuron changes because of the weights. It's intuitive. Why? You need to look twice and then you see the intuition. Okay. Good. So, the local gradient of the jth neuron can be given us, so I call this delta sub j, is then minus the gradient of the error for that neuron v sub j. So, that's the local gradient and again, I add it manually negative because I want to go in the opposite direction of the gradient because I don't want to grow. I want to decrease. Now, from the gradient of v sub j with respect to the w sub ji being equal y sub i, we get, what do we get? So, we are going really slowly to establish something that we can apply. So, from that, we get delta w sub ji is eta times delta sub j y sub i. Now, I have it everything because all is that... So, I have the output, which you are giving me. I have the local gradient and I have a factor for me, the control freak to control a little bit. So, this is our delta rule. This is the first rule that we started with. So, this is the simplest thing you can adjust the synapses in a network. Not a super complicated network but an average size network with several thousand. You may have a lock with this simple delta rule to get things... So, we just apply this. We keep applying this because now, this is changing. Do you see the dynamic? So, at any neuron, I will get a different w... delta w's. At any neuron, I get a different delta value to adjust the weights. And then I use that delta to put it on the current value and create the new synapses. So, now you see those light bulbs going, making bigger and smaller and letting the signal go through or not. Okay. Now, one question we have not answered yet. I said pay attention to the internal neurons but what input neurons, output neurons. So, what is this guy? So, this delta sub-j can be actually two things. So, it can be... Basically, now I write a little bit of different notation because I don't want to use this notation for gradient. Go functional. The logistic function at the neuron, at the jth neuron, v sub-j, the derivative of that. So, build the gradient of that logistic function because the sum of weights goes through that function. I need the derivative of that function. So, we can use functions as logistic functions that are differentiable. Why? Because I need to know how they change. You cannot give me something that I cannot build the derivative of. Times the y sub-j optimal minus y sub-j that you calculate. So, y sub-j star is the desired output, the desired. And this is what we calculate. So, this delta will be what we put through when we build the derivative or change of the function, logistic function of the jth neuron, multiplied with a difference. Now, if what you calculate is no different than the desired output, this will be zero, no change. So, this would be if j is an output neuron, okay? What about if I'm not an output neuron? In that case, I still need to build the derivative of your logistic function at that jth neuron, which is now is not a output neuron. So, it's an internal neuron. And multiply it with the sum, because now we are getting a lot of stuff of delta k's times w sub-jk. Now, it's getting a little bit difficult. And this k is of next layer. And we do that, we do that if j is an hidden neuron, or internal neuron. So, if that's an hidden neuron, I need to know what was the part, what was the fault of everybody else. What is the credit assignment problem for everybody else? And then I can sum it up and multiply it, weight it with the change at my location because I want to be fair. And then this is for the hidden neuron. Some of it is straightforward. Some of it, I have to draw something in front of me and say, hmm, it's coming here. So, I calculate the faults or the credits locally with gradients, but things are getting pushed into me as well. So, I have to locate all of them as well. So, okay. If I do that, now, the question is, we sort of miss this. I sort of maybe skip this a little bit because we said if you have, let me, so, let me redraw that simple neuron that we have the sum and we have f and we have one and we have w o and we have x one and we have x two and we have w one and we have w two and y comes out. Now, we said that y is equal the sum of x sub i w sub i plus w o if you don't use the, if you don't use the convenient notation. But this guy is just a summation. Now, this has to go through the logistic function. So, the response of the net, the neuron is the sum plus the bias or the sum in general going through the logistic function. So, I need to know how the logistic function is behaving. I need to know, I need to measure the change of this accumulated weights going through the logistic function. So, that's why we are looking at that. So, f of x is the logistic function. Now, for instance, there are many. For instance, f of x is equal one over one plus e to power minus a x. So, the sigmoidal, sort of sigmoidal function. Why they go with functions like this? Because you can build the derivative and beyond that, it has some nice features. Now, when you do this, again, I said, if you take a look at the code of any package, you should be able to see this sort of stuff. Now, if we do the back propagation and I have 30 layers, I start at the output, propagate to layer 30, 29, 28, 27. You have to back propagate into the network. 20, 25, up to the level one. So, if the first layers are lucky guys, they don't get any blame. What is that problem called? What is that problem called? If it's simply too many layers and you back propagate the error and basically the vanishing gradient. People who are sitting in the front get all the blame and when I get to the end of the class, they say, oh, the problem solved. So, that's when you have too many layers because I have to bring the change and then accordingly solve the credit assignment problem layer by layer. Okay, just because this is very important for us to show you how we can simplify things to come up with a practical implementation of things. So, what is the derivative of this function? Why people use this something like that all the time? So, I have to build a derivative of 1 over 1 plus E2 minus AX which is building the derivative toward X of 1 plus E2 power minus AX everything raised to power minus 1. So, and this would be equal minus, chain rule again, minus 1 plus A2 power minus X raised to power minus 2 times minus E2 power minus X, I'm ignoring A because A will be a factor. I don't write A anymore. So, I'm ignoring A just a factor. So, I don't write it anymore. So, I will keep it for the end. I can just add it as a factor. So, this would be then E2 minus X over 1 plus E2 minus X raised to power 2. And this would be 1 over E plus E2 minus X times E minus X over 1 plus E2 minus X. So, which we can take it out, 1 over E plus E2 minus X times 1 plus E2 minus X minus 1. So, add plus 1 and minus 1 just for convenience over 1 plus E2 minus X. And then, I can write it 1 over 1 plus E2 minus X times 1 minus 1 over 1 plus E2 minus X. You are wondering why you are going through this torture because this is then f of X times 1 minus f of X. So, the derivative of f of X is f of X times 1 minus f of X. Therefore, the torture. So, and why is that important? Why? It makes things a lot easier if your derivative can be rewritten in a way that is more friendly. Now, we can reformulate stuff much more easily. So, for example, we are going through the historical route. Today, we are doing very different things. You are doing really. You are not doing sigmoidol most of the time. And building derivative of max is really easy. So, but this is the historical route. You have to have an at least rudimentary understanding of what did we do? How did we get here? That's a shallow understanding. I just get what we have since two years and I don't care what we had 10 years ago. So, that would not be a good approach. Okay. I'm not saying that we are not using sigmoidol functions anymore. Oh, we do, heavily. But other transfer functions have caught up. Which means that f prime of X is now I ignored A. A times f of X times 1 minus f of X. So, the A that I ignored here and I didn't just drag it with me comes back. Which means that f prime of X is, sorry, f prime of V sub J is then A times Y sub J, 1 minus Y sub J, gorgeous, simple, practical. So, I like that. As somebody from time to time likes to code a little bit, I like to have things like that. That keeps things really in place. So, unless you convince me with other transfer functions, I will stick with sigmoidol. Yes. Sorry again. Second line. Why is it not? Well, because this is the J's neuron. This is the J's neuron. Why is it not what? What should it be? Why are we? Because this is my output. This is my output. Why is, make me understand the question. What is the question? F of X, I see. Oh, okay, okay, okay, sorry. Because again, the derivative of f of X is this and VJ is the output of Y sub J. Okay? Okay. Now, if I put all this together, I want to come up with a back propagation. Back propagation. Now, I want to have something real. Having something real is not very easy. It's not easy at all. So, you start by saying, okay, number of iteration is equal one. Initialize WN randomly. So, that's the sort of pseudocode. So, take my simple example for XOR. We had, whatever, six or seven abates. Just give a random number to any of them. So, randomly initialize the weights. And now we go inside the loop. Now, the learning starts. While you're stopping criterion not satisfied, now you have to tell me, now you have to tell me how you want to stop because you can go forever. Now, go. While you're stopping criterion not satisfied, we'll go inside another loop. Now, I have a for loop for each example, for each example X, Y star. So, for each example that I have the input X and I have Y star, which is my desired output. Go inside the loop. So, this was one line. This is another line. So, one line, another line. I'm at second line. Run network with X and get Y. So, this is your feed forward. Feed forward. And then update, update weights in back propagation. Now, you are going back to the network propagation. Propagation. And then we have end four. And we have n is equal n plus one. And we have end wide. That's it. So, the problem was not the algorithmic structure of learning because we knew I have to go to a loop and do something again and again and again to make sure that I can generate the desired output. So, this is supervised learning. And the supervision comes from the Y star. That's the supervisor. So, it's telling me you have to be this. And then I don't do it. I say, try one more. Try one more. So, each one of them that you go to is simple and most likely incomplete pseudocode is just one epoch. So, one epoch is you go to it once, process all data and say, when I started, my error was 535. Now, my error is 122. Okay, not good enough. Second epoch, drop to 85. Third epoch, drop to 25. Fourth epoch, drop to 7. After 50 epochs, my error is 0.01. Wow, great, stop. So, but it's very important to know how do you want to do this? Is it clear if something that we go and we push the data, we read it from that gigantic table and we push every row through the network, we calculate the error, we accumulate the error, we push it back into the network by adjusting those delta W's. That's it. Nothing else. But how? So, everybody who has made his or hands dirty and done some programming knows when you start, after you say import this, import this, okay, for I equal one equal, let's go 5,000 times, 5,000, do, hmm, do what? Let's go have coffee, come back. I don't know how to do this. So, programming is a different world because if you don't have that structure in your mind and you have to go back to the world and say, ah, yeah, I have to do this first. So, I need two modules. I need to read the data. I need a training module. The training module needs to calculate the derivative. You bring in structure to implement it. Implementation is a whole different world. So, algorithms designers may not generally be a good program. They may be or they may not be. So, algorithm designers have a more abstract view of things and they see really good ways of one, two, three, four, do it and you should get it done. But they cannot tell you the individual steps. That's for a good developer that gets the algorithm and implement it and make it work. So, how do we do backprop? Now that we know each other, we can use the cute name. We don't say backpropagation anymore. We say backprop. So, backprop in the batch mode. So, in the batch mode we update weights only after all examples, after all examples have been pushed forward through the network, through the network. So, which means w subji will be, let me add n plus one, the same value w subji at n. Now I don't have delta here. If I do batch mode, I accumulate the blame every week, every month. I'm looking at you, what you are doing wrong and at the end of the month, instead of every day screaming at you for five seconds, at the end of the month and I will come and I scream at you for half an hour. That's way more effective because it will psychologically damage you and say I don't want to see this guy anymore. So, therefore we do this. No delta, the sum of delta, the sum of delta w subji x and sorry x and x capital x training samples, training samples. So, we accumulate all the errors. We are so patient. We don't run to the guy and say you made a mistake. Let me make a mistake today, tomorrow, the week after that. Just add it up. I would be nice if you could do that with humans. Oh my God. So, just come up with the list. Every day come up, every week come up with the list. So, you did this, this, this, tomorrow I'm a perfect human being. Okay. Not going to happen. So, but in computers we can do this. We accumulate the blame or credit, maybe partly blame, partly credit. Because if you do that at some position, so for some instances the neuron did well. For some instances did not do well. So, you add them up, some stuff get canceled out. So, you add them up, you come do a sweep up back propagation at once. So, that's batch mode. Batch mode means not necessarily old examples. I can go with the batch of examples. For images, for example, take 128 images. Put 120 images to, why is it always a power of two? Think about it. 120 images, go to the network, calculate the image. So, 128 different phase recognition cases. Get the error, come back, make adjustment. Grab another 128 images, go through the network. So, not one by one. So, batch means this. Okay. So, training is epoch by epoch. So, at the end of every epoch, which is set up iterations, every epoch could be 5,000 iterations, 10,000 iterations, as many as we have set to it. So, and every epoch gets rid of a chunk of error. And of course, you have to generally go down. Generally go down. So, how do I know I should stop if I do this? So, stopping criteria. How do I know I should stop? Well, one, we can look at the total, total mean squared error. Change. Look at the change. You add the mean squared error. You add them up. And see from epoch to epoch, is it changing? Is it going up? Is it going down? It does not change. If it's not changing, there's not much you can do. Stop. You are wasting CPU and GPU time. Just stop. The total error is not making any change. So, that means the network converged means the error is almost 0, which would be ideal. We are way more content than that. We never want 0. 0 actually freaks us out. I don't want 0. If you give me 100%, then there's something wrong. OK, give me 1% or 2% error. And a difficult problem, I'm happy to die. So, the network converged if the absolute rate of change in the average squared error per epoch is sufficiently small. Well, that is, I don't know, anywhere between 0.1 to 0.01. There are no magic numbers. Just rule of thumb just to give us some idea. So, the average square error per epoch has to be small. If it is sufficiently small, you have converged, which means you learned it, you got it, you did it. Stop. Whatever the problem was, it solved. We learned it, we adjusted the weights. Now, I can do X or in 5,000 dimensions. Now, I can do face recognition. I can recognize digits, whatever the task is. Second, so we look at the average squared error per epoch. And if things are not changing, so as long as it's going down like this, I continue. Of course, I will continue if it's going down. If it's going up, oh, something is wrong. You messed up the coding part. So, why is it going up? It cannot go up. We set gradient descent in opposite direction. It has to go down. Did you put the minus there? Go back. Sorry, I got the text message. No, I didn't hear the minus. A lot of time is the coding part. And this is the part that makes things happen. So we make some mistakes there. Partially just regular mistakes. Partially we don't understand the concept, so we put it in the wrong notation. So second, generalization, generalization-based method. I like that. I have never been fan of look at the total error and see that it's not changing and then stop. Because this is very relative. What do you mean it's not changing? It could be really changing a little bit. That's still, for my application, is important. I just want to watch it. It's minimally changing. I just magnified, looked at it, and assumed it's making a big change. So generalization is different. You can test for generalization, test for generalization after each epoch, after each epoch. If you have adequate generalization, then stop. Do you remember what that means? So, then stop. So we said you have your data and you break it down in a certain percentage. Most of it for training and some of it for testing. And inside training, keep some of it for validation. Now, the generalization-based approach to decide when is it enough to stop says train, validate, test. What was test result here? Oh, I get 85%. Okay, one more time. Train, validate, test. I get 91%. Oh, it's possible to get better. Train, validate, test. I get 72%. Train, validate, test. I get 91.5%. And guys are knocking on my door, saying they need the GPU too, so I stop. So if I am lucky and I'm keeping all results, I will go with the 90.1 because this is the generalization on the test data, not the validation, not training on the test data. So I stop. I say this is the best generalization I get. So you know how much effort that is? That means this is testing embedded inside training. You train, you validate, you test. You train, you validate, you test. That's the touring test. That's the way that we have internalized touring test. So I like that a lot more than looking at the statistical way of is the error changing? No, it's changing a little bit. What is a little bit? Okay. Going back to the Delta rule, so which means at this time, we should be able to sit down, review these notes, maybe do some additional reading. After half a day, saying, I got it. This is back propagation networks. So let's me start. I know, I know it's everywhere, but I'm such a geek. I want to implement it myself. So implement it myself. Trust me. Nobody can take that experience away from you if you do it once. It's just a weekend for God's sake. One time, don't go to the dance club or something. Just sit down and implement the back propagation network. It's just a day. And the experience that you get is just, is immensely embedded in your own brain, so you never forget that. And then you implement all the problems in the box and then you get it to work and then you compare it with the state of the art code for neural networks with TensorFlow, for example. You say, oh my God, this is elegant. But now you understand both sides of it. You understand the theory. You understand the coding. And you appreciate to use what others have put forward with a lot of effort and design. So I'm not joking. If you have time, do it. That would be extremely beneficial. And nobody can take that experience away from you. It will stay with you. Even if you leave AI, even if you don't do neural networks, that experience goes in the corner, in the section of the brain that is responsible for get the abstract idea, understand it, implement it, verify it. This is, this is, this is here. Has nothing, you will apply to buying a house, finding a partner. Should we get kids, yes or no? You will apply to anything. So Delta rule. Let's go back to Delta rule. Apparently, apparently this Delta rule of Delta W subji doing at iteration n. We said this is my learning rate eta times the local Delta subji at iteration n times the output sub, y sub i at n. So that's our, at the moment this is all we have. This is at the moment all we have. This is it. This is the Hebbian rule for us. Fire together, wire together. Because my inputs are your outputs. So I look at this and say, is that it? Is that the magic of artificial intelligence? There's nothing more. Just calculate the Delta based on some local gradients and you put a minus sign there and then voila. We can go. Yes, but this is not enough. So first of all, you have to realize that if my eta goes towards zero, that means no learning. So if that factor is zero or goes towards zero, you're shutting down the learning. And if eta goes toward one, you are making large changes. You are making really big changes. What is bad about making big changes? Nothing. If you just start it, make big changes for God's sake. Because my weights are randomly set. They are all crappy numbers. So it will position you in the worst corner of the Milky Way galaxy. And you have to put a lot of gradient descent to get to somewhere reasonable and away from any black hole, which is local minima for us. So large changes, if you do large changes, what can go wrong? If I put eta to 0.95 all the time, don't change it. What happens? What can happen? Sorry? Oscillations. So you make a big change. So if this is your error, don't... Okay, so let me do this. So you are here. You make a big change. It puts you here. You make a big change, puts you here. You make a big change, puts you here. You make a big change, puts you here. It doesn't let you go down to the minimum. I'm oversimplifying it. But when we say if you make all the time big changes, you cannot really zoom in for fine tuning. This should be intuitive. So you should make big changes at the beginning, but as we proceed, you should be more cautious to not destroy what we have learned. So now go slow, because now you want to do really small changes. So if you do large changes, then you become unstable and leads to weight oscillations. Weight oscillation. So you weight and weight and weight, nothing comes back. You go here, here, here, here. Nothing is happening. No progress. And this, we call it a learning rate. So a high value for eta means you are going to learn more. Yeah, but it should be applied properly at the right time, not all the time. So how do we deal with this problem? Solution. How do we deal with this problem? The problem that learning rate may not be easy. So we need to counteract the learning rate. So the delta w sub ji at the iteration a is we set, write it again, and eta times delta sub j of n, y sub i of n. So this is what we have. And now I want to add something else. I want to say at some point, at some point, maybe, maybe, sorry, maybe you should not touch things. And to just stay consistent, maybe I put a new factor here, alpha. So sometimes you need to change stuff, but sometimes leave the guy alone. The credit assignment problem is not a big issue with that new one. So don't make any big change. So this factor is a new factor. Alpha is, again, of course, between 0 and 1. We want to keep things constrained. And we call it momentum. Momentum. So momentum is a factor for us that gives some weight to the value it was. So now, of course, I can do alpha as a function of n, and I can do eta as a function of n. Now I get a lot more control over the learning process. You see the process is rather empirical, isn't it? You just put some general piece of knowledge that we have, we put it together, and we have something that makes sense, or follows some logic. Nobody can find a flaw in it. I say, look, sometimes I have to make big changes, and the big changes are because of the credit assignment problem. I will reward and punish according to your contribution to the error. But sometimes I don't want to change things. Sometimes I just want to keep stuff as they are. But I will play with that with eta and alpha with a learning rate and momentum. So, OK, when I started, it was default value. Both of them were 0.5. Eta is 0.5, alpha is 0.5. It's a safe spot to start. But nobody knows what your case was. Maybe I start 0.7, and then I decrease it to 0.2, and this is at, I don't know, 0.8, and I decrease it to whatever. I start at 0.2 and increase it to 0.8 because as we move along, the significance of this will increase and increase. If you learn, you should not touch things. So at the beginning, this is the dominant guy. At the end, this is the dominant guy. OK, now we are trying to develop some understanding for the learning process. Of course, this has nothing to do whatsoever with the way that human brain learns. Backpropagation is not an imitation of this. Most of us, most of us, we just need one event to learn. Some of us also never learn. But for the easy stuff, I put my coffee here and then pay attention. OK, it's gone. Well, I learned to not put it here. Look, and put it with more caution. Simple stuff we learn really fast from one mistake, from one mistake. There is no backpropagation of error in our mind. There is another concept that is called reinforcement learning. That's more humane type of learning that we will talk about after the reading week. OK, so good. Oh, good. I can take. So this is the generalized delta rule. So this is the delta rule plus that is the generalized delta rule is a more sophisticated way of operating of multi-layer perceptrons. And those perceptrons, we said two layers, three layers. Basically, this is it. We cannot do much more than that. And we get some constraints with respect to computation. OK. What about topology? What about the topology of the network? So topology of the network is the number of layers and the number of neurons, number of neurons per layer. So this is entirely task dependent and mostly done, mostly done via trial and error. So we don't know. There is a trend that I personally find dangerous that you can put 200 layers together and then you have it. Well, I'm not so sure about that. I still like to stick a little bit with Occam's razor. Be sure that I don't get surprised, especially if I'm applying that AI agent. This is a new 747 or A380 and 400 people are sitting on it and when we let it go to autopilot and then you can see a very different effect of overfitting if things go wrong. So for us, this is the model size. So the number of layers and the number of neurons in each layer is the model size. So you have two cases. Either you have too small or you have too large. If you don't have the perfect topology, if you don't have the perfect number of layers and perfect number of neurons per layer, then you are either too small or you're too big. So if you're too small means the problem is gigantic, is nonlinear, is nasty and you're approaching it with a tiny network. So you will do underfitting. So you cannot fit the data. You cannot even get to that point that we have the pleasure that during the training it says approaches 95, 96. And I start sending text messages to my friends. I say, I got 96. You cannot even do that. So because if you're underfitting, you will not come above 50, 60%. If the reason is underfitting, it could be that the problem is tough with any type of topology. And if you have too large, which means you have a tiny problem and you are throwing the Milky Way galaxy at it. Well, you are overfitting. This is at the moment one of the biggest concern of everybody who is seriously working in AI are our deep networks overfitting. And we don't even know, which is very dangerous if we don't know it. If they are overfitting and we don't know, they will come and bite us really badly when we are not looking. And usually you are not looking in the back. So how do you do that? We don't know. Because the problem is if you have really deep network, we use every piece of data that we have to train them. There's no other data to test them, validate them that they are really beyond the paradigm. So you have to deploy them, wait and pray that they don't collapse. So overfitting is a huge concern for us because the networks are big. Look, some stuff do not change. It's not about the second century and 20th century and 21st century or the 30th century. Some stuff do not change. All comes wisdom is there. Right solutions happen to be simple solutions. So we still want to go back, can I do that work instead of dense, let me 200 layers? Can I do it with seven layers? No, you can't. So why? At the moment we don't want to ask that question because we are just enjoying the success. Enjoy the success. But don't think that that question will let us go. So we have to find some answers for that question. Okay. How do we do this? How do we make sure as much as we can? This is exactly what the human learning is about. Learning is not memory. I guess I mentioned that most of the time. Somebody has said, where is the capital city of that country? And somebody says, this is this. I say, oh, the guy is smart. Knowledge is not intelligence. Intelligence is creating something. Coming up with an answer for a question that nobody has asked before. Memorizing stuff. Capital of this is this. Renaissance started 1845. The poem of love is from T.S. Eliot. This is not intelligence. You have a good memory. Good for you. Fantastic. Intelligence is being creative. Come up with something that has not existed before. Concept, abstract, solutions, ways. So how do we do this? We can. How do we check that? How do we check? How do we test? Again, Turing test. Alan Turing doesn't let us go. So Alan Turing is with us every single step. How do we know? Well, you can start with a large net, and then you remove neurons until performance starts to degrade. Then we stop. So now I'm talking about the topology. So you start with a network of 300 layers, then you make it 229 layers, and then you make it 200 layers, 180 layers, and then when you go to 170 layers, boom, the accuracy stops. Okay, 179 layers is a good topology for this problem. Let's stop. See how difficult that is? People don't do this. You know how many experiments you have run? I don't think anybody even on this campus has the resources to do this. You know how many tests you have to do? Any training of them? How much time it takes? So checking the overfitting is very difficult. So I can start with a large net and make it smaller, smaller, smaller, smaller, and find that point that suddenly the performance drops massively and say, okay, go one step ahead. This is the one. I don't need more than that. That this is the simplest network that I can get is 182 layers, sure, but this is the simplest for this solution, for this problem. Or we can go other way around. Start with a small net and then you add more neurons until performance is acceptable or becomes acceptable. And then you stop. So either you do top down or bottom up. So start with a network of three layers, each 500 neurons. Your error is 500,000. Okay, so four layers, five layers, six layers, 10 layers, 12 layers, 16 layers, 25 layers. Oh, I got to 90%. Stop. Don't be greedy. Some of the problem we have is about greed. You cannot be greedy. Take what the data gives you. Don't try to push it through. So I got 92 with 172 layers. Let's go 300 layers. Then we get 99. Then we win in Kaggle. Then I get a job in California. Why you can't do that? It works. It seems it's working. No others. Okay, as one of the examples that you want to talk about, how much time you have. Ten more minutes. So I want to start talking about auto-encoders. Really interesting type of networks. So the auto-encoders, you get some neurons and then the next layer is a smaller than the other layer. And then you build basically you start building things back up. So the input is X and then of course like every time so I am not I'm just indicating it. Everything is connected to everything and what comes out is also X. What is supposed to be X? So I put X in and I want to get X out. So X and X out. There's no why. First question. Excuse me for the stupid questions. Why should we do this? I thought we have X. So you want to reconstruct something that you have? Yes. Why? Well, if I start with N then the next layer has half of N neurons and here has half of that and so on. So if I can, which means here I if I can reconstruct if the difference between X and reconstructed X goes towards zero. Which means from this network from this layer in between in the middle with way less number of neurons than here I am actually able to reconstruct X. If that's the case, if that's the case what is this? What is the deepest in the middle? That's a compressed version of X. Can you use normal networks to compress stuff? Oh, fantastically. So can they be a competitor of PCA? Yes, they can. PCA is linear. They are nonlinear. So why we are not using them? PCA can take anything. This is a normal network still. You can put 2000, 3000, 4000. You cannot put a million here. It would be tough to train. So if I do that I can basically come up with a normal network type of compression. So this is based on the bottleneck concept. So the bottleneck concept force the network to reduce the dimensionality of the data. So now I can take one layer. So my my X comes here in it and then the next layer is smaller. The next layer is smaller. The next layer is smaller. The next layer is smaller. So I can really exaggerate this. So one, two, three, four, five, six, seven and I get to eight. And then I need seven to build it back up symmetrical. One, two, three, four, five. Oh my God. Six, seven and then we get X out. So X in, X out. So now I have a deep network. One of the experiments people did back late 90s, early 2000s just maybe out of pleasure for experimenting with networks. Can I do this? Because if I can do this again. So if this is n neuron this is n over two generally n over four n over eight n over 16 n over 32 n over 64 n over 128. So as many n you have here. So let's say you have a million when you get here you have a million divided by 128. So there's a lot of compression. So why do we need that? Why in some cases we need that. So and in some cases after you train you get rid of this and you just use this for the output to somebody you say okay here the data that you want it compressed. So even this could be just an academic experiment. In some cases at least my experience deep auto encounters cannot easily compete with well established things that we have like PCA just because PCA is so practical is so easy and it runs without any effort. So designing this training this is not easy. So it needs it needs some some effort to do that. The question is why should we do this? So the inputs that you get let's say they are normalized between 0 and 1 in D dimension and the encoding Y is also belonging to 0 and 1 and is D prime dimensionality and D prime is much smaller than D and the decoding D is some sort of function G the weight W star times Y plus a bias B star. So now I'm talking about something that I have not explained here. So this is encoding this is decoding. So you push it through a much more compact representation and you try to construct it from a much smaller vector and you build it back up and of course this will not be exactly X it will be an estimate of X and you try to push it you can do that with images put an image here and the same guy the same face and the same face should come out so if that happens grab this that's a very nice compressed version of that face why should I need that? For additional steps that you may need we're working with it visualization other type of classification is a nice thing to have compressed information. So now you have here you have here W and you have here W star. Why? Because W star is the transposed of W. This is one of those tricks that that still we use some of it for the deep learning because this was the beginning of it. So when you when you generate your weights here this is a gigantic network. I do not generate random numbers for everybody. I generate some weights here reverse it put it here. So the weights are tight. The weights for encoding and decoding are tight. This wasn't this is a really nice trick. It's like the kernel trick. So these guys are correlated they cannot work without each other. They have to make it happen. And of course if I tie the weights I have way less numbers to adjust. The training becomes much easier. Again, one of the tricks that we down the road we learn is amazingly helpful in deep learning. So so the error then is the error of x and estimate of x is well take any distance like Euclidean. If we do as a bit vector then we calculate it as a as an entropy then this is minus the sum of k equal 1 to d of x sub k log of x hat sub k plus 1 minus x sub k times log of 1 minus x hat sub k. So now I'm writing it writing the error as the in tradition of entropy if I'm working with a bit vector. And this is your cross entropy loss function. The cross entropy loss function. Fundamentally the same thing but more convenient in working with stuff. So before I let you go I just want to get to some questions and then we can ferment over these questions over the reading week and then when we come back and then we just dart back in in the deep learning and say okay what is all the fuss about so how does it work? So some questions all I wanted from autoencoder is this you can use this is not the start of deep learning for us although the autoencoders can be deep because something like this you cannot train at the moment we don't know the back propagation will not work on this it will not work on this you need some tricks and that these are the tricks that we invented to do deep any type of deep learning so if you are 19 if you are 1995 it doesn't work if you are 2000 it doesn't work if you are 2006 or 7 it may work so how to train how to train deep nets so we still don't have any magical deep network I just had a stupid concept like this which people will tell me why do you want to do this? we have we have PCA but I had fun in this I had fun in this I experimented with three layers and five layers it was with some made up example and the result is amazing so I want to see can I do it really for face? but for face I realized I cannot do it I want to do I want to put a face here and get the same face back and use this as an encoded face for some purposes I can't do that at the moment I can't so any multi-layer back propagation with more than three four more than four layers you can't train you cannot train any multi-layer back propagation with more than four or five layers you cannot train with generalized delta rule and back propagation it doesn't work we said even with two you can solve any problem theoretically the practice the computational barrier is too big so what is the idea? the idea is layer-wise pre-training followed by greedy layer-wise supervised training supervised training with fine-tuning with fine-tuning so some people came up with an idea you cannot train this and you cannot train any multi-layer network with more than let's say five layers so we can do it this way we can do layer-wise pre-training what does that supposed to mean? it means okay you cannot train this that's okay can I take these two and train them? what do you mean? you want to the output comes from all the way here what happens if I take this out? just stick with me can I is it easy to train two layers? of course we can train two layers so if I could figure out a way to take two layers two layers two layers take them out train them put them back in would be okay with that? I don't know how you want to do it but if you do it okay by the greedy layer-wise supervised training and then after that fine-tuning still some people don't get it at the first sight so if you cannot train the entire network let's do layer-wise training and then we do a layer-wise supervised training and then we fine-tune it and that fine-tuning is basically the traditional training then things work so in order to train any type of deep network including the autoencoder that we just learned how it may work we need to do some tricks and we waited almost 20 years for those tricks so and now everything else we have are fundamentally based on those tricks but then we got also other type of networks other type of topologies that are completely new have nothing to do with MLP and they work in a different way and they let us do way more in the complex feature space than with MLP so we will work with that so quiz number three is coming and assignment number two is coming you will get a little bit more time beyond the reading week such that if you want to do it during reading week after reading week is up to you so we will be in touch with that other than that enjoy your reading week and stay safe