 Okay, good afternoon. I'm very happy that you all come to listen to me. And I have made three lectures, and here they are in this e-mail that I sent to the organization. The first lecture is on, so these lectures are taken from my course on machine learning. And the first one, so I had to make a guess about what your background and interest was, and I made these three lectures. The first one is about the perceptron and about learning as in machine learning for as an optimization problem and how you can do that and various methods for that. And then I will connect that also. So this is very basic. So what I decided to do is actually teach very basic stuff, very rudimentary stuff, so not the latest in the research, but I wanted to also connect to the research. So the way that these three lectures, the first one is on these learning rules, and then I go into, at the end I will sell something about the binary perceptron, which is a perceptron with binary synapses and how you can optimize that. And that's a very recent work that I have been doing. So that's the first lecture. Second one is on stochastic processes and on ergodicity and Markov processes and how from that you can get to equilibrium distributions, to Gibbs distributions, about detailed balance, about what it means, ergodic ergodic, ergodicity theory, phase transitions and those kinds of things. Also very basic, very rudimentary in a sense. And then it connects to my most recent work, which is on quantum machine learning, which I get to talk about on Thursday afternoon, but I just heard from the organizers that's not a no-go for you guys because you have a class here, so I'm sorry that you have to miss on that part on this connection. But anyway, so that's quantum machine learning that's a little bit maybe not in your direct focus anyway. And then the third lecture is on control theory. And in particular, I have been developing the so-called pathological control theory over the last ten years, and I will try to give you one and a half hours, give you the gist of that theory, what that is, and some applications of that work. So I would like to a little bit understand who is sitting in front of me. So who have a background in physics, a degree in physics? Ooh, that's more than half of the class, okay. Who have a degree in computer science? That's the other half of the universe. Okay, that's computer science. Okay, and so who has neither a degree in physics nor computer science? And that should be neuroscience maybe? So can you scream to me what your degree is if it's not physics or computer science? Okay, so who of you have taken a course on machine learning? That's slightly less than half, right? Okay, so that should be, okay, that gives me a little bit of idea. Okay, let's get started. So maybe you have this information probably, so here's this webpage of my course, and so we're going to be doing this first lecture here, supervised learning, perceptons, and gradient descent. There's also, I also suggested some exercises. I'm sure you got that material also. These exercises are for you to program and to do, and they are quite, well, they're involved, like always if you want to program something from scratch, it's not an easy thing, but there is a little bit of a template here that should get you started with the data, et cetera. Okay, let's go. So, okay, so we're going to be talking about the perceptron. So who knows what a perceptron is? Less than half. So perceptron is maybe the first neural network. It's a neural network consisting of one neuron. So it's one neuron, one little ball, and it has a bunch of inputs, and that's the whole neural network. And it can do perception because the input can be very large, and then it just has to, you know, this little unit has to decide something. If it fires, there's a cat in the picture. If it doesn't fire, there's no cat in the picture. So that is the neural network, what it does. Now, this very simple neural network was cooked up by Frank Rosenblatt in 1962. He was very excited about this, and he said, well, this now we have a machine that can learn, and it can be intelligent, and it can reproduce itself, and all these kind of things. He made a very, very wild statement about this. Very, very simple neural network consisting of one neuron. Now, of course, there was a lot of industrial interest there, and everybody, the sky was a limit, and people got very excited about that. They got a lot of patents in the 60s, and it was a big hype. And then somebody found out that actually these perceptrons can only do very little things. They can only learn a certain class of problems and will come to that what they can learn. And then we got into the neural network winter, in a sense. Then other interests came up. When neural networks get out of fashion, it's not so much that they're suddenly bad, but they're suddenly something else that's more interesting. So that's what's happening. So it's happened at that time. It were the expert systems. The expert system was very popular in particular to do medical diagnosis, et cetera. So there was this yin-yang between, actually maybe there's also an interesting story to tell. So after the Second World War, computers were invented, in a sense, because the transistor was invented, and the idea was from the very start, shall we make a computer a digital machine or an analog machine? There were pros for both of these camps. And the digital ones, they work with bits, right? And the analog one, they work with analog currents. And there was something to be said for both. And the people who were thinking about the early computers were also the same people who were very much interested in AI and in machine learning. So names like the John von Neumann, who was the grandfather of our current modern computer, people like Thuring, these people were not only interested in the conventional computer, but were very much interested also in the brain. And so these questions, they were very much in the air. And these two streams, this analog stream and the digital stream, they existed actually up to, I would say, up to the mid-90s. And the analog stream had their analog stream in AI. And one of the analog things was the neural networks. The perceptron was an analog thing. Control theory was an analog thing. On the digital side, you had the theorem provers, the chess computers, those kind of things, the digital world. So I'm going to put things in bits and make explicit knowledge. People were talking also about symbolic and sub-symbolic. Those are very old-fashioned terms, which I think are now slightly ridiculous. But anyway, so these terms were used. And so this analog stream, so when the expert systems took over in the 70s, so there was this going back and forth. And then in the 80s, the neural networks came back up again and they were big hype. Same happened, multi-layer perceptrons. We got new patterns, a lot of hype. Everybody thought it was solved everything. And then the winter started again. And this winter was the Bayesian machine learning, the Bayesian story. So the Bayesian stuff got into a conference called NIPPS, which was very dominant, has been very dominant in this whole history. And the probabilistic approach to doing knowledge representation and doing learning became very, very dominant. And then you could suddenly see that the neural network could be written as a probabilistic inference problem, that you have some probability distribution between input and output. That's the probabilistic description of a neural network. And same goes with your expert system. You could write your expert system as a graphical model, which is a graph, and where the logical statements, A implies B, et cetera, become probabilistic statements, A implies B was a certain percentage. And so suddenly everything became one happy family. And the difference, in my view, between the subsymbolic analog stream and the symbolic digital stream sort of faded away into this Bayesian mohawk, this Bayesian machine learning that's now so strong there. So that was there since the mid-90s. And then in 2005, 2006, that whole thing got canned again because then the deep learning started to kick in. And then, well, the rest is history, and you know all about that. So I'm going to tell you about this first neural network, this perceptron. And so you take an image of A or a cat, and you do some low-level feature extraction, and then you get a bunch of numbers. And then these numbers, you're going to add them with weights, with weights here. So you see these weights, W. So you take it through some non-linearity. You get some Ws. And then this output is a single number. This is an inner product. And that single number, you take the sign of it. If it is positive, excuse me, you say it's class one, and if it is negative, it's class two. And so that is the whole of the perceptron. And so we're going to ignore these features. So these features, for the perceptron treatment, we can ignore them. We just say that they're absent. So instead of having these features, we're going to be saying that we have just the input X as a vector. And we're going to vector W. And we're going to take the inner product. And we're going to take the sign of that. And that is our whole perceptron. And so the task of the perceptron is here that I have a bunch of patterns, say X labeled by mu. And I've targeted output T also labeled by mu. And I have a whole bunch of those. So a bunch of pictures of cats, a bunch of pictures of non-cats. And I want to build a classifier. And that means, in this case, that I want this weight vector W to be such that whenever I put this pattern mu in there, that I get the correct label T mu. So this should hold for all mu. And I want to find one set of weights that does that. So this is the learning problem. Now, this learning problem, we can think of this as this inner product. So there is this W dot X. So we have that if W dot, so I forget about these arrow signs because they're always there. So this W dot X, if this is positive, we have class one. And if W dot X is negative, we have class minus one, right? So this is the two classes we have. So in other words, if we look at the line W dot X is equal to zero, we have the separator. This is separating the two classes. So the mental picture is that we have an input space and we have a line, and this line is separating the cats from the non-cats, right? And the line is a straight line because it's given by this linear expression, right? OK, so here you see an example of that. So here you see a bunch of black points and you see a bunch of white points and you see a line in between that's separating the blacks from the whites. And so this is a good learned situation. And the learning problem is, of course, to orient this line in the right way. Now, we can do a little trick and say, OK, let's multiply this equation on both sides with T mu. So then we get T mu here. And since T mu is plus or minus one, its square is equal to one. So we get here one. And so this is a sign. This is a plus or minus one. And so we can take this inside the thing and multiply it here. T mu here. Get rid of it here. And then we see that what we need that if we transform our input data, that we go our input data goes from X mu, comma, T mu to a representation which is X mu times T mu. Then we need that in this transformed space where this is a vector with components. So this is we call Z i mu. That these transformed vectors which are unchanged for the data of class one but have got a minus sign for the label for the data of class minus one, they are also vectors. And if we do that, we see that what we are left with here is that basically the sign of this thing has to be positive. So we need that W transpose times Z mu is larger than zero for all mu. So this is what we need. So we have to find such a W. So we have a bunch of Z's, that's our data and we want to find a W that has a positive inner product with all the Z's. So that's it. So now this, as I said, this perceptron draws a line and some problems are of that type and some are not. The simple example that you can give you is if I give you an ant problem, so I have two inputs, X1, X2, 0, 0, 0, 1, 1, 0, 1, 1 and if the output is the ant, so then it's 0, 0, 0, 1. If I make a picture of that, I get here 0, 0, I get 0, 1, I get 1, 0 and here I get the other class. It's clear that I can separate this with the ant very easy. So this is a linear separable problem. There's a line that separates the two classes. If I would have the exclusive OR problem, so I have XOR, that means that the truth table is 0, 1, 1, 0. That means that I would get here, in this picture I would get 0, I would get 1, 1, 0. And now you see you cannot draw one line that separates the crosses from the circle. So you need at least two lines. So these are examples of well not linearly separable problems and so the perceptron can only solve linearly separable problems. So the perceptron learning rule is a very simple learning rule that just cycles through all the examples that you have and then one example, if you start with the current weight vector, W, and if you present your example, you look at the inner product Z or whether it's positive with that example. If it is, you don't adapt W. And if it's not positive, you adapt it in a way, right? And this is the simple rule. And that's given here on this slide here. So the new one is the old one plus a change and the change has a theta function which has output 0, 1 depending on whether the argument is positive. If the argument is positive, it will output a 1. If the input, if the argument is negative, it will output a 0. So if Z, W is positive, it will output a 0 and the delta W is 0. And if the Z, W is negative, it will output a 1 and the learning will happen, right? And the learning will be just Z mu itself, eta times Z mu, which is this product of input and output. OK, so that's the learning rule. And this learning rule, we can illustrate very nicely here in this example. So well, first let me note that this eta is a learning rate. That is the amount that you adapt the step size, as to say, in this learning step. But for this algorithm, actually this eta is completely irrelevant and we might as well take it equal to 1. So as we will see later, if you write in the send rules, then the learning step is really important. If it is too large, you get no convergence. If it's too small, you get convergence, but you get a very slow algorithm, so there's a whole story there. But here, in this case, actually this eta doesn't play any role of significance. The reason is that it just multiplies. So if you start with weight 0, then you see that the size of your weights will be proportional to eta. And the size of the weights is not important for the problem because the size of the weights is irrelevant or weight vector of size 1, it doesn't affect the sign. So it doesn't affect the performance. So the whole size of the weight vector is irrelevant. So we might as well say this, put this eta equal to 1 and then it is illustrated the result in this figure here. So suppose that our initial weight vector is this weight vector w here, this here over there. We have a two-dimensional problem, so we have a two-dimensional weight vector and we have a bunch of training samples which are given here, x1, x2, x3. It doesn't affect the z's that I have. That's because in this picture I could not change the x to the z. So this one should be noted as this x is our z. So if I take eta as equal to 1, so what I have to do? I have this initial weight vector. By the way, you can also ask more difficult questions than that one. Because I have a tendency to blast on, so please, a good way to have an interaction, to have the slow me down is to ask questions. So we start with the first weight vector and that's this and we have this first pattern, x1 and we take the inner product, we have to look at the inner product and if we take this inner product you see it is slightly larger than 90 degrees so the inner product is negative so therefore the learning rule fires and so we have to have the delta which is just the z and which is just x therefore we have to add the x we get the new vector that's what the learning rule does so we get this new vector here so now we go to our second training example which is this x2 here and we see that its inner product is negative so we have to add it we get here then we go to x3 it gets a little boring so we have to look at the inner product it's negative so we have to change weights and now with this one we're done with our data set we go back to the first pattern which is this one we don't have to do anything we look at the second one we get a positive inner product we look at the third one with a positive inner product learning is terminated so the funny thing about this learning rule is that it converges in a finite number of iterations this is as you may know from some gradient descent rules that you get asymptotic convergence by your patients that you say now it's good enough I'm going to stop but in principle you get asymptotic convergence here you get finite time convergence finite number of iterations okay so but does this always work and therefore this is not the case of course and so in order to get there we have to build a bit of intuition here so we have to have a w which has a positive inner product with all the z's and so if we look at the worst one the minimum of this inner product then if we make that as large as possible we're doing good because then we push up everybody so if the smallest is large enough then if the smallest is positive then they're all positive and we have a learned solution so one criterion would be to say okay let's define this quantity and as a function of w and I'll find a w that find a w that maximizes d that would be a criterion to try to satisfy and there is a 1 over w here because to just emphasize that this whole quantity, this quality of the perceptron doesn't depend on the size of the norm of the weight so we can just take that out right so now if the solution is positive here you see two cases where the solution is positive here in this left case the solution is all these vectors w have all a positive inner product with all these data points right because this one is just like this line is just slightly non-negative here and if you can turn this all the way there it's still this one is just touching it's just negative so all these are good solution and there is a best one which is here in the middle which has quite a good positive inner product with all of them so you have a large negative d in this case for this best solution in this case you see it's actually the number of solutions is much less because if I go out of this cone I will always violate one and so you see here the best solution is like this and it has a very low d because the worst pattern is still very close to giving inner product zero right so you have problems where d is larger than zero significantly then you have marginal problem where d gets close to zero you can just barely learn them and then there is of course problems where the best solution has a d that is negative right you can also have problems which are not linearly separable for instance this exclusive or problem then your best d is negative okay so this is this yeah so yeah I mean linearly separable yeah yes yes but not in the original space yes so if you kind of can transform it so the whole yeah so what's your question well in this case it's both right yeah what did I say yeah yeah yeah X X so whether you put these files in there or not so the picture is right you have some input X you transform it to a phi of X and you do then a linear so you have essentially let's put it like this so you have here your input different X components right so you make your features and so in principle this first feature depends on all the X and all the features depends on all the X right so you get this transformation which may be not linear like that and then you take a linear combination of these things and that gives you then you sum that and then you take a threshold of that right so this is the picture and so I'm just saying well let's suppose that we have done this pre-processing and call phi whatever that is that we call our axis right so that's but you probably understand that and I still don't understand your question well what I meant what I said maybe I'm not sure what I meant what I'm talking about in this context what is really important for the perceptron learning rule is the fact of linear separable so that means that whether the data are such that there is a solution which has this property that it's true for all for all patterns so that is that is it's about it's a feature of the problem not so much of the it's linear in x or linear in w it's also but okay so yeah I mean every problem is linearly separable in some way after you do sufficient basis transformation you can always so this is not so of course you can in a multilayer perceptron in fact what happens is that you do a whole bunch of nonlinear transformations such that at the end of the day you get a representation such that things are easily linearly separable right that's one way of looking at a multilayer perceptron we're here just looking at this no it's not it's nonlinear the weights and it's nonlinear the input yeah it's true if you transform it yeah but often this transformation is very hard so if you do it as a learning problem this is a hard issue so this perceptron learning rule the question is does this converge and one can show that the perceptron learning rule converges provided that the problem is linearly separable so that's to say if I like this example that I gave with the three samples I do this is a linear separable problem because there exists a w which has a positive inner product with all the three and if I do this learning rule I will find this in a finite number of iterations if the data is such no such solution exists for instance if the data in the z space would be like would be pointing in all directions right in all directions the data the z vectors then I can of course not find a w which has a positive inner product with all of them right and then this perceptron learning rule doesn't converge and the proof is quite is quite easy and it's quite curious I would say so the proof goes as following so suppose that there is a solution this w star it exists and it has a d which is positive right that's the problem is linearly separable so now in each iteration we update only if the in that iteration the inner product is negative right and we denote by m mu a counter the number of times that pattern mu has been used to create a w right so we start with w say 0 and then it evolves in a number of steps and the resulting w that we get is the number of times that we present a pattern mu times that pattern mu itself times the learning rate if you want right this is then the total weight that you get after a bunch of updates right so we have like 3 times pattern 1 and 2 times pattern 2 these things because that was what we did if I remind you in the learning rule this is what we did here where was it here in this learning rule we add this thing we add these vectors and the number of times we add the vector 1 is m1 times the vector 2 is m2 etc so we get this this total vector and now consider this quantity we take our current solution w and we inner product it with the w star solution that we are supposed to that is given and we take the divided by the norm so this quantity is in fact is the cosine of the angle between these 2 and dimensional vectors right now this cosine is between plus and minus 1 right so that's a no brainer that we know for sure now I will show that this ratio will grow as order order squared of m that is to say that if m keeps on keeps on growing then this right hand side will get arbitrary large and it will get over this value 1 so the proof is out of contradiction so let us assume that the perceptron learning rule does not converge then m will grow by definition indefinitely that's the meaning of non-converging I will keep on updating and therefore this will grow and therefore I will violate this bound and therefore I get the contradiction and therefore my assumption that the learning rule doesn't converge is wrong the learning rule converges so this is the proof ok so we have to show this that this grows and it's very easy because we have a numerator which we are going to show is larger than something and the denominator we are going to show is smaller than something the numerator we take this inner product we fill in the definition of w which was eta times the sum and now we get this z times w we can replace it by the minimum over all patterns and so then we get a larger than because this inner product is larger than the minimum and then we are left with the sum over mu of m mu which is definition is m right we have defined m here as the sum over mu m mu so we get this inequality here and so now we can put our definition of d mu star d w star we can put in there by just multiplying by w star so then this is the same so this is the one side and the other is we look at the norm the change in the norm of w in one learning step so the change in the norm of w is the norm of w after a learning step minus the norm of w before the learning step and we can take out this double product so we get w squared minus w squared that cancels we get a double product w z mu which is this one and then we get a eta squared w norm w z mu squared right now here the trick comes now in this learning step the learning only takes place if the inner product of w and z is negative so this term is negative so that means that this one is less than this right because this is a negative term so this whole thing is less than just the last term now this last term is a constant and it's in fact bounded by the norm of the input vector so if they're all binary if the z are just plus or minus one then the norm is just a vector of length n and it's just n the input dimension so in other words the change in the w squared grows as in one step grows less than this so the norm of w in m steps so delta so delta w delta w squared in one step is less than eta eta squared n so in all learning steps the change is delta w squared is less than eta squared times m times n because that was a total number of learning steps right and so if we start with a norm zero then the norm of the vector is less than the square root of that so which is eta the square root of m times n so that's that last result so now we just divide and we find that the numerator is larger than something and the denominator is smaller than something so the ratio is larger than this ratio and here you find that this is a constant and this one grows with the square root of n so here you find the proof that we need it and so it's also instructive to see that if you turn this around you say okay let's bound m let's take m out to satisfy this bound then you find that m has to be less than this and you see that for problems for which the d w star is very large those are the easy problems we find that the number m is small and for the problems that are very hard where the d is close to zero this number of iterations can be very large so you see that kind of okay so this means that you get finite time convergence of the perceptron any questions about this convergence rule? so another curious fact about the perceptron is its capacity so we have and the capacity is about how many how many instances can I learn with such a machine with such a perceptron so the rationale is the following so I have if I have three points in two dimension then I can define these are three vectors x in the input space how many problems, how many learning problems can I define on this well this point I can be in two classes this one can also be in two classes this one can also be in two classes so if I have p patterns I have two to the power p different problems that I can define and I'm asking myself which fraction of that problems are linearly separable if I would a random sense and so in this case what is the fraction that is linearly separable right? because either we have the same class in which we put a separator on the side or there is one of one class and the other is the two guys put it in the middle okay so they are all linearly separable now I add a fourth point and now what so now I have two to the p this is now 16 problems and some are linearly separable and some are not if I have three of one class and one of the other class is linearly separable but if I make an exclusive or problem on this and this is the other class I cannot separate them so you can count so if you say okay I do random so what plays a role here is p the number of samples that I have problem but also n that's the input space dimension so if I would have four problems I have it in two dimensions there are some are not linearly separable we have four samples maybe three dimensions or a four dimension actually it becomes linearly separable it becomes easier so there is an interplay between p and n and this probability of being linearly separable now the answer to this question is given here by this formula which is this cpn this is the number of linearly separable colorings of p points in n dimensions that can be separated through a plane through the origin in this case this is assuming through the origin this is a detail and so I should maybe explain a little bit so this p over this notion of n over k you know of course what it means it means n factorial over k factorial n minus k factorial that's this combinatoric number but this is usually defined when k is between zero and less equal to n but here it can also have values this i can have values up to n minus one and you can have the case that n is larger than p and so then by convention it is defined that n over k is zero if k is larger than n so that's the convention that is taken there so if you I'm not going to go through the details of the math but if p is smaller than n it turns out that this sum is just the complete sum of all the polynomial all these factorial terms is equal actually to two to the p minus one so the whole thing is two to the p so if p is less than n you will find that all problems are linearly all problems are linearly separable now if p is equal to two n you will fill it in this formula and you will find out that the answer two to the power p minus one that means if you divide two to the power p minus one divided by two to the p you find one half you see that 50% is then you just 50% is linearly separable and if you go for p larger or yeah p larger than two n you will find that this fraction goes more and more to zero the picture is like this here it's given like this and it is plotted here so if p over n is two we have this 50-50 hit here and it draws a line for different values of n so for n is small this is sort of a smooth curve and if n gets larger this curve gets steeper and steeper and if n goes to infinity it actually is a step function at two so if you have very large problems like thousand inputs in your perceptron and you have a random instance will have a probability of almost one to be linearly separable if your number of samples is less than two n and it's also probability one that it will not be linearly separable if your number of samples is larger than two n so this is the situation so this is quite curious and this number two is called the capacity of the perceptron that is the number of samples in a random set that can be satisfied okay yeah so now this is about random instances if you think about some typical instance that you will see in the textbook on machine learning you will have some data here and you can draw a line here and so here you have the particular case that the data here are all of one class and the data here are all of the other class that's very different from these random instances here so this is sort of so another word in random data there is no structure but in real data of course there is structure there is a certain probability that you have a point of a certain class then there is likely that the nearby point will also have the same class the same class label and so this story is actually saying something as I just said about random instances but it also says something about generalization and that is that if you the fact that it is linearly separable here means that in a random instance you can put a line in between there and in fact you can put many lines in between there there is a lot of room for different lines to put there and if you are on this side there in a random instance there is no line to put there but if there is structure in the data then the structure will actually say that for instance the data are such that they are generated from an unknown teacher that actually is a linear separable perceptron so there is a solution there then of course you get a different reasoning and then this says that on this side of the line actually you are going to nail down the solution to exactly one solution and so the probability may be zero but there may be still one solution and here there is actually too much room and there is many solutions and the generalization is bad so this is also a story about generalization that if you have few samples p less than 2n the freedom is massive and so you cannot expect good generalization but if your p is larger than 2n you will be able to pick out that one solution that is actually hidden there in that structure of the data that is the second lesson of this picture anybody confused about that? so the proof is maybe also insightful so this is our perceptron that goes through the origin so this is a mathematical detail that in fact so the perceptrons cannot be anywhere but they always go through the origin in this high dimensional space so suppose suppose that we have a number of so cpn is the number of linearly separable problems that I have on p samples in n dimensions now suppose that I have one of these instances which is a linear separable instance which is these four points there is two of one class and two of another class one of the colorings in this set in this total number of colorings that is linearly separable now if I add one point to it I can either put it somewhere here in which case I can define two new problems because I can put the point on this side of the line or even better to say I can put the line on this side of the point so I can make two colorings so if I add one either I can make two extra colorings or I put it here in which case I can only put one coloring because I have to put a red in this case so you can see that the number of linear separable problems with one point extra is for these problems type A I can make two instances and for the problems of type B I can only make one instance so that's what I get and so since A plus B is CPN this is also equal to CPN plus A I can write this in this way and now A is the set of problems where the point can be actually go through the separator that's the set of A and since this separator also goes through the origin that was the construction here you can figure it out here that it's going through the origin and it's going through this point and in fact it defines a problem in one dimension lower because this one dimension is taken out so this is actually CPN minus one so you get this recursive formula here and this has to be solved and now you can either think very hard and take this trial solution and fill it in and see that it works and this is the proof of this induction okay do we take a break or how does this work we just go on for up to me well it's up to you also I guess I was the only one with this idea what is the custom custom is no break, okay so we have no break sorry okay so so so this perceptron has a funny learning rule which converges in a finite number of iteration but most learning rules actually are not of this type and they are typically of the form that one defines a cost function which is then minimized and the cost function expresses a desire that we want the output of the network to be close to some target output right and this we're going to specify and then we're going to make that error as small as possible and that's the optimization problem for instance I can have a network where I have some where y is the output of some network maybe a very deep 20-dimensional 20-layer network I have some input I have a bunch of parameters which are all the parameters in my neural network and it produces some output y and I also know the label of this input because this is a cat and so this has the label of cat and so I want this output of the network to be close to the cat label so I want to minimize this distance so I take the square of that and I get a positive number and I want to minimize that and I want to do that simultaneously for all the data that I have all the samples that I have so the end labels here, the samples and so this is a criterion that depends the data is given so this is something that depends on w and I just want to find the w and this is a typical way to do that in classification you could have other criteria so this is typically for regression where the outputs are continuous if the outputs are binary say 0,1 a typical use is the use of this cost function which is not a quadratic error but it's a different error this is used for so in the exercise for today you're going to be looking if you want at the gradient descent applied to a simple perceptron that is trained with this cost function here this regression cost function where there's two classes 0,1 and you execute the learning rule using different tricks that we're going to be discussing now anyway in both cases there's some error which has to be minimized so that's the upshot so the picture looks like this we have an error which is a function of w and it has a lowest value because it's a sum of squared terms or it has some other lower bound so it's bounded from below this is very important if it would not be bounded from below we would go down and down and down we would never converge we would end up in China and we would not get any converges we don't want to go there because then we can minimize and then something has to give we have to stop at some point so this is the whole trick of this gradient descent so we're going to find so in one dimension we would have a picture like this so picture like this so we have some error function and we want to minimize this what we're going to do is try to find a solution of the set of equations so and if you don't remember the gradient it's just the E the wi is 0 so this is a set of equations one for each of the components of your parameters in your model and you want to set them all 0 and so what happens here there is 0 here, the 0 here and the 0 here so you will find a solution by setting this equal to 0 that's either there or there now what happens in these learning rules that's easy, you can forget about that although in deep neural networks there's a lot of plateaus and you may stay there for a long time so there's another story I'm not going to go into there but if things are slightly ok then you will end up in one of these solutions which is a local minimum so this is not the global minimum but the gradient is 0 and it's an attractor so you may end up there and never get out of there so if you do this gradient procedure that we're going to be discussing now is one of these solutions so it's not a global minimum so if you start here you will end up here if you start here you end up there so then you have stability attraction in some directions and you have repulsions in some other directions so that's ok but what's the problem is if the gradient gets 0 if it gets very flat so then you don't move anymore ok so we're going to minimize it so the simplest learning rule is called gradient descent and it is the following so you say ok I started with a W here and now I'm going to compute the gradient and I'm going to take a step new W is going to be the old W minus a small number times the E dW and we can put indices so index I like this this is a vector update so this whole vector gets updated in the negative gradient direction so the picture is that you so what you can now show if we say ok that we define delta W is minus dE dW right then we can see that the E in the new point E plus delta W we can do a simple Taylor expansion W plus delta W dE dW and now we can fill in the delta W which is minus eta we get here this E of W minus eta and then this is of course this is a sum actually sum over IWI this so minus eta we get sum over I dE dWI squared plus higher order terms and that's allowed if the eta is small enough there is some range of eta for which I'm allowed to do that so I see that this is less or equal than E of W in other words my new point has a lower energy than my old point so I'm going down in this function in each step and I'm going down with the amount of square of the norm of the gradient vector actually so I keep on going down if the gradient is big I go down a lot if the gradient is small I go down a little if the gradient is zero which is here I stop because things don't change anymore so the fact that that this that you have this property that it goes down and you have the property that it's bounded from below over means that the proof of the asymptotic convergence of gradient descent that asymptotically means after infinite time you get a smaller and smaller gradient you get more and closer to this thing you get smaller and smaller updates and at some point you stop so this is the notion of gradient descent now let's look at gradient descent rule for a very very simple example let's say quadratic well so suppose that this E of W is a quadratic form so this form on the top so it's a lambda IWI so it has curvatures so we would make a two-dimensional picture it has curvatures in in all directions so for instance here so it would be like ellipsoid it would be like this so this is W1 W2 and these are lines of constant E so if if lambda 1 so this is E is 1 half lambda 1 W1 squared plus 1 half lambda 2 W2 squared and if lambda 1 is very small you get a very shallow well in this direction and you get sort of a steep well in that direction so that's this contour set telling you now if you compute the gradient the change in the weight so it is minus the E to the gradient so you get this thing and so the new W is the old W plus the change so we can write that so the old W is WI and the new one is also proportional to WI so we can take that out of the brackets and so we can write it in this way and so we see that the old W is proportional to the old W with a factor and if this factor is smaller than zero then the solution is attracting itself to zero which is where the minimum is which is very easy the solution is in origin so it's attracting to the origin but the attraction is different in different directions so suppose that this eta is very small then this number is less than 1 but if the lambdas are different so then in the direction here where we have lambda 2 is larger than lambda 1 it's steeper in this direction in this direction you will find that in that direction the shrinking is more than in the horizontal direction and you see that illustrated here in this graph so if you do this learning rule you start here you see that in this steep direction you get fast convergence in this shallow direction you get slow convergence so the typical line you're not going to the origin like that you go like this in this curving case now if you increase the learning rule the learning rate this eta this term may actually become negative and so then you get the situation that since the lambdas are different for one direction it's still positive for the other direction it's negative for one that's positive you get the same movement in one direction and for the one that's negative you get an oscillation it goes to minus, minus, it flips but that's okay as long as the norm of this thing is less than 1 because then it's still over time strange to 0 which is the situation here of course things go out of hand if this is no longer the true if there's one direction for which this norm is positive then of course you get divergence you can have eta small to get the convergence but of course you want to have it as large as possible so this is so you see that the optimization of the eta has to do with the curvature that is different in different directions and there's actually no easy solution for that to have to set this eta because different directions would actually like have a different eta so how to there's several ways to deal with that one way is called is momentum the momentum term is like like a massive particle that is moving and the change is changing the velocity but not directly the position so you have a sort of momentum it keeps on going before it's actually making the turn has a mass and adding momentum to the learning rule is saying okay my gradient my step size is the gradient and a little bit of the old step size that I take and so if you put here this definition in again you get here minus eta that thing plus alpha minus eta it keeps on going and going so you get this whole telescoping thing and you can see that each term is proportional to eta and it has increasing powers of alpha so you get something with eta and then powers of alpha and then you get the gradient at previous steps so here for k is zero you get one and you get the gradient calculated at t that's this term and then for k is one you get this term and for k is two you get that term it's at the powers of alpha so this is the expression that you have for the update now let's consider two extremes let's suppose that that our learning that like here like here that the gradients are always in the same direction the steps are always in the same direction like here also so to give you a back of the envelope kind of idea of what happens with this momentum let's make the very crude approximation that all these gradients are equal that all basically the same there's a constant gradient so we can take this gradient out of here and then we take this sum over terms we take it for t going to infinity and then we get this geometric series which converges to one over one minus alpha so we get this as a result so what does that mean suppose that alpha is 0.9 then you see that one divided by one minus 0.9 this gives a factor of 10 so you get in this direction in this direction where the gradients always have the same sign you get an acceleration by a factor of 10 so keep that in mind now let's look at the other direction where you get this oscillation so if you have oscillations then we can also let's say we have oscillations but each time the gradients in absolute value is the same but just changing sign so we just get then in fact a minus alpha here because of this sign of this thing we get a minus alpha here to power we get now one divided by one plus alpha so if alpha is again 0.9 we get now basically a reduction of a factor of 2 so you see that this momentum term is acting different in the different dimensions for which the learning rate is too small and where the gradients are always in the same direction you get an acceleration and for the directions where the step size is too large where the gradients have alternating sign you get actually a damping and that happens at both times so the other fact is that if you are in this scenario you apply the momentum you get something like this so you get an increase in this direction and you get a decrease in this direction so that's why momentum is a very nice easy way because it doesn't cost really anything because you already have computed these gradients you just have to remember it and add it to the something more profound more fundamental maybe is using a second-order method so a second-order method is using the following narrative it says okay let's say we have a this cost function and instead of taking the gradient here we're going to also compute the Hessian here and we're going to locally approximate around this point by a parabola by a second-order function so something in the picture to keep in mind there's something like this so how does that work so we can write the E so W0 is this point this is W0 this point is W0 and we're going to approximate it locally by second-order Taylor expansion so we say E in W plus a linear term where B is the gradient plus a second-order term where H is the Hessian so we do second-order expansion and now in this approximation we're going to set the gradient equal to 0 so we're going to look at the gradient equal to 0 here we're going to look for that point we can do that by taking the gradient so the gradient of this expression this is a constant this gives B and this one is quadratic in W gives something linear in W it gives H times W where this H and the B are evaluated in the point that we are W0 so this gradient cannot be equal to 0 and we can solve for W and we get W is W0 minus this thing so this looks almost like gradient descent except that we replace eta by H minus 1 so instead of having a single number we have now this whole covariance matrix there so we have not to worry anymore about step size because it's given by the Hessian but we have to compute that thing of course computing a gradient as we will see in the multilayer perceptrons is something if you have n parameters computing a gradient can be done in order n which is quite remarkable but it can be done by doing the back propagation rule but of course computing the Hessian that's an object of order n squared and so we have to so if a million parameters then computing a Hessian is really not a no-go and furthermore we then have to invert a Hessian that's even worse because there's order n cubed inverting a matrix so for large problem this is really not a no-go but what is quite easy to do is actually to consider the diagonal as an approximation which is called pseudo quasi-Newton method so the diagonal of H of course is only n entries these are the diagonal entries and the inverse of a diagonal matrix is very easy because you just invert the diagonal elements one by one so that goes very easy so that's very cheap so that's something that works well in this cases but if this ellipse would be oriented like this it would of course not work because you need the diagonal elements of this Hessian then to do it well okay so that's what it is so this second order method is really a no-go for many machine learning problems so let's look at something else so there is something called a line search so this looks at face value looks like an excellent idea why don't we do the following so we have this we have this problem where we have basically this sort of contour where we need to optimize and suppose that we start here and we have computed the gradient which points more like this direction I'm convinced it doesn't point here it points this direction and so I could do a line search I say okay let's go this way and let's do this one dimensional optimization I'm not far enough and here I'm too far I should go somewhere somewhere here right this is where I then do the first optimization and then from there I could do again a gradient and I do an optimization I do again an optimization I could do that problem with that is already illustrated in this picture so let's see what happens so I do this so w1 is w0 plus some increment where the increment is gradient and so we have now one dimensional optimization so we have to say okay the e along this line so w0 is given d0 is given I want to optimize lambda0 such that this gets minimized so I want this one dimensional optimization along this line and if I take the gradient of this I get the gradient of the argument gives me d0 which is the gradient of the function so I get this expression here the gradient so you take the gradient you compute the gradient here because you compute the gradient here the gradient gives you a direction so the gradient like in this problem here the gradient also you started here the gradient gives you a direction I'll come to that So the line search, the direction is given by the gradient in the current point, and the step size is lambda is found by minimizing this thing, just looking for where is this thing minimal, right? If you just walk on the line, go down, down, down, up, up, up, up, up, so somewhere it's minimum. You want to find that point. If you take that, so you get the gradient of this expression, which is w1, got here times d0, so this is by the chain rule, you get this. So in other words, you find a point w1, which is this new point, which is w1 here, so this is w0, this is w1. You find the point w1, which is such that its gradient there is orthogonal to the current search direction. So the gradient in the new point, it's staring right at you, the gradient in w1 is orthogonal to d0, d0 is this direction, so the gradient is orthogonal to that. So it's like walking in the mountains. If you go down some valley and you go in the direction and then suddenly you go up, if you stop at the lowest point, the gradient at that point is orthogonal to your path, right? It's a profound wisdom of mountaineering. So okay, so you make, so the lesson is that you make orthogonal steps and that's bad because now you're going to get this jig-jag-jig-jag-jig-jag, the orthogonal step again to the origin and this is really, really bad. So this converges very, very poorly. What you would like to do is to go not orthogonal, but, well, you would like to do this, right? I mean, this is the solution, but you would like to go in the, in the, in a little bit in the direction of the gradient, but also in a little bit in the direction of the old search direction that you went. And that is, and that is the idea of a conjugate gradient descent and that's a really good idea, a very powerful idea. So what you say, okay, my new search direction, d1 prime, is, is something with the gradient at that point, which is orthogonal to the, to the previous one, right? That's, this one is, this d1 is orthogonal to the current, to the old, old direction and a little bit of the, of the old direction. I'm going to move in these two directions. I take a little bit of the old direction and a little bit of the gradient direction. And of course, the whole intelligence is in the size of this beta. How to choose this beta, right? We don't know that. So now we do line optimization. We start again, we go to W2. We start here, and we're going to do line optimization in this direction. And we're going to find a point W2. And that's, that's W1 plus lambda 1 times this d prime thing. So we find some lambda 1 larger than 0, such that, again, we do this optimization. So we find, we're going to find that, so we have EW2 is EW1 plus lambda 1 d1 prime. And so we minimize, we do dEW2 d lambda 1 is, is equal to d1 prime inner product gradient EW2. We want to set this equal to 0, right? So, so we get this condition here, right? Okay, now, now here the trig comps, the, the inside comps. So what we want is the gradient to be 0 in all directions, right? Now we had a first point, we had a first point here, where this, this, this gradient is a vector. And so this is a whole vector of, the whole set of equations have to solve. So this first thing said the gradient is 0 in the direction d0, right? The second point says the gradient is 0 in this direction d1 prime. What we're now going to set is that we want this, this new gradient not only to be 0 in this direction, but also in the all direction d0. We're going to demand both. So what, and that will fix us on our beta. So we say 0 is this, is this inner product, which we can write by Taylor expansion. That the gradient in this point is the gradient in that point plus lambda 1 times the increment, which is this increment times the Hessian calculating that point, right? So we get this expansion. Now d, this one is already 0, because that's what we had here. This one was already 0, so we can just call it out. So what we're left with is, and this is a constant, so we can take that out. So we get d, h, d prime, 0. And this is called the conjugacy relation. So we don't want, so if h is a diagonal matrix, we want our search direction to be orthogonal. But if h is not a diagonal matrix, we want something that satisfies this. And this gives a non-orthogonal new search direction, as pictured by this. Now I still haven't told you how to find this beta. And I'm actually not going to tell you in detail because I don't have time. But it's on this sheet. You can do expansion of this Hessian. And it's stipulated out here. And you can express this beta to lowest order in the Taylor expansion in terms of the gradients that you compute. You see in w1 and w0, so things that you already know. So you can compute that. So this is just based on gradients. You can compute this beta. And this beta will then give you which direction you have to go. And so it's not an easy algorithm because what is the algorithm? You start in a certain point. You compute the gradient. You do a line search. Then you do a minimization. You find the point. Then you have to find a new direction. The new direction, you have to find as part of the gradient and the old direction for which you have to compute this beta, which is OK because you have these things all there. But then you have to again do a line search to get to the minimum, et cetera. So particularly implementing the line search is hard, but it's worthwhile for certain problems, particularly when you have high dimensional problems and when there is not. So if you have limited data, so for very large data problems like neural network problems, this is not a good idea. But if you have small problems in high dimensions, small data problems in high dimensions, this can be a very, very effective and much better method actually than the momentum method. You can prove that if the problem is quadratic, like this problem, quadratic problem, then this conjugate gradient method converges exactly in n steps where n is the number of dimensions. So you go one direction, two directions, you're done. And in n dimensions, this happens also. So it's a provenly convergent in n dimensions. Of course, for nonlinear problems, this is not true, but it gives a little bit of the flavor of why this is a good idea. In fact, quadratic problems are intimately related to solving linear systems of equations. And one of the best methods to solve linear system of equations is actually using the conjugate gradient method. So that's under the hood in many packages that you may encounter for solving large linear systems. OK, that's that. So how much time do I have left? 15 minutes? Who says half an hour? The organizer says half an hour. Who says 16 minutes? OK, OK. Well, let's try to make a compromise. OK, so what you may have all heard of is stochastic gradient descent. So stochastic gradient descent is at face value, an extremely stupid idea, that you say, OK, instead of computing the gradient, I'm going to take only part of the gradient. I'm going to make an approximation of the gradient. And the approximation that I make is the following. I say, OK, my cost is the sum of terms, one for each data point, right? We saw that we had this cost term like here, where we, here. So we have this quadratic error, the sum of terms, one for each sample, right? And so in general, we can have this situation that we have such a sum of terms. And so why don't we do the following? We just take random one of these n's, one of these samples, and we compute the gradient in that, for that sample, only one of the terms out of n. And we take a gradient step with that in thing. So the picture there is the following. So we have the gradient, the full gradient, maybe this. But it's a sum of terms, right? There's a gradient in all kind of directions. And the sum of this, all this, is this resultant, because we have e is the sum over n of en. And therefore, the gradient of e is also the sum over n of the gradients of en. So it's a sum of gradients. So instead of moving in the gradient direction, for one sample, you move in this direction, and then you take another sample, we move in that direction. So you get some sort of a stochastic motion where the stochasticity is because of the random choice that of the samples that you take, right? And you can do it with single samples, you can also, people talk about mini-batch, then you take, you have a data set maybe consisting of a million samples. You take mini-batches of size 100 that you draw at random, or you make a fixed partition of the million samples in billions of 100 and do it in this way. So then you have also a learning rate, which, sorry, for the cluster of notation is what I call alpha, which is the eta before. And there's a sub t, meaning that it is, in principle, it's time dependent. Now, this actually, what we want to solve is this system of equation. We want to solve gradient E is 0, right? And we know that this is a sum over terms of the gradient of E n, it's depending on W. So we want to solve this. Now, there is a very old theory, which is very nice to know, which is called the Robbins-Mann-Row theorem. And it's about stochastic approximation. And it gives actually the answer for why this is, why and under which conditions this is a good idea. So that in general, the Robbins-Mann-Row setting is the following. So instead of W, we have x as our thing that we want to solve. We have Mx is A. This is a high-dimensional or multi-dimensional, many equations with many variables like this here. And the setting is that this M of x is a sum of terms. In fact, you can think of it as a probability. There's some random variable. And it is made out of a sum of terms. So in this case, we have also this because we can think of this as 1 over n. And this is n is 1 to n. And so we can write this as the sum over some stochastic variable xi where we can put p of, no, we can say, let's say, yeah, so we can basically, no, let's just go. So we can just pull this p of xi. We can just say it is 1 over n. It's uniform. Uniform probability. We take all the samples with the same probability. And that is our setting that we have here. So this x is a vector. A is a vector because of many equations. M is a vector which is this left hand side. And n was n. And each of these gradients is also, of course, vectors. And so the Robin's Monroe algorithm is the following. I start with an initial x0. And I choose random from this probability distribution. Read here. We put random uniformly. We choose one of the patterns. And we update the x according to the old x plus alpha times the difference of the left and right hand side here. We're doing this. And then the statement is, so if this m is a convex problem, and x is a solution, then you can prove that this sequence x goes to that unique solution in the center of the norm goes to 0, provided that this learning rate, this alpha, has these properties. Now, the first one says that it should stay non-zero. So it's decreasing. And there can be two things wrong with decreasing. One is that it decreases too fast. Then it decreases before actually the process has converged. And this is to ensure that that doesn't happen. So that the sum of terms is sufficiently large. In fact, it has to add up to infinity. The second is that actually it goes to 0 at some rate, because asymptotically you want it to go to 0. So this is because asymptotically you want the update to get smaller and smaller so that you get to a fixed solution. So a solution that satisfies both of these is, for instance, alpha is 1 over t. So if you have 1 over t, you sum 1 over t. It's like integrating 1 over t. You get the log, and the log evaluated at the end gives infinity. No, no, no, no, no. So you need to have, so this is for convex problems. So the m has to be a convex. Yeah, so for instance, if you're quadratic well, then m is the gradient of that. So it's a straight line. So it's convex. So yeah, so this is a way to satisfy that. So if we apply this to our case, so then we get the training error is this sum of terms. Sorry that I keep on changing the notation. So the gradient was this. So the Roman's Monroe problem you get if you take that mu is actually the xi, and it takes a different value, different samples. The p, the probability of the mu is just uniform. We take the a, this thing, we take it equal to 0 because that is 0 here. And we take is, sorry, sorry, sorry, this is wrong. This is a gradient. This is one of the gradient terms, of course. So in other words, this Roman's Monroe theorem says that the stochastic gradient descent method converges if you choose random patterns and you update it in this way. And you take your learning rate to 0. So this is very powerful because, in fact, you see in the exercise, you will see that actually when this is applied, in the case of this logistic regression, that this stochastic gradient descent gives a very fast, very fast convergence. If you want to know more about that, there's a nice paper by Svoldigstein which reviews all these different stochastic gradient descent. There's a lot more to this, which we don't have time to review. No, these are methods for convex problems. So they do a local optimization. But of course, then there is all kind of heuristics that, for instance, if you have a neural network and you have plateaus, adding noise will maybe help you to get off the plateaus. And the plateaus are known to be a very serious problem of the slowdown of the convergence. So this kind of stochasticity has other advantages, just getting off plateaus. It also, you can also reason that if you have this noise, that you have a multimodal well, that this noise may actually help you to get from one to the other. But of course, then you need significant amount of noise to do that. So there may be more reasons that it's a good idea to do stochastic gradient descent. And people have been introducing other kinds of noise into the neural network training, like dropout, et cetera. And then they say, this is the reason why we have such a good network. And then now, last year, they said it's no longer true. And so anyway, so there's a lot of things going back and forth. I have not so much time anymore. I could do the deep neural networks. I think I'm not going to do it because I want to stick to the perceptrons. Multilayer networks can do many function approximations and are universal. So they can fit any function if you just put enough hidden units by it. One of the important things is this back propagation scheme. The reason is that if you normally have a function, which is a function of a whole bunch of parameters, all these parameters, and you would develop, it's OK, evaluating this function will cost you at least linear in the number of parameters, you would think. You would have to see each parameter at least one. So this would be evaluating. This would be order n. So if you now have the gradient of E, it would also, for each component of the gradient, it would also be order n. So you would expect that all the gradients would be order n squared, typically. Then you would have what you would expect. Now, for neural networks, this may be a problem because n may be very large. And this, you don't have the time. Actually, the back propagation, the nice thing about it is that it's sort of an administrative scheme that allows you to do this computation in order n time. So that's really the upshot and the details I'm going to leave for you on the slides to work it out. So basically, you have a forward pass in which you compute all the activities. You have a backward pass in which you compute all the errors. And then you multiply that. And each of these operations are linear in the number of parameters that you have. OK, that's that. Universal approximators. Yeah, so maybe, yeah. OK, so one of the things that is very popular is these convolutional neural networks. They work if you have certain structure in your data. And one very, I think, a very simple way to understand the convolutional neural networks is by the idea of weight sharing. So if I suppose that my input space is one-dimensional, and I have here a hidden unit, and it's looking at these three inputs, now if these are pictures, then the statistics that I pick up here is going to be the same as the statistics that I pick up there, et cetera. It's a translation invariance, right? These cats can be anywhere in the picture. So if I have a second neuron which looks at, say, these three neurons, these three weights, then I can enforce a weight sharing. Do they say, in fact, I'm not going to train these things independently. I'm going to enforce that this weight is equal to that weight, and this weight is equal to that weight, and that weight is equal to that weight, right? So I can do that then. And so if I have to cover the whole input space, of course, I have many neurons to do that. And this will give me what is called a feature map. So it will be the output of all these neurons that do the same feature, the same feature, because they all have the same weights. So this is one feature map. And then I'm going to have many feature maps, because, of course, I can also learn one feature of these three input variables, but I can also learn another feature of these three input variables. So then I get a second feature map, right? And that's the weight sharing among all these neurons that encode for the same feature. And that, in essence, is the convolutional idea. Yeah, so I'm not going to. But maybe one thing is important to note that these ideas are extremely old. So these are papers from the 80s and 90s that already have these ideas in there. So then, of course, there was the world famous ImageNet competition 2012, where this is a computer vision competition where they have a data set where there's a benchmark with 1,000 classes, 1.2 million training samples, 50,000 validation images, and 150,000 test images. And this was sort of the state of the art. So to get the error in getting the top correct used to be in this order 45, 47% error to get the top correct. And then suddenly this convolutional neural networks came around, and they cut almost 10% of that error down. So that was the big hype that started a lot of this interest in deep learning. And I think it is, you know, it's just absolutely amazing. And it is, of course, true that given enough data and given enough, particularly enough data and also given enough computers, you can do amazing things, these things do. Of course, there's always a certain worry about a certain brittleness that you get a solution that is sort of over-specialized on a particular data set and it may not well generalize to other settings. So these solutions are very complex and in a sense somewhat brittle. And there is some papers out there, particularly Joelle Pinot was talking about that. You may wanna look at some of those papers about some critique on the robustness of these solutions. This is one of my famous favorite examples here. So this is one of these test set images which is then the deep network is then also connected to a language generation recurrent neural network. And so this is an image that this machine has never seen before. And so you present it and it outputs a group of people shopping at an outdoor market. There are many vegetables on the fruit stand. I think this is absolutely amazing. I mean, if I would have seen this example 15 years ago, I would not have believed that this could be done on a new image. So I think it's very spectacular. And this is showing how the network has attention. So these white blobs are actually the attention where the text is telling something about where the blob is, the dog and this drop sign and the giraffe and et cetera. Anyway, so here you see some papers on the deep learning. So this is this old work of Fukushima and Yanlokan image net paper. And this is a nice review on deep learning in 2015 where you find many good ideas. So the last part I wanted to do is to connect a little bit. All this is what I told you to my own research. And so this is more a bit of a sort of a research talk. And this is motivated by the fact that synapses in the neural networks are unreliable. So here you see, for instance, firing. This is input. This is activity of input neuron that is stimulating this output cell. And the output cell is recorded at what is called the postsynaptic potential. This is not the spiking of the cell, but it is just the effect in the memory potential of the impinging input spike. And you see that if you put as an input this spike train, then the output in repeated trials is very, very different. You see that very unreliable as you get. So the first response, which is the most important one, you see it in some cases is even completely absent. And so that means that this synapse is a sort of stochastic variable that is on, off, and has a certain probability and is not for sure. And so having a neural network with continuous weights which have infinite precision, maybe not a good model, and maybe we have to look at much simpler ideas. And also for hardware, there's a whole other story that is about energy consumption of computation. Did you know that our planet is spending 5% of its energy on computation at the moment? And that this number is rising with about 7% per year. They say that the self-driving car, which is going to be in China driving on highways in 2020, by the way, one and a half year, we can just mostly look for it, a self-driving car that it will spend more energy on its IT and its neural networks than on its propagation, on its movement. So there's a big energy problem there. So having cheap hardware, for instance, and having cheap synapses, which are just bits, may be a good idea. So if you do the perceptron problem, so we have, again, the perceptron. And we now look, we have here weights, w. And we now ask yourself, what is the solution? We have now to find a solution of binary weights for this learning problem. And you can imagine that this is much harder than the continuous problem. The continuous problem was relatively easy problem. If you put it in terms of logistic regression cost function, as you'll see in the exercise, it's actually a convex problem. It's a unique solution in the continuous thing. But if the weights are binary, 0, 1, say, or plus or minus 1, 0 plus or minus 1, then it becomes actually an NP-hard problem. So it becomes an intractable problem that's where the solution is expected to scale exponentially with the problem size. So what are you going to do? So the energy landscape is going to look a little bit like this here, this gray. And now there's been work in the group of Ricardo Zekina and Carlo Bardasi that have been looking at, have found out that there is different types of minima in this energy landscape. There are isolated minima, which are these, that are defined by the fact that you have a solution, where if you change one bit, it's no longer a solution. So it's an isolated solution. And you have also a non-isolated solution, which is this, which is this area of low values. And so you want to actually have learning rules that get you to that kind of a solution. And so this is what we developed. And I'm going to tell you a little bit about that. So here, this is so-called stochastic binary perceptron. So we have a input x, which were, for easy, we take just the binary vector. So these are the inputs. Then we have an output, which is plus or minus 1. And we have synapses, which you denote here by s, which are also plus or minus 1. So this is our learning machine. And so we're now going to take, for simplicity, take the threshold equal to 0. Just not to clutter the notation. And the probability of the output given input is now the sum of all the values of s or the probability of the output given s and the input times the probability of s. So this is just base calculus. So we do this. So this probability of the output given s and the input, this is a perceptron. This is giving sort of a perceptron. We can model it by this sigmoid function, where sigmoid is basically like a hyperbolic tangent, something like this, like this kind of a shape. So we have a sigmoid. And we have an argument. So we're going to have sigmoids beta y times h, where h is the sum over i of s i x i. And so the idea is that if the probability that y is 1, given the input, so if we have this sigmoid, right, if we have the probability, if this were our model, the probability of y1 given h is the sigmoid of beta h. And that is this curve, where we have h. So the probability of one class increases with h. And beta is a steepness parameter. If beta is large, you get something that looks like this. If beta is small, you get something more smooth. So that is the probability of the output 1 given h. And the probability of the output y is minus 1 given h is by this formula is sigmoid of minus beta h. And it's easy to show that that is equal to 1 minus the sigmoid of beta h because of the property of this sigmoid function that's on the slides there. And so this is correct, because this is then 1 minus the probability of y is 1 given h, right? So this is then saying the probability of y is minus 1 plus the probability of y is 1 is equal to 1, right? So this is how this is encoded. That's for a given h. But now this h here is made up of these binary synapses. And they have a probability. So their probability is given as independent. So for one for each synapse. So these synapses here, which are called s here. These synapses s. We assume that they're independent. And they have a certain rate. And the rate is given by w, right? So this is sigmoid, the probability of. So here we get again that the probability that si is 1 is the sigmoid of w. And the probability that si is minus 1 is the sigmoid of minus w. And so these probabilities also add up nicely to 1. And so the expected value of the synapse can be easily shown as the hyperbolic tangent of the w value. And we call this the mean, the m. And we can also look at the variance of this synaptic value. It's 1 because it's expectation of s squared, which is 1, minus the squared expectation, which is mj squared. So that's the variance. So this synapse has a mean which can be between minus 1 and 1. And if it's close to 0, it has a large variance. And if it's close to 1 of the n's, the variance is very small, right? And then we have n of these synapses. And they work together in this network. This is sort of the neural model is very similar to the Boltzmann machine. But the Boltzmann machine is recurrent. And this is a feedforward structure. So in the Boltzmann machine, the connections go both ways. So the neuron 1 affects neuron 2, and neuron 2 affects neuron 1. Here there's only one directional information. So h is this total input activity coming in this neuron. And these s are independent. So we can see that if that is h, actually, we get a sum of stochastic variables. So for given input, this thing has a mean value and it has a variance, right? Which is the sum of all the variances of the, and since xi squared is 1 that we have assumed here, xi is plus or minus 1, we get the variance of this h. So h, so the point here to make, let me write it here. So we have this neuron, which is a sum of this input output relation, which has to sum of all this stochastic activity of the synapses. This stochastic activity only arises in the output of the neuron through this summed activity. And this summed activity, you can use the law of large numbers. So it has a sort of a mean and has a variance. And that's all you need to know. And it becomes Gaussian because all these components are independent. So this becomes Gaussianly distributed. That means that this sum over s, which is the sum of all the 2 to the nth possible synapse configuration, cannot be replaced by an integral of a Gaussian variable. And so this we can do. And so here you see that. So this Py given x, which was this sum over s of the probability of s, which is this term, times the input output relation of a perceptron with given values of the s's. And now sum of all s, this sum of s can be replaced by an integral over h where we get the Gaussian variable in h, which has a mean and a variance given by this. And there's this function that is not the same. And so we can forget about the whole s dependence because we only have to integrate over h, which is a one-dimensional variable, continuous variable. And we can do this integral in the limit when beta goes to infinity when this step size gets very sharp. And then the outcome is that it becomes, again, a sigmoid with a different sigmoid. So it becomes a sigmoid which depends on, so if you compare it with this perceptron, we have the y and a local field. Here we get also a y and a local field where the local field is now the mean local field. That's this one. But most interesting, it gets a denominator which is sort of setting the slope of the sigmoid, which is a sum over the mean activities. So it gets a very funny kind of slope. The slope becomes depending on the learning. And so this has very interesting properties, as we will see in a moment. So if we have a data set now for training samples, input-output pairs, we can maximize the log likelihood. So that is one of the cost criteria. And that's like the energies, the cost, the ease that I had before that we were minimizing. We can maximize this quantity. And the log likelihood says essentially the following. So for instance, I always give the example of a Gaussian distribution. Suppose if I have some data points and I want to do a maximum likelihood estimate for Gaussian distribution, what's the best Gaussian? And I can have various Gaussian, I put a Gaussian here, I can put a Gaussian here, I can put a Gaussian here, whatever, and I'll put a Gaussian. What if I have the best Gaussian? What is the best Gaussian? Now I'm going to look for each data point for the probability that that particular Gaussian has. And so here I have these values, et cetera. And I want to maximize it. And what I maximize is the product of that. So I have for each, I have a Gaussian model and I have a data mu and I have parameters theta, say. And I have a bunch of data and I'm going to take the product of all these values. And this becomes a function of theta. And now I want to find the theta that maximizes that thing. And now I take the log because that doesn't matter. And then we get the log likelihood and then we get the sum of terms. So that's the argument that you use here. Now instead of having a model just as a Gaussian, we have now a model which is a conditional model that for each input, we want to maximize the correct output. So we get here an input output relation. That's the only difference here. But for the rest of the argument, it's the same. And so you get this is the maximum likelihood idea. And so we plug in this thing here. We get this and we do the gradient computation. And we can do that. It's not so insightful to look at that gradient. Well, it's here. But so it has a first term, which is looks like an input, like a Hebbian term, which we saw in the perceptron, right? Multiplying the input and output activity. This we also had in the perceptron. It has this term. It says, well, don't do anything if it's already learned correctly, right? That is the term we had also in the perceptron. But it has this new term, which has all kind of funny behavior. And we're going to look at this here. So this is some results that we get. So we take a binary problem with input and output. And we're going to learn it. And we're going to see what the learning does. So the error gets smaller and smaller. And Q, which is the sum over Q, is the sum of i mi squared, which is the how much these synapses are going to specialize to the value plus or minus 1. It's the mean value of each synapse. Mi means the value of each synapse. So if this goes, if mi goes to 1, so we normalize 1 over n, the number of values we have. So if this goes to 1, it means that each synapse has decided whether it has to be plus 1 or minus 1, right? And here you see this Q going over time. So it decides for each synapse where it's to go. And so you can do this learning. This is case in which you have, oh, I should tell you, that we had the capacity of the perceptron was 2n, right, for the continuous weights. We have seen it in the beginning of the lecture. Now for the binary perceptron, actually the capacity is 0.83n. So that's a difficult statistical physics calculation that people have done many years ago. But the good news is basically only a factor of 2. So it's actually very limited. So you give up, you reduce full continuous precision to just bits, and you only lose a factor of 2 in capacity. So that's actually quite good. So of course, learning gets harder if you get closer to this capacity limit. And that's what's shown here on the right. So here alpha is the capacity. So it is on this scale of 0.83 somewhere here. And you see that the performance of this network for different network size, 10,000 inputs, 1,000 inputs, you get these different lines. And you see that basically you get good learning with this stochastic perceptron rule. Up to alpha is about 0.64 that range. And actually that is pretty good for such a simple algorithm because the state of the art is in this range using more sophisticated techniques like belief propagation methods. So because it's a very, as I said, it's an MPR problem in essence to solve this problem exactly. So this stochastic rule is promising because it works well, it's very easy, and it's also possible to extend it to multi-layer structures. So for instance, we have applied it to the MNIST data where you have with three layers we can use this learning rule and get an error of about 1% percent, which is not a world record, but the fact that we don't use any convolutional layers, it's also not, it's actually quite promising that we get this result. So that connects a little bit what I told you about all these basics to the current research that I'm doing and this paper you can find online. If you have more questions, let me know. Yeah.