 We want to continue, we talked a little bit about perceptron and black propagation. So we want to review what problems we had, so what problems the research community had such that we basically came into a halt. So we want to talk about problems so far. So not necessarily problems that we encountered, problems that the research community encountered using artificial neural networks, partly support vector machines, and what was then perceived as okay, we cannot really break through this. So one problem was that discriminative methods like SVM and backprop generally need a long time to run or to train. So that was a problem. Well, you might say this is still a problem. But back then it was really bad, and it saved a long time so we don't mean several minutes, we mean several weeks and days and it's not being feasible, so things are not working out. Another problem we had, again, so when we talk about discriminative, maybe I should have highlighted this as well. So we are talking about methods that classify. So classifiers, neural networks are classifiers. Support vector machines are classifiers. So whenever you train a classifier, it takes a long time, and when we say long time, again, we don't mean several minutes, we mean really a long time. Another problem we had was that backprop would get caught in local minima. This was a huge problem. So the problem is that if you have your error here, and this is your training session basically, and you try everything, and then you would see something like this, and of course you want to get something like this, which is the absolute minimum. So this is the global min. So you want to get the least error, which will give you the best fit, the best model. That's the purpose of training. But you may get this, you may get this, you may get this, you may get this. So all this are local minimums. So which means there are a minimum, and of course the error surface is very complicated. It's not the two-dimensional case that I'm drawing here. We cannot easily follow the error case. That's the purpose of training, to follow the error case and get really to a bazin of error. So if instead of the global min, which most of the time it could be unique, but it calls also be that you have several of them, or they're very close, it doesn't matter, I need just one of them, but I have many, many, many other local minima, which have the appearance that you get into a minimum, but when you come out and try to use the software, then you see the response is not good, but we had a minimum. So that was a local minimum. So bank propagation had a problem with that, and bank propagation was the best we had. So how do we do anything else if the bank propagation doesn't work? If you talk about SVM, we would get a large number of support vectors for hard problems, because when around mid-90s support vector machines came along, I said, okay, so now we have an alternative. Bank propagation, I cannot make it really deeper than three, four layers. Let's use support vector machines. They are on a solid mathematical ground. We have the proof. Everything should be fine. So by then we would see that for difficult problems, while difficult problems are something like this, that let me draw something like this, so if this is one class, I'm just artificially creating a difficult bound decision boundary. And then I have some crosses as another class. This is not actually very hard, but okay. Then you need this as a support vector, this, this, this, this, this, this, this, this, all the other support vectors. So you need across that margin that separates the two classes. You need many, many, many support vectors, because if it was a line, I would take one here, one here, I would be done. So the more complicated the decision surface or decision hyperplane, the more support vectors you need to separate things. But wouldn't that be overfitting? Isn't that overfitting? I could put the question mark here, because sometimes your problem is difficult, you grab a lot of support vectors and you're going to be fine. But most of the time, if I have that many support vectors, you are basically just trying to make it work for the training and then you come in the real world and one of the crosses is here, boom, nothing works. So that was, why is that important? Again, back propagation was, we couldn't go deep. With shallow back propagation, we couldn't solve any major problem. And even with three, four layers, we would get in local minima, so it was not able to generalize. And then we get happy with WAPNICS SVM, and then SVM finds so many support vectors. Well, another problem was that diminishing gradients prevent, diminishing gradients prevent deep topologies. Well, what deep topologies did we know back then? Autoencoder, nothing else. Nobody even dreamed of, I don't know, maybe some insane researchers once some experiments with back propagation with more than six layers. And then they waited two months, six months, and nothing happened. And then somebody just unplugged their computer. But with Autoencoder, we were able to do some experiments with fake data, synthetic data, make it four layers, make it 10 layers, make it 12 layers, make it 20 layers. Okay. You could do that, but then we realized, oh, if I really make it deep, then the gradients diminish. So I'm propagating the error from here to here to here to here to here. So at some point, nothing is left. So the guys at the beginning don't get blamed at all, because the guys from the, at the end of the network, they pay for everything. So how do we deal with that? And there are more problems. There were more problems. It was not fun to be in machine learning. You would send a paper, it would get rejected. You would apply for money, nobody would give you money. You would write, I'm working on AI in your resume, nobody would give you a job. The real problem. So this, this is serious. I cannot pay my mortgage. I have to make AI work. Another problem is, of course, we need training data. This has nothing to do with the techniques. The technique you use, and which is, I need to calculate the probability of label given data. Very important. So why is that a problem? Okay, because we were not able to think outside of the box. Always we say, I have to calculate the probability of the label given the data. What is the probability of cancer given the data? What is the probability of the direction of the robot being the right one given the measurements? What is the probability that the elevator at the fifth floor of E7 comes on time given it is Monday morning? So that probability means you need a lot of training data, a lot of measurements. This is our business as engineers, but measurements are cheap, aren't cheap. So it takes a long time to collect measurements. So it is generally also hard to learn a non-overfitted, now we have to emphasize, a non-overfitted model just by relying on labels. I would say we still have this problem. We have done some stuff to remedy that, but how can you come up with a good model that can generalize well if you are just relying on labels? So what else is there? So for example, why not looking at inter-data patterns? What is that supposed to be? Why am I just focusing on digit recognition? This is a two, this is a five, this is a nine, this is an eight, this is a zero. What can I learn from five to make recognition of eight easier? What can I learn from one to make recognition of seven easier? Maybe a better example. So what is the relationship between the data? Are we exploiting that? Not really. So there's a lot, isn't that learning pattern discovery, pattern recognition, there were big words in the 70s and 80s. Everybody would put it in his resume or her resume. I'm doing pattern recognition. You couldn't say AI. If you say AI, you're insane. So pattern recognition was more serious. Okay. So we have a lot of problems. So what do we do? How do we solve the problems? So let's have some rough ideas. Some rough ideas. So how about we look at discrimination, discriminate versus generate when we are talking about data? What's the difference? Yes. It's going to be faster than that. How long do you need? Okay. Right from beside you, because we have to continue. Or I can give you my notes. So discrimination versus discrimination versus generation. So we have been focusing too much on, we have been focusing too much on saying this is class A. This is class B. This is discrimination. So what about if you can generate the data? Doesn't that require more intelligence actually? So you want to say this is Picasso. This is Van Gogh. Okay. Can you be Van Gogh? I would say that's more difficult. We're saying this is William Faulkner. This is, I don't know, Romain Roland. Can you write like William Faulkner? So that would be more interesting. Regenerating data requires more intelligence. You can bring in any trainee, teach him or her within hours. This is this, this is this, this is this. But becoming a master in painting, writing, it takes years. So generating data requires more intelligence. So what does that mean for us? It means do not calculate the probability of label given data, but calculate the probability of data. Can you do that? You're saying if I grab 10 by 10 pixels, what is the probability that was drawn painted by Picasso? Or what is the probability that was based on brush strokes? What was the probability that it was Van Gogh? Okay. So what is the probability that the stroke goes this way or that way? Or is this color or that color? It's not easy to convince people to do things differently when we got used to it. So not Van Gogh versus Picasso, but just Van Gogh or Picasso. I don't have any. Why? Colors with Van Gogh are just insane. I would rather go with Van Gogh, whether you can go with Picasso. So can I generate Van Gogh? Can I write poems like T.S. Eliot? Oh, take it easy. Just try to imitate just the spam filter. You don't want to write poems like T.S. Eliot. Okay. So this is discrimination, of course, discriminative. This is generative. So which means we are saying intelligence is understanding the data. Intelligence is understanding the data. Intelligence maybe sometimes is not separating data from each other, discriminating between instances, classes of the data. Maybe intelligence is getting the data, get the essence of the data. So if you understand it, you can generate it. You can probably in some cases add a re here, regenerate it because you see some examples to learn from. You get some data. You have the probability of data. You know how many times I get this color intensity, how many times the stock market values are here, how many times the elevator is here, how many times the robot make a turn, left turn when this type of obstacle, how many times I get cancer when I smoke. We have those numbers. And those numbers are actually much easier to come by compared to the case that you say, I did this, this, this, this, this, then the output was this. This is classification, discrimination. So fundamental piece of knowledge to use just the data, which means no labels. We need to use its energy. Now we bring some of the, sometimes we bring in, so we are talking about data and then we bring a concept into the game, which is a physical concept. Energy is a physical concept. So what is the energy? So if I'm, if I have a mass and I'm moving in a certain direction, so what is the energy that is necessary? Minus mass times acceleration. Why minus? The question is, how much energy does it take to stop this guy and move it in the opposite direction? So the concept of energy is closely related to work. How much work is necessary to be done? For data it could be a little bit abstract and cause some headache for people who are sent energy, what does energy to do with data, with numbers? One, nothing and everything. Okay. Let's have an example. We want to find out when is there a forest fire? So we want to find out. So that's a huge problem. We have forest fires, not much close to us but on the other side of country, south of us, in many spots, on many spots of the planet, and only we get some forest fires. Some of it is natural. We freak out because our lifespan is so short and we say, oh my God, we lost so many hectares of land and trees, sometimes it's a way that the modern nature gives itself a break and say, okay, let's cool down, go 20 years back there, it's beautiful. But 20 years is a lot for us. We want to go enjoy, do camping and then we freak out, oh my God, the world is burning. So why is that? We want to find out a solution. So basically you have a forest fire and the question is what does cause a forest fire? So sometimes you have a storm and you get forest fire. So that's the observation. You get a storm, then there's forest fire. Of course, when there is storm, there is always lightening. And when there is lightening, of course, there is forest fire. Sometimes there is storm, there is no forest fire and there is no lightening, but there is forest fire. If there is lightening, there is thunder. Thunder in itself cannot cause forest fire, but lightening causes thunder. And we see also when there is a storm, people go and create campfire. So when there is a storm, people do campfire. And when there is campfire, there is forest fire. And where does the campfire come from? A common boss of tourists, no Canadian in there. And then they bring people and they do campfire. So what is the solution? Just shut down, build a wall. And then there is no forest fire, we don't need AI. So of course, there is a probability that when there is storm, there is lightning. There is a probability that when there is lightning, there is thunder. There is a probability when there is lightning, there is forest fire. There is a probability when there is storm, there is forest fire. There is a probability when there is a storm, there is a campfire. And there is a probability when there is campfire, there is forest fire. And there is a probability when people come, they do campfires. We can do some observation, collect that data. We call this a belief network. And under some condition, we can call it vision, belief network. So I don't want to do this manually. I don't want to sit down on the board and draw this ellipsis and say, OK, this is this, this is this. I want to figure that out automatically. That's why I need the AI to do for me. So how can we do that? Well, everything is doable. But it could be that we need to do some homework before. There are some potential models that are already in literature. So if I have, let's say, if I put some circles like this around each other, and then I connect every circle to all other circles, is everything connected to everything? Then this could be the most general model that I could use to come up with a network that make us believe in an event, whatever the event is. So we call that generally Boltzmann machines. Boltzmann machines. Well, they are one of the most generic types of machines that can possibly be used to learn something. So they can be used, for example, in recurrent networks. Recurrent networks. They can be used in off-field networks. We have not covered any of that. Off-field networks with hidden neurons. When we use Boltzmann machine, the stochasticity is always in the game. Well, it's about probability. So you cannot make infinite number of measurements and observations. Whatever we do comes with a touch of uncertainty. So Boltzmann machines are a special form of Markov random fields. Markov random. Well, this could be also regarded as Markov random fields with some changes. And they use energy functions. Can anybody tell me why they use energy functions? Whatever the energy function means. We still don't know what you mean, energy. But you have just the data. So you want to generate. You don't have this. You don't have the probability of label given the data. You have the probability of data. So I cannot calculate any error. There is no error. Because there is no target. There is no desired output. It's just the data. So if there's just the data, the only thing that I have the data, so I can only calculate the energy of the data. So what is the energy of the data? Well, take it easy. So let's go one step at a time. So I don't have the error function. I have to come up with something else. Historically, we went with energy. Well, maybe you find out a little bit more why. So this was also called Harmonium by Smolensky. So the Boltzmann machine was also called Harmonium if everything is connected to everything. And that was modified by Jeff Hinton to restricted Boltzmann machine, RVM. So what was going on? So don't forget why we started this and why I'm going a little bit slower than usual. Because we have to get this. This is the beginning of deep learning. So we have to get this. What happened? So we had all those problems. We couldn't train. We would get stuck. We couldn't go deep. We would overfit. Or we become intractable. OK, so we have to do something else. One of the ideas was, OK, don't discriminate. Generate. This is one school of thought, which is the Torontonian school of thought. It's the UFT school of thought. It's the Jeff Hinton school of thought. OK, so let's go back to the root. Get something here as the belief network and make it practical. OK, but we knew this. Mr. Hinton, we knew this. This doesn't work. I said, yeah, OK. Give me 10 years. I will figure it out. OK, how do you want to figure it out? What can you do with this? This is not even that structure. I have two input neurons. I have three hidden neurons. I have two output neurons. And it goes forward. And then the error goes back into the network. Everything, that they have the structure, when we were using back propagation network. What is this guy telling us? No, I am in a mesh, which is a mess. And everything is connected to everything. So yeah, OK, let's take care of that first. We cannot deal with the Boltzmann machine. Is everything is connected to everything? OK, I don't know where you're going with this. But I don't have anything better to do. Nobody gives me a job, because we are doing AI. So maybe we spend some time on this. One of the phrases that we used back then to get a job was, I work on knowledge-based systems. Knowledge-based systems. Then it became, I am working on pattern recognition. Then cautiously, cautiously, I'm doing machine learning. And now we are auditions on doing AI. So what is restriction? Restriction is simplification. Restriction is one of the most powerful tools we have in computer science. When you restrict something, you make it more powerful. Sounds very strange. You restrict a list, make it a stack. Wow, suddenly you can come up with parsing and building a compiler, syntax checking. You had a list. You put a restriction on it, become stacks. Put another restriction on it, make it a queue. Well, I can administer message transportation and transmission in networks. Restriction creates specialization, hence, becomes more powerful. You have the totally wrong image of restriction. So, okay, we're saying, look, we have, let's say, a Boltzmann machine like this. Again, this guy is connected to O3. This guy is connected to O3. This guy is connected to this. This guy is connected to this. This guy is connected to this. And this guy is connected to this. So I just redraw the Boltzmann machine, which historically we draw as a circle. I draw it because I'm still, I don't want to let go from the back propagation networks, multi-layer perceptrons. I drew it as if I have two layers. What? I don't have two layers because these two guys are connected to each other. These guys are connected to each other. Well, so this is the Boltzmann machine with inter connections within each layer. And that's not gonna work because this is not gonna work. This connection is a problem. This connection is a problem. Connection from this nodes to this nodes. And of course, please keep in mind that this two putting down, this three putting up, being put up by me is completely arbitrary. It still doesn't mean anything. I'm just trying to do some gedunkin experiment. That's it. We said, Jeff Hinton was saying, okay, look guys, let's redraw this. Let's redraw this. We have the three. We have this two. And yes, I'm right with you. Connect everything to everything. Let's stop right there. So no connection here. No connection here. No connection here. No connection here. So he removed these connections. This one, this one, this one, this one. Restricted the Boltzmann machine to not have layer wise connections. Which means he's assuming and saying good, look, look. This is one layer for me. And this is another layer for me. Okay. Whatever makes you happy. I don't see it. I don't know where you're going with this, but okay. So this would be our restricted Boltzmann machine. RBMs. Without RBMs, it wouldn't be here. With respect to interest in deep learning. So then we have, let's say, you would say, look, I have this layer, which is visible. And I have this layer, which is hidden. Why definition? Okay, don't freak out. My definition, this is my visible layer. This is my hidden layer. I can't say that because I removed these connections. Now I can assume that my topology is like this. Okay. Then we have a note here with V1, V2, or X1, X2, it doesn't matter. This is H1, H2, H3, because I said visible, so V1, V2, hidden, H1, H2, H3. So you make restrictions in the abstract way. You make it practical and then you start creating your terminology. RBM is often binary uses binary neurons. It makes life a lot easier if it is binary. Which means the values of V1 and V2 are zero and one. The values of H1, H2, H3 are zero and one. Oh, is that useful? We will get there. Perhaps it's not useful. But maybe we can make things work. So then you define the RBM energy. I'll say the energy, let me also use this part, the energy of the visible and hidden neurons, the energy that it takes to work on this belief network. I can call it the belief network. On this restricted Boltzmann machine, the energy of that is minus the bias that I assign to the visible neurons, minus the bias that I assign to the hidden neurons, minus the hidden neurons times the weights times the visible neurons. Where's the weights? Well, we forgot the weights. Well, these are the weights. This is the W. Learning is synoptic adjustment. If there's no weights, there's no learning. I need the weights. We didn't talk about them. We assumed, blatantly, they are there. So this we understand. So you can get rid of the bias and say I don't have bias. So we want times this plus a bias. Or I can get rid of this and say I'm just working with the product of the visible and hidden neurons, mediated through weights, of course. So if this is one and this is one, the only thing that counts is the weight. If one of them is zero, everything is zero. There is no energy. And why is it negative? Again, I want to move things from here to here. I have to look at the potential and I have to get the negative amount because I want to go in the opposite direction to move things around. Okay. Okay. Then what? So V and H are binary and W is, of course, a number between zero and one. Otherwise, we have nothing. I need the synaptic values to be adjustable. I need to play with that. Okay. So what is this? Relax. So, okay. Can you understand this? I get, I get, let's say some books, right? Book number one, book number two, book number three. Here I have what type of book is it? Is it a novel? Is it a collection of poems? Is it science fiction? What is it? The genre. Then I have the author. Who is the author? Then I have, has it won the Nobel Prize? Then I have number of pages. For example, I'm just grabbing or petrifying. And now I create my restricted Boltzmann machine and now I can do magic. So I, let's say, I don't wanna mention names of the books because I cannot write it here. Any book, any book. So, and then we ask everybody, have you read those books? And I asked him and he says, yes, no, no. I ask her, she says, no, yes, yes. Yes, yes, yes. And so on. Well, we have the data. We know this book is a novel. The author is William Faulkner. It did win the Nobel Prize and has 450 pages. So now you recognize inherent structure. What type of books does he like to read? There are usually books that are science fiction, less than 200 pages, and most of them are on the screen. So this is the philosophy that we use, binary ones, because we are just saying yes and no. And then we learn the association. So what is this, why is this person, the first measurement, likes book one? And then we look at it and we say this will light up and this will light up. This person seems to like, you need of course many measurements, this person seems to like books from whatever. I don't wanna mention, do advertisement. So maybe the ones that are dead. I don't know, Victor Hugo and 850 pages. So we can recognize relationships. Can you do that? How can I do that? Well, you cannot do this for gradient descent. Why not directly? And by the way, if you do that, this is just one visible layer, one hidden layer. What is that good for? You were promising, you will do deep learning. You are in such a hurry. We are not doing this to be alone of significance. We wanna do this as a component, as a building block. If I rotate this and put many, many, many, behind each other, I wanna construct my outer encoder. That's what I have in mind. So I don't tell it to people at the beginning. So, but that's what I wanna do. Otherwise, what can you do with just two layers, visible and hidden? Not much, not much. Okay. If I ignore the bias, we can ignore the bias to simplify stuff, to just see what's going on. So ignoring the bias, the energy of visible and hidden neurons is minus sum of V sub i H sub j W sub i j over i and j. And of course, this have binary states. Otherwise, things get really complicated. And this is, of course, the weight of the connection. If you wanna do this for the sake of doing this, you are wasting your time. But we are going somewhere with this. Okay. So the restriction was, again, important. The restriction was no connection, no connection between hidden slash visible units for each. Which means you have conditional independence between your visible units. You are making that assumption. I'm making assumption, which is a naive assumption, but so I'm making the assumption the fact that you like this book has nothing to do with the fact that you dislike this book. So they are independent from each other. The fact that I have science fiction here has nothing to do with the number of pages. They are not depending on each other. If they are, things will get messed up. Because I wanna find, we remove those connections because we say there is no dependence. When I remove this, I'm saying there is no dependency between book one and book two. When I remove this, I'm saying there is no dependency between outer and number of pages. Very important. Each one of these sentences that I'm uttering in a compact way, in my broken English, it took the research community six, seven, eight years to figure it out. Simple things. So, okay. What else? You know, sometimes, have you been in the wilderness and for the first time in a spot and you don't know where you're going and then you go and you go and you just try to guess. The forest is so dense. You don't see anything. And then you have a feeling, you know, we have to go that direction. Go, go, go, go. And everybody will say, you are insane. We have to go this way. And then say, no, I know, I know. And you get there and say, how did you know? I don't know. Is it instinct? You're not able to explain everything. And we should not be. So, hence we can write, if I get rid of the bias and I assume there is conditional independence, then I can write that the probability of the hidden neurons, given the visible neuron, is the product of the probability of each hidden neuron times given the visible neuron over I. The same, the probability of visible neuron given the hidden neuron is the product of the probability of V sub I given each one of those over I. So, because if they are independent and we are saying, and, and, and, and, I can just multiply the probabilities. What is the probability that tomorrow rains and there is no bad news? Raining and bad news have nothing to do with each other. At least we think so. So, there's conditional independence between raining and bad news. So, probability of raining is 20%. Probability of bad news is 80%. So, that tomorrow is raining and there is no bad news is 20% times one minus 80%, so 20% times 20% is 4%. We are making things easy. That, we made it easy to remove the connections and made it a restricted Boltzmann machine. And then we are making it easy to assume there is no dependency. What happens if there is? I don't know. Probably you are wasting your GPU hours. Well, learning, learning means to assign a probability, a probability to every possible pair of visible, possible pair of visible, hidden vectors via the energy function. So, now I'm trying now to formulate everything. Energy function. So, what is the restricted Boltzmann machine? And how does it learn? It learns to assign a probability to every pair. So, book one and genre, book one and outer, book one and Nobel Prize, book one and number of pages. Every pair. I want to give a probability to every pair. How probable is it? We want to learn the probability of the data. Generating patterns. Not guessing them, not discriminating them from each other. Okay, hopefully it's slowly becoming apparent what we want to do, which may not be easily visible. So, that energy function, now we can write it that the probability of V and H. And now again, I have to remind you that these are vectors and out of laziness I just don't do it on the board. So, but of course every, I said book is one, zero, one, zero. So, it's a vector. H is a vector. So, that probability is exponential function raised to E minus the energy of V and H divided by normalized by the sum of E raised to minus E of V and H over V and H. And we call this the partition function, which is quite challenging in our restricted Boltzmann machine. It's actually intractable. We cannot calculate that. So, why are you then writing it on the board? Formalism is one thing. Practical considerations is another thing. We write it down. I said, if you go from zero to infinity, can you go to infinity? No, this is your problem, it's not my problem. So, probability assigned to a visible vector. Visible vector. So, the probability of V, and again, of course this is a vector, is one over Z. And so, this is Z. One over Z, the sum over all hidden neurons E raised to minus the energy that goes through going from visible neurons to hidden neurons. It's just a normalization, it's nothing. And if it's intractable, then how would we do it? Just take what you get. Get an estimate, get an expected value. That's the best we can do. So, we cannot wait for the infinite observation. Yes. So, we start with the probability of visible neurons. But we wanna learn that matching, that mapping, that gives me the hidden neurons. So, because I wanna go from visible to hidden. But, my visible in itself, I wanna know what is the probability, and I wanna ask you how you read book one, book two, book three, what is the probability that the answer is zero, zero, one? What is that probability? So, it would be interesting to know, because I wanna generate data, right? You wanna generate data. If you wanna generate data, you should be able, what is the probability that everybody in this room likes book one? Would be a very small probability. Okay, so why is that important? So, let's say you have the visible layer at iteration zero with two neurons. So, the first visible layer with two neurons. And I wanna go to the first hidden layer at iteration zero. So, I will go here. Which of course, I will not do individually because it will get messy and I will not write the weights. So, I start. No, I'm starting brainstorming. How can I learn with this? What is this? What is the generative model? It's a mystery, especially for newcomers in machine learning. What is generative model? So, if I do this, and I go from V to H, I should be able to come back from H to V. Can I do that? So, based on your saying, so that's a collection of poem and that was TS Eliot. It won the Nobel Prize, what, did he? If he didn't, that was the biggest injustice. And 125 pages. Can I come and say, okay, which book was it? So, can I come back and get the visible neurons at iteration one? So, you're building a flip flop, basically. Visible, hidden. Hidden, visible. Visible, hidden. Hidden, visible. Visible, hidden. I'm trying to adjust the weights. And then, from this, again, I go back up and say, okay, can I build up the hidden ones at iteration one? Of course, we want to do this. And then, we get at some point, we get at some point to V at infinity, and we go to the hidden one at infinity. So, iterations, I'm going through iterations. This is very different. Hence, the relationship with recurrent networks, heartfelt networks. This is not feed-forward. Feed-forward, you start here and say, send the data, get the error, back propagate the error, make adjustments. But the network is feed-forward. Here, I'm building a flip flop. Going back and forth. This is recurrent. You come back. Coming back was a nightmare. This is still a nightmare because things can go wrong. Feed-forward is still easier. It's easier to understand, easier to design, it's easier to train, everything is easier. Why is this guy bringing us in that direction? Because it got stuck with that architecture. So, okay. I still don't know what would be the algorithm to train this exactly, but if you have something meaningful in place, perhaps you can figure out, at the end, you have the really good weights that give you the relationship between those books and the genre and the Nobel Prize and the author and everything. Okay. So, because these are bits, zeros and one, so if I take the derivative of the log of the probability of V over the weights, W sub ij, that log loss function would be something like this. So, you look at the ensemble of V sub i h sub j of training minus the ensemble of V sub i times a sub j of the model. And what does that mean? So, this for us means the expected value means expectation under distribution by training or by model. So, that's expectation under the distribution that you're assuming, whatever distribution you have. So, which means I wanna minimize the error. This is error. I don't have error here, which is between what I have in training and at the end of the day. So, which is what I have in training is here. What I have at the end of the day is here, which means to me, I don't think we got confused with this because you're going slow and I'm still skipping two, three of the functions and equations that we need to get this in all detail. We don't need all details to understand the concept. So, which means then what? The error for the visible and hidden neurons is, as we said, minus the sum of V sub i h sub j w sub ij, which means I'm looking at the derivative of the error with respect to the weight to be the product of V and h. That's my error. That's very convenient. What are zeros? Zeros and ones. That's what I'm looking at in the log function. That makes things a lot easier for me. So, if I build the derivative, this will disappear, right? So, the derivative of the error is this. So, I'm pretty sure when you start that somebody gets here at the beginning, he or she has had this in mind. How can I, so, in most likely you have this in your mind in day one, but you don't know how to get there. So, then you would say it would be nice if the error of a restricted Boltzmann machine, if the derivative of the weights, if the change value of the weights for a restricted Boltzmann machine, which is the intelligence, would be just the product of the visible and hidden neurons. I bet at least maybe implicitly, Jeff Hinton thing would think like this. And then you work your way out to get there. So, what can I do? What type of restriction? What type of assumptions? Can I put in place to get here? Okay. Which means in this case, the amount of change that you need is basically, again, a sort of learning rate, eta, times the expectation of V sub i H sub j at iteration zero minus the expectation of V sub i H sub j at iteration infinity. And this is what makes it difficult because you cannot go to infinity. Well, but we don't need to. Depending on the problem, you may go half a million iteration and you get there. You get really stabilized. You really don't need to go to literal infinity, which nobody can do. So, which means our hidden neuron at n plus one is, or okay, roughly is, because I will be working with an estimate, I cannot go to infinity, would be approximately a, the response of a logistic function g that takes as argument the transposed weight vector, or matrix, weight vector in this case, times V of n plus the bias c. And the V of n plus one is approximately using the same logistic function transposed of the weight vector times h of n plus p. So these are the biases that before we introduced in that general equation for energy, we had minus c times V minus V times h minus w V h. And we said, okay, I don't need to. I got rid of them, but I'm reintroducing them after the world was simple and easy for me and I established my ideas and now I bring them back and say, okay, bring them back and put them through the, now I have some idea, put them through a logistic function. Of course, you can use a sigmoidal function, nice function, exponential, can be differentiated easily because we wanna calculate error. So, right, we wanna calculate error. Nothing has changed. We still, the only thing we have is calculating error. What we postpone calculating error to the very end, the error between model that I learned and the model that the data is giving me, not the individual steps. That's the achievement of restricted Boltzmann machines. So, so, if I wanna go through the actual learning, we will not make it. So, I will upload just a one page and you can easily find it. A one page pseudocode that based on everything I said and some other stuff that I didn't say. So, the learning method for this is contrastive divergence. Well, it's not back propagation. And this was introduced by Jeff Hinton in 2002. I will upload that one page pseudocode that ties everything that we talked about together makes it a nice algorithm. You take V, we take G, you go through this, the flip flop, you learn. Okay, assuming we read that, so, okay, what is it that we wanna do? Okay, let's go back to the auto encoders. So, he wanted to do this all for the auto encoders. Now I'm speculating. It was one of his motivations. So, and we start with the idea that you can train. We said that back deep auto encoders cannot be trained with back problem. That was the problem we started. That was one of the problems. If you have a deep network, if you have a deep multi-layer perceptron with more than four layers, five, six, seven layers, nobody even dared to go more than five layers, six layers. We cannot train it. It goes forever and nothing happens. That was the motivation. And now, we are ready to deploy the first big idea and say you train a deep network like this. And this was a big deal because from formal and mathematical perspective, if an algorithm is intractable, which means if you run it, it never come back. But then you find a trick to make it come back. That's pretty amazing. So, deep networks before 2002, 2003, at the latest 2006, deep network where you wanna walk from here to the center of the Milky Way galaxy. There are some huge black holes in the center of Milky Way. Even if you make it there, you will be sucked in. You will not come back. And nobody even knows that you went there because you cannot come back and say, oh, I was there. Deep networks with more than seven layers was like walking to the center of the Milky Way galaxy. Because I go by, I go and nobody knows, okay. Did he make it to the center or is he still walking? And now with restricted Boltzmann machine and contrastive divergence, we can do as if we go even to the event horizon of the black holes in the center of Milky Way galaxy and come back. But they actually don't. It's like the kernel trick. So, okay. If, let's draw a simple, but still challenging auto-encoder. So, I start with this and so I get to this. So with three layers, I get to the smallest layer. And then I will build it back up. So, we said X will go in and X should come out. If that's the case, so this is usually, I don't know, this is N, this is half of N, this is a quarter of N, this is N over eight, whatever. So you can really compress the data. Of course, if you wanna have this N over 1000, you need how many layers? You have to make it really deep. You can compress 10,000 numbers into 20. Well, but PCA could do that. I know, I just, would you let me have fun? I wanna do it with deep networks. And we said it doesn't work. And now, Hinton is about to tell us his magic. How restricted both, he bored us and say, ah, he's got tugging about again, Boltzmann machine and graphs. Well, one of the genius ideas was this. So here you have W1, here you have W2, here you have W3. You have your matrix or tensor of weights. He said, you know what, let's mirror this and do not create new weights here. Let's take here W3 transposed, W2 transposed, and W1 transposed. That's already a huge facilitation of fast learning because you don't have six. You don't have six matrices to learn. You have only three clusters. So, okay, that's a nice trick. But we were not after this. So, this is W and this is W transposed, basically. So, first, you train RBMs. So, you train RBMs on this because you just need to adjust this W transposer. Do you have this? And after you did that, then fine tune with back prop. So, he was waiting all the time. He was waiting all the time to not tell us this. First, second, well, maybe I call this zero. So, W1, W1 transposed. So, he was just exercising self-control to not tell us that because the moment that you give up your idea before you have the validation, people start taking it apart. People are brutal. If he would have told us, you know what? Before he's talking about restricted Boltzmann machine and contrastive learning and this and that. If he would have told us, guys, I wanna take half of this, mirror the weight, train this half, and then train everything together. I said, that doesn't work. It doesn't work because of this, because of this. People would come up with ideas to reject the idea. And professors are really good at that. Big part of it is ego. Big part of it is good at science is a self-correcting system. Over time, it overwrites my ego. But my ego can be destructive, very in short term. Big ideas always, always, always get rejected. Why do you think that is? It's because of our ego. We see it, and one part of it says, that's right. Oh my God, that's your fantastic idea. Why, I didn't have this idea. And then you look at the name and say, look at the university, okay, this is not my university. No, that will not work. And you write half a page and reject it, whatever. So, are you not disappointed? This is one of the biggest secrecies that made us, enabled us to start and bark upon the deep learning. So, take weights, do some weight sharing. Can I call it weight sharing? Some people may project and say, no, no, it's not exactly weight sharing. I know, but I wanna start using this word, this phrase, because it will help us again. So, one of the problem of going deep is you have many, many weights. Reduce the number of weights somehow. How? Share weights, collaborate between layers, between filters, between components. Then, so what does that mean, train RBMs? Which means, first you train this with this. Then you train this with this. Then you train this with this. Isn't that genius? Somebody tell me this is not genius. And then, at the moment, everybody was jumping on it and was like, oh, what, what, what did it? The network in its totality, and people get very creative in their word selection when they wanna reject your idea. The network has to, in its totality to learn, you cannot do layer-wise and pair of layers. You cannot do this. That has no bearing. There is no theory for that. There is no justification. So, that you train this, and then you train this, and then you train this, and then you put everything together, you train with back propagation. So, this is with contrastive divergence. It's not back propagation. It cannot be back propagation. We should reflect on this, I think. It's very important because, well, what, this is the Toronto School of Thought. You have not started with the Montreal School of Thought, which will give us CNN and everything. But this is what mainly was in the group of Jeff Hinton, many others. I'm just, well, he's the center of attention. So, I'm pretty sure he had this in mind. All along, but he didn't know how to formalize it. So, it's like those poem poets that you asked them, how did you come up with a poem like this? I said, I don't know. I was just watching the stars, and somebody knocked on the door, and then just left the poem in there and disappeared, and I knew I have to write it down, otherwise it disappears. So, I just wrote it down. Is that the inspiration that nobody knows what happened? We have to continue now with CNN, yes. Not today. You take this, you have this input, right? You take this and train this with this, then you have to get some input. You get to do this with this, you get some input, you get this with this. Take a look at the contrastive divergence pseudocode that I will upload tonight. If there's question, then we can go in detail. We are over time, two minutes. So, on Thursday, we will continue with the second wave of deep learning which started CNN.