 Okay. So we're going to talk about a number of different topics today, but the first series of topics, and then I'll probably go through all of this, and then the next topic, I'll go through it completely. The first one is about, you know, basically extending sort of new architectures of deep running so that deep running systems can do things like, remember things, facts, learn large collections of facts, and then perhaps reason. And we'll talk about transformer architectures because you're going to have a bunch of guest lectures in the next few weeks that we'll probably talk about this a lot, and it's become a very important set of architectures in deep running. How do we get deep running systems to reason and to use memory, right? When we reason, we have kind of a working memory, or when we do any kind of activity, we have sort of a temporary, you know, storage of, you know, the things we're working on, you know, it could be linguistic based, it could be kind of remembering names of people, phone numbers, you know, things that we just wrote or whatever. But it can be nonverbal, completely nonverbal, you're building a widget, for example, or you're driving, or you're flying an airplane, or sailing or something. You remember a lot of things, a lot of facts about the world. And you sort of act on those representations if you want of the world, and you update them with, you know, your new percepts, and you reason with them to kind of plan ahead, right? So so far what we've seen in deep running systems are systems that are, you know, basically feed forward. I mean, there's recurrent nets, but, but essentially, you plug in input and, you know, you propagate signals to the system, you get an answer. And that's good for perception, for things like vision, audition, etc. It's good also for reactive action taking. So basically, you're playing kind of a mindless video game, right? Should they not put something or one of the old traditional itery games? You don't have to do a lot of planning, you know, after after a while that you've played with it, that you've trained yourself, you can play kind of reactively, you just look at the screen and you know what to do. You know, maybe when you learn to play that game, you need to reason, but you need to kind of figure out, you know, what, what sequence of action should I take perhaps to get my, you know, characters to survive or things like that. But, but after you've, you turn yourself for a while, you can play reactively. You can see this also with grandmaster chess players, where, you know, you play chess against the challenging opponent, and you have to think really hard. You have to like, you know, plan ahead like a combination of things. And if you play a grand against a master or grandmaster, they will play within seconds. They will just, you know, look at the board. And, and because they know you're not a very challenging opponent, they'll just, you know, reactively play. They've sort of integrated this, this, you know, this capacity of basically just directly playing from just looking at the board without having to reason very much. It's only when they're facing kind of a challenging opponent that they need to reason. So there's a lot of tasks like this that people do that are reactive. And in fact, in psychology, this famous psychologist, also Nobel Prize winner in economics called Daniel Kahneman, and he characterizes those two, those two types of, of thinking, if you want, as system one, system two, right? So system one is the one that you do sort of instinctively without having to reason and plan. And then system two is kind of the more deliberate kind of planning of like, you know, trying to figure out what sequence of action should I take so that I'll teamize a particular objective, things like that. And so what we've talked about so far in deep running is very much system one. And the question is, how do we do kind of the system two type type things? And I must tell you in advance, this is not a completely solved problem. It's a very active topic of research. And it's not clear that we have an answer to this. It's not clear how many animal species can do this. Okay, probably not that many. And probably some humans can do it very well either. But so, you know, the assistance that we've seen so far also can learn hierarchical representations of the perceptual world, right? If you're trained to control net on vision and supervise or supervise with learns of, you know, good hierarchical representations of percept in a way that is invariant to irrelevant transformations. But how do we get deep learning systems to use a working memory? So basically perform long chains of reasoning by sort of updating a list of facts, maybe or, or, or kind of simulating the world, right? So you're building a widget out of your hands, you know, with wood and you're cutting wood and drilling holes and hammering nails and things like this. There's a lot of background knowledge that you have to use to do this. But you also have to plan ahead and have some idea, some physical intuition about, you know, how things work. You're building an airplane, a model airplane, right? You have to know a little bit about aerodynamics, about stability, about things like this. And there's some very basic rules that are very intuitive. Nobody, you know, needs to necessarily have explained that to you. You're sailing. So I don't know if any of you has ever tried sailing, but sailing, basically, if you're a really good sailor, what you have in your head is some sort of intuitive physical simulation of fluid dynamics. You can have an intuitive sense of how the air bounces off the sail and, and how the forces can balance between the dagger board that's in the water or the keel and the sail so that the book can go forward. You don't have to know all of those things, but you sail a lot better if you do. Same for flying. You have some physical intuition of how an airplane flies. You know that either because you were told or because you have this kind of physical intuition that if the, if the, if the plane slows down too much and has too much of a, the angle of attack will have to increase if the plane needs to maintain altitude. And after a point, the airplane will stall, which means the air is not going to flow fast enough on the wing to kind of keep it in the air. And it's just going to go into a spin or, or basically fall off the sky and you'll have to recover from that. So, you know, all of this, we have all of those models in our heads of, of, and then some of them are linguistic and some of them are non linguistic. They are physical intuition models or models of how other people behave or animals, okay, which are much, much more complex models to some extent than, than physical models. So how do we use a working memory? How do we perform long chains of reasoning that require taking into account a lot of different facts and, and rules perhaps. That includes logical reasoning, but all the types of reasoning as well. How do we remember massive, massive amounts of factual knowledge? We know a lot of stuff that we just store in our, in our factual memory, if you want. And a lot of this is not actually stored in our context, it's stored in our hippocampus, which is the special piece of the brain, kind of the center of the brain that connects to every part in the context that we use as a short term memory. So, you know, regardless of where you are, you probably remember what you ate for breakfast, if it's past breakfast for you. And, you know, you, you know where the door, even if you can see it, you know where the door, the building you're in is, you know where to get there. All of this is stored in your, either your short term or long term memory, but it's probably stored in your hippocampus, not your cortex. So hippocampus is capable of kind of storing facts and acquiring them very quickly. This is different from learning a skill and then compiling this into the weights of a neural net, if you want. It's more like storing a fact, like immediately, right? When start learning, if you want. How do we plan complex sequences of actions? And how do we learn hierarchical representations of action plans? So we'll talk about the first four items in this list here. Not, I'm not saying we have complete solutions to all those, but there is a lot of interesting work there that is practical. The fifth one is really something that nobody knows how to do. Okay, so I'm not going to talk much about it, but it is a major issue in sort of AI research, if you want. Okay, so what is reasoning? So one way to view reasoning, this is not all types of reasoning, but one possible way to that sort of encompasses a lot of different forms of reasoning is reasoning as a constraint satisfaction, or energy minimization. And that's one of the reasons why we talked about energy based models so much, which is, you know, energy based model allows you to kind of encompass the process of reasoning inside of a learning machine, essentially. So the energy function represents the constraints between observed and unobserved variable or between the unobserved variable between them, between them. So here's a classic example, which is used very often in lectures on graphical models, probability graphical models or factor graphs. And I've used that example before, let's say we have a variable effect, a variable, okay, which would be true or false, which is that your house, so your apartment building kind of just jolted, okay, during the night, they woke you up. And you just happen to be in California, or Taiwan, or Japan, or Southern Italy, or someplace where you have earthquakes, okay, you can, you can have two hypotheses, either attract, kind of run into into your house, your apartment building, all this been an explosion or an earthquake, let's say, all right. Both of those reasons to explain why the house just jolted are are pretty unlikely, right, the rare events. And so the prior probability you will give to them be very low, which is another way of saying the energy you will give to them being true is very high. Okay, so imagine that those variables are binary variables. Okay, and you have an energy function that makes you pay a high price for setting that value variable to one, because you know, a priority that the event is rare. Okay, so that would be sort of those those red boxes here at the bottom. The truck at your house, that's very rare. If you're in California, or one of those spaces where you have earthquakes, earthquakes are rare, but not very rare. And then you take those, those two variables, they're either true or false. Together with the fact that you know that the house jolted, and you plug them into an energy function that basically computes the compatibility between the between those variables. So it knows that the variable house jobs is equal to one. And then it's trying to figure out what other values are the other variables that I have not observed yet, that the track, you know, I don't know if you track it the house or there was an earthquake. Now, because the, the, the prior for earthquake being true is, is higher or the energy is lower for earthquake being true than for, you know, truck eating the house, which is very rare, you're going to probably infer that the there was an earthquake. All right. Now you wake up, get out of bed, look at the window and and out the window, you see that the truck actually hit your house. Immediately, you infer that there was no earthquake. Okay, because now your, your, your observation has been explained away by the new observation that you just made that actually attracted the house, so you don't need the earthquake explanation anymore. Let's go explaining away. Okay. And then energy minimization inference will tell you this right. So energy minimization, if you only observe the house, jolted, you're going to conclude with some level of confidence that there is an earthquake. But with a little bit of doubt that, you know, maybe you're tracking the house, it could be another set of variables into this thing. Then you observe that the truck actually hit the house. So now this become a gray variable that is observed. And all of a sudden, the earthquake becomes unlikely because you explained away the observed variables. Okay, so that's a form of inference, or reasoning, through energy minimization, where the rules that apply to the world basically are seen as energy terms. Okay. So there's a long tradition, actually, in traditional AI, in doing this kind of, this kind of reasoning, in terms of constraint satisfaction, is even actually programming languages that have been invented to kind of specify things like this. And there's been the whole subfield of graphical models, probabilistic, the business networks, graphical models, or factor graphs actually resulted from attempts to kind of model reasoning as some sort of likelihood maximization or energy minimization. And because you have energies that you can turn into probabilities, you can compute marginal probabilities of all those variables, right? If you have an energy for a particular configuration of variables, you do e to the minus this energy divide by the sum of all the configurations of the variables that are possible. And you can get a probability for that particular configuration. You can marginalize over variables, you don't know. Okay, so inference, reasoning, particularly probabilistic reasoning, but also logic reasoning, can be seen as a form of energy minimization. So, so essentially what you will have is a list of if you have, you know, a large set of problems that you want your machine to be able to solve what you would have. And this is this is kind of, you know, turning kind of deep learning into something that's very classical in in sort of traditional AI, good old fashioned AI is that you have a list of variables. It's called a knowledge based. And those list of predicates, okay, those predicates can be true or false. They can have a value attached to them, which is basically an energy, you can think of it this way, right? And some of those are known, the value of some of those are known, those are the ones that are observed, and some of those are unknown. And then separately, you have a base, a database of rules, and rules are just energy terms. Each of those energy terms will take a set set of the variables in the long list of variable, and can compute incompatibility value for whether those those variables are compatible or not. So for example, the single example of the earthquake or the track, you could take those three binary variable, the house jolts, the reason there was an earthquake or a track at the house, and you can plug them into a rule, which is just an energy term that says, I want to compute the logical or of in of the first input and the second input. So the track at the house or there's an earthquake. Okay. And if that logical or those two variables is true, then the house jolts is also true. And vice versa. Okay, so it's a constraint between the house jolts, and either the track at the house or there was an earthquake. Right? Or both. Okay, the or actually includes is an inclusive or right. So you can think of this as as a, as a constraint. If those variables were continuous, or distributions or things like this, then, you know, you could have an energy function. If there are binary, those, you know, you can think of this as a rule, essentially, that says, you know, that implements a constraint between between those variables. And you can do you can perform the inference in any direction, right? You if you know that there was an earthquake, you can deduce that the house jolts, if the earthquake kind of local, you know, things like that. So so you could view this sort of form of inference by energy minimization as if a kind of weird sort of continuous smooth form of traditional logic based reasoning, where you have a list of variable values, okay, some of which are known, some of which are unknown, are unspecified, a list of rules which basically are energy terms that take subset of those variables, each of each of which takes subset of those variables into account. And what you do is that you you look through you look for rules for which a subset of the variables are known. And you apply the rule, which means we infer a value or a list of energies for each of the possible values for the for the variables that are unknown. Okay, if that value is easy, you just have to remember to two energies for those variables. And you keep doing this with all rules until the the content of the variable values can stabilize. Okay, and this is, you can think of this as sort of, you know, probabilistic logical inference, if you want. Okay, I can't say that there are a lot of practical implementations of this, where deep learning is used to learn the rules. This is a thing that people are working on. It's very much at the research level is interesting work in this area. But I can't say that it's sort of very practical, those kinds of things have been practical in the past, in the context of graphical models, privacy graphical models, factor graphs and things like this. But but the rules and the and the facts are basically kind of written by hand, if you want, you know, including things that use energies or low probabilities of things like that, right? But they're all pretty much written by hand. So, you know, learning things like this is, although in principle, as possible, might be difficult. Okay, here is another form of reasoning through energy minimization that would be for planning. Okay, so you, you're facing a situation where you want the piece of the world in front of you to result in a particular state, okay, you're building a widget, for example, you have to figure out what sequence of action should I, should I take to arrive at a result? Or, or you're planning to, you know, go to go to class at NYU, there has to be a future that will actually be something you need to do. You have to figure out like, you know, how do I plan this? I have to get out of my building. So I got to, you know, walk towards the door, where is the door to remember where the door is, I need to stand up and walk towards the door, open the door, etc. And then you have to decompose all of those acts into kind of millisecond by millisecond control of my muscles. Okay. So this is a, a kind of reasoning planning that is a classical one in AI as well. Also a classical one in a field of engineering called optimal control or control theory. And the way this is done in robotics, for example, or other types of control situations is that you observe the world. So the world is X. Okay, that's your observation of the world. You run through a perception module and that perception module, it's in combat, extracts some idea of the state of the world. And you call that s. Alright, so it's going to be incomplete because you don't have a perfect observation of the entire world, you just have a approximate observation of just the world around you that, you know, goes into your, your sensors. So you have some estimate of the world s. Let's, let's say s of t at time t, okay, because we're going to run this over several time steps. So s of t is your estimate of the world as time, state of the world at time t or at least the relevant part of the state of the world time t. And you have an internal model of the world, the predictive model of the world, which is your ability to predict what the consequence, what the consequences of your actions are going to be on the world if you take a particular action, if you take a step forward, you know that you're going to move forward, you're not going to immediately, you know, being transported to the other side of the world, right? You know that your trajectory as a physical object is so much continuous, right? You can't, you can't just snap your fingers and jump to another place, right? So you know, if you step, take a step in, in a direction, particular direction, you're going to go in that direction by the length of a step, right? So that is included in your model of the world and your model of your own dynamics, right? So given an action you're taking A of t, you feed this into the model of the world together with the previous state of the world, and you get the next state of the world. Okay, but there is an issue here, which is that the world is not entirely predictable. It's a lot about the world that you're not observing. Okay, we're going to represent this by latent variable Z of t. And you may have some idea of, you know, the prior that latent variable. So for example, I'm in a room right now with, you know, has a door, I actually don't know if the door is closed or open. I think I remember that I closed it. So my prior is going to be that the door is closed, but maybe it's not. Okay, so depending on whether it's closed or not, I'm going to have to decide whether to open it or not once I once I see it. But if I'm, but you know, that that will change my, my plan if you want. So there may be a lot of things that happen in the world that you just cannot possibly predict. Because you don't have either because you don't have complete observation. So this is called epistemic uncertainty, which means, you know, your uncertainty about the world is due to the fact that you don't have a complete knowledge of the world. There's another type of uncertainty about the world called aleatoric uncertainty, which is due to the fact that the world maybe is intrinsically unpredictable or stochastic. Okay. So, you know, this may take us to a kind of philosophical question here. But even if you assume that the world is completely deterministic, which actually physicists tell us it is, okay, in some interpretation, physics, at least, we still may not be able to predict what the future is going to be. Let me take an example. Okay, you can play head or tail with with a coin. And, you know, I if I throw a coin in the air, and I put it in my hand and I ask you is this head or tail, you know, you basically don't have much information to decide whether it's head or tail, you're going to assume the coin is fair. And so you have, you know, probability one half for for each of the two outcomes. Now, imagine you have access to, you know, a ridiculously powerful supercomputer, as well as a ridiculously, ridiculously powerful perception system or sensors that basically give you the state of the entire chunk of the universe within, you know, a cubic kilometer around me. Okay, so that includes the entire state of my brain, right? Now that would be an enormous amount of information, because you have to know the position of every atom and molecule and, and entanglements between particles. I mean, it would be just insane. But it's imagine this is possible, and you have, you know, a supercomputer the size of our entire universe that's capable of simulating this. Then there is no uncertainty as to whether the very little uncertainty about whether the the coin is going to end up head or tail, because you can probably just simulate the entire thing and predict whether it's going to be head or tail. So why am I telling you this? Because it's not entirely clear what is, uh, aleatoric or, uh, epistemic uncertainty. It depends on, on your knowledge, and depends on your computing power, if you want, computing ability. So for two different people, you know, for one person, a phenomenon might look completely random. For another person, that same phenomenon might look organized. Let me take an example. Uh, if you're trained, you know, if you're kind of used to listening to Western music, and, and all of a sudden you listen to, I don't know, Indian classical music. It looks very strange, right? And there's a lot of things that you just can't, you know, you can't fathom. Basically, you can't predict what's, you know, when the music is gonna is gonna is gonna go. Or, you know, you're trained to listen to a particular type of folk music, and you're exposed to, uh, you know, 1960 jazz improvisation. To, to you, it looks random. It may look completely random. But if it's not, it's actually very organized. It's just that there is structure that you don't perceive, right? Uh, so a lot of things like this are, you know, depend on, depend on your training. But so much for the little bit of philosophy here, and epistemology. Um, so we have a model of the world that, uh, takes three inputs, the current estimate of the state of the world, the latent variable that maybe represents what we don't know about the world that we may draw randomly. Um, and then an action we're taking and our model of the world predicts the next state. Um, we can use this for planning. So let's imagine that the world is essentially deterministic, so we don't have much of a latent variable. Um, let's say we are kind of shooting a rocket, uh, and we want the rocket to rendezvous with the International Space Station or land on the moon. We can completely plan a trajectory. We have a complete dynamical model of the rocket. We know exactly how much it weighed. We know how much it, you know, uh, spends fuel per per second, uh, for particular throttle. Uh, we can control the nozzle orientations and the thrust. Uh, we know the density of the, of the atmosphere, wherever the rocket is. We know the, we know the force of gravity. We know, you know, a lot of that stuff, right? So we can, we can write down the physical model of what is the state, what is going to be the state of the, of the rocket, uh, in one millisecond, knowing the state of the rocket right now, and the action I'm taking, which is the thrust of the engines and the direction, uh, of the controls. Um, and the other variables that come in are the environment variables, the how high I am, what the, you know, density of the atmosphere is, um, um, temperature or whatever, right? So I can completely, what I can do is, um, imagine a sequence of actions and then run this model forward. Okay. And then compute the trajectory of the rocket. Okay. So after a while, the rocket is going to be at some location, you know, in orbit or some place, and it can compute the square error, let's say, between the distance, between, which will be the distance between the, or the square distance between the rocket and the space station, uh, both in terms of position and velocity. Okay. Um, and I'm going to put this into this C function. So the C function takes the state as an input variable and computes a cost of, you know, how satisfying is my state, right? I can compute the sum of those cost functions over a trajectory, uh, which would be, for example, the sum of the square distance of the rocket to the space station over the trajectory. And then what I can do is through energy minimization by minimizing the sum of those costs over the trajectory, I can find a sequence of actions that will minimize that sum of the cost. Okay. And what that minimization, I can do this by backprop, backprop, I can just backprop a gradient through this entire system and figure out the sequence of action that will minimize that cost. If the cost is the square of the distance of the rocket to the space station in terms of position and velocity, uh, the effect of this planning is going to be to basically get the rocket as close as possible to the position and velocity of the international space station as fast as possible because it minimizes the integral over the, over the trajectory. So it's going to try to get, it's going to try to get there as fast as possible. You can put other costs in there. There are things like, I'm sure I consume, you know, things like that, constraints. I'm like, I'm sure I can use constraints on like how much I want the throttle to be high, you know, all kinds of things you can wrap into this, uh, this cost function COS. But the point is, I can do, uh, planning this way by basically minimizing an energy and that's pretty complex planning. In fact, this is the kind of planning that people can do with by hand to plan the trajectory of a rocket. You have to use computers to do it for that reason. Um, and that's why people that, you know, uh, in, you know, various space agencies around, around the world, uh, but in the sixties it was, you know, mostly the, the U.S. and Russia, uh, were, were doing initially by hand and then eventually using what is now called model predictive control, where, which is this process of basically unrolling a dynamical model, mathematical model of the dynamics of the rocket and then by, great in the center, some minimization method, figure out the sequence of actions that will minimize a particular cost function, uh, that is, you know, characterized kind of good trajectories versus bad ones. Um, a lot of this was developed in the nineteen fifties and sixties in Russia and the U.S. and, and other cases, but they, they were really active on it because it was a space race, right? Um, and in fact, uh, the idea of backpropagating, backpropagating gradient through a structure like this, which you can think of as basically recurrent neural net, uh, to find a sequence of action, uh, goes back to the, uh, in the U.S. is, is, is known as the Katie Bryson algorithm, uh, and it goes back to the early nineteen sixties and, uh, in Russia, there were kind of theoretical work around those lines, um, uh, by, again, called Pontryagin, so there's something like the Pontryagin optimality principle. So, you know, things like this were developed, so in a sense, backprop was invented by, by these guys, okay? They didn't use it for running, they used it for model predictive control, optimal control. Okay, so, uh, there's a, there's a movie, a very nice movie called Hidden Figures, uh, which some of you may have seen. It's a story of, uh, uh, mathematicians that worked at NASA in the 1950s, the late 50s, early 60s, and those were, those were mostly black women who were trained in mathematics, and they were, uh, employed basically by NASA to compute trajectories for rockets before there were computers. Okay, in fact, the, the title was calculators, okay, they were mathematicians, um, uh, but NASA called them calculators, and, uh, and they were, you know, using the, the math, essentially, uh, numer, you know, so before computers, they had like, you know, calculating machines and things like this, they worked like running equations and solving differential equations, basically, and, uh, and sort of, you know, solving them numerically when they couldn't do them, uh, solve them analytically, and, and basically computing, uh, trajectories for, for, uh, all the rockets. And then in the 60s, they basically trained themselves to program in Fortran because computers were, became available at NASA, and they kind of became, you know, the first one is probably to implement those kinds of algorithms that I just talked about. Did you watch that movie? It's a, it's a wonderful movie. Um, it's really very interesting. Um, it tells a lot about the, you know, social history of, uh, uh, you know, in the U.S., which is complicated. There is a question about this diagram. Okay. Do I understand it correctly? So, A is the control, and Z is like noise uncertainty in the system. Right. So, A is, is the, uh, the control, the command, if you want, that, you know, so for a racket, it would be the thrust of the engines, the control of the, of the nozzles, the direction of the nozzles, uh, maybe the other jets, you know, that control the, in the case of the, you know, you know, my spaceship, it would be also the, the fins, it has fins. So, anything that you can use to, you know, affect the state of the system, essentially. In a car, you essentially have two things. You have the angle of the wheel, and you have the position of the pedals, right, accelerator and brake, those are the two controls in a car. You know, anything that you can use to kind of change the, the state of the system under, uh, under consideration. In a video game, it's the, you know, the joystick position and the button actions or whatever, uh, right. And Z is the part of the state of the world that you think is relevant for the problem, but that you do not directly observe, and you may have to infer either from observation or just draw randomly. And if you have to draw it randomly, then you may have to run this multiple times for multiple drawings of the Z variable, which may result in different outcomes. Okay. So for example, if you use this, uh, for, I don't know, financial investment, right? So your G model now, so A is, you know, whether you buy or sell a particular financial instrument, G would be the next state of the stock market, right? Given the action you took, and Z would represent the action that everybody else are taking, right? It's a huge vector that, um, and, and you assume that those people are rational, so they optimize their own objective. So there would be kind of a what's called a multi-agent system where you don't have one copy of this, but you have as many copies of this as there are sort of, you know, important players on the stock market if you want. And, uh, and then you, you know, you can, you can run this forward, but then what's important there is that you don't know what the other guys are doing. So you have to infer this from the observations that you're, you're making at every time step. And, uh, and you have to hedge your bets. So if your prediction is very uncertain, if there's a huge amount of uncertainty about the prediction of where, where the world is going, where the stock market is going, uh, then there may not be a definite optimal action, uh, action sequence that you can take. Okay. There's going to be a risk attached to it that you may need to, uh, estimate as well. So, you know, for multiple drawings of the Z variables, you're going to have multiple outcomes, right? So I mean, it's very similar than, you know, when you play, uh, you play chess, uh, there is a Z variable, which is what the other player is going to play in response to your, um, to your play, right? So you're going to play. Your opponent is going to play. You don't know whether your opponent is going to play. So there's multiple options. Okay. Then for each of those options, you're going to play something and then your opponent is going to play, you know, something else, but you don't know what. So you need to kind of explore that, that tree of possibilities. Um, so in this case, the Z variable is discrete. It's what, you know, the other, the other, uh, what your opponent is playing and every time you take a step, the opponent can take, you know, something like 36 different, uh, uh, steps, actions. And so you're going to have 36 different values for possible values for ST plus one, right? And it goes exponentially as you go down. So, okay. So that's kind of a discrete situation. We have to do, you know, tree exploration, essentially. Uh, you can imagine sort of discrete situations like, for example, you're driving on the highway and there's a car right next to you. Uh, and you observe the car was swerving a little bit, probably the, the driver was kind of looking at smartphone or something, uh, or not paying attention or filling with the radio. So, you, you have a lot of uncertainty about the position and the future position of that car. And as a consequence to hedge your bets, you're going to probably change lanes so that you stay away from this car and you're going to pass that car pretty quickly so that if by any chance the car swerves, uh, again, uh, you're not going to be hit by it. Okay, if, on the other hand, the car is driving really straight, you're uncertainty about where this car is going to be next is relatively small. You're, you know, the latent variable of what the driver is doing, if you want, this is as a fairly small variance. You pretty much know the car is going to stay in its lane and not swerve too much. And so it's pretty safe to just drive past it. Um, on the, on the next, on the next lane, right? Uh, but if the car is acting weird, you're going to stay away from it. So that's an example of, you know, planning ahead by having a model of the world, but you take into account the uncertainty about your prediction to basically take a course of action that will minimize your cost, your cost being high if you hit another car. Okay, uh, and, and taking you to account the uncertainty so that, um, uh, you know, the likelihood of you hitting another car is small. And we'll see a more complete example of this then I'll further come back to this also in a, in a practical. Okay. So, you know, but what I'm saying here is that, you know, the, this process of, of inference by energy minimization can be used for all kinds of different things, right? For planning, for logical reasoning, for, uh, you know, probably psychological reasoning, you know, all kinds of different things concerning, uh, satisfaction of various kinds, problem solving of various kinds. Um, so it's pretty general thing, uh, in practice, it comes down to, like, do I have a good model of the world? First of all, what is the uncertainty? And how do I make predictions that allow multiple predictions? Okay. Which is why we've talked about latent variable energy based models, essentially. And then can I buy, grade and descent to some other process? Can I find a sequence of actions that will, um, minimize the cost? So we've talked about, uh, inference by energy minimization, but there are other types of, of inference, which to some extent can be reduced to that, but people tend to think of them in, in another way. And, and this is the idea of, you know, basically having a working memory. So it's, you know, it's similar to the, the situation I was describing here, we have kind of a working memory that contains, you know, variables and values, and you have rules and, and you apply the rules recursively, and every time you apply the rules, you take facts from the memory, uh, compute new facts that may be true or false, and then write those values back into the variable, uh, uh, database, knowledge base. Um, so you could think of an architecture to do this that would be what's called a memory augmented network. So it's essentially, um, um, recurrent neural net. You can think of it this way, right? So this could be a few layers of a neural net, or a very complex one, and then you take the output of this neural net feedback, feedback to the input, but then what this neural net is doing is that at every time steps, it, it reads from a memory, okay, perhaps existing facts, right, the working memory, uh, and, uh, or, or statements, or words, whatever. So it reads from this memory, and then it crunches on it in one cycle, and then it writes back, uh, stuff to that memory. So essentially, you'll have one step of a recurrent net, and that recurrent net will produce an address to a memory, and a memory is just another module, okay? So this is a memory, and this is sort of the, the address of the memory, right? So you, you feed a vector to this memory, and the memory kind of use that vector as kind of an address, and then returns another vector, and that goes into your recurrent net again, okay? So there is kind of a similar input, uh, here, and the recurrent net also has an internal state that it, feeds to that next step, okay? So we can call this s of t, we can call this q of t for query, okay? We can just, we can call this v of t or value, okay? Coming out of the memory, and if we have a memory that we, we don't write into, we can sort of repeat the process at the next time step, et cetera. So an example could be, so here's an example of how this has been used, and this is relatively old work from some of my colleagues at Facebook, equal to memory network, and what they, what they did was, write into the memory, basically a, a number of statements that correspond to a story, okay? So for example, John goes to the kitchen, John picks up the milk, John moves to the backyard, he drops the milk there, then, you know, Jane goes to the bedroom, blah, blah, blah, so you can have kind of a sequence of statements like this, each are encoded into a vector that is stored into, into the memory, essentially, okay? So every statement in the story is basically a vector in the memory, and the way you train the system, so you enroll it in time, three or four times, and you give it a question here, so there would be sort of a, a network here that encodes a question into the state, and then you have another network here that produces an answer, a text, which would be yes or no, it could be, you know, the question could be like, how many people are in the kitchen right now? And the answer could be three, or it could be, where is the milk? And the answer would be in the backyard, or the question, you know, would be, did John and Jane ever, were, were John and Jane ever in the same room at the same time? Something like this, right? So you put a question here, run this through the system, and the system has access to this, the whole list of events in the story, in this memory, and then in the end, you train the system supervised by turning it the answer to that question. And you back propagate gradient through all this, and the system learns to kind of change the parameters of the memory. I'll tell you in a minute how it's implemented, as well as the parameters of the recurrent net in such a way that for any question, you get the correct answer. And you have to train this on all the data, so it only works with sort of relatively, with sort of artificial toy problems where you can generate as much data as you want. Okay, this is called a memory network. And so this is a paper by Chisholm Weston and his colleagues at Facebook, this is one of the early papers coming out of Facebook and research. A paper that appeared just very short here thereafter by also people from Facebook and research and Chisholm and Chisholm stack augmented recurrent neural nets where the memory is kind of a stack kind of memory. We push facts into it and you can pop them. This is good for parsing, for example. Almost simultaneously, but a little after a couple of papers from DeepMind came out in the neural training machine and differentiable neural computers, which also had this idea of a recurrent net that talks to some sort of differentiable memory. Okay, so now how do you implement a differentiable memory? And this is an idea that at the time was not very, was very new and not very popular, but then completely took over essentially the space of things like NLP and now even computer vision and things like this. So this idea of having an association soft differentiable memory inside of a neural net. So how do you do this? You build the memory this way. You have an input vector, okay, which you can view as an address, right? If you know how like a computer memory works, right? You give an address which is a string of bits. That address is compared with a bunch of binary templates which basically represent all two to the N binary combinations, right? So if your memory has 64 kilobytes, for example, right? The address is 16 bits and you have the thing that will compare the 16 bit of the address to every possible combination of 16 bits of which there are 65,000, you know, more than 65,000, right? And then there's one that's going to match and your memory chip is going to output the eight bits that are at that location, okay? That's how a RAM chip works, basically. Of course, RAM chips are much bigger than 64 kilobytes nowadays. So here we're going to do the same thing, but it's going to be sort of a soft continuous version. Okay, so we take a vector, continuous vector, some dimension. We compare that vector with a bunch of so-called key vectors by just computing the dot product, okay? So the dot product is going to be large if the two vectors are aligned and essentially going to be zero if the two vectors are orthogonal and going to be minus one if the two vectors are opposite, okay? No minus one because they're not necessarily normalized, but negative. Okay, now you take all of those dot products. You plug them into a softmax. So now what you get is a bunch of numbers that are between zero and one and they sum to one, okay? And what you do next is that you have a bunch of so-called value vectors, which are basically the represent the content of the memory. And what you're computing now is the weighted sum of those values by those coefficients, okay? So imagine that one of those vectors is exactly identical to one vector, like let's say this guy and orthogonal, let's say, to all the other ones, okay? So you're going to get, here you're going to get a positive dot product and for all the other guys, you're going to get zero. You plug this into a softmax, you get a high coefficient for this guy and smaller coefficients for all the other guys, right? Because we have much smaller coefficients depending on the beta and your softmax and depending on the length of the vectors as well. So now what you're going to get on the output, because all those guys basically don't count because their coefficients are going to be small. So you're only going to recover that corresponding VI, okay? And that's basically a RAM chip, okay? That's how a RAM actually works. It does compute a sum, actually, but because only one of the values has a coefficient of one, you only see that one, you don't see the other ones. But here, because those coefficients are continuous, we're going to see some linear combination of all those vectors at the output of our memory. Now, the cool thing about this is that this is all completely differentiable. You can back propagate gradient from the output all the way down to the query, to the input, to the address. You can compute the gradient of the output with respect to the value vectors. You're going to get a large gradient for the value that was, had a high coefficient and smaller gradients for the ones that are small coefficients. You can even back propagate all the way to the keys. So you can learn the keys, actually, that will produce an output that presumably would be useful subsequently, okay? So that's the formula here. Compute the Y values that are the, sorry, you compute the CIs, which is a softmax applied to the dot product of the input address called the query and the key vectors. Plug this to a softmax, you get a bunch of coefficients between 0, 1, 1, and sum to 1. Compute the linear combination of the value vectors, where the coefficients are those CIs, and that's the output, okay? So that was the idea in this family network. And since this has been reused a lot for all kinds of situations. Now, this mechanism of basically deciding what input or what vector to pay attention to based on the dot product of the query and a bunch of keys. That's called the attention mechanism, okay? People now call this attention or soft attention. I don't know if it's an appropriate term, but that's the term that is now accepted for this. Because what it does is that it basically causes this network to pay attention to essentially one or a subset of those Vs. Now, imagine that those Vs are not vectors, but are themselves outputs of another neural net, you know, the lower layers of a neural net. Then what the system will do, or maybe the input sentences of a story or the words of a sentence, then what a system like this can do is essentially dynamically choose to pay attention to a particular output of the previous layer or a particular word of the input sentence. And that proved to be revolutionary in the context of things like NLP and particularly translation. So you probably heard of Professor King-Yung Cho who is at NYU. When he was a postdoc at University of Montreal, together with Yosha Benjo and Dimitri Bata now, he published a paper on this idea using attention for translation and it basically revolutionized the field. That's basically what it is most famous for. Okay, the idea that he proposed was, let's say you have an input sentence. So an input sentence, you know, you have a particular word and you represent this word by a very large vector, which is basically a one-pot vector that has a one. So this vector has the size of the vocabulary. The English could be 100 or 200,000 or something. And this vector has one location that represents which word, let's say it's the word cat. So in the lexicon, in the vocabulary, the word cat appears here. You set that component to one and all the other ones to zero. And then you multiply this by some matrix. And with this, you produce a vector of dimension, let's say 500 or something. And you're going to run this matrix, of course, it's part of the network, but you do this for every word in your sentence. So a sentence is basically a sequence of those things. You run each of them to the same so-called embedding matrix. Okay, and multiplying a one-pot vector by a matrix is very simple because it consists of just selecting the column in that matrix for which the component is one, where you don't need to actually do the product. And there are special functions in PyTorch to do this. Right. So what you get now is a sequence of vectors that represent the input sentence. And let's say you want to do translation. So this is a sentence in the Mandarin and you want to translate it into English. So you have to produce a word. Now, or let's say it's English to German. Okay, so this is an English sentence that you want to translate into German. Now there's an issue with German, which is that the word order in German is very different from the word order in English. What's more, in German, there's a lot of words that are actually composed words. So they look like a single word, but they are actually kind of multiple words that are stuck with each other. And so to find an appropriate translation for an English sentence, you have to... When you are about to produce the first word, you have to figure out what is the corresponding word or expression or combination of words in the input sentence that I need to pay attention to basically to translate. And that's where attention can play a role. So basically there's going to be some big neural net here, but there's going to be another neural net. And this neural net is going to take into account all of those vectors or a big subset of them. Compute the dot product of those vectors with some vector. And then produce coefficients. And those coefficients are attention coefficients. So there are numbers between zero and one, and you're going to multiply those vectors by those things. Feed them to compute their sum and then feed that to a few layers of a neural net. That is going to predict what the first word is. And what this guy, what this red guy can do is essentially choose at any particular point which of the input word is relevant to produce the corresponding output word. Now, this word is produced. And so what you do now is that you take this word and feed it basically to the next version of this. So you replicate that network if you want. The weights can be shared. I mean, there are various architectures to do this, but you can imagine doing this. This is going to take the previous word into account and then produce the second word. And you sort of keep doing this. So this was not the first but the second successful attempt at trying to do a language translation with neural net. The first attempt was some gigantic multi-layer LSEM by EDI Syskever at Google at the time. But then the Montreal group and with Kim Jung was able to basically get really good results with a much more compact network that used this notion of attention. That completely revisionized the field in just a few months. A team from Stanford implemented this idea and won a big language translation competition with it. And all of a sudden the entire industry jumped over this idea and started implementing transition system based on the attention. And a lot of people in natural language processing said, like, you know, this works for translation. It might work also for other things in sort of language interpretation. And then there was a paper from a team at Google whose title was attention is all you need. This was a few years ago. And basically what they said is you can build a neural net entirely built out of modules that are essentially associative memories. Very similar to the ones that we talked about to these objects, okay? And their entire network was built out of modules of this type. And they called it a transformer. Before I go there, I want to talk about something called... So this is a... Less than two years old. This is Sana Sukhbatar. He was a PhD student here at NYU with Robert Ferguson and me. And he's now a research scientist at Facebook in Paris. And he did this at Facebook where basically it's one of those associative memory network where you have a feedforward network that produces inputs to itself essentially. Think of this as the current net as well as an associative memory here, which has this sort of attention mechanism. And you come up with several architectures to use this where basically the entire network is based on this kind of attention. And this is... You can think of this as sort of a neural net that has memory because every time you back propagate gradients with this, with respect to the values that are in the memory, those values basically change. They can change a lot. And so the system can just remember things. Okay, you can just store things in memory. So this seems to be perhaps it could be a model for how the hippocampus in the brain can store its memory. Memories in the hippocampus are stored in fast-changing weights of neurons in the hippocampus. It can change much faster than the synaptic weights in the cortex. And they're used for essentially short-term memory. So there are people, older people, for example, who basically lose short-term memory completely. And it's basically because their hippocampus shrinks. So they all tell you a story and then you see them the next day and then it may not remember that they saw you the day before. They'll tell you the same story again. And that's probably a story that's old because if it's something that happened last week, they probably don't remember it. So if you don't have a specific module to store facts, short-term memory in your brain, you cannot remember things for more than about 20 seconds. There is some memory in your cortex because your cortex, the activity of the neurons in your cortex has a state. And if your cortex was some sort of recurrent net, then you could think that that state is a kind of memory. But the thing is, it's been shown that the state of the cortex basically becomes independent of its initial state within 20 seconds. So you cannot remember things with the state of the neurons in your cortex for more than about 20 seconds. If you want to remember things for more than this, you need your hippocampus. That's your RAM, if you want. Now we're going to talk about transformers, particularly transformers that are pre-trained in a self-supervised manner. And we talked already about denoising autoencoders. So this is going to use the technique of denoising autoencoders, except it's called masked autoencoder, but it's really the same thing. And that led to another paper by Google called, with a model called BERT, which means bi-directional something, camera directly. The name really doesn't matter because there was sort of a tradition in the field of finding acronyms that corresponded to Sesame Street characters. If you are from outside the US and you don't know what Sesame Street is, it's a TV show for kids. And they have various characters called BERT and Ernie and things like this. And so there's a sequence of a whole series of models in natural language processing, deep learning models that basically are named after Sesame Street characters. And BERT is probably the most famous one. Okay, right. So you've heard about denoising autoencoders. I believe you heard about this even yesterday. Yeah, last week. Those last week, I'm sorry, Thursday from Alfredo. Oh, yeah, I talked about it, but Alfredo talked a bit more about it, right? So you start from a Y, which comes from your data. You corrupt it, which means you're blocking certain pieces of it or you're adding noise to it or something like that. You corrupt it in some way. And then you run through a neural net and you measure the reconstruction error. So you're training the system to basically reconstruct the uncorrupted Y from a corrupted version. And to be completely consistent, this X here should not be X. It should be Y hat if only consistent in notations. Yeah, that's what that was on my slides. So that's why Y hat and then the Y tilde on top. Yeah. So this is just pretend this is Y hat. And the idea of training, so the transformer with this paper, attention is all you need, but it was when it out from Google. And then there was this following paper by Devlin, which was the idea of training a transformer using the denoising autoencoder idea or using kind of contrastive energy-based training. You can think of it this way. You can interpret it this way. The idea goes back a long time. Denoising autoencoders go back to Pascal Vincent. In fact, I had some stuff on this in my PhD thesis in 1987, but in a different context. Kolevian Weston, in the context of NLP, actually used a contrastive training idea to pre-train their system to represent text as well. So there were really pioneers there. But really, that became popular with this paper on the BERT model. So take a piece of text, remove some of the words from that text, typically 10 to 15% of the words, and replace them by a blank marker. In some cases, you can actually replace them by another word, another just a random word or some kind. You run through the system and you train the system to predict the words that are missing. So you tell the system which words are missing and you train it to predict the words that are missing. Now, obviously, the system can do a perfect job at this. So if I say the cat chases the blank in the kitchen, you can probably guess that blank is mouse. Or if I say the blank chases the mouse in the kitchen, you can probably say that blank is cat. But if I say the blank chases the blank in the blank, you may have no idea. If I say the blank chases the blank in the savannah, you probably know that you probably can guess it's neither a mouse nor a cat. It's probably a lion and a zebra or something, or wildebeest or whatever it is that the lions eat in the savannah. So because you know how the world works and you have some background knowledge, you can fill in those blanks. So the idea is that by training a system to fill in those blanks, and this is the whole idea of self-supervised learning, the system will learn basically the role of words in the sentence, their grammatical role, as well as the some semantics. So we'll know that a cat can chase a mouse, but a lion can chase a mouse, but probably will not chase a mouse, things like that. So what the system produces is not a single word, but it's going to produce essentially an energy for every word in your lexicon, in your vocabulary. Or if you run this with self-max, the probability for each word in the dictionary, which you can train using cross-entropy as a classifier, essentially. So here is the correct word that appeared here. Here is the logits coming out of the weighted sounds of the last layer of my network, and I compute the negative likelihood cost between them, which is a cross-entropy between the 0, 0, 1, 0, 0, 0, and whatever comes out of my self-max and my SAT, just like a regular classifier. And I do this independently for all the words that are missing, which is, by the way, incorrect. Because if you do this independently for all the words that are missing, you're basically assuming that the words that are missing are independent of each other, which they are not. But anyway, regardless, this works amazingly well. You can pre-train the system to learn representations of text by training it to fill in the blanks in the text. And if you take the representation of text somewhere inside of the network, it works really well, and it works particularly well if the architecture you train is a super transformer architecture. Okay, so this was the paper by Devlin in 2018. That really caused a revolution. The paper was put on archive after it was submitted to Acclear 2019 conference. Or maybe it was in 2018, I can't remember. And within the six months between the time it was posted on archive and the time it was presented at Acclear, it had 680 citations or something like that. Okay, so this really took the world by storm, even before it was officially presented at the conference, just because of the archive publication. There were really quick follow-ups to this. Roberta is kind of a version of Vert that was built at Facebook, which is open source. And since then, there's been lots and lots of contribution. I kind of screwed up the reference here. This is Devlin. This was one in Devlin, and I kind of... One is the attention is all you need. The other one is assessment provides running of transformers. Okay, so what is the transformer? There's a question first. Yes. So transformers are basically a differentiable hash table. Are there other data structures which are differentiable and used in other type of architectures? Well, yes and no. I mean, yes, in the sense that if you can come up with differentiable versions of a lot of interesting functionalities that are normally used in either computers or algorithms, then it's great. There's not a huge amount of those. There's only kind of a small... You know, there's like classes of architectures that we talked about, right? There is this associative memory, differential associative memory type architectures that we just talked about. There is recurrent nets of which LSTM and GRUs are kind of special cases that are particularly useful. And there is convolutional nets of which ResNet are a particularly useful thing, which have the same idea as LSTM. You just have connections that skip so that the system doesn't get stuck if one of the layers behave badly. And then you have things like transformers which basically combine some of those elementary objects for particular architectures. You have a mixture of experts, which also uses attention. So if you remember, I talked about mixture of experts, right? The mixture of experts is a network where you have multiple experts and then you have a sort of gating network that computes weights, coefficients with which to combine the outputs of the multiple experts. So it basically chooses which of the individual network is the current expert for the particular sample that we are seeing. It's a form of attention as well. So anywhere where you have multiplicative interactions, where you have basically the weight of a network, or a piece of a network, being the output of another network, then you have multiplicative interactions. If those weights are between one one and sum to one, you can call this attention, all right? But you might imagine all kinds of other forms of those things that are all differentiable. Anyway, I'm not sure I answered to your satisfaction, but that's as much as I can do. Okay, so here is what a transformer is. You take an input, typically it's text, but people are increasingly applying this to essentially what amounts to image patches, okay? So if you take, if your input is text, you represent those words as one-hot vectors, you run them through a matrix, let's say, or that turns them into embedding vectors, right? So your input now is a sequence of embedding vectors, and you're going to learn that matrix, of course. So it's similar to what I talked about earlier. You can combine those vectors, those representation vectors with other data, like for example, the position within the sentence, okay? And you can encode the position in various ways. A common way of encoding position is through basically values of sinusoid, right? So, and I'll tell you in a minute why you need to encode the position, but you do need to encode the position, okay? So you have a word on the input and it's part of a sequence of words, okay? And those are obtained using embedding matrix from one-hot vectors, okay? So the same diagram is here. What you do here is that, so you basically number the position of each of those vectors with a number between zero and one. So let's say here you have five, so this would be zero, this would be 0.2, 0.4, 0.6, 0.8. Maybe put another one, so we go to one, all right? This is one particular way of doing it, people are doing it in various ways, but we're going to add here to this vector, we're going to add another vector. And that vector is going to encode that position number in the following way. So that position number, okay, let's go to P. We're going to run it through a sinusoid, okay? And that's going to be the first component of that vector. So if the position is 0.2, we're here, that height is something we're going to put this in the first component. Then we're going to have a second sinusoid. And that second number is going to go into the second component. And then a third one, and that goes into the third one, et cetera. And we put as many of those as is required to get the required precision, okay? So those sinusoids are of frequency 1, 2, 3, 4, et cetera. All the way up to whatever is required to kind of encode accurately all of those things, okay? And so that gives a vector that basically encodes the position in so continuous way, continuous differentiable way, if you will. So that's this positional encoding thing, and this is why there's a little sinusoid in it. Then you go through a module called multi-headed tension, which I'll describe in a second. You add this to the input, normalize, go through a couple layers of feedforward net with also skipping connection like ResNet, add and normalize. And you keep doing this, you take the output, feed it back to the input, and keep doing it, stack multiple layers of this. A typical transformer will have something like 40 layers. Something like that, okay? And that's just the first part. Then there is the decoder that kind of produces the output. So let's say you want to produce a text one word at a time. You take the output of those first 20 layers or whatever it is, plug it to again, one of those multi-headed tension module, add and normalize, feedforward, add and normalize. Run through a linear classifier and softmax, and you output probabilities for each word, okay? So that is going to produce the first word. You sample from that softmax, that's the first word that comes out. You can take the word that has the highest probability or you can sample from the distribution produced by the softmax. Now you take that word and you feed it back to the input, okay? And you put it back to the input. And you feed it back to the input, okay? To produce the next word. So that word now goes to the embedding. Also it has positional encoding. So you tell it this is the first word that you're generating. Go through a few layers of your network. Run through those. This also has multiple layers. This is multiplied n times, 20 times or something. At the output, run through the linear module, softmax. Produce a second word, feed that back to the input, and do it again. And you do this sequentially to produce all the words in your output. The text generation is quite expensive because of this kind of so called ultra-regressive way of generating words. Okay, now, so this is a sort of general architecture of a transformer that transforms text into text. And I don't like the name very much because transformers are a little too generic. And there's also something called graph transformer network which you will learn about next week. Is that right? Through a guest lecture, which has nothing to do with this. Next week there is Ishan. Okay, so this is the week after next then. Then it's Ani. Yeah, Ani Hanoun. He's going to tell you about graph transformer networks. Do not confuse transformers and graph transformer networks. Okay, so what is this multi-head attention thing that we just talked about? So you take multiple embeddings which are vectors that represent your input, all right? And then you run through one of those modules. And what one of those modules does is so-called self-attention. Okay, so what does that mean? It means that one input is used as a query but all the other inputs are used as keys. Okay, so instead of having a associative memory where the keys are stored in the memory, the keys are actually inputs, right? So this is basically an attention mechanism, right? So we have an input, we have an input, right? But we have other inputs as well. We're just going to consider this particular guy. So this guy will view it as a query, okay? And we're going to call it QI. And all of these guys, whatever comes out of these guys, we call them keys, okay? And we're going to compute in a module here, we're going to compute the dot product of that query with all of the keys, okay? So here are we going to self-max them? So here we're going to get a bunch of coefficients and sorry, I should call them CJ, okay? CJI or CIJ rather. So this is CIJ and CIJ is equal to the dot product between QI transpose and KJ, all right? And then we're going to plug them into a self-max. So this is really a self-max of those guys, self-max over J, right? There is a thing, we're going to do this for every input. So this guy here has a module on itself that is going to produce CI minus 1J. And same for this guy, and same for this guy, etc. So what we get in the end is a matrix CIJ and this matrix is the result of computing all the dot products of all the QI with all the KJ. And the QI and the KJ can be the same or can be different. It can be two vectors attached to every location, a Q and a K or it could be that the K of one location is the Q, actually, okay? And that matrix is called a self-attention matrix, right? So it's a big matrix where you normalize the rows so that for a particular I, the numbers on that row are normalized through a self-max, okay? So you apply a self-max to the rows and so you get a bunch of coefficients between 0 and 1 that's up to 1 and you get 1 for every location. Now here is a very interesting characteristic of this. So, okay, so this is one head of a multi-head attention and what's called the multi-head is that you do this multiple times with multiple vectors K, essentially. And then in the end, you also have a process by which you compute a weighted sum of a bunch of values and you can have multiple of those values. So that's the multi-head, okay? So each of those guys has vectors V, which I should represent as bubbles. You multiply the coefficients coming out of this attention, which each of the V vectors, the symbols here are difficult to read, you sum them up and you do this multiple times with multiple V, that's a multi-head. So at the output, you get multiple vectors here that are basically combinations of all the Vs with coefficients that are those self-attention coefficients. This is all differentiable. It's like an associated memory, okay? You can learn the Vs, you can learn the Qs, the Ks, et cetera. Now here is a very interesting property of this system. Okay, so I should say you don't get a single output, you get one output per input, right? Because every one of those guys produces CIs by which you can compute a linear combination of the Vs. And so you get one for each of the outputs you're gonna get an output vector, right? So now here is an interesting property of this. So you'll get as many output vectors as you have input vectors, they may be different size. They generally the same size. The interesting property is that it's equivariant to permutation. So if you change the order of the inputs, the outputs also change their order, but are otherwise unchanged, okay? And it's a crucial property because now what you have is a neural net module because of the self-attention. It's a neural net module that basically gives you a, that computes an operation on a set. It doesn't care about the order in which the element come in. It doesn't even care about the numbers, the number of them, because you can run this self-attention with sort of as many inputs as you want. So it's variable size input if you wish. Assuming you have the Vs that correspond to that. But it's equivariant to permutation, okay? It only cares about the elements that are on the input. If you change the order, it changes the order of the output but doesn't change the result otherwise. This is a very interesting property because it means that if you need to process something, you know, independently of order, that's really great. Independently of relative positions, that's really great. Now, if you want to put some information about the position, you use this positional encoding that I was telling you about, okay? So the position of the word in the sentence in English actually matters, okay? There are some languages in which the position of a word within the sentence doesn't matter that much. In English it does because the grammar in English is pretty weak. But there are languages like German or French where the grammar is a bit stronger and so the order of words doesn't matter as much, nearly as much. And, you know, you may need to encode the position. You may need to break this equivariance to position of that transformer because, you know, it really matters where word appears. But the point is the overall function is in a manner of respect to permutations. So this works just ridiculously well. I mean, particularly in the context of the BERT model, okay, so now I have the correct references here. So multi-edit tension, which is the transformer architecture is was when in 2017. And then the BERT model is debited in 2018. These are two groups from Google. And this used like unsupervised, you know, filling in the blank, you know, using autoencoder or master autoencoder. So this really kind of revolutionized natural language processing completely. And now people are trying to apply this to images and it's kind of working pretty well, at least in certain situations. To the extent that all people are saying, we're not going to use convolutional nice in the future, may or may not be true. Okay, so standard applications of this include things like translation, like text generation in various forms, although it doesn't work that well, at least when they're trained this way, this particular way. And, you know, representing text in general, so that you can train on a supervised downstream task, like say, classifying the topic of a text, determining the tone of a text. Is it positive or negative? Is it judgmental? Is it bullying? Is it hate speech? Is it a call to violence? You know, things like this. So those things are used a lot, both by Google and Facebook, for content moderation, essentially. And it's only in the last year or two because those models have only appeared in the last year or two. Some of those models are multilingual, so you train them with input sentences from multiple languages, and they automatically can learn like to detect a language and sort of do the appropriate representation. So what you get in the end is a representation of text that is independent of language. And that's very useful because what that means is now you can train a speech detector for whatever language your word system or your transformer was trained to represent. And so it's very important when you want to be able to do content moderation in a lot of different languages for which you may not have a lot of training data. And this is, again, very widely used by both Google and Facebook and various other companies. So here's an interesting example. This was a system that was initially proposed by Guillaume Lamp and Alexey Connault who were at Facebook in Paris. They were both PhD students, actually, at Facebook in Paris, and they proposed to train in this semi-cell supervised manner to train a system to translate, like a transformer system to translate. And the way they did it was you take a sequence in English and the same sequence in French, the same sentence, meaning the same thing, more or less. And you remove some of the words. So here you remove the word curtain, you remove the word wear. So the sentence in English is the curtain were blue. Here, les rideaux étaient bleus, which is the translation. So here you move curtain and wear, and here you remove la, which is the article, and blue, which is blue, the color. And you train the system to predict those missing words. And as a consequence of this training, the system learns to basically, when it wants to produce the word blue in French, which corresponds to the blue, the word blue in the English version. It doesn't have that word in French, but it has it in English. So it learns the translation automatically. It learns to pay attention to the corresponding word in the English translation. So you train the system that way to fill in the blanks, basically, in sentences. And what you get at the end is a system that can translate. I mean, you have to do a few more things to get it to work to any kind of state of the art. This requires also positional encoding in the two sentences, but it's pretty amazing. You can, of course, train each half separately to represent text in any language and you can do this for any number of languages. So there are systems now that can translate a couple of hundred languages into that. Guillaume Lamp was the same guy again with François Charton also at Facebook in Paris a couple of years ago. In fact, last year, they just open sourced their code just last week. They trained a system, a transformer system, supervised in this case, but it's a transformer architecture to solve differential equations and compute integrals symbolically. So you give an equation like this, y prime equals blah, blah, blah. Okay, that's a differential equation. And it gives you the solution. Okay, symbolically, it represented as some sort of tree, if you want, which is encoded in the sequence, yeah, put sequence of the transformer. And the way they do this is that they start from formulas and then they generate differential equations for which those formulas are solutions. They do the same with integrals. They do the same. I mean, this is an integral, really, right? This is just computing an integral. But if you have something like this where this is differential equation, this is just computing an integral, this is differential equation. So if computing an integral, you can easily generate data, just generate a random formula, then differentiate it symbolically, which you can do automatically from simply simplify or not the resulting formula, and then train the transformer to turn the derivative into the integral of it. Okay, you can generate as much data as you want. This is the old results. They actually have better results now, but you can basically have symbolic solutions to ordinary differential equations in the first and second order, as well as integrals. And this works kind of really good, really well. Like it solves the problem like better than traditional approaches to this like that are used in Mathematica and Matlab and Maple. Here's another example of a use of a transformer. This is for a data system. So here, system is trained to basically respond data with someone. But this particular one actually, this is called MM Blender Bot. So Blender Bot is kind of a data system that was built at Facebook, and the MM version is one that is multi-model, so it can look at pictures. So here, you look at this picture, there's a human as well as the bot, and they talk to each other about the picture. So first thing, I would love to take this teddy bear for a spin on my motorcycle. How would the teddy bear ride in the basket in front of you or in the sidecar? In front of me, of course. I don't want to get hit by a car. Yeah, it's a good point. The sidecar would take up a lot of room. I think it would be fun to ride on the back of a motorbike with a stuffed animal in the basket. Do you ride your motorbike often? I used to ride a lot when I was younger, but I've been on one in a long time. So you have to figure out which of those is the human and which is the machine. I think the machine is the gray one and the human is the blue, if I remember correctly. So same here. So this is a system that has basically a commercial net that looks at the picture, extracts the representation of the picture that's one input to the transformer and the transformer has been trained on dialogues of people actually commenting on an image. So this was connected through Amazon Mechanical Turk and this is basically learns to emulate human dialogues. So this is an entertaining and fun chatbot. Making a chatbot that's actually useful is much more difficult and people basically can do it other than by sort of building them by hand at the moment. So this is a big challenge for the next few years is to figure out how to kind of build or train chatbots that are useful. So you can, using those transformers and memory networks and various tricks around those ideas, you can basically compile the entire knowledge of Wikipedia into a collection of SOCHETI memories inside of a neural net. And so then you can ask a question one session on that, which would have billions, hundreds of billions of connections of weights or parameters. You can ask any question and if the answer is somewhere in Wikipedia, that system will probably be able to answer the question. Some of you may have seen GPT-3. So GPT-3 is a slightly different type of model which is just trying to predict the next word in a text using some sort of context. And it stores so much, it has so many parameters, 175 billion or so. There's so many parameters that whatever you give a text prompt, it kind of has something that's similar in its SOCHETI memory and essentially just by running through the network through the network will generate text that sounds a plausible confirmation of that initial prompt, including if the prompt is a specification of a problem which is kind of surprising. So it's interesting. It's not practical for practical applications yet. Some studies that show that it's not particularly reliable. So building a chatbot that is impressive or entertaining is easy. Building one that's useful is much tougher. So this sort of Wikipedia answering system that I was telling you about. If you ask a question like, what is the population of Germany, it will answer easily. If you ask a question like, what is the country that has a common border with Germany with the largest commercial exchanges with China? Then it can't do it because what the system will have to do is go through a lot of different Wikipedia articles, read tables, and then sort the numbers in the table, cross-correlated the values in the table to figure out and then look at a map and then figure out which countries have a border with Germany. And so it requires a sequence of actions. It requires complex planning that none of the systems can do at the time. So this is a big challenge for the next few years. Getting machines like this that can reason and basically plan the sequence of actions to answer a question, for example, or solve a problem for people. And that's where we'll have, if we have this one day, we'll have virtual assistants that can do a lot for us. Here's another example of a transformer system called Deter. So this was proposed by Nicolas Carillon on a large casting character also at Facebook at Research in Paris. Nicolas Carillon is actually a postdoc at NYU right now. And it's open source. You can play with it if you want. And it's basically a vision system that combines a convolutional net with a transformer. This is where the world is going. The latest vision systems basically do that now. They combine comets and transformers. Some of them actually don't have any comets anymore. They just use a transformer all the way down to... So they basically break up the image into patches, overlapping and unoverlapping, and then run those patches through an embedding matrix or through a couple layers of a comnet to produce a representation. And then everything is done through a transformer. Okay, and those seem to work pretty well. But the prevailing approach to the research level at the moment is kind of to use a combination of the two. So convolutional net for the low layer and a transformer for the high layer. So here's how it works. You take an image running through a convolutional net that produces basically feature maps that are dense image features. Okay, and you can pre-train this or not. Supervised. Then you take that and feed it to a transformer, which you can think of as kind of an encoder decoder. That transformer has a bunch of slots that correspond to objects in the image. Typically it has 100 slots, okay, which are 100 different inputs, if you want. And those are basically feature vectors that, you know, the feature vector that come out of the convolutional nets are inputs to that transformer. So that transformer kind of produces a list. And remember, a transformer is equivalent to your permutation. So it doesn't matter which order you show this to the transformer. It's, you know, I mean, you do positional encoding, of course, so you indicate at which location each of those feature vectors are. You run through the transformer, the transformer produces the output of the transformer is a bunch of vectors with a softmax vector that is probability of a categories, together with a position of a bounding box for that object. Okay, and again, because the transformer is equivalent to the permutation, it's going to produce this in whatever order is natural for it to produce, okay. So it has 100 empty slots and it fills them with different objects, okay. And you train that system supervised. So you train it. So the thing is, you don't know when the system produces those boxes. You don't know which object here that it produces corresponds to what actual object in the image for supervision. So you do something called bipartite matching loss. So you say, okay, I know there is a seagull in this image. Let me look in all the slots that the system produces for something that's close to seagull. Oh, here is a box that puts a score, a high score for seagull. So I'm going to decide that that is the seagull category. The bounding box seems similar. The score is high for seagull. So I'm going to match this slot with that label, right. Same for this guy, it's going to be easy because the bounding boxes are slightly different. So that's also a seagull at a different location. But then the system produces another box here with some score, which doesn't have any equivalent. So the bipartite matching has to kind of take that into account that some object may appear here that don't appear in the input and some objects that may be filled up in one of the slots here may have no corresponding thing. So this graph matching, this sort of bipartite graph matching kind of figures out the best matching. And you can think of this as finding the minimum of some energy over a latent variable, okay. Latent variable being the pairing. So here you see the overall architecture. Confucian net produces dense image features. You do positional encoding to indicate the location of each of the features. Feed this through a transformer that has as many inputs as you have locations in the feature maps if you want. Okay, so each of those inputs is a vector of values for all features at one location. And you have one of those for each location in the feature maps. You run through that. So you get a representation that's used as a contextual input for a transformer. Okay, which itself is a decoder. And this one has object slots. So it has typically 100 slots. And you put object queries in it. So those are fixed vectors that are learned. You can think of them as weights actually that are learned by gradient descent. But from the point of view of a transformer, they actually are inputs. They are like input embeddings. They're learned through gradient descent, right? So they are used as parameters. They're learned as parameters. But for a transformer architecture, they look like inputs. You run through this transformer. I don't know how many layers it has, maybe four years old, maybe 20, I can't remember. And then you do this matching with the categories. So you run through a feedforward net to compute basically a score for each category as well as a bounding box. Okay. And then to train the system, you do this pairing of the targets to whatever comes out. And you compute the, you back propagate the gradient. You say, well, you know, the category here, you do compute the cross-entered view of the desired category, which is Siegel. You compute the error with the bounding box. That's a regression problem. You back propagate to all the ones that have a pair. Back propagate all the way to the transformer. Back propagate all the way here. Update those embedding vectors here. Back propagate to the encoders. And I think all the way back to the comment, I'm not sure actually the back propagate all the way to the comment, or is the comment just pre-trained. I think the back propagate all the way. So it's basically, you know, you need to sort of pre-train some of those things, but essentially you get a end-to-end system. Now, what the transformers are doing is what is the equivalent of something that was done by hand before called non-maximum suppression. And what's cool about the transformer is that it does essentially object-based reasoning, right? So it basically reasons about objects and says, so you can say things like, you know, here there are two elephants, and there's one in front of the other because, and I know it's an elephant because it has a trunk and four legs and a tail. And so, you know, it's smaller than this guy, and you know, this guy is behind it obviously. I'm not seeing all of it, but you know, I'm paying attention to the trunk here. So those highlighted areas are basically running through this encoder looking at what the attention, the self-attention circuit pay attention to, which part of the input they pay attention to, okay? And you highlight that on the image. And that tells you what the system uses to really kind of arrive at the answer it arrives at, okay? Which is pretty cool. Same for Zebra. So we can, you know, take apart multiple Zebras and Zebras have evolved to be in the business of kind of confusing, you know, predators about their numbers basically or where they are. That's the world of the strikes. So this works really amazingly well. They are computing approaches that work, you know, similarly, but they also use transformers on top of control nets. They'll train slightly differently, use slightly different principles, but this is really kind of revolutionary, in my opinion, really recent. Instead of training the system to just produce bonding boxes, you can train it to produce masks for every object. So the output here of this seed forward net is not just a self-max vector, but it's actually an image, like a binary image of where the object is. And this works really well as well. Where, you know, for panoptic segmentation. So this uses the, you know, multi-head attention transformer that produces attention maps, and then you run this through a commercial net and you train it to produce masks. And so it can basically, you know, outline, not only sort of detect and recognize every object in the image, but also draw an outline of every one of them. So, you know, a lot of progress in computer vision due to that. So in the five minutes that are left, and I'm afraid we'll talk about this in the future. I wanted to talk about this idea of planning, I was saying about earlier. And here is the name of the algorithm that, you know, I told you about the K. E. Bison or I joined state method, where you have an observation about the state of the world which you run through a perception module that it basically estimates the state of the world. And then you have a model of the world that tells you the state of the world at time t plus one as a function of the state of the world at time t, the action you're taking, and perhaps some latent variable that represents everything you don't know about the world that may occur. Okay, and you may need to sample multiple values of this to generate multiple features, or go exhaustively maybe if it's a discrete variable. Okay, so model predictive control is this idea that by some minimization algorithm could be gradient based or not, could be based on dynamic programming, you find a sequence of actions over time that will minimize the overall cost computed from the state. Okay, so you might be wondering, okay, you're doing control. I heard that people use reinforcement learning for this. This is not reinforcement learning. This is optimal control. Okay, the difference between optimal control and reinforcement learning is twofold. So first of all, if we were doing reinforcement learning, this would be called model-based reinforcement learning because inside of this, we have this model of the world that predicts the next state from the previous state. But in reinforcement learning, you don't know what this cost function is. Okay, you're not told what this cost function is. The only way you know the value of this cost function is that you take an action and then you wait for the world to tell you whether the outcome of this action was good or not. And you may not get a response for every action you take. You might only get it at the end. So you play a game of chess or go. You only get the answer at the end if you lose or if you win. You're not being told in the meantime whether your action you took was good or bad. Okay, so reinforcement learning, the difference between optimal control and reinforcement learning is that in optimal control, you know the cost function and you can back properly get great into this or at least you can compute it. And so you know in which way to change your actions so that your cost function goes down. Reinforcement learning, you don't know the cost function. You only know the value of the cost function by taking an actual action in the world and then waiting for the world to tell you whether this action was good or bad. And it may only tell you this at the end of a sequence, a long sequence. And so you get very, very sparse reward or punishment, which you can think of as a value of a cost. And so it's much more difficult to do reinforcement learning because you don't have what's called intrinsic motivations. You don't have a cost that you compute yourself. Humans, in the base of our brain, we have something called the basal ganglia and that's where our brain computes if we are comfortable, happy, hungry, thirsty, if we hurt or not. And the rest of our brain basically is there to satisfy that piece of the brain. And that piece of the brain is our cost function. Okay, it's the thing that tells you you're in a good state or bad state. All the actions we take is to basically minimize the expected value of that cost over time, increasing long term. And we have the better model of the world that we have to do this, the better we can do it. Okay, so let's say we want to train a car to drive itself. We need to have a way of predicting what cars around us are going to do. So we're going to observe, let's say this is a view, a top down view of the piece of the highway or car is the blue car. And we've observed what happened around us for a while with some sensor, LiDAR or a camera or whatever. What we need to do is being able to predict what's going to happen next in the world. Okay, and what architecture we can use is one that we've already studied, which is a conditional or variational autoencoder. Okay, so a conditional variational autoencoder, what is it? You observe X, you run this with the predictor, you get a hidden representation of the past, they call H, that H goes into a decoder. The decoder combines this H with a latent variable, which represent what we don't know about what's going on in the world. And then run this with a decoder and the decoder makes a prediction for what the next state of the world is going to be in the form of an image in this case. All right, we can train the system by minimizing the prediction error over recorded sequences of the camera looking down at cars in a highway. Now, because inference of this latent variable might be hard and we may need to marginalize over it, we're going to use a variational autoencoder. So we're going to use Y and H run into an encoder to predict a guess as to what the best value of Z is. And then we're going to basically sample the Z value around, you know, with a Gaussian distribution where the mean is produced by the encoder and the standard deviation is maybe another output of the encoder or not. But we're also going to have a prior here for the Z that, you know, it should be mostly zero or close to zero, let's say. And then run that to the encoder. So once the system is trained, of course, when we want to do inference, predict what the Y is going to be for a given X. We don't have access to Y. And so we're going to have to sample that Z from a Gaussian distribution or some distribution according to this prior, perhaps, and run it through this. And that's going to be to allow a system to make multiple future predictions for the single observation of the past. That's a demonstration of that. So this is a project that Alfredo was involved in, as well as Michael Inath was a former student of mine. So this is the recorded on the left is a recorded sequence from a camera looking down the highway. And what you see here, the, you know, immediately to the right of the recorded sequence is what happens when you don't have a latent variable. So when you don't have a latent variable, you say to zero all the time, you just run through the encoder decoder. What you get is increasingly blurry predictions because the system doesn't know really if other cars are going to accelerate or break or whatever. And so it sort of predicts the average and that's kind of a blurry, a blurry prediction. Those four columns here on the right are different predictions made for different drawings of the latent variable Z. So the variable Z is a sequence. You draw the disease over a sequence and for different drawings of that sequence, you get different future predictions. And those are the different future predictions. And the square and the circles indicate the same car being tracked in different ways, like behaving differently for the different values of the latent variable. So that's a good example of showing how, you know, one of those latent variable models can actually make multimodal predictions, right? And so you can use this to do model predictive control, run this for a couple of seconds. The prediction is every tenth of a second, every 100 milliseconds. And so you can run this for 20 times or something and what you get is a two-second segment and you can plan over two seconds. We can run it for longer and plan over longer. And that will be model predictive control. It's kind of expensive. So another thing you can do, and the cost function, by the way, is something that measures the distance of your car to the other cars and whether your car is in lane or something like this. What you can do is instead of having to reason and infer the sequence of actions that it retimes that, you can train a neural net called a policy network to predict that action. And it's basically just backprop, right? So unroll your model of the world, your cost function, initialize it through an initial observation, run through this, and then through backpropagation, backpropagate through the entire thing so that you adjust the weights of this policy network in such a way that it takes an action that over time will minimize the objective. So this is sort of grid-based policy learning. Again, it's not reinforcement learning because you know the cost function and you can differentiate it. But it's kind of learning a direct controller by backpropagation through time, assuming you have a good model of the world. And this actually works. Okay, I will conclude because I'm out of time anyway. So we talked about self-supervised learning. We talked about the fact that we can train very large networks of self-supervised learning. This is clearly the future of AI. There's no question. And one of the challenges is handling uncertainty in the prediction. And we can use energy-based models for this. We can use latent variable models if we really want to do prediction or we can use joint embedding methods if we only want to learn representations. We can do reasoning and planning through energy minimization, and usually these models, basically with latent variables on that, kind of allow us to do this. They give us a framework for how to do this properly. We don't do logic and symbols. We basically replace symbols by vectors and we replace logic by continuous energy functions that we can compute weight and stuff, essentially. But that's how we do reasoning. We don't know how to learn hierarchy for representations of action plans yet. Like with, I have no idea. Now, perhaps this is sort of a blueprint for kind of future autonomous intelligence. System through AI, if you want. Or what some people call AGI, artificial general intelligence. I do not believe in the concept of artificial general intelligence because I don't think even human intelligence is particularly general. I think it makes sense to talk about human level AI but not about general intelligence. Human intelligence is very, very specialized. Less specialized at cats or rats or dogs, but still very specialized. So perhaps one day we'll be able to put this whole thing together where we have a system that has perception which in our brain happens in the back of the brain. We have a cost function which is in the base of the brain. We have a model of the world that allows us to predict what's going to happen in the world as a consequence of our actions or just because the world is being the world and predict multiple outcomes. We may have a critic and the critic is basically a neural net, another neural net, a module that predicts what the future value of the cost is going to be. So if I come to you and unexpectedly I pinch your arm, you're going to be surprised. You're probably going to step back and kind of wonder what came over me. The next time I come to see you, you may be a little careful because you're going to wonder if I'm going to pinch your arm. And that's basically your critic predicting what your ultimate level of pain is going to be. You felt pain. That was an immediate cost. But you have this neural net that predicts for a given situation what is going to be the expected value of that cost in the future. Now you know that I'm a pinching guy. You're going to step back because of this. This is perhaps the source of a lot of emotions. So emotions basically are anticipation of good or bad outcomes. And the fact that we can predict in advance whether that comes going to be good or bad, that creates things like fear, elation, and things like that. But there are more immediate emotions like hunger and thirst and those are directly in the cost. But packing your lunch for the day because you know you're not going to be able because you're going on a hike, that's basically your critic telling you, okay I know I'm going to be out of any food source for a long time so I need to pack my lunch beforehand. Now what we have here also is that in our brain humans essentially can do only one thing at a time. We can pay attention to only one thing at a time and we can think about essentially only one thing at a time deliberately. And so perhaps the reason for this is that we only have one engine to be the model of the world and that engine is configurable to the situation at hand. But we have only one. So we need a way to configure that engine to handle the situation at hand whether we are sitting, whether we are building something, whether we are talking to someone. We need to configure our attentive model of the world to the situation at hand and that's probably done by this kind of configuration engine. And maybe that's the source of consciousness of what we call consciousness. It's a consequence of the limitation of our brain that we only have one more model essentially one engine as a model. And the actor is the module that figures out what sequence of action should I take so that given the model of the world that I have, how do I minimize the cost predicted by my critic or computed by my cost given my current perception. Okay. So that may be the architecture of future autonomous AI systems. Nobody has really kind of built anything close to this. I mean, some reinforcement learning systems have some of the components but not all of them. Okay. Thank you very much. Sorry for going over time.