 So, I want to do two things, talk a little bit about some ways to use convolutional nets in various ways, which I have gone through last time. And I'll also talk about different types of architectures that some of which are sort of relatively recent design that people have been kind of playing with for quite a while. So, last time when we talked about convolutional nets, we stopped at the idea that we can use convolutional nets with kind of a sliding window over large images and it consists in just applying the convolutions on large images, which is a sort of very general method. So, we're going to see a few more things on how we use convolutional nets. And to some extent, I'm going to rely on a bit of sort of historical papers and things like this to explain kind of simple forms of all of those ideas. So, as I said last time, I had this example where there's multiple characters on an image and you can, you have a convolutional net that whose output is also a convolution, like every layer is a convolution, so you can interpret the output as basically giving you a score for every category and for every window on the input. And the framing of the window depends on like the windows that the system observes when you back project from a particular output kind of steps by the amount of sub-sampling, the total amount of sub-sampling you have in the network. So, if you have two layers that's sub-sampled by a factor of two, you have two pooling layers, for example, that's sub-sampled by a factor of two. The overall sub-sampling ratio is four and what that means is that every output is going to basically look at a window on the input and successive outputs is going to look at successive windows that are separated by four pixels, okay? It's just a product of all the sub-sampling layers. So, this is nice but then you kind of have to make sense of all those steps that's on the input. How do you pick out objects, objects that overlap each other, et cetera. And one thing you can do for this is called non-maximum suppression, which is what people use in sort of object detection. So basically what I consist in is that if you have outputs that kind of are more or less in the same place and or sort of overlapping places and one of them tells you, I see a bear and the other one tells you, I see a horse, one of them wins, okay? It's probably one that's wrong and you can't have a bear and a horse at the same place so you do what's called non-maximum suppression. You can look at which of those has the highest score and you kind of pick that one or you see if any neighbors also recognize it as a bear or a horse and you kind of make a vote if you want a local vote, okay? And I'm going to go into the details of this because just kind of rough ideas. A lot of this is already implemented in code that you can download and also it's kind of the topic of a full-fledged computer vision course. So here we just allude to kind of how we use deep learning for this kind of application. Let's see. So here's again going back to history a little bit. Some ideas of how you use neural nets to or convolutional nets in this case to recognize strings of characters which is kind of the same problem as recognizing multiple objects really. So if you have an image that contains the image at the top, 23206, it's a zip code and the characters touch so you don't know how to separate them in advance. So you just apply a convolutional net to the entire string. We don't know in advance what width the characters will take. And so what you see here are four different sets of outputs and those four different sets of outputs of the convolutional net. Each of which has ten rows and the ten rows corresponds to each of the ten categories. So if you look at the top, for example, the top block, the white squares represent high scoring categories. So what you see on the left is that the number two is being recognized. So the window that is looked at by the output units that are on the first column is on the left side of the image and it detects a two. Because there are zero, one, two, three, four, etc. So you see a white square that corresponds to the detection of a two. And then as the window is shifted over the input, there's a three, a low scoring three that is seen. Then the two again, there's three characters, three detectors that see this two. And then nothing, then the zero, and then the six. Now, this first system looks at a fairly narrow window. And, or maybe it's a wide window. No, I think it's a wide window. So it looks at a pretty wide window. And when it looks at the two that's on the left, for example, it actually sees a piece of the three with it. So it's kind of in the window. The different sets of outputs here correspond to different size of the kernel of the last layer. So the second row, the second block, the size of the kernel is four in the horizontal dimension. The next one is three, and the next one is two. What this allows the system to do is look at regions of various width of the input without being kind of too confused by the characters that are on the side, if you want. So for example, the second two, the zero, is very high scoring on the second, third, and fourth map. But not very high scoring on the top map. Similarly, the three is kind of high scoring on the second, third, and fourth map, but not on the first map because the three kind of overlaps with the two. And so it wants to really look at it in our window to be able to recognize it. OK, yes. So it's the size of the y square. That indicates the score, basically. So look at this column here. You have a high scoring zero here because it's the first row corresponds to the category zero. But it's not so high scoring from the top one because that output unit looks at a pretty wide input, and it gets confused by the stuff that's on the side. OK, so you have something like this, and now you have to make sense out of it and extract the best interpretation of that sequence. And it's true for zip code, but it's true for just about every piece of text. Not every combination of characters is possible. So when you read English text, there is an English dictionary, English grammar, and not every combination of characters is possible. So you can have a language model that attempts to tell you what is the most likely sequence of characters that we're looking at here, given that this is English or whatever language. Or given that this is a zip code, not every zip code is possible. So there's some possibility for error correction. So how do we take that into account? I'll come to this in a second. But here, what we need to do is come up with a consistent interpretation that there's obviously a three, there's obviously a two, a three, a zero somewhere, another two, et cetera. How do we turn this area of scores into a consistent interpretation? It's the horizontal width of the kernel of the last layer. Which means when you back project on the input, the viewing window on the input that influences that particular unit as various size, depending on which unit you look at. Yes. The width of the block. It corresponds, it's how wide the input image is, divided by four, because the substantive viewing ratio is four. So you get one column of those every four pixels. So remember we had this way of using a neural net, a convolutional net, which is that you basically make every convolution larger. And you view the last layer as a convolution as well. And now what you get is multiple outputs. So what I'm representing here on the slide you just saw is this 2D array on the output, which corresponds where the row corresponds to categories. And each column corresponds to a different location on the input. And I showed you those examples here. So here, this is a different representation here where the character that is displayed just before the title bar indicates the winning category. So I'm not displaying the scores of every category. I'm just displaying the winning category here. But each output looks at a 32 by 32 window. And the next output looks at a 32 by 32 window, shifted by four pixels, et cetera. So how do you turn this sequence of characters into the fact that it's either 3, 5, or 5, 3? So here, the reason why we have four of those is because the last layer is different last layers, if you want. There's four different last layers, each of which is trained to recognize the 10 categories. And those last layers have different kernel width. So they essentially look at different widths of windows on the input. So you want some that look at wide windows so they can recognize large characters, and some that look at narrow windows so they can recognize narrow characters without being perturbed by the numbering characters. So if you know a priori that there are five characters here because it's a zip code, you can use a trick. And there's sort of a few specific tricks that I can explain, but I'm going to explain the general trick if you want. I didn't want to talk about this actually, at least not now. OK, here's a general trick. The general trick is kind of a somewhat specific trick. Oops, I don't know why I keep changing the slide. You say I know I have fact characters in this word. So let's say I have one of those arrays that produces scores. So for each category, let's say I have four categories here. And each location, there's a score. And let's say I know that I want five characters out. I'm going to draw them vertically. One, two, three, four, five, because it's a zip code. So the question I'm going to ask now is what is the best character I can put in this slot, in the first slot? And the way I'm going to do this is I'm going to draw an array. And on this array, I'm going to say what is the score here for every intersection in the array? It's going to be what is the score of putting a particular character here at that location, given the score that I have at the output of my neural net? So let's say that since I have fewer characters on the output of the system five, than I have viewing windows and scores produced by the system, I'm going to have to figure out which one I draw. And what I can do is build this array. And what I need to do is go from here to here by finding a path through this array in such a way that I have exactly five steps, if you want. So each step corresponds to a character. And the overall score of a particular string is the sum of all the scores that are along this path. In other words, if I get three instances here, three locations where I have a high score for this particular category, which is category one, OK, I start at zero. So one, two, three. I'm going to say this is the same guy, and it's a one. And here if I have two guys that have high score for three, I'm going to say those are the three. And here I have only one guy that has high score for two, so that's a two, et cetera. So this path here has to be sort of continuous. I can't jump from one position to another because that would be kind of breaking the order of the characters. And I need to find a path that goes through high scoring cells, if you want. That corresponds to high scoring categories along this path. And it's a way of saying, if those three cells here all give me the same character, it's only one character. I'm just going to output one here. That corresponds to this. Those three guys have a high score. I stay on the one. And then I transition to the second character, so I'm going to fill out this slot. And this guy has high score for three. So I'm going to put three here. And this guy has a high score for two, as two, et cetera. The principle to find this path is a shortest path algorithm. You can think of this as a graph where I can go from the lower left cell to the upper right cell by either going to the left or going up and to the left. And for each of those transitions, there is a cost. And for each of the, for putting a character at that location, there is also a cost or a score, if you want. So the overall score of the one at the bottom would be the combined score of the three locations that detect that one. And because it's more, all three of them are contributing evidence to the fact that there is a one. When you constrain the path to have five steps, it has to go from the bottom left to the top right. And it has five steps, so it has to go to five steps. There's no choice. That's how you force the system to give you five characters, basically. And because the path can only go from left to right and from top to bottom, it has to give you the characters in the order in which they appear in the image. So it's a way of imposing the order of the character and imposing that there are five characters in the string. Yes. Okay, in the back, yes. Right. Yes. Well, so if you have just a string of one, you have to have trained the system in advance so that when it's in between two ones or two characters, whatever they are, it says nothing. It says none of the above. Otherwise you can't tell, right? Yeah, a system like this needs to be able to tell you this is none of the above. It's not a character, it's a piece of it, or I'm in the middle of two characters, or I have two characters on the side, but nothing in the middle. Yeah, absolutely. It's a form of non-maximum suppression. So you can think of this as kind of a smart form of non-maximum suppression, where you say for every location, you can only have one character. And the order in which you produce the five characters must correspond to the order in which they appear on the image. What you don't know is how to warp one into the other. So how many detectors are going to see the number two? It may be three of them. And we're going to decide they're all the same. So the thing is, for all of you who learn computer science, which is not everyone, the way you compute this path is just a shortest path algorithm. You do this with dynamic programming. So find a shortest path to go from bottom left to top right by only taking transition to the right or diagonally, and by minimizing the cost. So if you think each of those is filled by a cost, or maximizing the score, if you think there are scores, there are probabilities, for example. And it's just a shortest path algorithm in a graph. This method, by the way, was some of the early methods of speech recognition work this way. Not with neural nets, though. We sort of hand extracted features. But then we basically match the sequence of vectors extracted from a speech signal to a template of a word, and then try to see how you warp the time to match the word to be recognized to the templates. Then you had a template for every word of a fixed size. This was called DTW, dynamic time warping. There's a more sophisticated version of it called Header Markov models, but it's very similar. People still do this to some extent. OK, so detection. So if you want to apply convolutional net for detection, it works amazingly well. And it's surprisingly simple, what you need to do. You basically need to, let's say you want to do face detection, which is a very easy problem. One of the first problems that computer vision started solving really well for recognition. You collect a data set of images with faces and images without faces. And you train a convolutional net with input window is something like 20 by 20 or 30 by 30 pixels to tell you whether there's a face in it or not. Now you take this convolutional net. You apply it on an image. And if there is a face that happens to be roughly 30 by 30 pixels, the convolutional net will light up at the corresponding output and not light up when there is no face. Now, there's two problems with this. The first problem is there is many, many ways a patch of an image can be a non-face. And during your training, you probably haven't seen all of them. You haven't seen even a representative set of them. So your system is going to have lots of false positives. That's the first problem. Second problem is in a picture, not all faces are 30 by 30 pixels. So how do you handle size variation? So one way to handle size variation, which is very simple, but it's mostly unnecessary in modern versions, but or at least not completely necessary, is you do a multi-scale approach. So you take your image, you run your detector on it. It fires whenever it wants. And it will detect faces that are small. Then you reduce the image by some scale. In this case here, I think a square root of 2. You apply the convolutional net again on that small image. And now it's going to be able to detect faces that were larger in the original image. Because now, what was 30 by 30 pixel is now about 20 by 20 pixels, roughly. OK? But there may be bigger faces there. So you scale the image again by a factor of square root of 2. So now the image is half the size of the original one. And you run your convolutional net again. And now it's going to detect faces that were 60 by 60 pixels in the original image, but are now 30 by 30 because you reduce the size by half. You might think that this is expensive, but it's not. The half of the expense is the final scale. The sum of the expense of the other networks are combined is about the same as the final scale. It's because the size of the network is kind of the square of the size of the image on one side. And so when you scale down the image by a square root of 2, the network you have to run is smaller by a factor of 2. So the overall cost of this is 1 plus 1 half plus 1 quarter plus 1 eighth plus 1 16, et cetera, which is 2. You waste a factor of 2 by doing multi-scale, which is very small. You can have a factor of 2. So this is a completely ancient face detection system from the early 90s. And the maps that you see here are kind of maps that indicate scores of face detectors. The face detector here, I think, is 20 by 20 pixels, so it's very low res. And it's a big mess at the fine scales. You see kind of high-scoring areas, but it's not really very definite. But you see kind of more definite things down here. So here you see a white blob here, white blob here, white blob here. Same here, you see a white blob here, white blob here. And those are faces. And so that's now how you need to do maximum suppression to get those little red squares that are kind of the winning categories if you want the winning locations where you have a face. So maximum suppression in this case means I have a high-scoring white blob here. That means there's probably a face underneath, which is roughly 20 by 20. If there is another face within a window of 20 by 20, that means one of those two is wrong. So I'm just going to take the highest-scoring one within the window of 20 by 20 and suppress all the others. And you'll suppress the others at that location, at that scale, I mean, at nearby location, at that scale, but also at other scales. So you pick the highest-scoring blob if you want for every location, every scale. And whenever you pick one, you suppress the other ones that could be conflicting with it, either because they are at different scales at the same place or at the same scale, but you know, nearby. OK, so that's the first problem. And the second problem is the fact that, as I said, there's many ways to be different from a face. And most likely your training set doesn't have all the non-faces things that look like faces. So the way people deal with this is that they do what's called negative mining. So you go through a large collection of images where you know for a fact that there is no face and you run your detector and you keep all the patches where your detector fires. You verify that there is no faces in them and if there is no face, you add them to your negative set. OK, then you retrain your detector and then use your retrain detector to do the same. Go again through a large dataset of images where you know there is no face and whenever your detector fires, add that as a negative sample. You do this four or five times and in the end you have a very robust face detector that does not fall victim to negative samples. There's a lot of things that look like faces in natural images that are not faces. This works really well. This is over 15 years old work. This is my grandparents' marriage by their wedding. Their wedding. OK. So here's another interesting use of convolutional nets and this is for semantic segmentation. It was called semantic segmentation. I alluded to this in the first lecture. So what is semantic segmentation? It's the problem of assigning a category to every pixel in an image and every pixel will be labeled with a category of the object it belongs to. So you can imagine this would be very useful if you want to say drive a robot in nature. So this is a robotics project that I worked on. My students and I worked on a long time ago. And what you like is to label the image so that regions that the robot can drive on are indicated and areas that are obstacles are also indicated so the robot doesn't drive there. OK. So here the green areas are things that the robot can drive on and the red areas are obstacles like tall grass in that case. So the way you train a convolutional net to do this kind of semantic segmentation is very similar to what I just described. You take a patch from the image. In this case I think the patches were 20 by 40 or something like that, relatively small. For which you know what the central pixel is whether it's traversable or not, whether it's green or red. Either it's been manually labeled or the label has been obtained in some way. And you run the coordinate on this patch and you train it, you know, tell me if it's green or red, tell me if it's drivable area or not. And once the system is trained you apply it on the entire image and it, you know, it puts green or red depending on where it is. In this particular case actually there were five categories. There was the super green, green, purple which is the foot of an object, red which is an obstacle that you know true of and super red which is like a definite obstacle. But here we're only showing three colors. Now in this particular project the labels were actually collected automatically. You didn't have to manually label the images and the patches. What we do would be to run the robot around and then through stereo vision figure out if a pixel is a correspond to an object that sticks out on the ground or is on the ground. So the middle column here it says stereo labels. These are labels, so the color green or red is computed from stereo vision from basically 3D reconstruction. Okay, so for, you have two cameras and the two cameras can estimate the distance of every pixel by basically comparing patches. It's relatively expensive but it kind of works. It's not completely reliable but it sort of works. So now for every pixel you have a depth, a distance from the camera which means you know the position of that pixel in 3D which means you know if it sticks out of the ground or if it's on the ground because you can fit a plane to the ground. Okay, so the green pixels are the ones that are basically near the ground and the red ones are the ones that are up. So now you have labels. You can train a convolutional net to predict those labels. Then you will tell me why would you want to train a convolutional net to do this and do this from stereo? And the answer is stereo only works up to 10 meters roughly. Past 10 meters you can't really, using binocular vision and stereo vision, you can't really estimate the distance very well. And so that only works up to about 10 meters and driving a robot by only looking 10 meters ahead of you is not a good idea. It's like driving a car in a fog, right? It's kind of, it's not very efficient. So what you use the convolutional net for is to label every pixel in the image up to the horizon, essentially. Okay, so the cool thing about this system is that, as I said, the labels were collected energetically, but also the robot adapted itself as it run because it collects those stereo labels constantly. It can constantly retrain its neural net to adapt to the environment it's in. In this particular instance of this robot, it would only retrain the last layer. So the N minus one layers of the comb net were fixed, were trained in the lab. And then the last layer was kind of adapted as the robot run. It allowed the robot to deal with environments it had never seen before, essentially. They still have long range vision. The input to the comb net were basically multi-scale views of sort of bands of the image around the horizon. I need to go into details. There's a very small neural net by today's standard, but that's what we could afford. I have a video. I'm not sure it's going to work, but I'll try. Oh, yeah, it works. That's amazing. So I should tell you a little bit about the cast of characters here. So, huh? You don't want the audio. So Piazza Menale and I had sales with two students working with me on this project. Two PhD students. Piazza Menale is a Google brain. He works on robotics. And Raya Head Sales is director of robotics at DeepMind. Marco Scoti is at NVIDIA. Meg Rimes is at DeepMind. Jan Ben is at MobileEye, which is now Intel. Actually, Arkan is at Twitter. And Urs Maler is still working with us. He's actually head of a big group that works on autonomous driving at NVIDIA. And he's collaborating with us. Actually, Alfredo works on this project. So this is a robot. And it can drive at about, you know, sort of fast walking speed. And it's supposed to drive itself in sort of nature. So it's got this mask with four eyes. There are two stereo cameras in pairs. Three computers in the belly. So it's completely autonomous. It doesn't talk to the network or anything. And those three computers I'm on the left. That's when I had a ponytail. Okay, so here the system is, the neural net is crippled. So we didn't turn on the neural net. It's only using stereo vision. And now it's using the neural net. So it's pretty far away from this barrier. But it sees it. And on the right side, it wants to go to a goal, a GPS coordinate that's behind it. Same here. It wants to go to a GPS coordinate behind it. And it sees right away that there is this world of people that it can go through. The guy on the right here, is holding a transmitter. He's not driving the robot, but he's holding the kill switch. And so, you know, that's what the convolutional net looks like. Really small, but to the standard. And it produces for every location, every patch on the input, the second last layer is a 100, 100-dimensional vector that goes into a classifier that classifies into five categories. So once the system classifies into those five categories in the image, you can warp the image into a map that's centered on the robot. And you can do planning in this map to figure out how to avoid obstacles and stuff like that. So this is what this thing does. It's not important for now. Now that, because this was, you know, 2007, the computers were slow, there were no GPUs, so we could run this neural net only at about one frame per second. As you can see here, at the bottom it updates about one frame per second. And so if you have someone kind of walking in front of the robot, the robot won't see it for a second, and will, you know, run over it. So that's why we have a second vision system here at the top. This one is stereo. It doesn't use a neural net. So don't make sure you think we don't care. This is the controller, which is also learned, but we don't care. And this is the, the system here, again, its vision is crippled. It can only see up to 2.2 and a half meters, so it's very short. But it kind of does a decent job. And this is to test this sort of fast-reacting vision system. So here Piazza Manet is jumping in front of it, and the robot stops right away. So that's the full system with on grand vision and annoying grad students. Right, so it's kind of giving up. Okay. Okay, so that's called semantic segmentation. But the real form of semantic segmentation is one in which you give an object category for a relocation. So that's the kind of problem here we're talking about where every pixel is either building or sky or street or a car or something like this. And around 2010, a couple of data sets started appearing with a few thousand images where you could train vision systems to do this. And so the technique here is essentially identical to the one I described. It's also multi-scale. So you basically have an input image. You have a convolution net that has a set of outputs that, you know, one for each category of objects for which you have label, which in this case is 33. And when you back project one output of the convolution net onto the input, it corresponds to an input window of 46 by 46 window. So it's using a context of 46 pixels to make the decision about a single pixel. At least that's the neural net at the bottom. But it turns out 46 by 46 is not enough if you want to decide what a great pixel is. Is it the shirt of the person? Is it the street? Is it the cloud or kind of pixel on the mountain? You have to look at a wider context to be able to make that decision. So we use again this kind of multi-scale approach where the same image is reduced by a factor of two and a factor of four. And you run those two extra images to the same convolution net. Same weight, same kernel, same everything. Except the last feature map you upscale them so that they have the same size as the original one. And now you take those combined feature maps and send them to a couple layers of a classifier. So now the classifier to make his decision has four 46 by 46 windows on images that have been rescaled. And so the effective size of the context now is 184 by 184 window because the core scale network basically looks at more or less the entire image. Then you can clean it up in various way. I'm not going to go into details of this, but it works quite well. So this is the result. The guy who did this in my lab is Clément Farbe. He's a VP at NVIDIA now in charge of all of machine learning infrastructure and autonomous driving. Not surprisingly. And so that system, you know, this is Wichita Square Park by the way. So this is the NYU campus. It's not perfect far from that. From that, you know, it identifies some areas of the street as sand or desert. And there is no beach I'm aware of in Wichita Square Park. And at the time this was the kind of best system of his kind. The number of training samples for this was very small. So it was kind of, it was about 2,000 or 3,000 images, something like that. You take the full resolution image. You run it to the first N minus two layers of your comnet. That gives you a bunch of feature maps. Then you reduce the image by a factor of two. Run it again. Then run it again by reducing by a factor of four. You get smaller feature maps. Now you take the small feature map and you rescale it. You up-sample it. So it's the same size as the first one. Same for the second one. You stack all those feature maps together. Okay. And that you feed to two layers for a classifier for every patch. Yeah, the paper was rejected from CVPR 2012, even though the results were record breaking. It was faster than the best competing method by a factor of 50, even running on standard hardware. But we also had implementation on special hardware that was incredibly fast. And people didn't know what a convolutional net was at the time. And so the reviewers basically could not fathom that a method they never heard of could work so well. They say it's probably wrong. There's way more to say about comnets, but I encourage you to take a computer vision course to hear about this. Yeah, this is... Okay, this data set, this particular data set that we use is a collection of images, street images that was collected mostly by Antonio Torralba at MIT. And he had a sort of a tool for kind of labeling, so you could sort of draw the contour of every object and then label every object. And so we would kind of fill up the object. Most of the segmentations were done by his mother, who's in Spain. She had a lot of time to spend doing this. Huh? His mother labeled that stuff. Yeah, this was in late 2000. Okay, now let's talk about a bunch of different architectures. Right, so as I mentioned before, the idea of deep learning is that you have this catalog of modules that you can assemble in sort of different graphs and together them to do different functions. And a lot of the expertise in deep learning is to design those architectures to do something in particular. It's a little bit like in the early days of computer science coming up with an algorithm to write a program was kind of a new concept. You know, reducing a problem to kind of a set of instructions that could be run on a computer was kind of something new. And here's the same problem. You have to sort of imagine how to reduce a complex function into sort of a graph, possibly dynamic graph of functional modules that you don't need to know completely the function of, but that you're going to, whose function is going to be finalized by learning. That's very, very important, of course, as we saw with convolutional nets. The first important category is recurrent nets. So we've seen, when we talked about back propagation, there was a big condition. The condition was that the graph of interconnection of the module could not have loops. It had to be a graph for which there is sort of at least a partial order of the module so that you can compute the modules in such a way that when you compute the output of the module, all of its inputs are available. But recurrent net is one in which you have loops. How do you deal with this? So here is an example of a recurrent net architecture where you have an input which varies over time, X of t, that goes through the first neural net, let's call it an encoder, that produces a representation of the input, let's call it H of t. And it goes into a recurrent layer. This recurrent layer is a function g that depends on trainable parameters w. This trainable parameter is also for the encoder, but I didn't mention it. And that recurrent layer takes into account H of t, which is the representation of the input, but it also takes into account z of t minus 1, which is sort of a hidden state, which is its output at the previous time step, its own output at the previous time step. This g function can be a very complicated neural net inside. A convolution net, whatever, could be as complicated as you want. But what's important is that one of its inputs is its output at the previous time step, z of t minus 1. So that's why this delay indicates here the input of g at time t is actually z of t minus 1, which is its output at the previous time step. Then the output of that recurrent module goes into a decoder which basically produces an output. So it turns the hidden representation z into an output. So how do you deal with this? You unroll the loop. So this is basically the same diagram, but now I've unrolled it in time. So I time 0, I have x of 0. That goes through the encoder, it produces h of 0. And then I apply the g function. I start with a z, an arbitrary z, maybe 0 or something. And I apply the function. And I get 0, 0, and that goes into the decoder, produces an output. And then now that I have z of 0 at time step 1, I can use z of 0 as the previous output for the time step. Now the output is x of 1 at time 1. I run through the encoder. I run through the recurrent layer, which is now no longer recurrent. And run through the decoder, and then the next time step, etc. Okay? This network that's unrolled in time doesn't have any loops anymore. Which means I can run backprop through it. So if I have an objective function that says the last output should be a particular one, or maybe the trajectory should be a particular one of the outputs, I can just backprop a gradient through this thing. It's a regular network with one particular characteristic, which means that every block shares the same weights. Okay? So the three instances of the encoder, they are the same encoder at two different time steps, three different time steps, so they have the same weights. The same g functions have the same weights. The three decoders have the same weights. Yes. It can be variable. You don't have to decide in advance. But it depends on the length of your input sequence. Basically. Right? You know, you can run it for as long as you want. It's the same weights all over, so you can just repeat the operation. Okay? This technique of unrolling and then backprop a gradient through time, basically, is called, surprisingly, BPTT, backprop through time. It's pretty obvious. That's all there is to it. Unfortunately, they don't work very well. At least not in their naive form. So, in their naive form, so a simple form of recurrent net is one in which the encoder is linear, the g function is linear with hyperbolic tangent or sigmoid or perhaps radio, and the decoder also is linear, something like this, maybe with a radio or something like that. Right? So it could be very simple. And you get a number of problems with this. One problem is the so-called vanishing gradient problem, or exploding gradient problem. And it comes from the fact that if you have a long sequence, let's say, I don't know, 50 time steps, every time you backprop a gradient, the gradients get multiplied by the weight matrix of the g function. Okay? At every time step, the gradients get multiplied by the weight matrix. Now, imagine the weight matrix has small values in it, which means that every time you take your gradient, you multiply by the transpose of this matrix to get the gradient of the previous time step, you get a shorter vector, you get a smaller vector, and you keep going, the vector gets shorter and shorter exponentially. That's called the vanishing gradient problem. By the time you get to the 50th time steps, which is really the first time step, you don't get any gradient. Conversely, if the weight matrix is really large, and the nonlinearity in your recurrent layer is not saturating, your gradients can explode. If the weight matrix is large, every time you multiply the gradient by the transpose of the matrix, the vector gets larger and it explodes, which means your weights are going to diverge when you do a gradient step, or you're going to have to use a tiny learning rate for it to work. So you have to use a lot of tricks to make it work. Here's another problem. The reason why you would want to use a recurrent net, why would you want to use a recurrent net? The purported advantage of a recurrent net is that they can remember things from far away in the past. Okay? If, for example, you... Imagine that the X's are characters that you enter one by one. There are characters that come from, I don't know, a C program or something like that, right? And what your system is supposed to tell you at the end, you know, it reads a few hundred characters corresponding to the source code of a function. And at the end, you want to train your system so that it produces one if it's a syntactically correct program, and minus one if it's not. Okay? Hypothetical problem. The recurrent nets won't do it, okay? At least not without tricks. Now, there is a thing here which is a big issue, which is that among other things, this program has to have balanced braces and parentheses. So it has to have a way of remembering how many open parentheses there are so that it can check that you're closing them all or how many open braces there are so all of them get closed, right? So it has to store eventually, you know, essentially within its hidden state, Z, it has to store like how many braces and parentheses were open if it wants to be able to tell at the end that all of them have been closed. So it has to have some sort of counter inside, right? Yes. It's going to be a topic tomorrow. Now, if the program is very long, that means, you know, Z has to kind of preserve information for a long time, and recurrent net, you know, will give you the hope that maybe a system like this can do this. But because of the vanishing gradient problem, they actually don't. At least not simple recurrent nets of the type I just described. So you have to use a bunch of tricks. Those are tricks from, you know, Yosha Evangelio's lab, but there is a bunch of them that were published by various people, like Tomash Mikolov and various other people. So to avoid exploding gradients, you can clip the gradients just, you know, if the gradients get too large, you just kind of squash them down. Just normalize them. Leak integration momentum. I'm not going to mention that. Good initialization. So you want to initialize the weight matrices so that they preserve the norm more or less. There's actually a whole bunch of papers on this on orthogonal neural nets and invertible recurrent nets. But the big trick is LSTM and GRUs. Okay, so what is that? Before I talk about that, I'm going to talk about multiplicative modules. So what are multiplicative modules? They're basically modules in which you kind of multiply things with each other. So instead of just computing a weighted sum of inputs, you compute products of inputs and then weighted sum of that, okay? So you have an example of this on the top left, on the top. So the output of a system here is just a weighted sum of weights and inputs. Okay, classic. But the weights actually themselves are weighted sums of weights and inputs. Okay, so WIJ here, which is the IJ storm in the weight matrix of the module we're considering, is actually itself a weighted sum of a third-order tensor UIJK weighted by variable ZKs. Okay? So basically what you get is that WIJ is kind of a weighted sum of matrices UK weighted by coefficient ZK. And the ZKs can change. They are input variables the same way. So in effect, it's like having a neural net with weight matrix W, whose weight matrix is computed itself by another neural net. There's a general form of this where you don't just multiply matrices, but you have a neural net that is some complex function, turns X into S, some generic function, okay? Could be a comnet, whatever. And the weights of those neural nets are not variables that you learn directly, but they are the output of another neural net that takes maybe another input into account or maybe the same input. Some people call those architectures hypernetworks. Okay? They are networks whose weights are computed by another network. But here's just a simple form of it, which is kind of a binary form or quadratic form. Okay? So overall, when you kind of write it all down, SI is equal to sum over J and K of U, I, J, K, Z, K, X, J. This is a double sum. People used to call this sigma pi units. Yes? It's the motivation. We'll come to this in just a second. Basically, if you want a neural net, they can perform a transformation from a vector into another, and that transformation needs to be programmable. Right? You can have the transformation be computed by a neural net, but the weight of that neural net would be themselves the output of another neural net that figures out what the transformation is. That's kind of the more general form. More specifically, it's very useful if you want to route signals through a neural net in different ways on a data-dependent way. So you... In fact, that's exactly what is mentioned just below. So an attention module is a special case of this. It's not a quadratic layer. It's kind of a different type, but it's a particular type of architecture that basically computes a convex linear combination of a bunch of vectors. So X1 and X2 here are vectors. W1 and W2 are scalars, basically. And what the system computes here is a weighted sum of X1 and X2 weighted by W1, W2. And again, W1, W2 are scalars, in this case. You get the sum at the output. So imagine that those two weights, W1, W2, are between 0 and 1 and sum to 1. That's what's called a convex linear combination. So by changing W1, W2, so essentially if the sum to 1, they're the output of a softmax, which means W2 is equal to 1 minus W1. That's kind of the direct consequence. So basically by changing the relative size of W1, W2, you can switch the output to being either X1 or X2 or some linear combination of the two, some interpolation between the two. Okay? You can have more than just X1 and X2. You can have a whole bunch of Xs, X vectors. And that system will basically choose an appropriate linear combination or focus. It's called an attention mechanism because it allows the neural net to basically focus its attention on a particular input and ignoring the others. And the choice of this is made by another variable Z, which itself could be the output of some other neural net that looks at Xs, for example. Okay? And this has become a hugely important type of function. It's used in a lot of different situations now. In particular, it's used in LSTM and GRU, but it's also used in pretty much every natural language processing system nowadays that use either transformer architectures or other types of attention. They all use this kind of trick. Okay? So you have a vector Z. Pass it to a softmax. You get a bunch of numbers between 0 and 1. That's sum to 1. Use those as coefficients to compute a weighted sum of a bunch of vectors X, Xis, and you get the weighted sum weighted by those coefficients. Those coefficients are data dependent because Z is data dependent. All right. So here's an example of how you use this. Whenever you have this symbol here, this circle with the dot in the middle, that's a component-by-component multiplication of two vectors. Some people call this Hadamard product. Anyway, it's term-by-term multiplication. So this is a type of a kind of functional module, GRU, gated recurrent net, so it was proposed by Kung-Yong Cho, who is a professor here. And it's an attempt at fixing the problem that naturally occurs in recurrent nets that I mentioned, the fact that you have exploding gradient, the fact that the recurrent nets don't really remember their states for very long. They tend to kind of forget very quickly. And so it's basically a memory cell. Okay? And I have to say, this is the kind of second big family of sort of recurrent net with memory. The first one is STM, but I'm going to talk about it just afterwards, just because this one is a little simpler. The equations are written at the bottom here. So basically there is a gating vector, Z, which is simply the application of a nonlinear function, a sigmoid function to two linear layers and a bias. And those two linear layers take into account the input, X of t, and the previous state, which they denote H in their case, not Z, like I did. Okay? So you take X, you take H, you pass them through matrices, you pass the results, you add the results, you pass them through sigmoid functions, and you get a bunch of values between 0 and 1 because the sigmoid is between 0 and 1. It gives you a coefficient. And you use those coefficients, you see the formula at the bottom, the Z is used to basically compute a linear combination of two inputs. If Z is equal to 1, you basically only look at HT minus 1. If Z is equal to 0, then 1 minus Z is equal to 1, then you look at this expression here, and that expression is, you know, some weight matrix multiplied by the input pass through a hyperbolic tangent function. It could be a value, but it's a hyperbolic tangent in this case. And that is combined with other stuff here that we can ignore for now. Okay? So basically what the Z value does is that it tells the system, just copy, if Z equals 1, it just copies its previous state and ignores the input. Okay? So it acts like a memory, essentially. It just copies its previous state on its output. And if Z equals 0, then the current state is forgotten, essentially, and basically you just read the input. Okay? Multiply by some matrix. So it changes the state of the system. Sorry, could you just clarify one more? Yeah, so you did this component by component, essentially. Okay, so one is essentially vector 1. Vector 1, yeah, exactly. How will the derivatives look in case of element-wise multiplication when the derivatives will also get like element-wise multiplication in macro? Well, it's just like a number of independent multiplications, right? What is the derivative of some objective function with respect to the input of a product? It's equal to the derivative of that objective function with respect to the product, multiplied by the other term. That's a simple example. So it's because, by default, essentially, unless Z is, more or less, by default, equal to 1. So by default, the system just copies its previous state. And if it's just less than 1, it puts a little bit of the input into the state, but doesn't significantly change the state. And what that means is that it preserves norm and it preserves information, right? So it's basically a memory cell that you can change continuously. Well, because you need something between 0 and 1, it's a coefficient, right? And so it needs to be between 0 and 1, that's why people use sigmoids. I mean, you need one that is monotonic, that goes between 0 and 1, and is monotonic and differentiable. There's lots of sigmoid functions, but why not? Yeah, I mean, there is some argument for using others, but it doesn't make a huge amount of difference. Okay, in the full form of GRU, there's also a reset gate. So the reset gate is this guy here. So R is another vector that's computed also as a linear combination of inputs and previous state. And it serves to multiply the previous state. So if R is 0, then the previous state is... If R is 0 and Z is 1, the system is basically completely reset to 0, because that is 0. And so it only looks at the input. But that's basically a simplified version of something that came out way earlier in 1997 called LSTM, Long Short-Term Memory, which was an attempt at solving the same issue that recurrent nets basically lose memory for too long, and so you build them as memory cells by default, and by default they will preserve the information. It's essentially the same idea here. The details are slightly different. Here they don't have dots in the middle of the round shape here for the product, but it's the same thing. And there's a little more kind of moving parts. Basically it looks more like an actual run cell. So it's like a flip-flop that can preserve information, and there's some leakage that you can have. You can reset it to 0 or to 1. It's fairly complicated. Thankfully, people at NVIDIA, Facebook, Google, and various other places have very efficient implementations of those, so you don't need to figure out how to write the CUDA code for this or write the backdrop. Works really well. It's quite widely used, but it's used less and less because people use recurrent nets. People used to use recurrent nets for natural language processing, mostly, and things like speech recognition. Speech recognition is moving towards using convolutional nets, temporal convolutional nets, while natural language processing is moving towards using what's called transformers, which we'll hear a lot about tomorrow, right? No? When? Doing so now, okay. So transformers are... Okay, I'm not going to talk about transformers just now, but basically transformers are kind of generalization, general use of attention if you want. Big neural nets that use attention, every block of neural uses attention, and that tends to work pretty well. It works so well that people are basically dropping everything else for NLP. The problem is systems like KSTM are not very good at this, so transformers are much better. The biggest transformers have billions of parameters, like the biggest ones were 15 billion, something like that. That order of magnitude, the T5 or whatever it's called, for Google, so that's an enormous amount of memory, and because of the particular type of architecture that's used in transformers, they can actually store a lot of knowledge if you want. So that's the stuff people would use for what you're talking about, like question answering systems, translation systems, et cetera. They all use transformers. So because the STM was one of the first recurrent architectures that kind of worked, people tried to use them for things that, at first, you would think are crazy, but turned out to work. One example of this is a translation, it's called neural machine translation. So there was a paper by Ilya Sutskever at NIPS 2014 where he trained this giant multilayer LSTM. So what's a multilayer LSTM? It's an LSTM where you have, so this is the unfolded version, right? So at the bottom here, you have an LSTM, which is here unfolded for three time steps, but it will have to be unfolded for the length of a sentence you want to translate. It's a sentence in French. And then you take the hidden state at every time step of this LSTM, and you feed that as input to a second LSTM. And I think in his network, he actually had four layers of that. So you can think of this as kind of stacked LSTM that each of them are recurrent in time, but they are kind of stacked as the layers of a neural net. So at the last time step and the last layer, you have a vector here, which is meant to represent the entire meaning of that sentence. Okay? So it could be a fairly large vector. And then you feed that to another multilayer LSTM, which, you know, you run for a sort of undetermined number of steps. And the role of this LSTM is to produce words in a target language if you do translation, say German. Okay? So this system, you know, it takes the state, you run through the first two layers of the LSTM, produce a word, and then take that word and feed it as input to the next time step so that you can generate text sequentially, right? Run through this, produce another word, take that word, feed it back to the input, and just keep going. So this is a... If you do this for translation, you get this gigantic neural net. You train it, and this is the system of this type, the one that ESSK represented at NITS in 2014. It was the first neural translation system that had performance that could rival sort of more classical approaches, not based on neural nets. And people were really surprised that you could get such results. That success was very short-lived. Yeah, so the problem is the word you're going to say at a particular time depends on the word you just said, right? And if you ask the system to just produce a word, and then you don't feed that word back to the input, the system could produce another word that has... that is inconsistent with the previous one you produced. So you're trying to inform it traveling between the two time sets? It should, but it doesn't. I mean, not well enough that it works. So this kind of sequential production is pretty much required. In principle, you're right. It's not very satisfying. So there's a problem with this, which is that the entire meaning of the sentence has to be kind of squeezed into that hidden state between the encoder and the decoder. That's one problem. The second problem is that despite the fact that LSTM are built to preserve information, they are basically memory cells, they don't actually preserve information for more than about 20 words. So if your sentence is more than 20 words, by the time you get to the end of the sentence, your hidden state will have forgotten the beginning of it. So what people use for this, the fix for this is a huge hack. It's called buy LSTM, and it's a completely trivial idea of running two LSTMs in opposite directions. And then you get two codes, one that is running the LSTM from beginning to end of the sentence. That's one vector, and then the second vector is from running an LSTM in the other direction. You get a second vector. That's the meaning of your sentence. You can basically double the length of your sentence without losing too much information this way. But it's not a very satisfying solution. So if you see buy LSTM, that's what it is. So as I said, the success was short lived because in fact, before the paper was published at NIPS, there was a paper by Dimitri Badanow, Kim Yong-chul, and Yoshua Benjo, which was published on Archive in September 14, that said, we can use attention. So the attention mechanism I mentioned earlier, instead of having those gigantic networks and squeezing the entire meaning of a sentence into this small vector, it would make more sense to do translation if at every time step, you know, we want to produce a word in French corresponding to a sentence in English. If we looked at the location in the English sentence that had that word, okay? So our decoder is going to produce French words one at a time. And when it comes to produce a word that has an equivalent in the input English sentence, it's going to focus its attention on that word and then the translation from French to English to that word would be simple. It may not be a single word. It could be a group of words, right? Because very often you have to turn a group of words in English into a group of words in French to kind of say the same thing. If it's German, you have to put the verb at the end of the beginning. So basically you use this attention mechanism. So this attention module here is the one that I showed a couple slides earlier which basically decides which of the time steps, which of the hidden representation for which of the word in the input sentence is going to focus on to kind of produce a representation that is going to produce the current word at a particular time step. And we're going to have to decide which of the input word corresponds to this and we're going to have this attention mechanism. So essentially we're going to have a small piece of neural net that's going to look at the inputs and decide it's going to have an output which is going to go through a softmax and it's going to produce a bunch of coefficients that's on to 1, between 0 and 1 and they're going to compute a linear combination of the states at different time steps. OK, by setting one of those coefficients to 1 and the other ones to 0 it's going to focus the attention of the system on one particular word. So the magic of this is that this neural net that decides, that runs through the softmax and decides on those coefficients actually can be trained with backprop it's just another set of weights in a neural net and you don't have to build this by hand it just figures it out. This completely revolutionized the field of neural machine translation in the sense that within a few months a team from Stanford won a big competition with this beating all the other methods and then within three months every big company that works on translation had basically deployed systems based on this. So this just changed everything and then people started paying attention to attention paying more attention to attention in the sense that and then there was a paper by a bunch of people at Google where the title was Attention Is All You Need and it was basically a paper that solved a bunch of natural language processing tasks by using a neural net where every day or in every group of neurons basically was implementing attention and that's what a or something called self-attention that's what a transformer is. Yes. This is a number of inputs that you focus attention on. Okay, I'm going to talk now about memory networks. So this stems from a work at Facebook that was started by Antoine Bord I think in 2014 and by Sainas Sukhbatar I think in 2015 or 2016 called N2N Memory Networks. Sainas Sukhbatar was a PhD student here and he was an intern at Facebook when he worked on this together with a bunch of other people at Facebook and the idea of a memory network is that you'd like to have a short-term memory, you'd like your neural net to have a short-term memory a working memory okay, you'd like it to you tell okay if I tell you a story, I tell you John goes to the kitchen John picks up the milk Jane goes to the kitchen and then John goes to the bedroom and drops the milk there and then goes back to the kitchen and I ask you where's the milk okay, so every time I I told you a sentence you kind of updated in your mind a kind of current state of the world if you want and so by telling you the story now you have a representation of a state of the world and if I ask you a question about the state of the world you can answer it okay, you store this in a short-term memory you didn't store it there's a number of different parts in your brain but there's two important parts one is the cortex the cortex is where you have long-term memory where all your thinking is done and all that stuff and there is a separate chunk of neurons called a hippocampus which is sort of it's kind of two formations in the middle of the brain and they kind of send wires to pretty much everywhere in the cortex and the hippocampus is thought that to be used as a short-term memory so it can just you know remember things for relatively short time the prevalent theory is that when you sleep and you dream there is a lot of information that is being transferred from your hippocampus to your cortex to be solidified in long-term memory because the hippocampus has limited capacity when you get senile like you get really old very often your hippocampus drinks and you don't have short-term memory anymore so you keep repeating the same stories to the same people okay, it's very common or you go to a room to do something and by the time you get to the room you forgot what you were there for this starts happening by the time you're 50 by the way so I don't remember what I said last week two weeks ago okay but anyway so memory network here is the idea of memory network you have an input to the memory network let's call it x and think of it as an address of a memory what you're going to do is you're going to compare this x with a bunch of vectors we're going to call k so k1, k2, k3 okay so you compare those two vectors and the way you compare them is through a dot product very simple okay so now you have the three dot products of all the three k's with the x those are scalar values you plug them through a softmax so what you get are three numbers between 0 and 1 that sum to 1 what you do with those you have three other vectors that I'm going to call v, v1 v2 v3 and what you do is you multiply these vectors by those scalars so this is very much like the attention mechanism that we just talked about okay and you sum them up so take an x compare x with each of the k's each of the k's those are called keys you get a bunch of coefficients between 0 and 1 that sum to 1 and then compute a linear combination of the values those are value vectors and sum them up okay so imagine that one of the key exactly matches x you're going to have a large coefficient here and small coefficients there so the output of the system will essentially be v2 if k2 matches x the output will essentially be v2 okay so this is an addressable associative memory associative memory is exactly that you have keys and values and if your input matches a key you get the value here it's kind of a soft differentiable version of that so you can you can back propagate through this you can write it to this memory by changing the v vectors or even changing the k vectors you can change the v vectors by gradient descent okay so if you wanted the output of your memory to be something in particular by backpropagating gradient through this you're going to change the currently active v to whatever it needs for the for the output so in those papers what they did was I mean there's a series of papers on my network but what they did was exactly as I just explained where you kind of tell a story to a system so give it a sequence of sentences those sentences are encoded into vectors by running through a neural net which is not pre-trained you know it just through the training of the entire system it figures out how to encode this and then those sentences are written to the memory of this type and then when you ask a question to the system you include the question at the input of a neural net the neural net produces an X to the memory, the memory returns a value and then you use this value under previous state of the network to kind of re-access the memory, you can do this multiple times and you train this entire network to produce or answer to your question and if you have lots and lots of scenarios lots and lots of questions lots of answers which they did in this case by artificially generating stories questions and answers this thing actually learns to store stories and answer questions which is pretty amazing so that's the memory network can you just say one more time what happens in the first step so the first step is you compute alpha i equals k i transpose X just a dot product and then you compute c i or the vector c I should say is the surfmax function applied to the vector of alphas so the c's are between 0 1 1 and some to 1 and then the output of the system is sum over i of c i v i where v i are the value vectors that's the memory yes yes yes absolutely not really no I mean all you need is everything to be encoded as vectors and so run through your favorite comnet you get a vector that represents the image and then you can do vqa you can imagine lots of applications of this so in particular when application is you can think of you know think of this as kind of a memory and then you can have some sort of neural net that takes an input and then produces an address for the memory gets a value back and then keeps going and eventually produces an output this looks very much like a computer where the neural net here is the cpu the alu the cpu and the memory is just an external memory you can access whenever you need it or you can write to it if you want it's a recurrent net in this case you can unfold it in time which is what these guys did and so then there are people who kind of imagine that you could actually build kind of differentrable computers out of this there's something called a neural turning machine which is essentially a form of this where the memory is not of this type it's kind of a soft tape neural turning machine this is somewhere from DeepMind there's an interesting story about this which is that the physical people put out the paper on the memory network on archive and three days later the DeepMind people put out a paper about neural turning machine and the reason they put it three days later is that they've been working on neural turning machine and in their tradition they kind of keep it secret until they can make a big splash but then they got scooped so they put the paper out on archive eventually they made a big splash with a paper but that was a year later so what's happened since then is that people have kind of taken this module, this idea that you compare inputs to keys and that gives you coefficients and you produce values as kind of an essential module in a neural net and that's basically what a transformer is so a transformer is basically a neural net in which every group of neuron is one of those it's a whole bunch of memories essentially there's some more twist to it okay but that's kind of the basic idea but you'll hear about this in a week oh, in two weeks one week okay, any more questions? cool, alright thank you very much