 Okay, so good morning everyone. This is going to be a short lecture today. Fredo is going to take over for the second half. So I'm going to be talking about the practice of convolutional nets, how they are used in various ways. And the first thing I should say is that they are used a lot and they're used almost everywhere where there is image recognition to be done, but also in a lot of other applications. And we'll go through some of them not entirely today, but perhaps in subsequent lecture today I'm going to be speaking mostly about computer vision. And I'm going to be illustrating some of those use with sort of a little bit of sort of historical ways that those things have been kind of proposed, you know, going back to the early 1990s. There are tens of thousands of papers published every year on using convolutional nets for various things. And so I cannot possibly do an exhaustive overview of everything. Just looking at the CVPR conference, which is sort of the main computer vision conference, that there are just the thousands of papers literally every year on just on this topic, right? So so much that, you know, nobody can really keep track of it. And certainly not me. Okay. So we talked about the basic modules that form a convolutional net last week and sort of the very basic architecture of convolutional net where you stack convolution operations or multiple convolution operations interspersed with nonlinearities and pooling operations where the spatial or temporal dimension is reduced. And we can ask ourselves what are convolutional nets good for? And so the design of it comes from the idea that, you know, the idea of convolution comes from the fact that you can have features that can appear anywhere on the image. And so if you want to feature detector at one location, it's a good idea to have the same feature detector at another location. So there comes the idea of local feature detectors and then replicating them over the over the field of view, essentially, right, which is the idea of correlation. Now, this is due to properties of signals that natural signals in particular, things that come to you in the form of an array, a multi dimensional array or a single dimensional array, right? So it can be a time series. It could be multiple time series. So it'd be a multi channel, one dimensional array, if you want. So in effect, a two dimensional array. It could be an image, which is a two dimensional array, it could be a color image or a multi spectral image. So an image that has more than three colors, right? That's the sensors that can do this. For example, with bands in infrared and ultraviolet, there's a lot of satellite imaging, for example, that have hundreds of bands. So that would be a three dimensional array where two of the dimensions are actually organized with a third dimension, not necessarily, right? They could be just channels that don't have any structure to them, any order, if you want. And then you can have four dimensional arrays. So four dimensional arrays would be another example, three dimensional array would be volumetric images. So the kind of image you get out of an MRI machine, for example, or in some cases, some sensors like LiDAR, which basically gives you an XYZ location for every point it looks at. So it gives you not just an image with grayscale, you could think of the grayscale being replaced by your depth, right? And you could interpret this either as a 2D image with different values or as a 3D image, where each voxel indicates the presence or absence of an object in it. There are actually libraries to deal with this. There's something called Petorch 3D, which does a bit of that. Then other 3D data includes video, or 4D data would be color video, right? And you can imagine all kinds of signals that have more dimensions, right? If you have a video of volumetric images, like what you get from a ultrasound, for example, or short term MRI, functional MRI images for the brain, things like that. Okay. So that's the first property. The data comes to you in the form of an array. The second property is that the signals have strong local correlations, which means that neighboring values, pixels, voxels, whatever, are likely to have to take similar values. Okay. So this is the case for an image. You should take neighboring pixels in an image and compute the statistics of how different are two neighboring pixels in an image. They're generally not very different unless you have an edge in this image, but edges are rare, actually. Okay. But as the distance between the two pixels you're measuring grows, the likelihood that those two pixels have similar color decreases. Okay. So what that suggests is that you have strong local correlations. And what that suggests is that there are patterns when you take a block of five pixels, five by five pixels, for example, or maybe more, the type of patterns you observe does not cover the set of all possible combinations of pixels because of those correlations. Okay. So what that suggests is that it's interesting to detect particular combinations of those pixels, flat uniform areas, soft gradients, edges, maybe gratings, things like that. Right. So what that means is that you can sort of abstract the content of such a patch in an image by a list of presence or absence of particular features. Right. So that's kind of the idea. But that's due to the fact that you have strong local correlations. Right. So if you want to take an image and randomly permute the pixels so that you break the local correlations and you plug your convolutional net to it, the convolutional net wouldn't do very well. A fully connected network will work exactly the same. Right. Fully connected network doesn't care about the topology of the input. It comes at a price because you have to connect everything to everything. So it's not always a good idea to use that. So convolutional nets exploit this local correlation. And the way sharing, exploit the fact that features basically can appear anywhere because you can move your eyes or your camera and you never know where a particular object will appear or a particular feature. And that's basically the fourth property signals in which objects are subject to translations, distortions, et cetera. And you want the system to be able to recognize those objects or classify them independently of those distortions. Right. And that's basically provided kind of hardwired a little bit by the pooling operation. Okay. So when the signals include text, music, audio, speech, time series, you know, and things of that type, 2D com nets, as I said, image images, but also time frequency representations of a speech and audio. So this is basically a way of turning audio signal or speech signal into basically an image or one dimension is time. And the other dimension is frequency and every value indicates the energy of the signal in that particular frequency band at that particular time. Okay. The spectrogram is an example of this. And so this, you know, this kind of representation is used for recognition, localization, detection, things like that. And then 3D com net for video, biomedical images, hyperspectral image images. So this is the example I was citing of satellite images where you don't have just three colors. You might have like 100 different channels of different frequency bands. That's used for astronomy as well, spectroscopy basically. So they are the color if you want is a third dimension. Right. So there's a lot of information in every pixel and that changes the thing. So, you know, why do we need to stack layers in com nets? Why is it interesting to stack layers? For the same reason that, you know, deep learning is a good idea. We need to stack layers, but it's due to the fact that the world is compositional in a way. And I alluded to this a little last time. So in a convolutional net, when you're trying to convolutional net to do image recognition, to classify objects, for example, what you see, and this is a visualization by Rob Ferguson, this former student, Matt Zeeler. What you see is feature detectors at a low level that detect those simple motifs, oriented edges, and, you know, color gradients and things like this. So essentially what I was just saying, motifs that appear surprisingly often in images and are informative for whatever task. And it doesn't matter what task you train this com net to do, it will almost always come up with things like this. Okay. It will always be different, but things like this essentially. Now what you can try to do is try to figure out what is the input pattern that will maximally activate a particular unit at a particular layer. Okay. The network that has been tested here is something that has something like 20 layers. And so I'm only representing three of them. But at some level, somewhere, there are feature detectors here that detect things like circles and gratings and corners and, you know, simple kind of local shapes and motifs, which then in subsequent layers are assembled to basically form parts of objects. So it's sort of a weird visualization here. But that's essentially kind of patterns that will maximally activate a particular detector in a particular layer in a convolutional net. And this is appropriate for natural data because the world is compositional in the sense that in physics, it's clearly the case. We have elementary particles at the low level or string even at the lower level, although that's not verified, but it's the elementary particles like quarks and things like that and bosons and whatever. And those assemble to form other particles like neutrons and protons. And then those assemble to form atoms. And then those assemble to form molecules and then materials and then parts of objects, collection of objects and planets and scenes and things like that. So there is this sort of hierarchy in physics or in natural science. And every level has a different name, right? So is this high energy physics at the bottom and then there is, you know, solid state physics or condensed matter physics. And then there is chemistry or physical chemistry. Then there is organic chemistry where you have big molecules and then, you know, biology or biochemistry. And then biology and then, you know, you need multiple levels of description. You know, and then when we talk about like human society, there is economics and sociology and stuff like that, right? Psychology. So at every level, we have a different level of description of reality that uses different concepts. And it's this idea of hierarchy, right? So that's one reason why perhaps this idea of hierarchy is a good idea. And one of the big mysteries of the universe is why is it that the world is so organized that in such a way that we can actually understand it? Because it could be a complete mess that we cannot possibly understand. There's a famous quote by Albert Einstein, which I don't remember if I told you already. Maybe Alfredo does. He says, he said, the most incomprehensible thing about the world is that the world is comprehensible and or understandable. And this is the reason is because it's compositional. So we have levels of description that can describe in some abstract level what happens at the lower level, ignoring the details. In fact, there are explicit models of physics that things like something called tensor networks. And they have architectures to basically represent the interaction between sort of different quantities that look very much like convolutional nets. So the two techniques, one called renormalization group theory, which kind of describes a complex system by abstracting the local details if you want. So this is used in condensed manner physics. And then things like this multi-scalanting amount of renormalization and that's that also called MERA. It looks very much like a convolutional net. And it's kind of designed to represent the complex interaction that occur between sort of different quantities. So this is probably some sort of deep cosmic reason for multilayer systems to be useful and probably why we have them in our brain. Okay, let's come back to Earth a little bit and talk about the practicality of convolutional nets. And so we talked about convolutional nets for single objects. But it turns out you can use convolutional nets as a detector or as a way of recognizing multiple objects. We'll talk about the more kind of complex way of doing this in a subsequent lecture on structure prediction. But I'm going to tell you kind of the basic idea here. So pretty early on in the early 90s, we realized that we could apply a convolutional net not just to recognize a single object but multiple objects, right? So what you see here is a convolutional net that has been trained to recognize a 32 by 32 image. So every output, if you want, is influenced by a 32 by 32 input. But in fact, the convolutional net has, after training, has been made bigger. So the input was extended to 64 by 32, right? So the or 32 vertically by 64 horizontally. And what happened to the convolutional net is that we kind of replicated it over multiple windows of 32 pixels. And so, and I'll tell you in a minute how that's done, okay? So imagine that you take a 32 by 32 window, run your favorite convolutional net for character recognition. It gives you a score for each category, right? Then you shift this by a few pixels, let's say four pixels. And it gives you another answer. And then you shift it by another four pixels. So whenever a character is more or less centered in that window, this particular component will actually produce a reliable answer as to what it's seeing within this window. You can train the system. When you train it, you show it characters that are centered and tell it, you know, you show a five and you say that's a five. And then you show a five that's slightly off-centered and you tell it don't answer anything because it's not centered, you can't tell what it is. So you train it to explicitly not produce anything, right? So this there's no softmax on the output, basically. You just turn off all the outputs. And what you can also do is show a character in the center during training and then show other characters on the side that you can bring, you know, more or less close to the character in the center, but train the system to only recognize the one in the center, basically to ignore the ones on the side. And then you can also show kind of images where you have two characters on the side and nothing in the center and you again train it to produce none of the above, okay? So when you take this neural net and you shift it over the input, it's going to turn on and detect a character whenever there is one that's more or less centered in it. And then it will probably say none of the above whenever there is, you know, no decent character that is being centered. So this is an example of this where this connet has been kind of shifted every four pixels, right? The input window is 32 pixels. And what you see here is the winning category with where the intensity of the grayscale indicates the score if you want to the winning category. So this system kind of detects, you know, the five in multiple places whenever it looks like a five. The one is basically only one location. And what you see is that the system doesn't quite care, you know, where the characters end and begin. It basically figures out which parts of the image kind of fit with each other to produce a score. So as I said, this is an old idea that was developed in the lab that was working in at Bell Labs at AT&T. And here is how you do it in practice. So that turns out to be extremely cheap to do this idea of kind of replicating a convolutional net over an input, over a larger input, and handling viable size inputs. We used to call this SDNN, which stands for Space Displacement Neural Net, but it's really just a convolutional net, okay? And so imagine that you've trained a single character recognizer whose input window is 32 by 32 or in this case maybe a little more, okay? And you want to apply it to the sliding window over the entire, over a bigger input representing a word, for example, an written word. The way you do this is you don't recompute the convolutional net at every location. You don't need to do this because when you shift the convolutional net on the left by a few pixels, there's a whole bunch of layers, you know, activations you already computed, right? Because you're using the same weights all over. So the basic idea here is that you extend the size of the convolutions, which you can do without retraining anything, okay? You use the same weights that you had for this network, you just make the input bigger and you apply the convolution. And so the output gets bigger automatically, okay? And every layer gets bigger according to that. And now what you have to think about is that what I told you were fully connected layers last week that we put on top of the convolutional net for the last few layers, I lied to you. They're not actually fully connected layers. They are actually convolutional layers as well with a convolution kernel of size one by one. Okay, so I need to unpack this a little bit. So I'm first going to show you a sort of explicit example of this little demo I showed you earlier. So this shows the, when you take a particular output, let's say the red output here, that, you know, detects either a five or a six or a three. When you back project it on the previous layer, you get a feature vector, which is basically a column of this feature vector here. I'm representing only one value here as a little red square. Okay, so it's just a single value, but because it's multiple channels, you have 120 of those because there's 120 channels. Those are connected to, you know, those red squares on the previous layer at every feature map represented by this. The size of this, I believe, is five by five. And then you go back to the previous layer, which, you know, is, I don't remember if this is before or after pooling. I think it's after pooling. So the input window you get here is something like 10 by 10, I believe, or something like that. And then there is a pooling, which is two by two, and then convolution. So you get the 32 by 32 input. So you draw the diagram of this, and those sizes just, you know, come out of it. Now, when you make the convolution larger on the input, the next output, because of the various subsampling pooling layers, is going to be shifted by four pixels. And I'll make a drawing that explains why this is, okay, in a minute. So each of these guys, you know, the output here, the green output, looks at a window that is shifted by four pixels from the previous one. Why four? It's because there are two layers of two by two subsampling. And so the overall subsampling ratio is four. And what that means is that when you shift the output by one, you shift the input by four. Or when you shift the input by four, you shift the output by one, okay. But again, I'll draw this so you get a better idea for that. Okay, so now what we have is basically a collection of answers with various scores. What we have to do is what's called non-maximum suppression, okay, NMS. This is used universally in computer vision. And what that means is that you see at one point, there are three detectors, I mean, three instances of this, of the output that detect the number five, okay. For one of them, the five is kind of centered. For the other ones, it's kind of on the side, but they see it's a five. So they say it's a five. Now you have to have some post-processing that says, this is only one five, okay. And that's called non-maximum suppression. So basically you take the scores of all the categories within a small window and you say, who is winning? Which category has the highest score within this window? And I'm just going to use this guy as the winner and then suppress everybody else. So I know there is only one five, even though there are three, five detectors that turn on, because they're all seeing the same five, really, okay. Same for the seven here. It appears twice, you know, only one of them. So this is basically what happens here between this list of detections here and the final answer, which only contains two characters. It's a kind of non-maximum suppression. It's done with what's called a weighted finite state machine, but I'm going to go into the details of this until we talk about structure prediction. But there are many ways to do non-maximum suppression. And in fact, it's kind of going away a little bit because now people are replacing this handcrafted process by actually a trainable neural net as well. The training non-maximum suppression, if you want. Okay. So let me, so here's another example of an early object detection system with commercial net, probably the first one actually, from the paper that I co-authored with a few people in 1993, although the work was done in 1991. And this is the same idea. This was for face detection. So the datasets at the time were really small. We had to collect our own. It was only a couple thousand images or something. So the idea there for detection is that you take a collection of images, some of which have faces, some of which have no faces, okay, no face from whatever photo that you took that you're pretty sure has no people in them. And then you train a neural net, which in this case had an input of 20 by 20 pixels. You train it to say, you know, turn on the output whenever there's a face and turn off whenever there is no face. And you do this only for input windows of 20 by 20 pixels. And then what you do to do face detection is that you slide this commercial net over an image and whenever there's a face that is roughly 20 by 20 pixels is going to turn on, okay. Now the problem is that faces can appear at different sizes. So what you do now is you take your image, you set sample it by some factor, another scale, and you apply your network again at that new scale, and then do it again for even another scale, right. So eventually there's going to be one scale where the faces appear to be roughly at the right size and where your detector is going to turn on. So these are the detection maps here that you see. So I think bright means high score for detection and dark means no detection. Those appear bright, but they're not actually that good of a detection. So you need a kind of a blob of activity for the detection to be high scoring. And you see them here at, you know, scale eight, for example, or scale seven, scale six, you have kind of blobs of high score activities that kind of represent the presence of a face. So here you do non maximum suppression again. You say for every location and scale, which is the blob that has the highest overall score, and you decide that's a winning candidate and you suppress the other ones. And that's the result in the end, what you get here on the, on the right with the four winners. So let me tell you a bit about those, this, this replication over the field that I was, I was just telling you about. Okay, so let's say you have an input image, and you're going to apply a first convolutional net. I'm going to draw it red. It's a very simple convolutional net. It's got three kernel size three. Okay, then pooling. And then another kernel. Okay, so this is convolution pooling convolution. And this is pooling by two convolution by three. Okay, you get an output. Right. Now, as you imagine, you want to apply the same convolution net shifted by let's say two pixels. Okay. So I'm going to draw that convolution net again. Okay, this is the very same convolution that applied to a window shifted by two pixels. All right. Now, notice that this part of the network is shared between those two, those two instances of the, of the convolution net. Right. I do not need to recompute those, those pink units here because I already computed them. Okay. So it would be, it would be stupid to recompute the entire network since I already computed, you know, part part of it. Right. And if I kind of keep going there, you can imagine, of course, that if I, if I shift by, by, by, by two pixels again, you know, I'm going to get a third network. Sorry. Forgot one. Here we go. And again, you know, I don't need to recompute those, those pink guys. Okay, they are in common. So what do I have to do in the end? What I have to do in the end is basically I just take my entire input. So it turns out to be like super simple. I take my entire input and I just compute a convolution over the entire input. I just apply my convolution. And I get a layer here that is whatever size is appropriate that corresponds to the size of the input when I apply the convolution. Okay. So I don't need to do anything really. I just, I just do that convolution. I get the result I want. And then I apply the next layer, which is a pooling two by two, okay, over the entire input. And then I apply the convolution again. So what used to be, and let's say I have a free connected layer here. Okay, so I'm going to add a layer, a free connected layer here to my, to my network. Okay. So this is what, you know, what you would call an F, okay, the free connected layer. But in fact, it's not a free connected layer because to be able to compute this extended convolutional net, what I need to do is use the same weights at every location here for this operation. And it, what it comes down to is a one by one convolution, essentially, right? So I still have shared weights. And which is, you know, one of the properties of convolution is just that now my kernel, my convolution kernel only looks at one value. So it says if I had a convolution kernel of size one by one. But of course, you know, you have multiple channels here, multiple channels here, and you have a whole matrix here between. It shows that this matrix is shared across the different locations, right? Okay, now let's take those two guys and see when we back project, what do they look at? So this guy, the red guy here looks at this input of size six. And the green guy looks at that input also size six and the orange guy at, I hope I did it right. I think I screwed up. Okay. I think this guy actually is like that. And then this guy is here. And this would be ignored. Okay. Okay, now they're shifted by two pixels. Because of this, because of this subsampling, because of this pooling layer that subsamples by a factor of two. If we didn't have any pooling layer, but we only had convolutions with stride, we would have the same effect that thing would be shifted by whatever the product of the strides of the operations we're doing at the various layers is. Okay, so if we have a convnet with various dense convolutions, and then one layer of pooling subsampling with a subsampling ratio of two, and then another layer of pooling and subsampling, the subsampling ratio of three, and the overall subsampling ratio is six, which means every output will see a window that is shifted by six pixels compared to the previous one. Okay. Any questions at this point? Yeah, there are plenty of questions here. So for the animation, you showed before with the three and one that we're moving, and you had the multiple digits, this one. So we had actually three questions, which are basically the same. So how do you deal with multiple characters, right? So someone is asking, do we need knowledge of how many characters are in an image for these methods, or the number of characters are also limited by the system? So there are several kind of questions about, like, there's some confusion about the multiple characters. Okay, the short answer is you don't need to know how many characters there are in the first place, but the proper way to do this requires something called structure prediction. And I'm not going to talk about this today. We'll talk about this in a future lecture. There might even be a homework about this. So you'll learn about this, okay? I promise. But the short answer is you don't need to know how many characters are in advance if you design the system properly, or train it properly. So you can apply this to any kind of detection, physical detection, pedestrian detection, whatever detection you want, detection of tumors in medical images, things like this. And the idea goes back a very long time as I showed you. You can simultaneously do pose estimation as well. So the example you see at the top left here is some work I did at the NEC Research Institute before I came to NYU in 2003, which was published a couple years later. And so this is done as pose estimation. So the output is not just, you know, is there a face or not, but also what is the orientation of the face? And this is slightly more recent work. So all of those works are preceded the craze, the new craze of around deep learning of the last eight years or so. This is work from, you know, 2010, 2012 or so. Yeah, another example of this pose estimation face detection system. This is my grandparents' wedding, actually. Those are my grandparents. Okay. So if you can detect objects, you can also segment images. So what does that mean to segment an image? It means basically identifying every pixel in an image as to whether it belongs to an object or not. You don't care what the object is, maybe, but you care about kind of separating an object from its background or classifying each region in a different way. So this is, again, some relatively old work where we train a convolutional net basically to indicate on those microscopic images whether a particular region or a particular pixel belongs to a cell. This is an embryo of, you know, worms, elegans. And there is, you can see the nucleus of the cell and the cytoplasm and then the outside. And we train this connet to classify, to basically take a little region, very small, a few pixels by a few pixels, and then classify the central pixel of that window as to whether it's cytoplasm outside cell membrane or the nucleus or the wall of the nucleus if you want. The idea there was so you could count like how many cells, these are videos actually, and you can count like if you can see automatically more or less if the embryo is developing normally. This is for geneticists of development. And those are some of the kind of raw results of, you know, each color indicates a category. Those are very noisy images, so you don't get perfect result, but then you can clean them up with something a little bit like non-maximal suppression with a little bit of post-processing. And so you get sort of fairly reliable idea of, you know, how many cells is in this image and if the nuclei have the right shape and things like that. Here's another example which is more impressive. This is also a relatively old work which I was not involved in by Sebastian Song, who at the time was an MIT, is at Princeton now, and he's been interested for the last 15 years in what's called connectomics. So this is basically taking a piece of brain tissue and then slicing it into very, very thin slices and then analyzing the images so that you can reconstruct the graph of connections of the neurons between them. So this is a video, an old video from, you know, the late 2000s. It's probably about 13 years old or so that represents only a small proportion, a small percentage of all the neurons that have been identified. Each neuron has different colors, so they run a segmentation algorithm that, you know, tries to figure out after running the convolutional net to identify the boundaries between neurons and other neurons or between neurons and whatever is in between, they can reconstruct the 3D volume of all the neurons and the dendrites and the axons and whatever, and if they can identify the contacts between neurons, they can also basically reconstruct the wiring diagram if you want of a piece of brain. There is, the data actually was collected using crowd sourcing at this website here, iwire.org. You can play with that if you want. And there's a recent review paper that I'm a co-author on called The Mind of a Mouse and it basically advocates for the fact that we should do this for the entire brain of a mouse. It's been done for the entire brain of the fruit fly, which is very small, but right now we just don't have the technology to do it for a brain that's larger than that. And so this is basically a call for, you know, a kind of project, you know, worldwide or national to essentially figure out the entire wiring diagram of the brain of a given mouse. The amount of data that would be required for this is absolutely staggering. A mouse has maybe 50 million neurons, so it's actually smaller than some of the neural nets that are being used in practice today, but it's absolutely gigantic in terms of connections and the complexity of it. So that's a really interesting application of convolutional net to science and to neuroscience in particular. And it goes back, you know, 15 years or so. This is an old video, there's more recent ones that you can get by looking for Sebastian Song lab at Princeton. Here's another application of image segmentation. So this is the same idea where you also take a window over an image. So here this was a three dimensional window, okay, in this example that's taken. And then you train a convolutional net looking at this little window, you train a convolutional net to tell you whether the pixel, the voxel you're looking at is, you know, the inside of a neuron, the outside of a neuron, or the membrane if you want. Okay, so that afterwards you can post process and kind of do this segmentation. This is the same idea here, but it's two dimensional. This is about this was done about the same time, actually maybe a couple years, roughly about the same time. And this is the idea of what in computer vision is called semantic segmentation. So here you run a convolutional net with a sliding window over an entire image. You take a window. So the input window that corresponds to a particular output in this convolutional net, in this case, is something like 40 pixels by 40 pixels or something like that. And you train the system to label the central pixel as to whether, in this case, whether it's something the robot can drive over, or whether it's an obstacle that the robot, you know, would bump into and cannot or cannot cannot traverse, like, you know, tall grass and things like that and brushes. So this was a project that was only my lab at NYU between 2005 and 2008. And directed by, when it involved, Raya Headsell and Pierre Salmanet, who were the two main contributors to that project in my lab, but they were a large cast of characters that was in collaboration with a startup company in New Jersey called Netscale Technology, which now actually belongs to NVIDIA. So they actually work on autonomous driving at NVIDIA. Raya Headsell is head of robotics research at DeepMind. Pierre Salmanet is also working in robotics, but at Google Brain. So anyway, so here the data was actually collected automatically. So we just run the robot in nature. And we used a stereo vision system to figure out if a particular pixel is on the ground or above the ground. And from that, we can derive a label. Okay. And this is the kind of labeling here at the bottom in the center that you would get from this stereo vision system. And the problem with stereo vision is that it only works at a limited range. So it works up to about 10 meters. And then beyond 10 meters, you can't tell really from triangulation with two cameras, you can't tell if a particular pixel you're looking at is on the ground or above the ground. So that has kind of limited range. And you can't really drive a robot in nature if your vision stops at 10 meters. You can avoid obstacles, but you can plan a long range path. So what we do is we use those labels to train a convolution net to label the pixel, but the convolution net only needs a single camera image. So it doesn't rely on stereo. And so it's not limited in range. And when we apply the convolution net to the entire image, it tells us, well, the path continues. In fact, this convolution net is not is trained in the lab on collected data. But it's also kind of trained itself on the fly as it drives. So the last layer is actually trained online. And I'm going to spare you the details. This is the architecture of the convolution net. It actually looks at images that are kind of bands, if you want, more or less centered on the horizon, because it doesn't need to tell, you know, to look at the sky and anything else at multiple scales. So it uses this multi-scale idea that I told you about earlier. And this is sort of a video that shows how the system works, more or less. Okay, so you get labels from stereo vision. You also get labels from the neural net. And you can reconstruct where every pixel is in the world, because you know roughly the distance. And so you can put this in a map. This map is represented here at the top. And the map, you know, the content on the map, if it's green indicates that the robot probably can drive there. And if it's purple, it's probably an obstacle. If it's red, it's also an obstacle. If it's blue, it just means like I haven't seen this, it's obstructed, the view is obstructed by an obstacle, so I can't tell. So I'm going to avoid going there. And then you can run a shortest path planning algorithm to kind of get the robot to go somewhere. So this is the robot being annoyed by graduate students. This is Raya Headsell on the right and Pierre Salmanet on the left. And they are making the life of this robot pretty impossible. But they're entitled to do this because they actually built the code for it. They didn't build a robot that was given to us by DARPA. And they were pretty confident the system was working well because if the robot didn't stop before their legs, they could break their legs. Okay. So a couple of years later, a few datasets appeared that had a small number of images, something like 2000 images or so, where people had labeled every pixel in images with a category. So not just, you know, is it reversible or not, but is it a window? Is it a door? Is it a sidewalk? Is it a road? Is it a car? Is it a person, et cetera? Or is it a tree? So that's the problem of what's called category level semantic segmentation. And we built one of the first, you know, the first system using convolutional nets for this, you know, around 2010 or so and published it in 2012. The computer vision community at the time was extremely skeptical in 2011 of convolutional nets. And even though this paper kind of beat the record on the dataset, it was actually rejected from CVPR. We published it in a machine learning conference in ICML. Six months later. So there was a lot of skepticism for those methods back then. They just didn't believe that the result could be so good because they never heard of that method. So it's interesting to kind of, you know, reflect on the history of this. A certain degree of skepticism is a good thing in science because you have to be skeptical about surprising results and kind of confirm them. You know, it didn't take that long for this to be confirmed. So that's the architecture of the network. It's again a convolutional net that is applied at multiple scales. So the same image is subsampled. The input window of this neural net is 46 by 46. So every input here is influenced by a 46 by 46 window is supplied with a sliding window. And then you apply the same convolutional net at multiple scales. But then the features of all of those scales are combined before going to the final classification. So there's a lot of ideas along those lines that a lot of different ways of doing this now that people use. I'll show some examples later. But if you want more details about this, you know, essentially you should take a computer vision, you know, Rob Ferguson's computer vision class that's really, it goes into the gory details of how you do this. So that's an example of this system running. We're able to actually build a, so this could not run in real time on the CPUs of the time. GPUs were not popular yet. So we implemented like a special piece of hardware on what's called an FPGA, which is sort of a configurable hardware so that we could run this. And Alfredo is familiar with this because he worked with this as well. And this was kind of the results. This is Washington Square Park. If you don't recognize it, the system is far from perfect. You know, it recognizes those bright spots here as desert or sand. And, you know, this is the middle of Manhattan. There is no sand, the desert certainly. So, but that was as good as it was at the time. And we could run it at about 20 frames per second on, you know, a special piece of hardware that was about this big. Okay. So this was sort of the groundwork that kind of convinced people, some people at the time that you could, you could actually implement those things in hardware and you could, you know, pretty much have this in your car and it could help you drive a car. There's a long history of hardware in Neuronets, which I'm not going to go into. I might go into it later lecture, but people started working on Neuronet hardware back in the 1980s. Okay. So what happened around 2012, 2013 is the fact that our friends at University of Toronto in Jeffington's group, E.S. Itzgaver and Alice Krzewski, had a very good implementation of convolutional nets on GPU and that allowed them to train a very large convolutional net on a single GPU at the time or two GPUs actually on the ImageNet dataset. And the ImageNet dataset was the first dataset in computer vision that had a large number of training samples and a large number of categories. So it had 1 million training samples, roughly 1.3 million and 1,000 categories. And it turns out that's really what convolutional nets love. If you have many categories, many training samples, they really shine. And they basically beat everything else. So what you see here is the error rate for what's called the top five error rate on this ImageNet dataset. And 2010, 2011, people used more classical approaches where a lot of the things were handcrafted. The architecture looked a bit like a convolutional net, but basically handcrafted was not trained with backprop. And the best people could get was 25.8% error. At ExNet in 2012, got this down to 16%. And that was a watershed moment. Everybody in computer vision essentially stopped what they were doing and switched to using convolutional nets so the people who didn't basically regret it not doing it. So they did it the year after. And every alternative approach to object recognition was basically abandoned within two years. And the error rate kept going down. Now it's below human error essentially for top five. But people still use ImageNet as a benchmark. And what happened during that period is that the number of layers of those networks basically grew dramatically, particularly with the invention of the ResNet architecture. So this was a paper by Kaming Ha in 2015. The paper was actually published in 2016, but it was an archive in 2015. And I'm going to tell you a little bit about this. So the basic idea of ResNet is to, by default, make a group of two or three layers compute the identity function. So it basically copies its input on its output. You don't change the size. You don't do anything. You just copy the input on the output. And then you have a few layers of a neural net that compute the deviation from the so-called residual from the identity. So there's various ways of building those things, and you stack them up. So this is a particular network called ResNet 34. But I would say the workhorse of image recognition, the thing that everybody uses or compares themselves to is ResNet 50. This particular architecture is an object, a class in PyTorch. You can just build ResNet 50 without thinking about it. There's a lot of variations of this. And you can go to this URL and check the current state of the art on ImageNet. There is various types of papers with code. It's a small company that was recently acquired by Facebook that basically sort of organizes all the results and papers that people have. This is a chart a few years old. It's about four and a half years old that Alfredo put together. Alfredo should probably make a more recent version of this. That indicates the number of billions of operations on the x-axis and on the y-axis is the top one accuracy on ImageNet. And there's various versions here of ResNet and Inception, which is sort of a Google proposed architecture. There's a lot more now and people are up to 90% now, but they use something else that's supervised running. Okay, I've got to drop out. So I'm going to stop here and let... Actually, before you disappear, I wanted to wish happy birthday to our TAs. They were born the same day, so they remind us one time less. Amazing. All right, so good luck with your presentation on the other side, and I keep here the audience. Enjoy it. Okay, great. Take care, everyone.