 He's got a long list of awards that he has won. I'm not going to read them all, but he is the president's gold medalist from IIT, Kanpur, and being from IIT, I know how difficult and how prestigious that is. He is a fellow of ACM and IEEE. He's the Arthur Click professor of computer science at Berkeley. It's probably right to say he is the top researcher in computer vision today. Definitely, his group has been the most prolific in the field over the last 10, 15 years. So we are very happy to have Professor Malik here. Okay, thank you, Venu. So thank you, and it's a pleasure to be here, and thank you very much for inviting me. So, okay, I'm trying to figure out what's the right place for me to stand. Okay, is there a pointer of some sort, like one of these electronic pointers? Yeah, yeah, yeah, exactly, okay. Usually somebody in the audience has very equipped. So you are the one. Okay, thank you. Okay, great, okay. So good afternoon. So I'm here to tell you about computer vision and this talk is, in a sense, a survey of the field. Where does the field stand today? What are the problems that we can solve? What are the problems we can't solve? And some sense of where we are going or we should be going. So let's start with sort of the obvious propaganda for the importance of the field. So the most popular buzzword these days in computer science and in fact, in the popular press as well, is big data. Well, big data is increasingly visual data, and here are some numbers of the kinds of images. There are talks which I've seen with titles, like what do we do with a trillion photos? So these are not, so these numbers suggest that, in fact, big data is really visual data, images and video, and we need to figure out how to make sense of it. I mean, the first stages of interest here obviously are to do with storage and communication and compression and so forth, and our colleagues in electrical engineering have made significant progress on that. We couldn't have done this without all these compression schemes, et cetera, et cetera. But we have that now, and the point now is, how do we do interesting analyses from this data? How do we pull things out? It's not just about storage and communication, it's also about inference. I should mention that much of the data in science and engineering and medicine is in the form of images. So it's often not just x, y, but x, y, z, or x, y, z, and t, and many of the techniques which have been developed in computer vision for natural images land up applying there. So what we have to do applies there. So I like to divide up the task into these three sub-problems. So there is the problem of recognition, and recognition is attaching semantic labels. So I want to be able to say this is a dinosaur, this is a pear, this is a piece of crumpled paper. There's a task of reconstruction. Reconstruction typically means figuring out what gave rise to the image. So there is a reality out there, there's an external world, and that external world projects to some image. We want to solve the inverse problem, that's reconstruction. So that could mean, classically it means recovering depth, the z value at every point, but actually it's more. It's not just the z values which are indicated here by in pseudo color, but also reflectance properties. What's the color at every pixel? As well as illumination, how was this object illuminated? So reconstruction is really solving the inverse problem of the physical world that gave rise to the image. Re-cognition is, recognition means to imagine again. So it's really about contact with memory. And then there is what I call reorganization. Normally people call this segmentation or grouping, but I want to call it reorganization so that it begins with the letter R, and I can call the slide three Rs. Traditionally this means that you divide up the image into sets of pixels. So all these pixels belong together because they are to belong to one object, the dinosaur, and these pixels belong together and so forth. So we have to group these pixels into entities. And I want to submit to you that this is the best way to think about the field of computer vision. This is not the way computer vision is traditionally taught. Classically the way computer vision is taught, you start from image, so you deal with image processing operations. So you first teach students about convolutions and things like that, edge detection. Then we move on to some segmentation mid-level vision, and then you teach them about classifiers and pattern recognition, or you might teach them about geometry. So somehow that's the usual way, the structure of a course. And it follows what Maher once proposed, which was this separation from low-level, mid-level, and high-level vision, kind of operating in a feed-forward way. But we have plenty of evidence from biology that this is not all that happens. You often the top-down can influence very significantly how we interpret the bottom-up signal. So really these things should be thought of as all interconnected. And I think this diagram somehow captures I think where we need to be going. I'll also argue that in fact in the future, we should not be working on these problems in isolation, but in fact the coupling among these problems is what is interesting. And that's where future progress is going to occur. So that's this diagram. So I've tried to draw six arcs here. So essentially I'm going to argue that there are examples where recognition has to help with reorganization. So a good example is when you detect a face, then you automatically know these are the pixels corresponding to the face. So that's recognition influencing reorganization. Reconstruction influencing recognition. Well, if you know 3D information, such as with a Kinect sensor, that might make recognition easier. But you also can have information where recognition enables you to do reconstruction. A classic example of this is deformable face models, where once you know that you're looking at a face, then you are in a small restricted class. And what you do is you map the image to the three dimensional Z profile of the object. So all of these interactions, in fact I can construct concrete examples of. Much of the research in computer vision has been in each of these as separate silos. And it was needed. If we didn't do that, we wouldn't be where we are now. But that has been accomplished. And I think that in the future, we really would need to focus on all the interactions. So for the benefit of people in the room who are not computer vision aficionados, or even people who are in computer vision but are somewhat younger than I am, it's good to review the history of the field. Because there is this great saying, those who don't know the past are condemned to repeat it. So I think our field has existed for 50 years. In fact, I want to say exactly 50 years, because there is this thesis from MIT by Larry Roberts in 1963, which is regarded as the first thesis in computer vision proper. Of course there was work related to that which had happened before. There were three different fields, artificial intelligence, image processing and pattern recognition, which all had something to do with vision. But there were very few people in the field at that time. And computers were very weak. I mean you couldn't store much, you couldn't do much. And yet people were trying. And in fact, a lot of interesting ideas, theoretical ideas were developed in that era. So a lot of foundational work on the mathematics and physics of image formation. Han, Kundring, Klangethegens happened in the 70s because to do math, you don't need computers. Vision is applied mathematics. This I did my PhD in 1985. So this was, I think, a time when you start to have this S-curve taking off of vision. But a whole lot of different kinds of applied mathematics got introduced into computer vision. Ranging from geometry, multi-scale analysis, control theory, optimization, probabilistic modeling and so on. And oftentimes these papers look like, oh, I've read some, I know now this little bit of applied math and let's find some vision application for it. So it was kind of just an exploration of that space. The 90s I regard as a success in the sense that the geometry part of vision, at least the basic math and science of it was understood. In the 1990s you had the first systems which came out which could really solve the inverse problem of structure for motion. In the 80s people used to think it was a hopeless problem. In the 90s we discovered that we could actually do it. And today this is a major success. Lots of applications have been fielded on it. 2000s I would regard as an era of significant progress in visual recognition. So somehow these fields which was a bit separated had kind of come together and the field of object recognition which in the AI sense was being done in a geometric setting rather than using pattern recognition techniques. Now people realize that they have to use those and those were pulled together. So we are at an exciting time and I'll try to take you through what we can do these days. And again, staying at the general level, I want to point out that computer vision is not the only field which studies vision. There are biologists and psychologists who also study vision and I want to state that perception is what how psychologists study vision and these are all equally useful ways of studying the field. They study what do humans perceive when they see a particular image and we have in computer vision got a lot of insight from that because we are often able to figure out how much information is sufficient for a task. If you can recognize a face with a 20 pixel by 20 pixel resolution that tells you that there is significant information in the low frequency component of the image and many other things. If you can do recognition in 100 milliseconds or less that means that there must be a purely feed forward pathway which works at least when the signal is not degraded. Neuroscience, they understand something about how things are computed in the brain but not enough that we could just simulate. And of course the laws of optics and the statistics of the world these are equally applicable to computer vision and to human vision. So at the level of function we are playing exactly the same game. That part of human vision and computer vision is exactly the same. Just like for the flight of an airplane Bernoulli's principle applies equally to a bird as to an airplane but we build airplanes not with muscles and wings and so forth but we do have some essence which is in common and that's Bernoulli's principle in the case of vision it will be the laws of image formation and the statistics of the visual world which has to be the same. Okay, so now let's take these three hours in turn and I'm gonna take you through a little bit of what has been achieved in each of these areas. So reconstruction, major success story. This is really out in applications and companies are using this Google and so forth. Recognition, I would say, so this is a clear success and in fact at this point all the research is in engineering. The scientific and mathematical problems have more or less completely been solved. In the case of recognition I would distinguish between 2D and 3D. We've had much greater success in the 2D version of the problem than the 3D version of the problem. While nothing, we don't want to say that something is completely solved but problems such as handwriting recognition with prominent work from Buffalo and problems such as face detection. I mean, this is a measure of that is that I mean, these guys, these guys have face detectors built in them, right? So these have been quite successful if you did in applications and we have had partial progress in 3D object category recognition. I'll try to show you some slides. So I regard this as a glass which is either half empty or half full depending on whether you are a pessimist or an optimist. Reorganization, we have had a lot of success with bottom-up segmentation but there's only so far you can go with bottom-up segmentation. So I believe that the key problem there is semantic segmentation and I'll come to that in a second. Okay, so back to the three hours. So let's focus on reconstruction. So I want to highlight some of the, here as I said, a lot had been achieved by the 1990s and I'm going to show you just some slides from projects at my group back in, this is like 15 years old. So we showed that, so there are a number of cues for recovering three-dimensional structure of the world. Having multiple images is one of them. Knowing something about the model class is another and so we took these pictures of, these are from Berkeley campus, the Campanile and these pictures are actually from a kite, a camera mounted on a kite and from this, one could do a reconstruction of the entire central Berkeley campus and of course you also had texture maps where you could map these and what was impressive about this demo from 15 years ago was that these images looked very photorealistic. You could look at them and you would not be able to tell whether this was a synthetic image or a real photograph and this technology of course had a big impact immediately in graphics. So the graphics people really cared about this. So this is a reconstruction again from one picture of the Arctic triumph and essentially you're exploiting the symmetry in the building which enables you to have multiple views in one view and here's a 3D reconstruction of the Taj Mahal. This was done by George Borshukov. So in fact, both the students who worked on this Borshukov and then Paul Debebek whom I mentioned earlier, both of them went on to win Oscars because this work, image-based modeling and rendering immediately got applied. It became part of graphics and Hollywood for visual effects found this very useful and this was in the mid to late 90s and then it really took off. So George Borshukov did the special effects for at least some of the special effects for the matrix and he got an Oscar for that. Paul Debebek got an Oscar later but for other stuff. As well as this work. So now photo realism in movies is we just take it as a given but it arose as a result of work which came originally from computer vision some 15 years ago. Now the work that we showed we had, there was stuff that was done by the computer and then there was some role for human observers. Human observers solved certain correspondence problems and so on. The techniques that people have developed in the last decade have enabled much further progress or much greater automatization is the way I would put it. So if previously you had a system which had to do 90% was done by computer, 10% by human. Now increasingly we have system where 99.99% is done by a computer. So you have reconstruction. So this is work from Microsoft and UW, Steve Seitz's group and Rick Zelisky, Agrival et al. So they could do reconstructions of the Colosseum in Rome. Just from a collection of Flickr photographs and there's more than that. So there is work from Mark D. Perfe's group where they've done reconstructions of our entire cities just from a collection of photographs into point clouds. So this is, if you just start with images you can also make use of sensors. So the Kinect. So this essentially has two cameras and then you can project a pattern. So this makes for the correspondence problem to be easier and this is available. Now you can buy it effectively for $10 or so. I mean it's part of the Xbox game system and I just want to get a sense how many people have seen the Xbox system with the Kinect? So this is a device which sold like millions of copies within a few weeks. And once you collect the data this way this gives you a depth. Then after that you have to do computer vision to escape the pose of the body. And it works. Otherwise people would not need dollars for this. In an outdoor setting, the Kinect doesn't work well on the outdoor setting because the projected pattern will get washed out under the sunlight. In outdoor settings you have these more expensive devices and these are behind the success of Google cars and so on. So the automated driving is really based on these Velodyne type leader. And now if this was in the $10 to $100 range this is more in the $10,000 to $100,000 range. But these will get cheaper. And these range sensors will be widely available. And if home robots happen, if personal robots happen I think it'll be largely because these have been driven the price point of these has been driven down sufficiently low. So at this point we can do point clouds but we can't do, we need to do semantic segmentation. So this is where I think my three hour story connects up. We need to do more than reconstruction. We really need to know for these point clouds this is one object, this is a chair, this is a table. And things have to be linked together. Otherwise you just have this points which are instead of just having pixels you have pixels with depth. But that's not enough, you want more. So I want to point to a little bit of work that we have done in my group which was which is trying to get at the following problem. So when you have multiple views we kind of know how to do this. But humans have this remarkable ability of solving the problem even with one image. So if I just gave you this image okay but I guess I'm showing you the factorized image. From one image of the scene we can guess what the shape is. You sort of know what parts are nearer, what is the surface normal, what's the lighting. That these black things are the paint here is darker than the paint here. How do we do this? That's, this is of course a classic problem in computer vision. This was the shape from shading problem. That has been studied since the 1970s. And because it is so ill posed because some combination of the albedo and shape and the lighting gives rise to the image. The forward direction is the graphics problem. The inverse problem is horribly ill posed. And so in the field of physics based vision what they did was they just simplified the problem. But I'm, there's some work that the student of mine John Barron and I have done where we have shown that you can in fact recover all of these shape albedo and illumination from a single image. And this is illustrated in the next slide. So there is the, so the basic secret of success here is do not rely just on physics. So in fact by and large reconstruction is based on really physics and geometry. No statistics comes into it. Unlike the case in recognition where it's all about statistics pattern recognition. But in a problem like this which is so ill posed where there is some combination of shape and lighting and reflectance that gives rise to the image. You need to have a probabilistic model. You need to say what is the kind, what of all the possible combination of worlds, objects, lighting and so forth that could give rise to the image. What is the most likely? So anyway it becomes a sort of an optimization problem. You maximize the probability of a certain combination of albedo, lightness and albedo shape and lighting subject to the constraint that it must be, it must explain the image. And now you need to set up probability distributions for shape, probability distributions for lighting, probability distributions for the reflectance. Which sounds big and hairy but we were somehow able to do that. And I'll just show you some reconstructions. This is work of the last year or two years. And as of yet it's not, it's going to see its applications down the road. But here you see this reconstruction where this movie is essentially the optimization process proceeding. And when it concludes, which it has concluded here, I mean this is a very sped up movie, you've recovered the shape. This is the surface normals and this is what it thinks as the reflectance changes. So all the paint marks should be in this layer. And this is what it guesses as the illumination. So all these five are output of the system and this is the only input. This is not going to be as metrically accurate as classical multi view stereo. But the fact that you can do this is itself interesting. And the reason why you can do this is because the visual world is not crazy. There are some probabilistic distributions that you can put on the world. It consists of objects, surfaces are more or less smooth. Reflectance changes happen but rarely. Lighting is smooth, things like that. Which if you can capture mathematically then we can get behind and solve it. And here's another demo but I think I should skip that in the interest of time. Again it's the optimization leading to the most probable solution. And then here are some results on real images. So these are images and this is what... So just a single image of an object. So there are, I mean there's clearly work to be done. So this is my face. And it's been messed up because this program knows nothing about specularities. So when there is this bright highlight on my bald head what the program does is it tries to explain it by making a patch of the surface which is pointing towards the light source. Because that way you can explain the brightness away. But these things will evolve, they will get better. So I think that interestingly even reconstruction from a single image is going to, has there still work to be done on this problem of how to do reconstruction from a single image? Okay, so back to the big picture. Now let me turn to the problem of recognition. So I want to start with this slide which is from Fei-Fei Li and Perona which is on what humans can do. So what they did was they showed images like this to people for short periods of time either 40 milliseconds, 80 milliseconds, 107 milliseconds and all. So they flashed the image then they turn it off. And then they ask people to write down what they saw. So in short exposures they can't say much. Maybe an outdoor scene, in the case something. If you give them enough time they can generate these essentially sentences, paragraphs, maybe even novels. So they're writing some kind of game of fight. Two groups of two men, et cetera, et cetera. That is what computer vision is able to achieve. Sorry, computer vision should be able to achieve. This is what human vision can achieve. Human vision and cognition together. So in a sense, this is kind of our goal. And so we, when vision is coupled, so there's right now a lot of work on vision and language together. And Jason here, his group is doing a lot of interesting work in this. And basically it's because someday we want like a function which is aim to text or aim to essay, right? Or video to essay. That's the function we want to create. Okay, so there are many different aspects of recognition. So let's, I wanna start with some measure of how much progress has occurred in this field. So this is a data set from about 2004 from Fei-Fei and Perona. So what they did was, if you look at papers from the 90s in computer vision, there would be very simple data sets. I mean, I'm talking largely about the 3D problem, okay? So there would be like two objects that you have to distinguish among the five objects or something like that. No, sorry, did I say 2D? No, I meant, when I'm talking about the 3D aspect of the problem, I think the 2D work on handwriting and character recognition was much more sophisticated. So here what they did was, the web was there and they collected these images and for each, so there is a category of faces so they'll have lots of pictures of faces or pianos and they'll have lots of pictures of pianos. And the goal was to classify, given an unknown image, which one of these 101 categories it was. And when their first program for this was written, their performance was just 15% correct, 15, meaning 85% error, or I think it was 16% correct. And that's terrible, right? Compared to what you have, errors you have in digit classification. And their explanation was, well, or their justification was, hey, random chance would be 1%, right? Because you have 101 categories. So I'm doing 16% better than, 16 times better than chance. But here is the thing. So these are curves where on the x-axis you have amount of training data and the y-axis is the percentage correct. And this red point is the original data, the original performance of the first system, 16%. But within a year or two or three, there were people producing curves like this. So just focus on one axis here. So for 15 training examples, the performance had jumped up to about 65% correct. My phone is calling, okay, I'll remove my phone. Is that it? Oh, okay. Okay, so, I better check that, yeah. So within a few years, between 2004 to 2007, the performance had gone up to 65% correct. So there was a factor of four improvement, so roughly from, or five, say 15 to 65. And some of these curves are from work in my group, some from other groups. But that's incredible, right? I mean this shows that this happened in speech in say the 90s when you have this process where suddenly people are able to do a task which before was kind of not doable. And techniques were developed which have stood the test of time. So this is work from Lezebnik, Schmidt and Pons. This is the so-called special pyramid matching technique. The details don't matter. I want to mention one idea which was used here and which was that of kind of using, trying to convert this to a problem where, where we could use techniques which had been developed in document analysis. So documents have words and a big, for a retrieval, oftentimes a very important descriptor is the so-called bag of words. So you have a count for a document of how many times does the word cancer occur? How many times does the word lawsuit occur? And so on. And from that vector of counts or words, you have a pretty good sense of whether it's a legal document or a medical document or an engineering document or whatever. And in images, we don't have words, but we can create words. And actually, I think one of the earliest examples that I know of recognition was work from our group which is Leung and Malik 99 which we used for distinguishing among different kinds of textures. So we took little patches of the image. We had various filtered outputs and then we vector quantized that. We buy K-means and so on. And then the histogram of that, we showed was a way to recognize various kinds of materials. So there was no, so this bag of words idea essentially is what is behind the success of many of these current systems in computer vision, particularly for classification of scenes. In the work that I talked about earlier, you were trying to classify in Caltech 101, you just had one object in the middle of an image. But in fact, in general, you're gonna have many objects in the scene. And your goal is to say, okay, it's not just that this image contains a horse. I need to say this is where the horse is. You need to locate the horse. And here it's shown by drawing a bounding box. So how do we do that? So that's a harder task. And this Pascal challenge has been run for the last five years where people have been trying to improve their performance on this task. And again, it's become reasonably good. So these are average precision numbers, which if you don't know what that means, it's fine. It's just some measure of the goodness of how well your system does. You compute some precision recall curve and you compute the area under the curve. So one is best, 100% is the best. And for different categories, motorbike, bicycle, bus, aeroplane, et cetera. For some of them, the performance is better than 50%. Some of them are hard, like bird, chair, boat, potted plant, where the performance is more like 20%. But the data set is truly collected in the wild. It's not cheating. It's just general images of buses or bicycles and so on. So we have made a fair amount of progress. In fact, that is how these precision recall curves work. And I think I'll skip talking more about this. Again, this is, I want to say, what are the kinds of big ideas behind the success here? So it's a lot of machine learning, obviously. And we now have large data collections which enable that. We have the computing power to try these things out. But it's also good features. That's where a lot of the art is. And good features in this setting are some kind of orientation histogram. So you try to find in each block how many times do you have vertical edges? How many times do you have 45 degree edges? How many times do you have horizontal edges? And you collect these statistics within a block. And then the whole object, as such, is divided into a number of blocks. And for each of these, you have this. So this is kind of nowadays playing the role of what in speech recognition was done by local Fourier transforms. So in speech recognition, you take these 10 millisecond windows and you do a short-term Fourier transform. You do some non-linearities as well. And then you get a bunch of coefficients. And those then go into your learning machinery. So they had arrived at some features. And these are the kinds of features which are very popular in computer vision now. And there is a, in particular, I want to mention this work from Felgen's shop et al, which is kind of a leading technique for object detection, which takes these kinds of templates. But then on top of that, it has a global template for an object and then smaller templates for parts. And everything is learned. So I want to turn a little bit to work on people. And this is what my group has focused on. Because of all the categories of objects, I think the most important is obviously people. Because we are people, so we care about people. We want to observe them, describe them, and so on. And people are so hard to analyze. Because look at this variation in appearance. The variation in appearance from one card to another is nothing compared to the variation among people. And we care about this. So there are versions of problems related to people for which we have good solutions. So face detection. So this is a result of a program from Schneiderman and Kanade. And this started to succeed sort of in the mid to late 90s. Viola Jones, Schneiderman and Kanade. And they were able to speed these up. So eventually, of course, these are there on all your cameras today. And they work particularly well for, I mean, frontal faces, some side views of faces. If you really mess up the lighting and mess up the pose a lot, then these programs will fail. But they are quite useful. Because normally, when you take a picture, the other person wants to be photographed in a decent pose. So they're not trying intentionally to screw up the face detectors or these things work. So face detection is a success story. But when you look at the whole body, that old way of thinking about this problem, I'll come to the old way in a little bit. But let me try to see what I want to be able to discover from an image. So I want to be able to do object recognition. So I'm trying to expand the meaning of the word recognition. So you want to say person, van, person, so on. I want to do semantic segmentation. I want to mark the pixels corresponding to each of the objects. I want to estimate their pose. I want to say this person is facing back. This person is looking to the left, and so on. You want to characterize their actions, walking away, talking. You want to describe people. This is important. We don't want just nouns. We want nouns and verbs. No, we don't want just nouns and verbs. We want nouns, verbs, and adjectives. So you want to be able to say an elderly white man with a baseball hat. And finally, we want to produce descriptions like this. That's our goal, right? Now back to what people in computer vision had been doing. So there's this long-standing idea. Actually, probably one of the earliest people who pushed it was my former advisor, Tom Binford, with generalized cylinders. That you try to model objects as stick figures, and each of these parts, like the limbs, the torso, is modeled as a generalized cylinder. Now this approach is, so Binford is really the original. And then there is a version of that from other people. But actually, this is the wrong way to go about things. It's the wrong way to go about things, because when you try this approach in practice, so each limb is really two parallel lines, right? Now if you go look, so how are you going to implement it? You'll implement a detector which looks for parallel lines. Well, you implement this detector, and you make it run all over an image. It finds, what's happened? I need to go back. It has developed a life of its own. So it finds parallel lines everywhere, but these are not the right parallel lines. And so what has happened is that we have confused our notion of what is, yes, at some level from the ontology point of view, yes, the body does contain limbs. But we need to match it also with what you can pull out of those pixels. We need to always never get ahead of ourselves. We need to pull something out of the pixels. And therefore, there has to be a match between the statistically possible goal from a pixels to what you want to get at. So I'll show you what we have done. So we have developed this notion of poeslet. And this is Lubamey Buddha and I. So in each of these rows, you have some part of the body. Or actually, it may be more than one part. And there is some characteristic visual pattern associated with this. So think of these as, so if you can find a face. A face is a certain pattern of light and dark. And we can build detectors for that. And if you have that detector running over an image and wherever it fires, then you have found a face. But as soon as you have found a face, you have found a person. But the face is not the only pattern which triggers the detection of a human. This pattern, which is like somebody standing with their hand like this. Or this pattern, which is just looking at somebody's legs. Or this pattern, which is the back view of my head. These are all characteristic patterns which when you detect, you can say, hey, that's a person. So what we need is to build zillions of these little detectors for these different parts. And if any of them fires, roughly speaking, we'll of course try to be more sophisticated than that. We know we are looking at a human person. Now the problem is that how do we construct? So we're going to use machine learning techniques to build a detector for each of these. So we build a detector for this guy. So these are all positive examples for the detector. So the positive examples are back views of people's heads. And negative examples are images which don't contain a head. I mean, there could be images of grass, dogs, cats, whatever. But how do I get this training set? That's the question. If I can get this training set, then I build a detector by usual techniques of positive examples, negative examples we train. How do I get this data set? So let me illustrate that with this example. So let's move on here. So here is an image. And I want to build a detector for this configuration. So we use the term poselet, by the way, where pose means the pose of the human body. And let is always like something smaller. We're not looking for a universal detector for the whole pose, but for a part of it, because the parts may repeat. And therefore, there'll be more meaningful. I want to find, I want to build a detector for this configuration, the two arms. And I need to find other instances of that in the training data. So this is all the training time. So let's say, here's another image. Now in this image, can I find, so this is my target configuration. Is there a configuration like that in this image? Can you see it? Where? This guy, right? Now this is you as humans, OK? Now how can a computer program figure this out? Because we need, and this is before we have succeeded in building our detectors, a computer program needs to automatically find that these two are instances of the same pattern. And when there's a chicken and egg problem, once my detector is trained, then this is easy, but my detector has not been trained yet. I need to isolate this data set. So the secret is that we need to do a little bit more work. We obviously need to have, for training person detectors, we need to have an annotated data set which says these are images of people. To build a dog detector, you need examples of images which have been labeled as these are dog images. To build a detector for the digit free, you need to have images for which you know these are the digit free. Here I need to have this guy. Now how am I going to annotate this? So here is our insight. The insight is just to mark the key points of the object. So the key points are shoulders, elbows, wrists, hips, things like that. So in the case of a human, it might be like some 10 or 20 such key points, which are well defined. And we can get this done cheaply. We can put this out on Amazon Mechanical Turk. We can give people 5 cents each and ask them to mark these images. Now in the space of key points, so this has been done on the training set, in the space of the key points, once you look at these key points and the geometric configuration of these key points, that geometric configuration is kind of similar to that other one because these are corresponding. And therefore, I can just pull out. So actually, when I want to find corresponding patches, I don't need to look at the pixels. I just look at the geometric locations of these key points. And I look at the geometric locations of those key points, and they are similar or corresponding. So we have found, and basically, that's how we can find corresponding patches. And then we can, so here's the setup. If I want to find a pozzlet classifier for this configuration, for every other image of a person, I'll try to find the corresponding patch. So this guy is a good corresponding patch. Another good corresponding patch. Another one, another one, another one. And sometimes, you will get ones which are not so good. So this guy is not so good. And we can score from the geometric configuration that this is not so good. So what we do is we sort them, and then we decide that beyond a certain threshold, we don't care. So we throw this away. So as a result by a computer program, automatically, we can find instances of each pozzlet. And then we can train a detector for these guys. So the only work which have to, so we didn't have to do work specific to each pozzlet. Once, generically, you just labeled all the key points. And we could do this easily. We have like 10,000 images which have been labeled like this. And it doesn't cost much. You can do this for a few cents per image. So once you have these detectors, then we can use them to do some interesting things. So here's a task. This is the task of attribute classification. I want to label male or female. So there's been a lot of work on classifying male or female from the face. And there are subtle differences. But actually, humans can, this top row are women and the bottom row are men. And you can do this without necessarily seeing faces. Something about the clothing, the dress, the way people stand, all of that reveals information about this. So how do we do this? So the traditional machine learning technique would be you take this whole image, you make that into a feature vector, and you take all these guys. And these are positive examples. And these are negative examples. And you feed that to your classifier. And basically, it will not learn anything sensible. Because there is much too much variation within the category. For machine learning techniques to work, the category has to be somewhat coherent. There is so much variation here that essentially any two of these guys might be closer to each other further apart. It's not going to work. However, once you've, at the level of a posilet, life is easy. Because these are the face, right? Male versions, female versions. Male versions of the back posilet, female versions. Male versions, so this will capture generally the difference in appearance of shoes and skirts and so forth, female versions. At this level, so these are, say, male examples, these are negative, female examples. At this level, you can hope to have your classifier do the job. For the whole person, it wouldn't work. So now for every posilet, we can train like a male copy and a female copy classifier. And I'll just show you some results from this. So the top row is what, so it thinks are the most male, and the bottom row are the most female. So there are errors here, so these are errors. Top row is what it thinks are people with long hair. The bottom row is people without long hair. So because, and underneath the hood, can you guess what's going on? Basically, it's glomming onto the face and head area. Once it's located that, then among different examples, you'll see a difference. And basically, that's how it's working. Where's a hat? Where's a hat? Doesn't wear a hat. Where's glasses? Doesn't wear glasses. Where's long pants? Doesn't wear long pants. Long sleeves. So these are quite enough subtle things which we can capture. Once you, if you were working with a whole body without any semantics, you couldn't do it. We can even use these for action recognition and it requires sort of some, the images have to be big enough for this to work, but these are characteristic postlets for phoning, running, walking, and so forth. So that's, and from this we can actually generate descriptions of scenes which is kind of cool. I want to now turn to what we can't do. I just have a few slides on that. This is a category of boat detection. Current technology for this, the average position is 0.23, which is pretty bad. So 100% is the best, zero is the worst. Because you just think of the variety here. This variety is simply not captured by our templates. So these are clearly open problems here. Here's another example. Here I haven't told you, these are all positive examples for something and it's not person. What do you think it is? These are all positive examples of chairs. So my training set for chairs has all of these as positive examples. And now you try to train a detector for chairs from this. So that detector, based on the HOG template, has to kind of glom on to this edge and this edge and somehow from that it has to figure out that that's the signal it has. The rest is all noise or, ain't gonna work, right? And it doesn't work. Now clearly what is needed here is a lot more use of context. Somehow once you know that it's a person and the person is sitting and there's all that inference which is needed. And people are working on this but this is by no means a solved problem. So this is why I think recognition we are gonna, if we are 50% for some common categories, but every stage like getting to 70, 80 and then to 90 and then 95, we'll have to sweat for each of those steps. And our models will have to get more and more sophisticated. They have to get more and capture more and more knowledge. I mean, in some sense, about 30% of the brain is devoted to vision. So we will need a lot of sophistication in our models to handle all this. So here's sort of a summary. Performance is quite poor compared to 2D recognition. So I think I work on this 3D recognition problem. I think we have had significant progress. I want both points to go through. We've had a lot of success and yet there is a long way to go. Both are true. We can't do post estimation too well. Progress seems to have slowed down. I mean, there is this idea of hog and DPM which has now been around for four or five years. And that's kind of the basic machinery that everybody is using for object detection. And clearly we are ready for another change. And I hope that somebody perhaps in this room will come up with that. Next steps in recognition. So these are just my speculations on how we'll do better on recognition. I think too much of recognition right now is based on appearance, color, and texture and not enough on shape. We know from studies of human vision that that is important. We know that it should be that it's kind of coupled recognition and segmentation and so forth. And post estimation are all coupled. And as we try to pull them together, we'll do better. So there are probabilistic machinery, graphical models, structured learning, and so on which will be needed. Recognition and figure ground inference which needs to co-evolve. Occlusion is signal, not noise. Currently for a template, any part which is occluded, you just, there are missing features and you regard that as noise. But actually you have two objects which are in a lawful interaction. We should be capturing that. Okay, so I'm getting ready to conclude. I want to give you one example. So I actually showed you this diagram where there were interactions amongst all of them. And yet mostly I talked about each of these in isolation. So I want to give you at least one baby example where we do some of these pathways. So this is one where we are trying to use some segmentation for recognition. And this is, this is, doesn't matter, it says University of California, Okay, this is using RGBD images to semantically pass scenes. And this is what I think is going to be, work like this is going to be the basis of your robot in the home. There are now robots that you can, people are producing for like $25,000. And at some point these will become like $5,000. And at that point, if it does something useful, people will buy it. People will never buy a $250,000 for the home. But, so here what we do is we say, let's say we have a connect sensor. So you directly know depth. So there's an image of an indoor scene, and this is the depth. So blue means closer, yellow means further away. And in this image what has been coded in pseudo color are the surface normals. And different, this wall is surface normal facing, blue means surface normal pointing up. So this you can, you get from the sensor, right? The sensor gives you this and then you can compute this by fitting a local patch and computing the surface normal. So what we wanted to do was to use this for semantic segmentation. What do I mean by that? So the input is this image, and then the depth image and the normal image, okay? So we constructed a pipeline, and this is a paper which will come out at CVP at 2013. We start out trying to segment the image into super pixels. So this notion of super pixels which we have been pushing for some time. The idea is that the right granularity is not image pixels of which there might be millions and it will depend on the resolution. Super pixel is a unit which is, you try to get as big a group as possible without screwing up. So it's an over segmentation. You try to create little groups where all those pixels belong to only one object. So you are conservative in grouping together. So that's the idea of super pixels. So you do a bottom up segmentation, and then we can do long range linking. So for example, we might link together the different parts of the table which get interrupted by objects resting on the table. But you want to say this part of the table and that part of the table is still the same object. And then after that, we want to do semantic segmentation. Semantic segmentation means that you classify every pixel as belonging to some category. So once we have the super pixel, we try to label them as this super pixel belongs to a chair, this super pixel belongs to a table, this belongs to the floor, this belongs to the wall and so on. So that's kind of the pipeline. And I'll just taking you through the pipeline. So the bottom up segmentation is work we have done before where we were trying to use brightness, color and texture. These are the basic signals for segmenting out regions in ordinary images. And then we come up with, so this is essentially like fancy way of doing edge detection, but the edges are based on multiple cues. And then we can actually construct regions out of them or these super pixels out of them. Now what happens is that when you have depth data, depth data makes life much easier. So right now when you're trying to classify an object, they can be the regions of different color which belong to different objects. And on the same object, so like on me, my jacket and my sweater are of different colors. So when you're relying on just color, I will make a boundary here, but I may not make a boundary between me and the background if me and the background are of the same color. Now imagine that you have depth. Now the Z values for me are five meters and Z values for the whiteboard are seven meters. Very strong signal for segmenting out one object from the background. In fact, if you also have surface normal, that's another signal. You can find convex edges and concave edges. So basically if you incorporate this information, the segmentation machinery does much better and you can do a much better job of finding super pixels. So we show that in various quantitative ways. And that's, so essentially this row shows you what the segmentation is if you just had brightness data and this row shows you what you have when you do have depth data. So essentially the availability of depth data makes the segmentation so much better. Then we do what's called A-model completion. So for example, the wall, the wall surface here and there's a little bit of wall here. Now this little bit of wall, if you try to classify in isolation, recognize it in isolation, it's gonna be tough because it could actually as well be a surface of a desk, right of that color. Well, once you see that this part and this part are really a continuation, a geometric continuation of the same surface. So for example, this guy and this guy are a geometric continuation of the same surface. That gives you a much better signal because for example, if that surface continues to a height of 10 feet, well, it's likely to be a wall, right? Whereas if it ends at 2.5 feet, it's not likely to be a wall. So we, anyway, skip the equations here. Once you have, the hard work is in finding these super pixels accurately. Once you have these super pixels, each of them you have some depth and surface normal information and you can do some clustering. And these are some results, I think which, so you see that the wall has been, these are outputs of the program. The wall has been linked. The different parts of the table have been linked. And this is part of our understanding. We need to know that this is one table, even though there are different objects resting on it. You want to complete behind the occluder. And here are more results, I'll skip that. So then recognition. So recognition is attaching semantics. So we now need to say wall, floor, ceiling, table, chair, et cetera. So you may not be able to read this, but this is, I think, picture. This says chair. This says blinds, like Venetian blinds. This says chair. This says, I think, table, and so on. This is an error, by the way. This little box of tissues is called a sofa. So we found something which was very useful was that you want to put everything in a geocentric frame. So it turns out that you can solve a nice problem which I will skip, it turns out to be an okay. So you just train your classifier. You have some features from each of these parts. And then you train through machine learning on it. And then these are the kinds of, so that classifier is trying to classify the super pixel and it says the probability that this is wall is 0.9, cabinet 0.05, window, and so on. And your features which capture orientation. So I think that it's very important to use a geocentric frame because that's the absolute truth. So if you know the lowest point, that's the ground. That's likely to be floor. Points where the surface normal is, gravity is very important. So the most important surfaces are, surfaces where the surface normal is aligned with gravity because those are surfaces on which you can put objects. And then surfaces which are vertical, which are used to support horizontal surfaces. So walls are vertical because they have to support horizontal surfaces. So those are ones where the surface normal is perpendicular to gravity. And in any scene you will find that most of the surface normal belong to one of these two categories. So once you capture that and now you put everything in this geocentric frame, the features are much more natural and better. So I'm gonna skip more details here. But just show you some results. So it's important to use these kinds of features. In every domain, part of what you have to do is to analyze the domain and figure out what features make sense. So here are some results. So we have, this is a dataset collected by NYU people and they had some previous numbers on accuracy and then there are odd numbers and they are significantly better on virtually every category. But this is a pretty good dataset because they just went into arbitrary rooms and they have 700 rooms or so. So it's a rich dataset where, and then somebody painstakingly went and labeled every region as sofa, ground, et cetera. So this is what I mean by semantic segmentation and these are outputs of the system. Input, output. So where this is going is this is what will make, I think the future robotics is really perceptual robotics. There's robotics in the sense of outdoor scenes driving around, well, 10 years ago, 15 years ago that was regarded as totally crazy but now there are these Google cars which have logged 200,000 miles. I think the next stage in robotics is really indoor robotics. And indoor robotics cannot just be blind robotics. It has to be driven by perception and perception. So the robot needs to know what is the door, what's the door handle, what is the wall. So just geometry, recovering this point cloud is not going to be enough. You need this kind of semantic segmentation. And I think now we are getting the technology, I mean these numbers are now like 60% but they'll improve, they'll become 70, they'll become 80, we'll add our tricks. And actually you make more errors for small things far away. Nearby things you might be right at. This will, I think things like, results like this will enable us to have robots which can operate in the home and to which you can say, hey, go to the refrigerator and bring me a can of Coke. That if you want something like that, the robot has to know about refrigerators, cans, and so forth. It can't, we can't just do it as a navigation problem. Thank you. Yes, yes, yes. So many of these motion zero shot, one shot. Mm-hmm. Underdeveloped young three-year-olds can do this. What is your opinion about this or are data-driven machine learning rethering for the real cognitive vision problem? Or is there some other? No, I think both have a role and I agree that our current learning techniques need way more data than they should. So I think of it like the following. So again, if you look at the analogy of child development, children are born with some abilities. For example, segmentation based on motion, they are just born with. They have this ability to, when some object moves, they put all the pixels of that moving object together. Once they have that, with that, they are able to bootstrap other forms of segmentation. So then they have some notion of visual objects. Then on top of that, so that includes somehow this kind of, what people call core knowledge in the psychology world. So core knowledge that objects exist, they are spatio-temporarily coherent and so on. Then after that you, so there is understanding of occlusion and so on. So I think that it sort of has to be built up on that foundation. That's how humans do it. What we do in computer vision when we are trying to build a classifier for a dog and we short circuit all of that, you've got a little patch and we are trying to, the program, the computer is starting tabula rasa, starting with a blank slate and all it has is this feature vector for the dog. It has no other visual knowledge. It does not know about the visual world. And naturally now you have to give it a lot more training data. So I think that, and that's fine because right now we can solve problems that way. And so it's a useful thing to do. But I think that in parallel we will, we should be evolving approaches which are much more comprehensive in their understanding of the world. So this is why I do this reconstruction, recognition reorganization story because I think that when we have all these modules operating, I think learning a new category should take much less time. And I think we'll get there. I mean, I accept the criticism, but look, we've made some progress with this paradigm, but it's not enough. Yes. I mean, there's no reason why. I mean, I should talk to Jason because he works on super pixels and video, but we haven't worked on that. But it's totally, the idea applies there. I think in any of these settings, there is these ideas apply in that setting. Yeah, we haven't worked on it. So the point is, so I want to talk about this as a semantic segmentation problem. So if there are organs, if there are particular kinds of tissues, biologists can label those. There's a precise notion of when you are in the liver and outside the liver. So that is the semantic segmentation task is exactly to be able to take the voxels and label, these are liver, these are not. And then of course, you can continue recursively down to finer parts. So actually, I should mention, we're doing some work, not on a medical setting, but helping some biologists who are analyzing fly embryos. So Drosophila embryos, and we're trying to do classification of various tissues there. And these techniques apply there. And there you have 3D data and so on, because you have this confocal microscopy and so forth. Yeah. Congratulations on the scene, right? Actually, I haven't unfortunately. We should be tougher over there. Yeah. Actually, the company that did all the visual graphics. Yeah, that could be. No, I mean, well, this is the capitalism is always about creative destruction. It's not the first time that this would have happened. But I mean, it's a, see, because in the old days in graphics, they would just sit down with a computer program and create these objects. So if you look at like the movies from the 60s or 70s or, I mean, they would be 70s or early 80s, they were, there was a lot of work in building things, but they never looked like real things. The difference was once computer vision got into the picture, you could capture reality into a computer model which you could then render with graphics. And this is now an amazing success. Now, I think the problem is I actually find many of these movies not so interesting. Not true for Life of Pi, that has a great story because they have lots of great special effects, but they don't have a good story. And I always think that instead of hiring these computer graphics PhDs, they may as well hire some English PhD who's probably unemployed and who could write them a good story, okay, and would charge them a lot less. Thank you.