 How do you see it on? Hello? Can you hear me in the back? Great. Can I get a pointer? Does somebody have a pointer in that one? Yeah. So I'm in the electrical engineering and computer science department at Berkeley, but this year I'm spending at Google, so hence my double affiliations. And OK. So let's start at a very big picture. So look at this scene. And what do you see in this? So you look at it with your eyes. So you make a lot of inferences. And here are examples of the kinds of inferences we can make. So you can segment out people. You can recognize that they are people. So you can say, for example, that that's a person walking away carrying three bags. Or this person is looking at this other person. And here are his arms and legs. And here is an object. So this is a bench. And this bench is the same as this particular object for which we have a CAD model. And it's arranged in the scene in a certain pose. So all of these are inferences we can make. In fact, we can go further. We can even speculate on answers to questions like this. We see that this is a person. We see that she's a musician. Because underneath here there is a tip bag. And she's got an accordion. And we'll be speculating about whether he's going to put some money there or is he sort of a very thrifty person who is going to just listen to the music and walk away. So computer vision should deliver on all of these. I have been working on computer vision for 30 years with the goal to deliver on all of these. And we can now do quite a bit of this. To connect it up to the talk you heard previously from Dhani, image processing problems are at a whole range of scales, from nanometers to very large astronomical scales and very wide-ranging time constants. What I'm going to talk about is all in the range of whether time scales are seconds and spatial scales are meters. It's the everyday world, the human world that we live in. But the techniques are of a general nature. And they could be pushed to smaller spatial scales or larger spatial scales. And I have myself worked on quite a lot on biological images and microscopy images and so on. However, that's not what I will talk about today. You will get a flavor of the techniques and how they apply at human scale. So scales of meters and times on the order of seconds. So this is our grand goal. And the amazing thing is that we can now do quite a bit along this. So in fact, of these problems just to expose what's coming up, we can in fact solve all of these. So these inferences that are listed with the exception of the one at the bottom. This we can't do. But these others, such as detecting people, segmenting them out, determining their 3D pose, finding stick figures corresponding to the person. All of these are what we have computer programs today that can achieve. So that's a starting point that it's amazing. I mean, 30 years ago when I started in this field, I did not know that we would ever be able to do this. In these 30 years, of course, computing power has gone up by enormous amounts. If you just think about Mohr's law, a factor of 10 every five years or so, which means that over the course of my career, computing power has gone up by a factor of a million. So we always talk about this operating in the forward direction. Yes, five years from now, things will be faster. But to appreciate those of us who are working on computer vision in the 80s, think about it backwards. We were stupid or foolhardy enough to try to work on these problems with machines that were a million times slower and storage, which was correspondingly limited, et cetera, et cetera. So that was the regime in which techniques were developed. We now have much greater computing power, much greater storage, et cetera, et cetera, much greater data available, which you can exploit for training. So we are going through a transition in the field which exploits this data richness and compute power that we have. And many of the techniques from the past, there might be new alternatives coming up. But you can understand them well by considering them as attempts to deal with the fact that we only had so little computing and so little data. So let me take you through this journey. What is visual understanding about, so again, we are talking about this kind of 3D world at human scale. Essentially, it's pretty straightforward. It is there are objects. Objects could include people. There are actions, which are performed by people. So if you consider space time, there are triplets which correspond to people, actions, and objects. So I can cut a slice of bread with a knife. So that corresponds to a person, an action, and an object. Events, like this is an event. There's a seminar in a certain room, and so forth. So ultimately, that is our goal. It's built up from lower level understanding, which is segmenting out an object, determining its 3D pose. But the highest level is this abstract understanding. Vision is our window into the world. There is an external world. Vision helps create an internal world inside your brain, in your mind, which is some version of that external world. And this enables you to act in the external world. It's a remarkable sense. It arose, say, 650 million years ago in the Cambrian explosion. And this is what enables animals to get a leg up on plants. Animals can move around, and they can perceive their environment, and that gave them the advantage to find food in different places. Now I'll do a little bit of a history, because I can't get into too much of the technical stuff. But it's good, because you're going to be going through. You're all either already are experts in image processing or are becoming experts in image processing. So I want to historically relate the different ideas that have been around in the field. The field, we often think of the field as starting in 1963. There is a particular PhD thesis by someone called Larry Roberts, who went on to play a major role in the development of the internet. And that thesis, what they did was there were images. I mean, 1963, rather feeble. But in those images, you could recognize whether there was a cube or a pyramid. So it was a two-class problem from this very, very low resolution image. And that is the beginnings of computer vision. And in that thesis, he has a picture to give you this idea, these two objects. So that's what he was trying to distinguish between. Now if you think back to that time, remember computing power is a million times slower. So what's the first thing you're going to do? You can't deal with all these pixels. In this case, I mean, we talk about gigabytes and terabytes with no sweat. For these people, a kilobyte was like a killer, right? And so what are you going to do? You will immediately get rid of the pixels. So you do it by detecting edges. You look for places where there are sharp changes in brightness. So now you've got a very sparse representation, and your data has been compressed. So and then those capture significant aspects of the objects because these are the geometric discontinuities. And you start to recognize. And we've gone through, so this is 63, 90s. These are sort of leading projects from that time on trying to do object recognition. This is work out of MIT. This is work out of Stanford. This is from Oxford. Edges and corners, all the early literature and computer vision, you'll always see such and such edge detector, the canny edge detector. Why? Because we want to get rid of the pixels because we can't afford to store them or compute them. We kind of hope that we haven't thrown out the baby with the bathwater, that there will be some enough information left in the edges, and we'll try to work with that. So this is what people could do for the first 20 years of computer vision. It was all about edges and corners. By the time it came around to the 90s, let's say 80s and 90s, computing power had gone up. So now what you could do is you could hope to keep more information around. And what we had were, I mean, so this is now I'm referring to work from our group here at Berkeley, which was we did, we could do linear filters, convolution with linear filters. Mostly these were inspired by what happens in the first stage of processing in the brain, various kinds of oriented edges. And you can model these with gabar filters, Gaussian derivatives, and so on. So these are filter sensitive to vertical edges, horizontal edges, et cetera. But there's also fire in textured patches. So they're not necessarily corresponding to a few isolated edges. And now you have this rich representation which arises after convolution with a certain number of filters. And on top of that, you do processing. So we showed that we could do texture analysis. There's work on face detection, image retrieval. I have a story here. At that time, we used to run these filters back in the days when we were in Corey Hall, where my lab was. And the image is going to be convolved with some of these filters, probably, on this order. And then we would go to Caffeine FLE, which still exists on Euclid and Hearst. And Pietro and I would spend an R there because we needed an R for this image to get convolved with all these filters. That's the amount of time it would take. So we would have a cappuccino. We would discuss science, politics, everything, come back. And by then, the image convolutions would have been completed. This is the era you have to think about. Why certain techniques from that era were the way they were. So that is era 2, which is the era of filters. And by the end of this era, you could do things like face detection. Era number 3, this started in roughly the 2000s. And if you will, the second era is about these filters, which were you can think of as the simple cells of V1. The next era is histograms everywhere. The idea was that you have filters, which are at a point. But then you want to describe an image patch. And the way you describe an image patch is the distribution of outputs of these filters. And some of the names from that are SIFT. This is due to David Lowe, Hogg, which where is there somewhere. This is some result based on Hogg. Histogram of oriented gradients. These are kind of like complex cells in some crude way. And other work on looking at this histograms was from Berkeley. This is a paper from Laseb Nekshmet and Spons, which was trying to do recognition of objects by looking at this distribution of output. So you're trying to say, how many times are there vertical edges? How many times are there horizontal edges? How many times are there 120 degree edges? You kind of capture this distribution. And this distribution is going to be different in different places. It's somewhat analogous to the bag of words representations for documents. So you want to distinguish a document on banking from a document on, I don't know, real estate. Well, the kinds of words that will show up are going to be different. And without understanding anything of the text, just the distribution of words gives you a clue which kind of document it is. And that's what Google uses for giving you the right kinds of documents when you type some search query, roughly speaking. There are many more things going on, but this is an idea. So this is the third generation of features. There's a fourth generation of features, but before I get to that, I want to go back again in time to some findings from the 1960s in neuroscience from Hubel and Wiesel. They did this first on cat and later on monkey. And they were trying to understand, so they were neuroscientists. So they were trying to understand what happens in visual cortex. There's a visual cortex in humans and monkeys and cats and so forth. And the first stage of processing is essentially responding to oriented stimuli. So whether you have vertical edges or horizontal edges and so forth. So this is an example of a kind of neuron. This neuron likes in the image an edge or a bar at a certain orientation. Then it fires a lot. If you displace it, it doesn't fire so much. If it's at the wrong orientation, it does not fire at all. OK. And these discoveries were what eventually got them a Nobel Prize in 1981. Inspired by what Hubel and Wiesel had found. Hubel and Wiesel had found these simple cells, which was what I showed you in the previous slide, and what are called complex cells, which respond to a certain orientation. But they respond in a certain extended amount of space. So it's almost as if you are pooling over some area and you're saying, anywhere if you have a 45 degree edge, I'm going to fire. Whereas the first kind of cell was needed a specific location. And then the idea was that the cells in the visual pathway, they start out corresponding to edges. But at later stages, they might correspond to parts of objects. And at the highest stages, they might correspond to very specific things like Bill Clinton's face or a particular location and so on. So there was this hierarchy going from simple generic kinds of filters to more and more specific filters, which will fire for very, very specific kinds of objects. So this hierarchy is shown here in this paper from Fukushima from 1980, where you start from an image. You have layers of alternating layers of simple cells and complex cells. And these are really filters. They are all doing convolutions. Between two layers, you have some non-linearity. This is important because if you compose linear filters with linear filters, you just get another linear filter. So you don't get any increase in representation. In between, you need to have non-linearity. In fact, the most common non-linearity in biology is half wave rectification. And today it's called rectified linear unit. But it's an idea going back there. And that's it. That was what he said. So he said, and he said that when you want to recognize the same object independent of position, so if there is a chair here or a chair here, my network should still recognize it as a chair. So you want independence of position. So that's it. You build in these things. You build in shift invariance. You have these alternating layers of simple cells and complex cells. And this is a paper from 1980. And he just did it by modeling what Hubel and Wiesel had found. He just put it in the form of a neural network. Now what he didn't have was a good training technique. How do I figure out the weights by which these neurons connect to neighboring neurons? What are the synaptic weights? The Wijs in neural network terminology. He wanted to do it with some Hebbian training, and that didn't work so well. So in the early to mid-80s, the back propagation algorithm was invented by Rammelhardt and Hinton and others. I mean, you can again argue about that one particular idea was probably reinvented or invented by many different groups. Once you had back propagation as a way of training these networks, it was natural to say, let's apply back propagation to these particular networks. And Jan Le Koon did that. And that's the so-called convolutional neural network architecture. So I hope you see the story of how the convolutional neural network architecture came about. It was taking biology, creating a model, and now adding this learning machinery or back prop which had been invented in the mid-80s. And it was pretty effective at that time for tasks like handwritten digit recognition and so on. It did not catch on for the rest of computer vision. Why? Well, because the kinds of networks they could afford to have were very small, given the computing power of the time. You didn't have much computing power at the time. So techniques which incorporated much more prior knowledge, your insight about a domain or ways to simplify the problem for particular settings. This is what we had to do because we could not afford to train very big networks. If you have very big networks, you also need very big data because you have lots and lots of parameters to train. And that was not available either. So for digits, there was this. Digits are 32 pixel by 32 pixel. And there were 60,000 digits which had been hand labeled by some high school kids. And that was a training set available. So you could do reasonably well on digits. But for the complexity of the world that we live in, there was just not the kind of data to train effective models. So the computer vision community, the mainstream computer vision community, was doing the things that I told you about. They were doing, first they were doing edges, then they moved to filters, then they moved to histograms. So this was like a parallel track. And then there were these neural network junkies who are living in their world. And these two worlds, they were looking at each other. But each had some reason to sort of denigrate the other. Because one was doing what the neural network people thought that they were doing things in a very principled way. They were not hacking around. They had a very principled approach to learning all features. But they couldn't work on complex scenes. The other world could work on complex scenes. But by lots and lots of hand coding and hard design. Well, all this changed around 2012. So by then what had happened is, OK, so computing power has obviously been getting better and better over the years. And by the 2010s, particularly once GPUs came into the scene, there was computing power available. There were data sets available. ImageNet was a data set which came out of Stanford, where they took a million images and had them hand labeled by Amazon Mechanical Turf workers. So there were now a million images available. So now you have the training data to train those big networks for which there will be millions of parameters or hundreds of millions of parameters. And so they showed that there was this task which was, in this image, is there a dog or a cat, things like that. And there was a previous system from computer vision which was error rate was about 25%. And this technique, which was really an old technique just done with modern hardware, got the error rate down to 15%. And suddenly the computer vision community had to stop and take notice. We had to say, OK, this is important. We need to do this. And there was still a debate as to whether this applied only to one particular domain, which was whole image recognition, or whether it applied to different tasks in computer vision. And so there was some uncertainty about that. And I'll talk about something out of Berkley, which was, which showed that this technique also applied to individual object recognition. But really there were a ton of papers which came out over the last few years, which have shown that, in fact, this neural network paradigm is really quite a general paradigm. And it applies to many different problems. So in the mainstream computer vision community, this battle between shift and hog versus coordinate features is over. Conducts a one period. It's not a halfway measure where you say, oh, maybe for this or that. I think that, in general, if you have enough data, there is a caveat that if you, in your domain, if you have very little data, then you have to continue to be clever. And it might be that you are lucky, and there is a lot of data in some nearby domain. And you could try to train a network for that, and then do transfer learning. And that might work sometimes. If you have very little data period, you have to take recourse to classical cleverness. But if you have enough data, then this is the method of choice. You still need some ingenuity, but that ingenuity is in designing the architecture of the neural network. So this is how we have shifted. See, if you, again, take a history lesson. 50 years ago, if you wanted to understand some problem as a physicist, what did you do? You tried to model it. Maybe you found the differential equations. You couldn't solve them in general, but maybe in some regime, you could construct approximations to the solutions. And that's how an analytical, mathematical physicist operated. 30 years ago, what would you do? You have a computer. You have some way to simulate the phenomena. You simulate it. And now you just run numerical simulations. Well, we have now come to the next generation, where if you have lots and lots of input-output pairs, you have examples which have been labeled. And now you have, in between, there is this black box, the neural network. You will provide those input-output pairs and try to train this black box to spit out the right output for the appropriate inputs. So at every stage in this process, you have somehow lost control. And you have to learn to live with it. I'm sure there was a generation of physicists who were not happy when numerical simulations took over. And the cleverness of people like Lord Raleigh and so on was no longer needed. You could just need it. Some guy who could hack and write a program which simulated the situation. Well, there is that same feeling that people like me have. All my cleverness in designing features is now obsolete because some kid who has enough data and GPU can beat me because with enough data, you can train a network which might do a better job of designing these layers of features. But I have grown to get used to it, and I encourage you to do the same. How much time do I have, by the way? Another 10 minutes? How much? 20 minutes? 10. Somewhere like that. I'm proceeding at a slow pace because I'm trying to do the history lesson. After that, there'll be technical stuff which you will eventually not care about. So let me just take you through some examples of how this neural network revolution unfolded over the last two, three years. So this is a work from Berkeley on regions with CNN features. So this was trying to solve the problem of what are the objects in an image. Actually, I'm going to jump to the next slide first. This is the output of a computer vision system from last year, from December, from Microsoft Cocoa, which won from Microsoft Research, which won on this challenge called Cocoa. But look at the results of the system. This is the winning system. So you see, you probably can't read this in the back, but I'll read it this out. So there is a rectangle here, which says person 0.998. And another person 0.987. So those numbers are probabilities or scores that this is indeed what you claim it is. Somewhere here, there is a wine glass 0.982, a knife 0.997. This whole thing is a dining table that's listed here, person, and so on. So this is the output of a computer program. We can do this today. This is saying this is analyzing this image and basically detecting the objects, localizing them with rectangles, and so on. So this is why people like me who have worked in this field for 30 years are so excited that we have achieved this level of performance. And how it came about is I will go to the past and the future. So the technique, there is this RCNN framework where what you do is you don't try to add the image as a whole, but objects are going to be spatially localized. So therefore there is some bottom up process by which you come up with candidate proposals for objects. So that's some kind of segmentation procedure. And you come up with candidates, and some of these are going to be ones that make sense and others which don't make sense. On that support, so the pixels in this rectangle, so these rectangles are generated by some bottom up segmentation procedure. In the rectangle, you compute features by running them through a convolutional network. And then you try to say it's an airplane, a person, or a TV monitor. So this was work from two years ago now, and there was a competing proposal from Google. And this was sort of the basic machinery. And now that was RCNN, and then there was fast RCNN, and then there is faster RCNN. And this is a result from faster RCNN. And accompanying this was a revolution in the number of layers of this neural network. So the early networks we had, I mean, in fact, Hoggart DPM, these are more classical computer vision techniques from the 2000s. There you could say maybe it's like two layers. And the 2012 network from Hinton's group was at eight layers, 16 layers. And now, in fact, there are networks with 100 layers. And the numbers that you see here, 34%, this is an accuracy number. So we ideally want to be at 100% or 99%, whatever. So the classical computer vision techniques were at about 34%. Went up to 50% with the best RCNN techniques, 66%. And now the numbers are at 86%. So in a period of about five years, the accuracy has gone up from about 34% to 86%. Every scientific field has an S curve. There is an early stage where you struggle for years and years, and progress is sort of slow and painful. Then there is a middle phase, which is sort of the revolutionary phase where progress is very fast. You are able to suddenly do much better. You can't really do it fully. Nobody can do the problem. And then after that, things slow down again. There's a consolidation phase, and there's a leveling off of the S curve. So computer vision has gone in the middle of this middle phase, this very rapid improvement in performance. And this has happened in other technologies. I mean, in computer networking, this was the era of the, I mean, the internet was not invented in 1991. It had been around for some time. But the boom era was, say, the 90s. Speech, again, people have been talking about interfacing with using speech recognition forever. But the accuracy, if it is very feeble, you can't field any real application. Today, is it perfect? No. But it's enough that you have interfaces such as Siri or whatever on your mobile phone, which you can use effectively. So that's what I'm saying. So here, we are talking about this kind of transition in computer vision. So let me talk a bit about the kinds of applications we can do now. So we can do this. This is a segmentation task. So it's not enough to say, here is a box which contains the object. I want to mark the pixels of that object. So classically, this is done a bottom up. So you look for patches or regions which have roughly the same color or texture. And you use that to separate them out from the background. You don't use top-down knowledge. So if I want to segment out me as a person, well, I want you want to separate me out from the background. If you wanted to use color and texture, you can see that my face is a skin color. So that is one region. My shirt is a different color. So you'll get that as another region. And my trousers are another color. So that's a third region. However, if you know about what people are, you can put together these components because you know the shape of a person, what people look like. So segmentation is really a combination of a top-down problem, which is knowing what tumors look like or knowing what people look like or knowing what cars look like. And a bottom-up problem, which is that pixels of parts of the object are going to have coherent color and texture. And classically, we focused on the bottom-up technique. And bottom-up is a big part of the answer. But it is not the whole answer. It is the right way of combining top-down and bottom-up knowledge. And that's what we show that we can do. And again, the machinery is not so important. We first detect objects and using the techniques described earlier. And then you try to classify every pixel. So in these machine learning techniques, you need to have example input-output pairs and then some training data and then some test data. So the training time what you're doing is trying to say for every pixel, does this belong to the car or not? So it's a binary classification task. And then this is the result. But this enables you to segment out the object. And here are some results. So for example, you're pulling out an individual ketchup bottle. And you can see the relevance of this for biological tasks. I mean, there's zillions and zillions of applications where you want to just count cells. And this is a way to do that. Now what you would be doing in this machinery is not just relying on the fact that the cell has a different color from the background. You can also try to make use of its shape and appearance. That's the power of this technique. Now there's no free lunch. You need to start by having lots and lots of label data. That's not necessarily easy to acquire. So you first need to have some colleagues who are willing to sit down and mark lots and lots of examples of the object of interest, let's say cells of a certain type. But once you have that training data, then you can train this network. What else can you do? If it's a 3D object, you care about the pose. What is the position of that object in 3D space? And so there the output of the network is the pose, which is the Euler angles and so forth of the object. And it turns out you can do this with a network too. This enables us to do something like building 3D models of objects. This arises in microscopy and problems like cryo, EM, and so on. You have some object for which you have many different views and you're trying to build the 3D structure. And what we showed here, which I'm going to skip the details, is how to build 3D models of cars given many images of cars. So you analyze individual images. You segment them out. You estimate their pose. And then you try to put them together into a consistent 3D model. And skip, skip. So these are the kinds of models that can be built up just from images in the web. Let me turn to, say, video understanding. So now you have images. In images, we want to find objects, like a dog. If you have a video stream, what is it? It's now space time. So there, what we are concerned with, what's the corresponding problem? So in images, the problem might be, is there a dog and where is it in the image? In video, we might have a problem like, is there a person diving? And where is he or she in the image? So it's going to be a volume, x, y, and time. So there's going to be some volume swept out by this person performing an action. We call that an action tube. And essentially, there are sort of isomorphic techniques to what we did for images in this kind of data. And so again, I'll skip details, which can classify these actions. And these are the kinds of results we can get. So I'll take five minutes. That's good. So these are some of the actions that you can recognize. Kicking, running, walking, so on. So I'll show you some other work on action, which is trying to do, which is trying to work from a single image. And again, I'm going to skip all the math. Skip all this. Let's get to this result. So this is trying to find actions such as jumping, phoning, playing an instrument, reading, and so forth. And I'm going to skip the machinery. But the key idea here is that you need to pay extra attention to some part of the image. So if you want to detect whether I'm phoning, so think of this in the era of cell phones. My hand is near my ear, and I have something there. This kind of a sign. This tells you that I'm talking on a cell phone. So you don't want to just look at all of me. You also want to sort of zoom into this area, and that will help you distinguish. So in this work, what we have are, you see everywhere green boxes and red boxes. So the red boxes correspond to some person, and the green corresponds to some region of interest, which gives you extra information. It's where you zoom in to get extra information. So if your task is to recognize playing an instrument, then there is going to be a zoomed in area which likely corresponds to an instrument. But here's the magic. Riding a bike will be a bike, or reading will be the book, or something like that. But here's the black magic. I don't have to build this in. I'm not building in that, ah, if you want to recognize playing a musical instrument, you run one detector for a person, and another detector for the musical instrument, and you put them together. The interesting box comes out purely from the desire to do well at the first task. And that's what all this mumbo jumbo kind of does, which I will skip. So you're trying to train with this loss function, which is you detect a person, and you're trying to detect the action. And you use the most informative other region. And that most informative is defined by the task. And it does not have to be specified by the human person. So here are some more results, skip numbers. Kinds of things which are, which are now using feed forward conducts is now pretty straightforward in its technology. More interesting things are problems where feedback is required. And this is the kind of stuff which is more researchy in these days. So here's an example where we are trying to find stick figures. So you're trying to estimate for this person, where are this, so you want, where is the person's shoulders, hips, elbows, and so forth. And this is what we want to get at. And the way the algorithm works is that it has an initial guess of the stick figure. And it tries to predict the correction to that guess. That's a result of one iteration, and then another iteration, and so forth. And after three iterations, it's done. So this enables you to use context from the rest of the figure as opposed to a purely feed forward architecture. And this is how this might operate. Again, I don't expect you to get that. You take an image, and your current guess of the solution, and you try to predict a better guess. And it's kind of reminiscent of how feedback control works. Skip, skip. But if you look at this column, the last column is what the ground truth is. But this, the second last column tells you what the computer program can find. So it can find pose. Now pose can show up in different ways. If you have a molecule, pose means something different. But these techniques are relevant. Tracking, instant segmentation. So these feedback paradigms are a general one. And here are examples where you apply it to a task such as segmentation, where if you just do a feed forward machinery, it makes certain kinds of mistakes. Because locally, if you see this other person's, second person's face is similar to a face. So therefore, it could be marked as part of the foreground. However, in a iterative framework, you can parse correctly. Once you've decided that it's really this particular person, then the face cannot belong there. And I'm going to skip the kinds of details. We can do things like predict hidden parts of objects. So not just a person, but all this hidden part, what you can't see. Because once you have a model of that this is a person, you know what the person looks like. And this enables us to do things like guess where they are in space. Because when you can see the person's foot, and you can say they are on a ground plane, you can estimate distances. So where do we go next? So I think I've given you now the flavor of all of these problems. We can do detection, which is objects, segmentation, pose estimation. We can, in fact, infer action such as this person is walking versus running. And on this problem of prediction, we can't do that prediction task. That was just a reach goal that will this person put money in there. But we can do something which is easier but equally fun, which is who will possess the ball next. Suppose you see here this game. And I mean you can, if you are like basketball, just imagine this as a basketball scene. And so in this work, again, there's some analysis. You construct the top-down view, blah, blah, blah. And then at the end, we can, skipping all the details of how the machine works. But at the end, we can actually make predictions. Because you can look at the player. You can see which player is free, which player doesn't have an opponent, which player is closer to the goal. So this is the kind of information that a player in the sport is going to have. And the algorithm can try to infer all of that from visual analysis of the image. And they have training data without supervision. Because if you just let the video run for a little bit longer, you know who actually got the ball. So that provides self-supervision. And we cannot do as good as human experts, but we can do reasonably close to human non-experts. So we can get to 44% accuracy. Human non-experts are at about 55%. I will end here. Thank you.