 to work on joint training of convolutional network and a graphical model for human pose estimation. So we train a convolutional network for the part detectors and then add some structure on top of it. Let's get into details. So what do I mean by human pose estimation? Well, given an image such as this one, what we want to be able to do is we want to be able to detect joint locations such as the point between the eyes, the shoulder, elbow, wrist, et cetera. And why would anybody want to do this? Well, it turns out it is a basic problem in computer vision. And if we can solve this well, there are a lot of things that we can do with it. For example, we can do action recognition, which can then be used for security and intelligence. We can do clinical analysis of gate pathologies and maybe help patients suffering with diseases such as Parkinson's. We can do human-computer interaction, so controls, softwares on devices with cameras. For example, a swipe gesture can take you to the next track in a car. We can do markerless motion capture, which can then be used for visual effects. And it can also be used for gaming, et cetera. But it turns out it is also a very hard problem. For example, in an image such as this one, even me as a human cannot tell you with confidence where his right knee is. Now, why is this problem hard? Well, due to the high degree of freedom, configuration of the human body. So if we have 14 joints in 2D, we have 28 parameters that we need to optimize over. Also, due to complex self-occlusion patterns that arise due to the highly articulate nature of the human body. So different body parts get covered by others. Also due to self-similarity of body parts. For example, hands can look very similar in an image. The detector needs to be invariant, yet discriminative to large variations due to body type, clothing, lighting, viewpoint. Also, we do not assume any priors about the scale or the position of the human in the image. And finally, due to limited training data. So of all publicly available data sets, the largest one being the human pose data set from Max Planck Institute has only 40,000 labeled images. Now, we started to apply deep learning to this problem in 2013. Until now, histogram of gradient hog-based part detectors or shape context-based part detectors followed by higher level spatial model was what people usually did. But as you might know, hog, which stands for histogram of gradients, only considers the gradient information. And thus, we lose out on information such as color, texture, et cetera. Also, these models usually were ad hoc and not jointly trained. So you would have a part detector. You would have a spatial model. You would have a weight, which you would have to tweak, and so on. And finally, due to the recent successes of deep learning for classification and other tasks, we decided to try using deep learning for human pose estimation. So we started with a very simple convolutional neural network based architecture. The input to the network is a patch. And the output are four confidence values corresponding to the four parts that we want to detect. For example, in this case, the shoulder, wrist, elbow, and the point between the eyes. So at training time, we show the network different patches. So if the patch is centered at the face, we expect the red value to be high. If the patch is centered, say, at the elbow, we expect the green value to be high. And if the patch is centered at the wrist, we expect the blue value to be high. If the patch is centered at the background, we expect all these value to be low. So this is how it is trained. And then at test time, we run this detector in a sliding window fashion over the entire image. So at each position of the patch, we run the detector. And we get four confidence values corresponding to the probability of each joint being present or not. And when we slide this detector throughout the image, in the end, we would end up with four response images. Now, the next thing that we did was to fold this sliding window architecture inside the convolutional model, so the confinette itself. This is done by replicating the fully connected throughout the image by using these one cross one convolutions. And this is a very common technique, which has been done since the 70s. But why we do this? Because now the input is an image, and the output in one shot are the four heat maps. Finally, also when learning, we get to see all the negatives, so gradients from all the negatives, and thus it helps training. Then the next thing that we did was we used these multi-resolution patches. So instead of taking the 64 by 64, we also take a larger 128 by 128 patch around it, and then we decimated. So it's a Gaussian followed by a decimation, which is also in some sense the approximation of the Laplacian pyramid. So the smaller patch has the high frequency information, and the larger corresponds to the low frequency information. Now the next thing that we did was to, again, do the one shot for this multi-resolution architecture. But it turns out that it is quite involved, because you need to interleave the responses before putting into these one cross ones. So instead, we came up with an approximation. Instead of interleaving, we simply used one of them and replicated four times. This architecture actually led to the state of the art in 2014, beating depots by Bitmargin. We'll come to this in the results. Now, due to the fact that we are detecting a human body, there is a very strong correlation between the location of the different body parts. Thus, we also have a spatial model which sits on top of the confine, which is the part detector. And then we train this jointly, because this also happens to be convolutional. So let us see how the spatial model works, and let us do this using this example. So for example, we want to detect where the face is. And we also have the tentative locations of the shoulder. But as you can see, the face has a false positive, and so does the shoulder. And we want to be able to use the information of the shoulder to detect the right face. And this can be done by training a, so suppose now we have a filter, such as this one, where if the shoulder was at the center of this, the face would be somewhere on the top left. And if you convolve the unaries of the shoulder with this filter, you would get a response map, such as this one. So this, in some sense, is the message from the shoulder to the face. And then you can do a point-wise multiplication between the face unary and the message from the shoulder to the face to get the correct face location. This can be expressed in terms of energy. So the final probability of the face is the face unary plus the face-given shoulder into the shoulder unary in the log domain. This can be realized in terms of convolutions. We have to do the soft plus, because log can only have positive values, so sorry, non-negative values. Now here we see some qualitative results. So the top one is the flick data set, which comprises of, actually my icons are turned over. It should be the other way around. So the top one is the Hollywood data set. So it comes from movies. And it is less challenging than the LSP data set, which is a sports data set. So people are doing, maybe, acrobatics, which are more difficult to detect. But obviously these are the cherry-peck good results. Now we use this to also do, say, 3D reconstruction from very few cameras. So this was done using two cameras. And it can do simple tasks, such as walking and also jogging. In the next one, we would do a slightly more difficult example where he's juggling, he's also doing cartwheel. If you notice closely, it fails where the person is completely upside down. And this is only because we do not have such examples in our database, I mean enough examples in our data set. Now let us look at some qualitative results. So in this graph, so this is for the wrist on the flick data set. Wrist happens to be the most difficult joint to detect. On the y-axis is, say, the percentage of the test set. And on the x-axis is the normalized distance error, which means if the torso was scaled to 100 pixels, much error there was. So for example, in this one, by the way, these are all shallow approaches, so until 2013. And say the black one, so for MODEC, we say that at 14 pixels, about 50% of the data set was correctly detected within 10 pixel error. Now this was the result of the first, the simple convolution network. This was deposed by Google at all, I mean Toshev et al. And this was ours using the flick data set, and this is ours using the flick plus data set. So we do not change the model, we simply increase the data set size. Now these are similar results on the LSB data sets. For wrist and elbow, the wrist is the solid line and the elbow is the dotted line. Again, ours is much better than the previous approaches. And currently, we are working on embedded applications, so the same model, but so that it can work on embedded devices, such as mobile phones or cars. So we're working on model compression. We're working on lower latency, better hyperparameters. I mean, this is always a constant search, right? And we're also working towards stronger and more sophisticated spatial models. And finally, we are also doing online learning with Amazon Mechanical Turk in the loop. So you install a camera, and we install our best model. But as and when the model makes a mistake, there is another, I mean, and we somehow detect that it is an error case, we get it automatically labeled by Mechanical Turk. And once we have done enough annotations, we push a new model onto the device. With this, I would like to thank you, and we'll open to questions. And by the way, we are hiring, so if anybody's interested, please email me at arjunjan.gmail.com. Thank you so much.