 The recording is running. And so as you can see today, we have a guest lecturer. We have Ishan Misra. Ishan Misra is a research scientist at Facebook AI Research Fair, where he works on computer vision and machine learning. His research interest is in reducing the need for supervision in visual learning. He finished his PhD at the robotics institute at the Carnegie Mellon University, where he worked with Martin Hebert and Abinav Gupta. His PhD thesis was about it was titled Visual Learning with Minimal Human Supervision, for which he received the SCS Distinguished Dissertation Award, so in 2018. So with less, how do you say, with further ado, I don't know how to speak English, let's get. We cannot even have the round of applause. Can we have in the chat round of applause for our speaker? So everyone, my name is Ishan. I'll be talking about self-supervised learning in computer vision today. And a lot of the focus is actually going to be more on the discriminative style of approaches, and it's not really going to be on generative style of approaches. And I'll go about it more and more as I go into my talk. So this success story for the presentation learning or computer vision so far has been really this sort of pre-training step or the ImageNet moment of computer vision. So what has worked really well is that when we have a large label data set like ImageNet, we can learn a representation by performing our image classification task on this large data set. And what is very useful is not just performing this particular task at hand, but to take these representations that you learn and then to use them for downstream tasks where you may not have enough label data. And this has worked really, really well, and it's sort of the most standard recipe of success. Now, this really involves collecting a large data set of supervised images. And you need to get a bunch of these large diverse images and label them with a bunch of large diverse concepts. So let's try to first see whether we can sort of collect these labels and what are sort of the difficulties in doing so. So the ImageNet data set is a very sort of small data set in the sort of grander scheme of things. For example, ImageNet just has 14 million images and it has roughly 22,000 concepts. And just labeling this entire thing, if you look at the amount of effort that was spent, it's about 22 human years to label this entire data set. For contrast, a lot of people are trying looking at these alternative supervision approaches where you are predicting something not really a very sort of pristine, nice label, but something which is more easy to get. For example, predict like hashtags or predict GPS locations of images. Or what we're going to really focus on in this lecture is going to be about self-supervised learning, which is going to be using the data itself. So the first question that I always like to sort of start out with is why don't you just get labels for all your data? Why do you even want to invent this entire line of research? Why not just get all the labels? So I did this small exercise where I plotted the amount of supervision that we have for vision data sets. So what I did is basically I looked at all the images which have bounding boxes. And so these are images where you know what kind of concepts are in the image and you also have a box drawn around them. And this is sort of the standard thing to do for something like an object detection model. So if you look at all the data sets in vision that have bounding boxes, you'll get roughly about a million or so images. Now if you relax this constraint and you say that, okay, I don't really care about where the object is located, all I care about is what objects are present in the image. And so if you relax that constraint, you immediately get an order of magnitude more data. So you'll basically get about 14-lin images or so. Now if you further sort of relax this constraint and you say that I don't really care about this image level supervision either, all I care about is internet pictures that are present. You'll get basically about five orders of magnitude more amount of data. And so if you look at this plot now, you can see immediately that the amount of data that we have, which is labeled even at a bounding box or an image level is basically nothing compared to what images exist in the internet scale. And I haven't really forgotten these images, forgotten the bars on the left-hand side, it's just that they completely disappear. And you really need to make this plot something like a log plot to actually even get these bars to appear. So now of course internet photos do not represent everything about the world. There are things that really require motion or things that really require other physical senses to learn. So in the real world, there are going to be far more things that you actually experience, far more sensory inputs that you can get. And it's really hard to obtain labels for all this data. And again, to put things in perspective, ImageNet, which is just 14 million images and with a very small number of concepts that you have required a lot of time to label. So clearly labeling is not really going to scale to either all of internet photos or even the real world. So the other sort of problem with labeling is that for complex concepts like video, it's just really hard to sort of scale labeling. The second problem is that rare concepts sort of are really hard to label. So for example, this is one of the popular image data sets called LabelMe. And over here we can see that if you look at the kinds of concepts you observe, there are a lot of concepts that are so rare that you're going to have to label a lot of data to even get a few instances of these concepts. So in this data set, 10% of the classes account for more than 93% of the data. Which already tells you that in order to sort of scale labeling to like more and more concepts, you'll need a lot and lot more data with very diminishing returns. So this is sort of the standard long tail problem. And of course, pre-training is not always the right thing to do. For example, if you just completely change your domain to now move into say medical imaging, it's not clear whether ImageNet pre-training is the right sort of thing for this task. Or if you do not know sort of the downstream task a priori, how do you collect a big data set and how do you do this entire pre-training a downstream task fine-tuning recipe? So self-supervised learning sort of comes in between and it tries to give you an alternate way to pre-train your models or to learn from data or learn from experiences without requiring pristine supervision. So in this case, so there are sort of two simple definitions that you can come up with for self-supervised learning. The first is more from like a discriminative or like a supervised training perspective. So in ImageNet, for example, you have an image and it can be classified into one of 1000 labels. So self-supervised learning can be thought of as a way to obtain labels from the data using an automatic process. So that automatic process does not really require a lot of human intervention. And so once you get these automatic labels, now you can sort of go ahead and train your model with these labels. The other way of thinking about self-supervised learning is that it's really a prediction problem where you're trying to predict a part of the data from the other parts of the data. So you have some observed data and you have some hidden data and you can now formulate a task where given the observed data, you try to predict either the hidden data or some property of the hidden data. And so pretty much a lot of sort of the self-supervised techniques can be viewed in this particular framework. So the term sort of self-supervised learning, I really like to give this analogy, which is from Virginia Tessa. So where she sort of tries to distinguish between the three terms, supervised, unsupervised and self-supervised. And so in supervised learning, you have say an input cow and you're given the exact target for it, which is say going to be the label cow. In unsupervised learning, you're given this input and it's not clear what really the entire target is, what exactly is the objective function or so on. Self-supervised learning is sort of the term which is preferred now more and more. And the idea is that the label really comes from either a co-occurring modality or co-occurring part of the data itself. So really all of the power is in the data and you're really trying to sort of predict either parts of it or properties of the data. So some very sort of standard and successful examples of this are say either the word to vec model where given this say a sentence, for example, the cat sits on the mat, you're given a part of the sentence that you observe, so which is in this case labeled as the context or the history. And then you have a part of the sentence of word in this case, which is not observed, which you sort of hide from this entire model. And given this context, you ask the model to predict this target. And so you have your self-supervised objective, you can minimize it in a particular fashion and now you will learn a representation for your input data. And word to vec has been, I mean, it has actually shown a lot of promise in variety of applications and this entire sort of predictive model has inspired a lot of work in computer vision as well. The success of self-supervised everything is sort of undebatable in natural language processing. So in 2018, there was this really successful model called BERT, which basically is a form of a masked autoencoder. And this model has sort of revolutionized the amount of things that you can do in NLP with limited amount of data. And a lot of people call this the image net moment of NLP. So in this talk, we'll sort of again to motivate why we want to use self-supervised learning. We are really going to focus on sort of how you can look at data and you can use observations and interactions of the data to formulate self-supervised tasks, how you can leverage multiple modalities. And I'll talk a little bit more about what this term modalities means or structure in the data to sort of learn as a representations. So let's move on to the context of computer vision and I'll sort of now try to define things that I've been talking about in a slightly high level in more concrete fashion. First question. So self-supervised learning is basically just unsupervised learning, right? Yes. I mean, yes, yes. I mean, basically the sort of main differences like unsupervised is a sort of very poorly defined term. So there is supervised, but what is unsupervised? So for example, the analogy given by Jitendra Malik is there is a cat, but there is no category called un-cat. So that's the reason to sort of come like prefer this term more and more that it's really about using the data or the properties of the data itself to come up with supervision. So that's why self-supervision. So someone suggests that it's a subset? Yes, I guess, yes. I mean, I guess, you know, my reason for calling it this is that the algorithms are essentially the same as supervised learning algorithms with some modifications because you're kind of training the system to learn part of its input from another part of the input. So it's very similar to supervised learning in many ways, except that you need to handle uncertainty better. And the negative category, if you want, may be much larger, which is kind of an issue. But unsupervised learning is really not very well-defined. Self-supervised learning is kind of a very different concept. It's not entirely clear, it's a subset of unsupervised. So moving ahead, I'll now sort of try to talk about self-supervised learning more in the context of vision. So in vision, a lot of these sort of prediction problems have been framed as pretext tasks. So a lot of the vision algorithms sort of, and this term comes more from 2015, where from this particular paper by Karl Dorsch. And the idea here was that you have a text task or the sort of task that you really care about at the end, like image classification. But of course you don't have a lot of data for that or you, so you want to solve a task before going to the real task. So a pretext task. So this pretext task is a prediction task that you are solving, but it's not often the real task that you really care about. So you solve this particular task to learn a representation and then you'll finally get your downstream task where you want to use this representation to perform something meaningful. So these pretext tasks are sort of funny, they're often fairly like, people got very creative with sort of coming up with these pretext tasks. So let's look at how you can define a bunch of pretext tasks and what each of these pretext tasks is trying to do. And so you can use either images, video, video and sound when you're trying to do these things. And in each case you'll have a bunch of observed data and you'll try to either predict hidden data or you'll try to predict the property of the hidden data. And this sort of distinguishes a bunch of approaches. So let's look at how you can use images to define something like a pretext task. So the paper that introduced this term pretext task came up with this fairly sort of funny method where what you do is you take say two image patches, basically take the network and you ask the network to predict what is the relative position of each patch with respect to the other. So in this case, say I first sample a blue patch and now I sample another red patch. So what I do is I basically feed forward both of these patches through a connet and I have a classifier that is going to solve a eight-way classification problem. And how do I get the label for this classification problem? Well, I just look at where the red patch is located with respect to the blue patch and that's it. So at the end of it, you're just predicting you're just solving a eight-way classification task. You've got your labels by basically doing this sort of exploiting this property of the input data and that's it. Now you can use this to basically train and this entire connet. So to sort of look at it in a different way, it's only solving a very small classification problem. It's just solving basically eight possible locations sort of a problem. So surprisingly enough, doing this sort of pretext task actually learns something fairly reasonable. So the one way to look at what this network has learned is to look at what it considers are nearest neighbors and it's which was representation. So to explain this plot a little bit more, on the left-hand side you have the input patch. So you feed forward this input patch through that CNN and you basically extract a bunch of patches on the data, on your data set. So in this case, ImageNet and you compute feature representations for each of these patches. Now for the particular input patch that you sent through the connet, you compute the nearest neighbors of all the patches from the data set and you can use three different networks to compute the feature representations. So the first column is the relative positioning pretext task and the second column is basically a randomly initialized AlexNet and then the third column is basically a ImageNet pre-trained AlexNet. So if you sort of look at what this relative positioning task is capturing, it's really able to find very good patches, patches that are identical or very close to the input patch. And you also see that it's, for example, like in the row of the cat. So that's the fourth row. You can see that it's actually able to figure out, it's slightly invariant to see the color. So the input cat was black and white, but it's actually able to pick out cats which are not just black and white. So it's really doing something, it's at least able to reason about patches as a whole. So why should this representation do anything which is semantic? So the nearest neighbor visualization technique is good at telling you what this representation space has captured. So in this case, what we can confidently say is that this relative patch representation has learned to sort of associate a bunch of these local patches together, local patches that have roughly the same appearance. And so because it is able to reason about these local patches, maybe it's actually able to reason about the image because image can sort of be viewed as a bunch of local patches together. So it's able to sort of put these patches in one part of the representation space. Now people have, like I said, people have gotten fairly creative with the kinds of pretext tasks they do. So another sort of popular pretext task is predicting rotations of an image. And so this task is very straightforward. You have an image, you can either apply a rotation of zero degrees, 90 degrees, 180 degrees or 270 degrees to it. And basically you send in that particular image after applying a rotation and you ask the network to predict what was the exact rotation applied to the image. And it just solves a four way classification problem. So it predicts basically either if the rotation is zero, 90, 180 or 270. And this pretext task is actually one of the most popular pretext tasks now because it's so easy to implement. You basically just take an image, it's very, very simple. You don't really need to sample too many patches or solve any sort of complicated thing. It's a very standard architecture and you can solve this. And it's become fairly popular now. So the network is gonna be basically trained, so the feature are trained in order to solve this problem, right? So the output will be somehow dependent on the specific task someone is gonna be picking somehow, right? Yes, so this is sort of again, this is a pretext task. So we are not really interested in predicting the rotations of an image. We are just using this task as a proxy to learn some features so that on the downstream task, say when someone gives us a thousand labels, images of a cat, we can then use this pre-trained feature representation to do that particular task. So these pretext tasks are often really not going to make a lot of semantic sense. And that's probably like, that's sort of the reason for calling them pretext because you have a downstream task where you actually have some semantic or some label that you actually know is good. Thanks. Why would the predicting rotations give us any sort of useful representations? Yes, so in fact, when this paper came out, this was the question of many, many people and it was my question as well. Empirically, this actually works really well. And sort of my intuition for this has basically been that in order to predict what sort of the rotation of an object is, it needs to roughly understand what the boundaries or what sort of some concepts in this image are. For example, to predict that this particular image is rotated by 180 degrees, it needs to at least recognize or sort of segregate the sky from like the sand or the sky from the water or at least understand that for a tree, the leaves are generally not below, sort of the bar, trees don't grow like downwards, they grow upwards. So it sort of needs to reason about something implicitly. It's not super clear what it really needs to do, but this task empirically works very well. Has this only been tried or works as a task with like a discrete classification or has it has been tried on like a continuous scale of angles at which the image is rotated? Yes, so you can do both versions. So you can, I mean, you can sort of increase the number of bins you want and as you basically make it very, very large, you're approaching more of a regression problem where you have a continuous variable. In practice, these four sort of angles work pretty well. I mean, increasing gives marginal benefits. And there's a question about the previous slide. How does the nearest neighborhood work in this context? Do you run every patch, each patch through the CNN? Yes, so this is just for visualization. This is just for sort of understanding. So although it is like sort of expensive to compute this, it gives you a very good idea of what the representation has learned. So what the authors did was basically extract a bunch of patches from each image, roughly 10 to nine patches. And so on a small set of images, so I think in this case, it was like 50 to 100,000 images. And then you basically just compute nearest neighbors on those patches of those images. Is that clear? Yeah, that's it. So another task which is also fairly popular is called colorization. So in this case, given a grayscale image, you basically try to predict the colors of that image. So you can really formulate this task for any image. You can take an image, you can remove its color, and you can ask a network to basically predict the color from this black and white or sort of grayscale image. And this task by itself is not as, it's actually useful in some respect. So a lot of like old movies when you sort of see them colorized, so like movies shot in say the 40s or 30s when there was not a lot of color technology, you can actually have this task really sort of be applied there. So in some way, it actually is more useful than other pretext tasks. And why does this task learn something meaningful? Well, it needs to sort of recognize that trees are green. It needs to understand what the sort of object categories is in order to color it fairly well. And so in practice, this has now sort of been extended to the video domain, and there are a lot of follow-up works on sort of colorization itself. It's interesting because I think the color mapping is not deterministic, right? It's not deterministic, yes. So several, there are several possible true solutions, right? Yes, so the initial paper was basically sort of proposing a deterministic mapping. So you were solving either a classification or a regression problem. So you only could have say a blue-colored umbrella, and if you could never sort of predict a gray-colored umbrella. And so what ended up happening was for a lot of categories which have different kinds of colors. So for example, let's assume say you have a ball, and that ball can appear either in red, blue, or green colors. The network would sort of predict that to be gray because I mean that's sort of the mean of all of these things. So that's the best it can do. There was follow-up work from David Forsyth's group in UIUC, which tried to sort of come up with variational autoencoders. So you actually had a latent variable, and then you could have diverse colorization. So in practice, you can basically do approaches like that. So you can actually have now a green-colored ball. And because you're doing that for the entire scene, you can actually have consistent colorings of the entire scene. Yeah, yeah, that's what I think we've been talking with Jan, whenever we have like some mapping that goes from one to many, and then we should choose like a latent variable model which allow us to choose a multiple solution given that we have the same input. Right, right, yeah. So if, I think the reason why like people did not really focus a lot on that particular aspect in this case was, at least back in the day, one, it was not clear what was working and what was not. And second, this was still a pretext task and people were not really concerned about the colorization quality. People were more concerned about the representation quality. But I think now a lot of us understand that both of them are fairly sort of tied to one another, that you really need to have this sort of non-deterministic mapping to get something more out of the data. I see, thanks. And finally, so this is, I apologize for this picture. It's from the paper. I think it was low resolution. So this is another task which is like context autoencoders. So the idea is basically borrowed pretty much from say word to back. So you hide a particular part of the image and now given the surrounding part of the image, you need to predict what was hidden. So it's really sort of the fill in the blanks task. And why should this work? Well, it's at least trying to reason about what objects are present. So cars can run on the roads or like buildings are basically consist of like windows and closer to the ground. They're supposed to have doors and so on. So it needs to learn something more about like the implicit structure of the data by performing this task. So this was just about images. And now I'll sort of talk about what are the other tasks that you can do in video. So in video, the sort of main source of supervision is this notion of sequentiality of frames. So frames basically have an inherent order in them and you want to sort of use that order to get something. For example, say predict the order of frames or fill in the blanks and a bunch of sort of other pretext tasks that are all dependent on sequential nature. So here I'll sort of talk about one of the works that I did in 2016, which was about predicting the temporally correct or incorrect order of frames. This is very much inspired from earlier work that Jan and basically others did on sort of sequential ordering of frames through contrastive learning. And I'll talk about those towards the end when I actually talk about contrastive learning. So in this particular work, we were very much inspired by like the pretext tasks again and we saw the binary classification problem. So given a bunch of frames, we extract three frames and if we extract them in the right order, we label them plus one. And if we shuffle them, basically we label them as zero. And so now we need to solve a binary classification problem to predict whether something is shuffled or not. And the reason this sort of works is because, so given three frames, let's think of them as basically start, middle and end. This network really tries to learn, given a start and end point, is this point a valid sort of interpolation of these start and end points? So it really tries to sort of interpolate smoothly these features given this visual input. So the network is fairly straightforward. It's a sort of triplet Siamese network. You have three frames. You feed forward each one of them independently. You concatenate the features that you obtain from these three frames and then you perform a binary classification problem. So you predict whether this thing is correct or incorrect, whether it's basically shuffled or not shuffled. You can basically minimize this with cross entropy loss and you can train this entire network end to end. So again, like I had mentioned earlier, nearest neighbor is sort of a good way to visualize what these networks are learning. So we followed prior work and we basically looked at the nearest neighbors of frames. So on the left hand side, you have a query frame. You feed forward that frame, you get a feature and then you basically look at the nearest neighbors in that feature representation. So we'll do that for ImageNet, Shuffle and Learn and then Random Features. So what you observe is there's sort of very stark difference between what ImageNet, Shuffle and Random give you. So the first row, if you look at the sort of gym scene, ImageNet is really good at figuring out that it's a gym scene. It's the nearest neighbor it retrieves looks very different from the initial scene that we like the initial image that we've given. So like the floor is much better lit in the query the floor was actually black and the exact exercise being performed is not really the same, but ImageNet is sort of really good at collapsing this entire semantic category and really sort of bringing in various different gym scenes together close by in the representation space. The same thing sort of goes for the row below. So you have an outdoor scene and ImageNet is immediately able to sort of pick up on that outdoor part of it. It's able to figure out that there is grass and so on and it sort of brings these two points together in the feature space. If you look at say the sort of right most, the nearest neighbors retrieved by the random network, you see that it really focuses on the color. So in the top row it's really sort of focusing on the black floor. It's really looking at maybe sort of the black color in this image and that's how it's retrieving its nearest neighbor. If you look at the shuffle and learn, the nearest neighbors they're fairly odd. It's not immediately clear whether it's focusing on the color or whether it's focusing on that entire semantic concept. And so on sort of further inspection and after looking at a lot of these examples, we figured out that it was really looking at the pose of the person. So if you look at in the top row, the person is doing sort of an upside down and that's sort of the nearest neighbor retrieved as well. And in the second row also the person has their feet in a particular way. And it's really trying to sort of get there with its nearest neighbor and it's sort of ignoring the entire scene. So it's not really focused on the background. And when we were thinking about this, why would a network even try to do something of this sort? Well, we thought back to our pretext task. So the pretext task was predicting the order or basically predicting whether things are in the right order or not. And to do this, you really need to focus on what is moving in the scene or sort of in this case, the people. So if you focus on the background, you'll never be able to answer this question fairly well because well, the background doesn't change a lot between three frames that are taken sort of close by in a video. The only thing that sort of changes is the person or the sort of things that are moving in that video. So sort of accidentally, we basically trained a network that was really trying to sort of look at things that are moving and then ended up focusing on the pose of this, pose of people. Now, of course, this is my interpretation. We wanted to verify this quantitatively. So what we did is we took our representation and we fine-tuned it on this task of human key point estimation. So this task is basically given up human, you need to sort of predict where certain key points are. So the key points are going to be, they're defined as basically say the nose, the neck, the left shoulder, right shoulder, right elbow, left elbow, wrist and so on. So you basically have these bunch of predefined key points and you train a network to sort of predict this. So this is really useful for something like tracking or post-estimation of a person. So we took our shuffle and learn self-supervised method and we fine-tuned it on these two data sets called Flick and MPII. And we did the same thing for image net supervised network and this is back in the days of AlexNet. So AlexNet was the architecture that we used. And in sort of fairly surprisingly, what we found is that the self-supervised representation was very competitive or even slightly better than image net supervised representation at this task of key point estimation. So in this case, what I'm measuring is AUC, which is area under the curve. So higher is better. And you can see that it's performing fairly well, which was very surprising to us because we hadn't really thought about this task when we were designing our pre-text task. We really thought that doing this pre-text task will help us understand actions better. But it turns out that if like, you can have sort of surprising outcomes depending on what you ended up or what you ended up creating as a pre-text task. So in this case, that was post-estimation. So for this example, you said you fine-tuned it on human key point estimation. So is that kind of like a supervised step like once you have your pre-text representations? Yes, so the pipeline basically generally goes like you do a pre-training step. So that can basically be say image net supervision, which is predicting 1,000 classes. And then you have a downstream task where you have a few amount of labels. So you basically just, so in this case, that's predicting the human key points. So this way of evaluation, what it does is it basically has, it takes a bunch of pre-trained networks and then it fine-tuned them using the same supervised data at the end. And so what you're evaluating is how good was it if I started from say a image net supervised network or a shuffle and learn network to perform this task of key point estimation. Okay, thank you. Isn't it strange that it did this well since shuffle and learn focuses on the background? So it actually focuses a lot on the foreground. So that's what I was trying to sort of come up with this, talk about in this example. So if you look at like what the nearest neighbor that is really looking at the person to come up with this, right? It's looking at the upside down person to sort of come up with its nearest neighbor. Okay. And the reason is if you want to talk about ordering of frames, I really need to focus on things that move. And in these videos, people are the things that move. So if it focuses on the background, it actually will not be able to solve the shuffle and learn task. So this was sort of surprising. And it sort of goes to show that if you design your pre-text task well, it will work well for a certain sort of set of downstream tasks. And there have been sort of fairly nice methods since then, which I've been basically about predicting this sort of using sequentiality and sort of predicting whether things are in the correct order or not. So this is odd one out networks, which basically rather than solving a binary classification problem, it actually tries to predict which of the frames is the sort of the one that is odd one out, the one that is sort of shuffled. And this, because you're sort of increasing the amount of information that you're predicting at the output, this sort of network ends up doing better and better. And it also sort of reasons about more frames at a time. So now you've seen sort of images and video. There has been a lot of creative work at the sort of multimodal. So where you have two modalities, video and sound or two sensory inputs. And these two have been sort of very popular and fairly nice sort of work coming out of this regime. So the key sort of signal in these works is predicting whether an image or say a video clip corresponds to an audio clip. So the way you can sort of construct these tasks is to take a video and you can basically just sample any frame from it and similarly take an audio track and sample any part of that. And now the problem is basically to predict whether these things are corresponding or not. So essentially given this entire sort of video, say of a drum, you can sample the frame and the corresponding audio and call that the positive. And in this case, you basically take a different video and you take the audio from the drum video and that becomes your negative. And so again, you can solve a binary classification problem by taking these bunch of positives and negatives. So the architecture for this is fairly straightforward. You take in an image, you pass it through a vision subnetwork. You have your audio, you pass it through an audio subnetwork and get 128 dimensional features for them. So also embeddings, then you sort of fuse them together and have a binary classification problem saying whether these things correspond or not. So at the end of it, it's just solving a single binary problem. What it sort of shows is that you can actually do a bunch of sort of nice things when you train networks this way. So you can answer the question, what is making a sound? Because the network really needs to focus on, say, to predict whether the sound is coming from this video. It also needs to identify what in the video might be making the sound. So if it's the sound of a guitar, it needs to sort of understand what a guitar roughly looks like. Or if it's a drum, it sort of needs to roughly identify what a drum is. So in this particular case, the authors sort of looked at visualizations for, in this case, two instruments. So you have a piano and a flute. And you look at just the video information and nothing else. The network already sort of puts a very high sort of visual importance on the piano and on the flute. And this is because when you sort of feed forward this image, it knows that there are going to be these two kinds of things that can produce sounds. So it really sort of learns to identify these kinds of objects automatically. Do you know about the slide before, whenever you had the convolutional net over the spectrogram, do you know what is the kernel size for that audio component? Just I'm interested to know like, whether it makes sense to have like a rectangular or a square kernel size. So these are square kernels. I mean, now there are sort of more improved models. So this is basically operating on the log spectrogram. So in some way to still handcraft, you need to decide how you're computing that spectrogram exactly. People have now figured out that you can actually use the raw audio and you can actually apply convolutional filters directly on the raw audio signal. Yeah, yeah, sure, sure, sure. And for that, it's generally a small window. It really depends on what the corresponding video that you're using. So roughly about a second's worth of audio and a second's worth of video. Gotcha. Cool. So now that I've sort of shown you how there are like multiple different creative ways of defining what a pretext task is, let's try to see what a pretext task learns and how can you sort of, if I give you 25 different pretext tasks, how can you a priority side, which one is the one that you want to use and what are they going to sort of learn? So the first thing is pretext tasks are actually complementary. So there was this really nice paper in 2017 that looked at two of these sort of tasks. So relative position was the first pretext task I talked about, where you take two patches and you try to predict what their relative position with respect to one another is. And colorization is basically taking a grayscale image and trying to predict its colors. And so what these authors showed is basically that if you train a single network to do both of these tasks to predict both the colorized output, as well as relative position, you can actually get gains in performance. So again, this is evaluated the same way I was talking about earlier. You have a pre-trained network and then you're basically evaluating it on an end task, in this case, image net classification and a detection benchmark. And in both cases, you sort of can get gains by performing both of these tasks. So you get best of both worlds. So in some way, what this also shows you is that a single pretext task may not be the right answer. So predicting just color or predicting just relative position may not be the right answer to learn self-supervisor presentations. In fact, if you sort of reason about what information is being predicted, it really varies a lot across tasks. So starting with the relative position task, you're predicting a fairly low level information. You're predicting just sort of eight possible locations. So just an eight-way classification problem. Or for that shuffle and learn problem, you're predicting whether things are shuffled or not. So it's just a simple binary problem. So it's very sort of less amount of information that's being predicted. Whereas if you look at on the extreme right, if you are trying to predict what is missing in an image and you're trying to reconstruct the pixels, you're predicting a lot of information because that entire box contains, I mean, it can have a very, you can have very different appearance space, right? So if you have n pixels, then you can basically have a lot of different values for that entire predictive region. So you're predicting a lot of information there. So essentially, this is one simple way of thinking about pretext tasks, how much information are you predicting? And that can give you already a good idea of, whether you're actually predicting a lot of information. So probably that representation is actually going to be better. So in general, this is sort of going to guide the next part of my talk. You can think of this sort of predicting more information part on an axis and I'll talk about three different sort of categories in this. Actually two different categories. So pretext tasks is what I've been talking about till now, which is just predicting simple classification problems, like different degrees of rotation or so on. I sort of move to contrastive methods, which are, which actually predict way more information than these pretext tasks. And in this particular talk, I'm actually not going to talk about generative models, but generative models predict more information than say a typical contrastive method. And so this is basically one way of thinking about these classes of methods. Question, how do we train multiple pre-training tasks? Do we shuffle data for both tasks? If trained individually, won't it lead to catastrophic forgetting? Right, so the simple way of doing that is basically that you have, so you can basically alternate batches. So you can have the same network and in one batch, you basically feed it black and white images and you ask it to predict the colored part of it. And then the second, now in the second batch, you basically feed it patches and you ask it to do the relative position tasks. You basically have two different, like head, like fully connected layers at the top. So you can basically alternate between these tasks. What the authors of the paper did was actually slightly more sophisticated. They had, they basically had sort of multi-task network which was three or four depending on the number of pre-text tasks you have and you actually solve all of them at once. But there was sort of more involved weight-sharing across these three, four different tasks, networks. Got it. Hi, I have a question. Yes. So about the pre-test tasks, what performance should we aim for in a pre-text task? When do we know that this is enough or when can we stop? And because ultimately, we care about the performance on the downstream. That's question one. And question two is you were speaking about low information and more information. For example, in the case where you mentioned, where you were predicting whether it's in the correct sequence or not, you could have also predicted the actual permutation of the images, right? Like, so how do you decide between which task to follow and based on what? Right. Okay. So both parts, the second part of the question actually, that's going to be in a couple of slides. So I'll sort of defer to that one later. But the first part, how much do you train this model on a pre-text task? So a sort of good sign of a pre-text task is that as your accuracy on the pre-text task improves, so as you get better at predicting whether things are shuffled or not, or as you get better at predicting rotations, the sort of accuracy on the downstream semantic task will also improve. So a good rule of thumb for basically using these pre-text tasks is to sort of have a very difficult pre-text task or try to make it as difficult as possible and then sort of optimize or like reduce the loss on that pre-text task so that your final downstream accuracy improves. So it's very correlated. Right. So in practice, you'll actually drain the entire pipeline each line, each time like the pre-text and the downstream and measure the performance. So it's not like you stop the pre-text at a certain point and then switch over to only like downstream or something. So that's generally how these methods are evaluated. But I guess when you're developing, you'll probably do this pipeline multiple times. So these methods are sort of trained like you do your pre-text task, then you stop and then you perform your downstream evaluation task and that gives you the final sort of measurement of how good your pre-text task was. And that's it. You just do this entire thing once. All right, thank you. And about the second part of your question, the more information part, I'll sort of come to it later, the permutation and so on. Cool. So these are sort of the three main buckets and the first two are what are going to be covered basically now. So this was another work we did which was basically about scaling self-supervised learning. So in this particular work, we focused on two problems. One was the colorization problem that I had talked about earlier. And the second is this more sort of like more information variant of the relative position task. So this task is called jigsaw puzzles. The idea is that you take an image and you split it into multiple different patches and you try to predict exactly and then you shuffle these patches basically by a permutation and then you predict which permutation was applied to the input. So that's very similar to what Shreyas was suggesting earlier. All right, so the way you solve this problem is you take, say in this case, three patches, you feed forward each one of these patches independently. You concatenate their feature and then you classify which permutation was basically used to permute these input patches. Now the authors used nine patches to solve this problem and that's basically going to be nine factorial which is like 360,000 number of permutations. Of course, when you're trying to perform this classification at the end, this means that your fully connected layer should have 360,000 output neurons which is a very large number. So in practice, what the authors did was basically have a sort of a subset of permutations that they use. So say the sample 100 permutations from the nine factorial permutations and then just have this perform this 100 way classification, right? So in some way, you can look at this, the size of the subset as the problem complexity or the amount of information that you're predicting. If you predict the full nine factorial thing, you're actually solving, you're basically predicting a lot of information at the output. If you only subsample, say two or three permutations and you're basically not predicting a lot of information. So the problem basically gets harder and harder as the size of the subset increases. So in this paper, we basically wanted to study the entire role of how much information that you predict and how good is the final representation that you learn. So in terms of evaluation, there are two ways to sort of evaluate once you have a self-supervised pre-train network. And there's still a lot of debate on which one is exactly the right method to evaluate networks. So the first way is to basically fine tune all the layers of a network. So you have a downstream task, say post-estimation or say image classification. You train this network and you update all the parameters of this network for the downstream task. The second way is to just use your network as a feature extractor. So you basically run your images through it, you get your feature representation and now you only train a linear classifier on top of that fixed feature representation. So in this particular work, we said that a good representation should transfer a little amount of training. So we opted basically for the second part, which is just to train a linear classifier on top of a network treated as a feature extractor. So there are of course different pros and cons of using both methods. So the first method that is fine tuning all the layers is treating the self-supervised network as an initialization because you're basically updating the entire network. So if your downstream task basically has say one million images, you're updating your entire network for that one million images. Whereas in the second case, you're just training very limited number of parameters on the fixed feature extractor. So in some way basically the second one is measuring how good of a feature is that you've learned. All right, so the other thing that is sort of critical in evaluating self-supervised methods is to evaluate them on a bunch of different tasks. So earlier when I talked about tech shuffle and learn work, I just showed you results on pose estimation. So on pose estimation it was doing really well, but I actually did not do really well on other tasks like say action recognition. So in this particular evaluation, we sort of wanted to correct that mistake and we wanted to focus on multiple different tasks. So a variety of different tasks like say image classification, few short learning, object detection, 3D understanding, navigation and so on. So we define basically like a set of nine different tasks. So the way to evaluate the representations is basically to extract fixed features and you can extract these fixed features from different parts of the network. So they can come basically from a layer which is very close to the input or from a very high level layer which is very close to the output. So in this way you're sort of measuring the semanticness of each of these different layers. And the sort of standard thing we did for a lot of these experiments was to use an image classification task to sort of understand what is going on. So the image classification task is on this dataset called VOC, which is fairly standard for detection and classification. And the idea is to predict whether an image has one of 20 classes. So an image can actually have more than one class. For example, like that picture of a person with a dog that has both person and dog. So this network now needs to recognize both the objects in it. So it's not slightly harder than image net where you basically need to only sort of identify one of the key objects in the image. So the first thing we did was basically to verify the hypothesis whether increasing the amount of information predicted actually results in better representations. So on the X axis, we're increasing the number of permutations that we're using to basically train our network. So that's going from 100 to 10,000. And on the Y axis, we are basically measuring the downstream transfer performance of these pre-trained representations. And it's measured using a metric called map which is mean average precision. So essentially, because this is a fairly, this is a sort of multi-label classification problem. You're going to measure average precision for each of the different 20 classes and then you're going to compute the mean of that average precision. So higher is better in this case. So we do that for two different architectures, AlexNet, which was originally used in the Jigsaw paper and then ResNet 50. And what you observe is for AlexNet increasing the amount of permutations is useful up to a certain point, but the gain is overall limited. Whereas for ResNet, if you increase the amount of permutations, the representation quality gets better and better. And our hypothesis was basically that the ResNet model has enough capacity that it can actually solve a very difficult permutation problem. And when it solves a difficult permutation problem, it's able to sort of learn much better representations and generalize to different downstream tasks. So the next thing we did was to evaluate our method on the object detection task. So object detection is basically where you try to identify what objects are present in an image. You try to draw a box around them and you're measured basically based on how sort of good the box is around the object and whether you were able to identify all the objects in an image. And again, for this one, we use the same VOC dataset. So this was the setting where we basically fine-tuned all the layers of a network because that's what's standard in detection. And what we observed was basically on two different splits of this VOC dataset, the Jigsaw method was actually fairly comparable within the margin of error to basically training a ImageNet supervised method. So you have a ImageNet supervised network, you fine-tune that on the task of detection and you get a mean average precision of 70.5 or 76.2. And the Jigsaw method is basically within the margin of error of these methods, which in itself sort of shows that it actually had some sort of nice semantic property and it was able to localize objects really well. And to put this sort of in context, for semantic feature learning on like in computer vision especially, object detection is sort of considered a benchmark dataset to reach something like really well on. And this result basically when we sort of published it was the closest anyone had ever come to supervised pre-training in terms of detection. Yes, some question. So is pre-text tasks similar to what we could try achieving with transfer learning? Is it like a subset of that or? Yes, so I mean the way you evaluate these pre-text tasks is by transfer learning. So you perform your original pre-text task and then you find unit on a dataset for a particular task like detection. So the evaluation is always transfer learning. So the next task we looked at was surface normal evaluation. So this is basically given input. You try to estimate what are the 3D properties of the like basically at each pixel location in the input. If you try to predict what is the surface orientation. So in 3D basically the x, y and z vectors at each particular surface. So it's a sort of dense prediction problem where you need to assign that x, y, z vector to each location in the input. And for that we use this nice dataset created by NYU. And we basically measured the prediction properties of our method and compared it to an image net supervised method. And so in this case we measured the median error and the percentage correct predictions. So the median error basically means that lower is better and percentage correct means higher is better. So it turned out that the jigsaw pre-training task was actually really good in this case. And it provided like significant improvements over image net pre-training. So it was basically across a multiple different sets, multiple different splits. It was able to really easily outperform the image net supervised pre-trained method. So again it sort of goes on to show that evaluating a pre-text task on multiple different tasks and multiple different datasets is really important to understand what is really going on in your pre-text task. So somehow jigsaw is really incorporating something about like geometry and something about like pixel level information, much better than image net supervised methods. So finally we found sort of the Achilles' heel of this method, like the jigsaw pre-training task. So to do this we evaluated on like the setting called few short learning. So in few short learning you have very sort of limited number of training examples and you're training your classifier just on these very limited number of training examples. So on the x-axis I have the number of training examples that were used to train a method. So that goes from say one to 96. And I'm sort of showing you curves for like two different self-supervised methods. So jigsaw method is trained on two different datasets. Image net which is on the top and a random ResNet 50. So what you can observe is that there is a significant gap in performance between a self-supervised method and a supervised method. And that gap basically just does not seem to reduce as you increase the number of labelled examples. Which point sort of shows that self-supervised representations although they may be good at tasks like say pose estimation or particular tasks like surface normal estimation. There is still a lot of difference between what sort of semantic aspect of the data they capture because this in the sort of few short learning task if I give you one image and if you're able to say something about it your feature representation needs to be really good to solve that task. The other way we evaluated this method was to basically look at what it learns at each different layer. So we basically trained linear classifiers on different layer representations in a ResNet 50. So from the con one which is going to be the layer closest to the input to the output say of the Res2 block, the Res3 block and the Res5 block. So Res5 is basically the sort of highest level representation that you'll get out from a ResNet 50. And after that representation is where you perform this entire jigsaw like predicting the permutation task. And so you look at in this case the x-axis represents the sort of where the feature is coming from con one or Res5. And on the y-axis we are looking at the again mean average precision of image classification on VOC. And funnily enough what you see is basically that the representation quality improves when you go from con one to Res4. So it steadily sort of increases in the mean average precision. But towards the end there's a sharp drop. So Res4 to Res5 is a sharp drop in performance, which... Is that due to the fact that it specializes to the specific task? Yes, yes, exactly. So this was very worrying because if you were to sort of plot this thing for a supervised network, you observe that from con one to Res5 the representation quality always improves. And this is true for like pretty much any good supervised network. Whereas for a lot of the self supervised networks we sort of repeated this experiment for the rotation network, for colorization, for relative position. We would always observe this very sharp gap from Res4 to Res5. And so this says that the end task that we are solving the pretext task is probably not very nice because it's not very well aligned to the downstream semantic tasks that we really want to solve. Which basically brings me to the next part which is to understand what is missing from these pretext or these sort of proxy tasks. So recap, pretext tasks are basically something like predicting rotation or to predict say jigsaw puzzles. And it's very sort of, if you look at it in the bigger picture of things, they're very surprising. And the fact that they even work is super surprising. So for pretext tasks, we have this pre-training step where which is self supervised. And then we have our transfer tasks which are image classification or detection. And it's really a lot of wishful thinking and a lot of hope that the pre-training task and the transfer task are super aligned. And there is no evidence really. It's a lot of just like wishing really, really hard that whatever pretext task we've come up with is really well aligned with our transfer task and solving that pretext task will do really well in transfer task. So a lot of research basically goes into designing these pretext tasks and implementing them really well. But it's not clear why solving something like jigsaw puzzles should teach us anything about semantics. Or for example, even the case of say weak supervisor thing where you're trying to predict hashtags from an image. It's not clear why predicting hashtags of an image is actually going to do something well for learning some like a good classifier on transfer tasks. So this question remains that how do you design good pre-training tasks which are well aligned with your transfer tasks? So this hope of generalization is basically, you can, and the way we can sort of evaluate this is basically by looking at the representation that each layer and if the last layer we do not see representation is that are well aligned with the transfer task then that is a red flag and that's sort of telling us that maybe this pre-training task is not really the right task to solve. So like I mentioned earlier, this basically is the sort of pattern that we'll get for jigsaw. And this shows us that probably the last layers are very much specialized towards the jigsaw problem. So in general, what we really want from pre-trained features is that they should represent how images are related to one another. So feature representation should, and this basically goes back to say the nearest neighbor visualizations that I had. They should really be able to group together images that are semantically related in some way. And the second property is basically a property that has been the backbone of designing vision features. So even before say the deep learning features were popular, the handcrafted features were always all about invariance, about sort of being invariant to things like lighting or things like exact color or exact location. So these are the two properties that we really want in our pre-trained features. And there are sort of two ways of achieving these things. One is clustering and the other is contrastive learning. And both these methods have promised because they are really solving, they're basically trying to get these properties when they're sort of trying to learn representations. And I believe that's why they've actually now started performing so much better than whatever pretext tasks that were hand designed for so far. So now I sort of focus on two recent works that we have which are, which fall into this bucket of clustering and invariances. So one is called cluster fit, the other is called Perl and both of them will be presented at CVPR this year. So the first work is cluster fit. It's a method which we think is very good to improve generalization of visual representations. So clustering is basically a good way to understand what images are grouped together, what images go together and what images do not go together. And it's sort of you, by basically performing clustering on the feature space, you can get these nice buckets of images that are related and images that are not related. So the main idea of this paper is extremely simple. There are just two steps. One is the cluster step, the other is the predict step. So what we do is we take a pre-trained network and this can be any pre-trained network. It does not really have to be just a self-supervised network. It can either be a image pre-trained network or a network pre-trained say using hashtags or a self-supervised network like one trained to predict jigsaw permutations. And you take this pre-trained network and you extract a bunch of features from it. On a set of images and of course these images have no labels. You extract these features and you perform k-means clustering and what you now get is basically for each label, for each image, you know which cluster it belongs to and that becomes its label. So in the second fit step, what you do is you train a network from scratch, so like from random bits and you train this network to predict just these pseudo labels. So they're pseudo because they were basically obtained using clustering so they're not really hard labels which were given by say a human annotator. And so now this second network is just trained to predict these cluster assignments. So it takes an image and it tries to predict which one of the k clusters that you've got from your k-means does this image belong to? So a standard pre-train and transfer task is to basically perform your pre-training. So that's the top row. It's to perform your pre-training on an objective like predicting hashtags or predicting GPS locations and then to evaluate this feature by learning a linear probe. In the cluster fit world, we basically do not touch the pre-training so you perform your pre-training as you were. You just insert a step in between which is the cluster fit step where you take a data set D and you take your pre-trained network and you learn a new network from scratch on this data. And finally, you basically use this green network for all your downstream tasks. So the reason we believe that this method works is because the clustering step, when you're sort of clustering just these images, you're only capturing the essential information which is basically what images go together and what images do not go together. So you're throwing away all the other information that is present in the original network. You're just capturing the sort of inter-image relationships that were modeled by the initial network. And to sort of understand this, we performed a fairly simple experiment. We added label noise, so synthetic label noise to ImageNet and we trained a network basically on this noisy ImageNet. So just flip a bunch of image labels and train a network. And now you evaluate the feature representation from this network on a downstream task, which is again ImageNet, but it's a much larger version of ImageNet, so it's 9,000-way classification. So we basically, on the X-axis, have the amount of label noise added to the images, so that's going from 0% to 75%. And on the Y-axis, we are looking at the transfer performance on the larger ImageNet, the ImageNet 9,000 dataset. So the pink line is showing you the pre-trained network, which is, and basically as the amount of label noise increases, the pre-trained network's performance on the downstream task decreases. And well, this is not surprising because as your labels become less and less reliable, of course, your representation quality is going to suffer, so that sort of goes down very quickly. In the sort of blue line, we experimented with this technique called model distillation, where you take your initial network and you use that to generate labels. So you look at the output of that network and you look at the sort of confidence in the outputs to generate labels for a second network, and that's called model distillation. So model distillation generally performs better than the pre-trained network, and you can see that all across. So as the amount of label noise increases, the distillation model actually is much better than the original model. And finally, towards the end, we have clusterfit. So that's the green line. And you can see that the clusterfit model is consistently better than basically any of these methods, either distillation or pre-training. And consistently gives better results, including when you have zero label noise, which is basically when you have a pre-trained image network. So we applied. Can you elaborate on the difference between distillation and clusterfit once more? Yes. So in distillation, I'll go back to this picture. Yes. So in distillation, what you would do is you would basically, so in this first step, you would take the pre-trained network and you would use the labels this network is predicting. So say the network basically predicts 1000 classes. So you basically use those labels in a softer fashion to generate labels for your images. So say the network was trained to predict 100 different types of dogs. So you take your images and you get a distribution over the 100 different types of dogs and use that distribution to train your second network. Whereas in clusterfit, you don't really care about the label space or the sort of output space of the pre-trained network. You only look at the features. You don't even look at the last fully connected layer. You just look at the previous features. Got it. So why would the softer distribution help with training? Like why would training on this be help? What's like the intuition behind distillation? So distillation's main intuition is basically that if your network was trained really well, so suppose you had no label noise, because a lot of things are not really, a lot of images really don't belong in the sort of same classes. So suppose your dataset actually had 200 different types of dogs, but you had only 100 of them labeled. And so for a lot of these images, say you actually had to assign, you had to pick basically which one of the dogs it was. A softer distribution is basically going to help you discover hidden categories. So it's basically 0.5 kind, this type of dog and 0.5 this kind of dog. So basically having these sort of softer labels helps you enhance sort of the initial class distribution that you have. Okay, thank you. So we applied this method to the cell supervised learning. So the Jigsaw task that I talked about earlier, and we were able to see surprising amounts of gains across a bunch of datasets. So the Jigsaw method is in the top row, which in each of those sort of columns, you're looking at the transfer performance of basically this Jigsaw method on a bunch of different datasets. If you apply cluster fit to this Jigsaw method, you actually can see gains across all of these datasets and they're fairly consistent. And we perform this test on a bunch of different pre-training methods like RotNet. So predicting rotations. And again, we could see fairly nice gains across these four different datasets. And surprisingly enough cluster fit really works on any pre-trained network. So it can be either a fully supervised network or a weakly supervised network. So say a network that was trained to predict hashtags or a weakly supervised video network, or basically any cell supervised network. And in each of these cases, we can observe fairly consistent and large gains when you apply cluster fit. So it's actually able to improve the generalization of most of these methods. I think you're dragging your microphone around. It's very noisy. Yeah. It's stuck on my laptop. Okay, all right. So the second thing is basically these gains were possible without extra data, labels or changes in the architecture. So in some way, you can think of this as being a self-supervised fine-tuning step. So you have your pre-trained network and then you basically perform this cluster fit step, which is completely self-supervised or unsupervised. And then you can observe that the representation quality improves. I had a question. In the slide that you showed the improvement with Jigsaw and am I using cluster fit? So in this cluster fit, is it separate thing, right? It is not using Jigsaw at all. So it is applied on top of the Jigsaw method, right? So there was a pre-trained network from which you extract features. Right. So in this case that pre-trained network is the Jigsaw pre-trained network. Oh, okay. But you take the Jigsaw pre-trained network and then you basically perform cluster fit on top of it. Oh, okay, thank you. Then logic with cluster fit is a good idea. I think the main sort of intuition is that when you say perform the Jigsaw task, the last layer becomes very much fine-tuned for that particular Jigsaw task, right? So we saw that accuracy go down. Now, when you take those features and you perform clustering on it, you can think of this as basically, you're reducing the amount of information, right? If I train the second network to directly regress the features of the first network, I would basically get the same exact network. But if I train the second network only to predict what images are grouped together in the first one, I'm actually predicting lesser information. And I thinking is basically that clustering is some kind of a noise removal technique. So it's really removing all the artifacts that are specific to Jigsaw from that feature space. And so the second network is actually learning something slightly more generic. All right, thanks. And that's sort of the reason for like this experiment as well. So in this case, we sort of empirically validate that hypothesis by actually injecting amount of label noise. So the last layer basically is going to get more and more noisy. And when you do cluster fit on top of this, you actually again see improvement. So that's sort of our validation of this hypothesis. I had another question. So did you measure the performance of cluster fit on object detection? Like, did it perform as well? Or was it just great in classification? So it performs well in detection as well. So it actually performs well in, yes. So there were initial experiments in detection that where it actually does perform well. So we did not really push a lot on the detection aspect of it in this particular paper. We were sort of more interested in the retrieval or like linear classification kind of experiments. Okay, because I was thinking if we are like making these pseudo labels, we are basically making it amnable to classification task instead of a detection task. Maybe we could lose one of some of those features that Jigsaw got. Right, that is possible. At least the initial experiments that I had run did not seem to suggest this. There was improvement detection. It was minor, but detection improvements overall, like the gap in performance is already so small that the improvements are actually, yeah, they're generally very small in general. Okay, thank you. I had a doubt in the same cluster fit algorithm. So will the final layer of cluster fit algorithm not get again covariant to the labels that were used for training it on that task? It becomes less covariant. So what we found was if you were to sort of, the paper has this plot, I don't have it in the slides, unfortunately. The paper has the plot where, okay, this particular plot where we were looking at con one to res five, cluster fit is much better. So the res five to res four gap for cluster fit is much smaller than it is for Sagex or Rottenet. But was it better than res four or was? It was slightly worse. So it was on VOC classification, it was better. But for other tasks like image, it was slightly worse. So it did not completely fix the problem. Okay, thank you. Which was sort of the motivation for Perl. So basically I'll not talk about Perl. So Perl was sort of born from the hypothesis again that you need to be invariant to these free text tasks. So before I get into the details of Perl, I will talk really a little bit about in general contrastive learning. How many minutes do I have by the way? 15 minutes more or less. Cool, okay, that's cool. So great. So contrastive learning is basically sort of general framework that tries to learn a feature space that can combine together or sort of put together points that are related and push apart points that are not related. So in this case, imagine like the blue boxes are the related points, the greens are the related and the purple are the related points. You'll extract features for each of these, like each of these data points through a shared network. So which is called a Sinese network. You'll get a bunch of image features for each of these data points. And then you'll apply a loss function, which is a contrastive loss function, which is going to try to sort of minimize the distance between the blue points as opposed to say the distance between the blue point and the green point. Or the distance basically between the blue point should be less than the distance between the blue point and the green point or the blue point and the purple point. So embeddings from the related samples should be much closer than embeddings from the unrelated samples. So that's sort of the general idea of contrastive learning. And of course, Yan was one of the first people to sort of propose this method in his earlier paper with Raya Hathal, which is called Dr. Lim. And so contrastive learning has now made a resurgence in self-supervised learning. Pretty much a lot of the self-supervised state-of-the-art methods are really based on contrastive learning. And the main question is how do you define what is related and unrelated? So in the case of supervised learning, that's fairly clear. All of the dog images are related images. And any image that is not a dog image is basically an unrelated image. But it's not so clear how to define this related and unrelatedness in this case of self-supervised learning. The other sort of main difference from something like a pretext task is that contrastive learning really reasons about the entire, or like a lot of data at once. So to go back to my previous slide, if you look at the loss function, it always involves multiple images. It involves, in the first row, it involves basically the blue images and the green images. In the second row, it involves the blue images and the purple images. Whereas if you look at a task like Sejig's or a task like rotation, you're always reasoning about a single image independently. So that's sort of another difference with contrastive learning. Contrastive learning always reasons about multiple data points at once. So now coming to the question, how do you define related or unrelated images? You can actually use similar techniques to what I was talking about earlier. You can use frames of a video. So you can use the sort of sequential nature of data so to sort of understand that frames that are nearby in a video are related and frames say from a different video or which are further away in time are unrelated. And that sort of has formed the basis of a lot of self-supervised learning methods in this area. So if you know of this popular method called CPC, which is contrastive predictive coding, that really relies on the sequential nature of a signal and it basically says that samples that are close by in the time space are related and samples that are further apart in the time space are unrelated. And there's a fairly large amount of work basically exploiting this. It can either be in the speech domain, it can either be in video, it can be in text or it can be in regular images. And recently we've also been working on video and audio. So basically saying this is that a video and its corresponding audio are related samples and a video and audio from a different image video are basically unrelated samples. And some of the early work in sort of self-supervised learning also use this contrastive learning method and the way they define related samples was fairly interesting. So you run a tracker, an object tracker over a video and that sort of gives you a sort of moving patch. And what you say is that any patch that was tracked by my tracker is related to my original patch, whereas any patch from a different video is a not related patch. And so that basically gives you these bunch of related and unrelated samples. So if you look at in this case, figure C where you have this like distance notation, what this network tries to learn is basically that patches that are coming from the same video are related and patches that are coming from different videos are not related. And so in some way, automatically learns about different poses of a object. So a cycle viewed from like different viewing angles or like different poses of a dog and it tries to sort of group them together. So in general, if you just talk about images, a lot of work is done on looking at nearby image patches versus distant patches. So most of the sort of CPC version one and CPC version two methods are really a sort of exploiting this property of images. So what you do is you have image patches that are close by, you call them as positives and image patches that are further apart, like farther away in the image are considered as negatives. And then you basically just minimize a contrastive loss using this sort of definition of positives and negatives. The more sort of popular or like performant way of doing this is to look at patches coming from an image and contrast them with patches coming from a different image. So this sort of forms the basis of a lot of popular methods, like instance discrimination, MoCo, Perl, SimClear. The idea is basically what's shown in the image. To sort of get into more detail, what these methods do is they extract two completely random patches from an image. So these patches can be overlapping, they can actually be contained within one another or they can be completely far apart. And then applies some sort of data augmentation. So in this case, say a color jittering or removing the color or so on. And then you define these two patches to be your sort of positive examples. You extract another patch from a different image and this is again a random patch and that basically becomes your negative. And a lot of these methods will extract a lot of negative patches and then they will basically perform contrastive learning. So you are relating to positive samples but you have a lot of negative samples that you're contrasting this against. So now moving to Perl a little bit, let's sort of try to understand what the main difference of pretext tasks is and how contrastive learning is sort of very different from pretext tasks. So the one thing I already mentioned was pretext tasks always reason about a single image at once. So the idea is that given an image, you apply a transform to that image. So in this case, say a jigsaw transform and then you basically input this transform image into a connet and you try to predict the property of the transform that you applied. So the permutation that you applied or the rotation that you applied or the kind of color that you removed and so on. So the pretext tasks always reason about a single image. And the second thing is that the task that you're performing in this case really has to capture some property of the transform. So it really needs to capture the exact permutation that you applied or the kind of rotation that you applied. Which means that the last layer representations are actually going to co-vary or sort of vary a lot as the transform changes. And that is by design because you're really trying to solve that pretext task. But unfortunately what this means is that the last layer representations capture a very low level property of the signal. So they capture like things like rotation or so on. Whereas what is sort of designed or what is expected of these representations is that they are sort of invariant to these things. That you should be able to recognize a cat no matter whether the cat is upright or whether the cat is bent towards like by 90 degrees. Whereas when you're solving that particular pretext task you're imposing the exact opposite thing. You're saying that I should be able to recognize whether this picture is upright or whether this picture is basically dead side based. So there are many exceptions in which you really want these low level representations to be co-variant. And a lot of it really has to do on the tasks that you're performing. And quite a few tasks in 3D really want to be predictive. So you want to sort of predict what camera transforms you have when you're looking at two views of the same object or so on. But unless you have that kind of a specific application for a lot of semantic tasks you really want to be invariant to the transform that are used at input. So invariance has sort of been the workhorse for feature learning. So something like SIF which is a fairly popular handcrafted feature. The I in SIF really stands for invariant. And supervised networks, for example, supervised AlexNets or supervised ResNets, they're trained to be invariant data augmentation. You want this network to classify different crops or different rotations of this image as a tree rather than ask it to predict what exactly was the transformation applied to the input. So this is what inspired Perl. So Perl stands for pretext invariant representation learning where the idea is that you want the representation to be invariant or capture as little information as possible of the input transform. So you have the image, you have the transform version of the image. You feed forward both of these images through a connet. You get a representation and then you basically encourage these representations to be similar. So in terms of the notation I was talking about earlier, you basically say that the image I and any pretext transform version of this image I are related samples and any other image is an unrelated sample. So in this way, when you train this network, this representation hopefully contains very little information about this transform T. And yes, you train it using contrastive learning. So contrastive learning part is to basically you have say a feature VI coming from the original image I and you have the feature VIT coming from the transform version and you want both of these representations to be the same. And in the paper we looked at two different state of the art pretext transforms. So that is the jigsaw and the rotation method that I talked about earlier. And we also explored combinations of these transforms. So apply both jigsaw and rotation at the same time. So in some way, this is like multitask learning, but you're not really trying to predict both jigsaw and rotation. You're trying to be invariant to both jigsaw and rotation. So the key thing that has sort of made contrastive learning work well in the past like sort of successful attempts is really using a large number of negatives. And one of the good sort of paper that introduced this was this instance discrimination paper from 2018 which introduced this concept of a memory bank. And this is powered, I would say, most of the sort of recent methods which are state of the art including mocopal and they're all sort of built and sort of hinged on this idea of a memory bank. Can I ask you to unplug your headphones from the computer because it's very noisy because the microphone is picked from the headphones. And that's been very. Did that in now? Maybe, I don't know. Let's try. Okay, let's try. So the memory bank is a sort of nice way to get a large number of negatives without really increasing the sort of computer requirement. So what you do is you store a feature vector per image in sort of memory and then you use that feature vector in your contrastive learning. So, okay, let's sort of first talk about how you would do this entire perl setup without using a memory bank. So you have an image I, you have an image IT. You feed forward both of these images. You get a feature vector F of VI from like the original image I. You get a feature G of the IT from the transform versions, the patches in this case. And what you want is the features F and G to be similar and you want features from any other image. So an unrelated image to basically be dissimilar. So in this case, what we now can do is rather than if you want a lot of negatives, we would really want a lot of these negative images to be feed forward at the same time, it really means that you need a very large batch size to be able to do this. And of course, a large batch size means is not really sort of good, is not possible on say a limited amount of GPU memory. So the way to sort of do that is to use something called a memory bank. So what this memory bank does is that it stores a feature vector for each of the images in your dataset. And when you're doing contrastive learning rather than using feature vectors say from a different negative image, sort of a different image in your batch, you can just retrieve these features from a memory. So you can just retrieve features of any other unrelated image from the memory and you can just substitute that to perform contrastive learning. So in Perl, we divided the objective into two parts. There was like a contrastive term to bring the feature vector from the transformed image. So G of VI, similar to the representation that we have in the memory. So M of I. And similarly, we have a second contrastive term that tries to bring the feature F of VI close to the feature representation that we have in memory. So essentially G is being pulled close to MI and F is being pulled close to MI. So by transitivity, F and G are being pulled close to one another. And the reason for separating this out was it's sort of stabilized training and we were able to train without doing this, basically the training would not really converge. And so by separating this out into two forms rather than doing like direct contrastive learning between F and G, we were able to stabilize training and actually get it working. So the way to evaluate this is basically by standard sort of pre-training evaluation setup. So transfer learning where we can pre-train on images without labels. So the standard we are doing this is to take ImageNet, throw away the labels and pretend it is unsupervised and then evaluate using say full fine tuning or using training a linear classifier. The second thing we did was also test Perl and its robustness to images, image distributions by training it on in the wild images. So we just took one million images randomly from Flickr. So this is the YFCC dataset. And then we basically perform transfer learning sort of pre-training on these images and then perform transfer learning on different datasets. So- I had a question about the Perl method, about the memory bank, where the M wouldn't those like feature representations stored in the memory bank be like out of date? Yeah, so they do go a little bit out of date but in practice it really does not make that much of a difference. So there's sort of particular way of updating them you're saying. So M of I is a moving average of the representation F and that sort of moving average, although it's stale it actually does not matter a lot in practice. You can still continue to use them. So assuming like I recently read the paper Simclear which used a huge batch size like 8,000 or something. So using like the memory bank approach and getting these 8,000 examples in one loss function is that possible? Like- So the sort of Simclear we are doing it really requires a large batch size because you're getting negatives from different images in the same batch. Whereas if you use something like the memory bank you really do not need a large batch size. So you can train this with like 32 images in a batch because all the negatives are really coming from the memory bank which does not really require you to do multiple feed forwards. Okay, thank you. If you're using memory bank then you can't back propagate for the negative examples. So is that not a problem? In it does not create that much of a problem really. So that was one thing I was worried about as well. So in the initial versions we did try something which was like using a larger batch size but when we switched to something like the memory bank it did not really reduce performance. Very, very little, very marginal reduction in performance. Okay, and any intuition why that's the case? So I think overall contrastive learning is fairly slow to converge. So all like all methods Simclear and like the latest version of MoCo and so on. All of them train for very large number of epochs anyway. So the number of backdrops that you're getting or the number of memory sort of parameter updates that you're doing are very large in general. So the fact that you miss out on one of them in this particular case probably does not have that much of an effect. Okay, thanks. Last five minutes. Cool, almost there. So yeah, we basically evaluate Perl on a bunch of different tasks. So the first thing was object detection again sort of standard task in vision. And in this case Perl was able to outperform ImageNet supervised training on detection for both the VOC 07 and 7 plus 12 data sets. And it outperforms on the sort of most stricter evaluation criterion, which is AT all which is now introduced by CoCo which was already sort of a positive sign. And then it was able to do this. The second thing we looked at was basically evaluating Perl on semi supervised learning. And once again Perl was performing fairly well. It was actually better than say the pretext task of jigsaw. The only difference between the top row and the bottom row is the fact that Perl is an invariant version, whereas jigsaw is a covariant version. And in terms of linear classification, when Perl came out it was basically at par with CPC's latest version and was performing fairly well on a bunch of different like parameter settings and a bunch of different architectures. And of course now you can have like fairly good performance by methods like SimClear. So that number for SimClear corresponding would basically be about 69 or 70 compared to like Perl's 63ish number. The other thing we looked at was basically how Perl sort of generalizes across data distributions. So for this we looked at just flicker images from the YFCC dataset. And Perl was able to sort of outperform methods that were trained using 100 times more data. So the jigsaw row in the second, like the jigsaw row, which is the second row was trained on 100 million images, whereas Perl was just trained on 1 million images. And despite that it's actually able to sort of outperform the jigsaw method fairly easily. This again shows you the power of in like sort of baking invariance into your representation rather than sort of predicting pretext tasks. And finally the sort of thing I started out with, which is that whether this thing is actually semantic. So if you look at different layers of representations, so con1 to res5, jigsaw basically shows a drop in performance from res4 to res5, whereas for Perl you sort of see a sort of nicely increasing graph, where res4 and res5 get increasingly more and more semantic. In terms of problem complexity, Perl was very good at handling that because they're never predicting the number of permutations, you're just using them at input as like sort of data augmentation. So Perl can sort of scale very well into all the 360,000 possible permutations in the nine patches. Whereas for jigsaw, because you're predicting that, you're very limited by the size of your output space. And the paper also shows that we can extend Perl to not just like it's not limited to jigsaw, you can do that on rotation. You can in fact do it in a combination of jigsaw and rotation and you can get more and more gains when you basically start doing this. So basically if you look at these methods, starting from pretext tasks to clustering to Perl, as you go from the left to the right, you basically get more and more invariance. And in some way, you also see an increase in performance, which sort of suggests that baking in more and more invariance to your methods is actually going to be more helpful in the long term. There are some shortcomings, which is basically that we really do not understand what are the set of data transforms that matter. So jigsaw works really well, but it's not very clear why this is happening. So some sort of future work or if you want to spend your spare cycles thinking about something is really understanding what invariance is really matter when you're trying to solve a supervised task. What invariance is really matter for something like ImageNet. And that's it. So basically predict more and more information and try to be as invariant as possible. Thank you. Hey Ishaan, so I had a question. Yes. These contrastive networks, they can't use the batch norm layer, right? Because then information would pass from one sample to the other and then the network might learn a very trivial way of separating the negatives from the positive. So like for Perl, for example, we really did not observe that phenomenon at all. So we did not really have to do any special tricks with batch norm. We were able to use batch norm as is. Okay. And it's not necessary for all the contrastive networks to not be using batch norm. It's okay to have the batch norm layer. Yeah, it's, I mean, for example, for Simplr and so on, they try and move to sync batch norm because they want to emulate a large batch size. So you might have to do some tweaks in batch norm. But basically if you cannot avoid it really, because if you completely remove batch norm, then training these like very deep networks is generally very hard anyway. Okay. Do you think that Perl paper works with the batch norm layers because it uses a memory bank and all the representations are not taken at the same time? Whereas I think MoCo, they specifically mentioned not to use the batch norm layer or use it spread across multiple GPUs. So that I think is one difference for sure because basically the negatives that you're contrasting against and the positive are from different time steps, which makes it harder for batch norm to sort of cheat it. Whereas for the other methods, like MoCo and Simplr, they're very correlated to the particular batch that you're evaluating right now. Okay, so is there any suggestion if we are using a N-Pair loss rather than a memory bank? Is there any suggestion how to go about this, whether we should just stick to AlexNet and VGG, which don't use a batch norm layer or is there any way to turn it off or... So what's a, can you describe the setting a little bit more? So basically what I'm trying to do is train on frames of videos and I'm using a N-Pair setting where I'm trying to contrast between N samples rather than two or three samples. And what I'm worried about is whether I should be using batch norm or not. And if I'm not using batch norm at all, then which pre-architectured models can I use? That's tricky. So the one problem with video frames is basically they're fairly correlated. So in general, batch norm basically, the performance of batch norm degrades when you have fairly correlated samples. So with video, that becomes more and more a problem. The unfortunate sort of, like the sad news is basically that even, like if you look at a typical implementation of AlexNet these days, it will include batch norm. It's just because it's much more stable to train with that. You can train with a higher learning rate and a lot, you can basically use it for a bunch of different downstream tasks. So I think you may still have to use batch norm. If not, you can give other variants like group norm drive, which basically do not really depend on the batch size. Okay, it makes sense. Thank you. Okay, thank you so much, Yishan. There's a lot of no interesting details. I think we still have like eight minutes if people are, I think there are like still many left in class. Any questions? Yep, I had one question, which I had also put forward in a lecture when we were discussing Perl. So this question is about the loss function. Can I answer it right now? Yeah, go for it. Okay, so when I read the paper, so there was a probability term that we were computing after computing the VI and the VIT representation for image and the transformed version. And after getting those probabilities, then we were using a noise contrastive estimation loss. So I was kind of confused that it had been better if just the negative log of that probability had been minimized. So you can use both, really. So the reason to use NC was basically more to do with how the memory bank paper was set up. So NC, if you have KN negatives, you were basically solving K plus one problems. So the one problem is, you basically have K plus one different binary problems that you're solving. So that's one way of doing it. The other way of doing it is basically what is now called info NC, which is really just softmax. So you just supply a softmax and you minimize the negative log likelihood for that. It's because that edge, the probability function looked like a softmax. So I just- Right. So, Yes. In, so at the time when I had tried it out, it actually gave me slightly worse results. And so that's basically why I used NC. And this was just initial experiments. Now when I'm trying it out, it actually gives me similar results. So I guess in the end, it does not make that much of a difference. Yeah, this is more related to the course, but we are gonna have a project for on self-supervised learning. So I was wondering, can you give us information on how to get a self-supervised learning model working as in the implementation details? Like this has been a lecture on the high level idea. So how to get it working quickly. So I think, I mean, there are certain class of techniques that are going to be much easier to get working from the get go, right? So for example, if you were looking at just pre-text tasks, then you would basically look at something like rotation because, okay, it's a very easy task to implement. You really cannot go wrong with it. I mean, there are just very few things to implement. So just the number of moving pieces is a good indicator. The other thing to remember is basically, if you're sort of implementing a existing method, then there are going to be lots of tiny details the authors talk about. So for example, the exact learning rate that they used or the way they used BashNorm or so on. If there are lots of these things, then basically it's going to be harder and harder for you to sort of reproduce or more and more things for you to get wrong. Second thing to remember is data augmentation. Data augmentations are really critical. So if you get anything working, you would try to sort of add more data augmentations to it. Okay. And would you recommend us trying poll or do you think that'll be too difficult to do in one month? I'm not sure what the setting is really. So I'm not sure if I can comment on that. Okay, thanks. One more thing, did you try using momentum contrast on poll instead of Memdi Bank? I haven't. So we basically moved to the end-to-end version, which is like similar to what SynthEar is. So the thing is, I mean, you can basically gather a bunch of negatives from different GPUs to increase your batch size. That actually generally helps a lot. I would suspect Moco would help a lot as well. I think Moco got improved performance over SynthEar by replacing end-to-end training with Moco. Right, I think the numbers are still fairly similar. And there are small differences in evaluation protocols that you would sort of see across these papers. So yeah, I think, so we're planning to sort of release a more standardized evaluation benchmark. So we did that last year. Unfortunately, that was in Cafe 2. So we're trying to sort of release something in PyTorch now, which will provide a lot of standardized implementations. So like Perl and a bunch of these and a standardized evaluation protocol for everything. All right, thanks a lot. Ishaan, I had a question about this cell supervised learning. So what do you think is the state of like generative methods? And did you think about combining like contrastive methods with generative methods like SynthEar actually like has a different space. So they have like a linear layer on top of the feature representation where they compute the actual feature representation where they did the contrastive loss, the NCE stuff. So like, do you think like having another head like that basically given like a crop of image, you just try to scale out that crop of image and you have that information because you crop that image, right? And use that again or something. So I mean, it is definitely a good idea. I think it's just the tricky part is getting these things to train is just non-trivial. So initially, like I haven't really tried any generative approaches in my experience that slightly more finicky and harder to get to work, but I do agree. I think sort of in the longer term, they are like, they are sort of the things to focus on. Thank you. Last question. No, that's it, I guess. Oh, I can actually ask a question. Yeah. So this is regarding distillation actually. So you were telling me how predicting software distributions gives a richer target, right? So can you elaborate on that? Because it sort of increases the uncertainty of a model, right? Like we are predicting from a one-hot distribution and then making it softer and then we are predicting on that so more uncertainty. And moreover, like why do they call it distillation? Because I sort of feel like you need more parameters to account for this richer target. Right, so the one thing is basically, if you train on one-hot labels, your models tend to be very overconfident in general. So if you have heard of these tricks called like label smoothing, which is sort of now being used by a bunch of methods, label smoothing is like you can think of it like the sort of simplest version of distillation. So you have a one-hot vector that you were trying to predict, but rather than trying to predict that entire one-hot vector, what you do is you take some probability mass out of that. So you would predict one and a bunch of zeros. So rather than doing that, you predict say 0.97 and you add 0.1, 0.1, 0.2. I'll take the remainder three labels. So you just add a uniform distribution to the remainder. So distillation is a sort of more informed way of doing this. So rather than like randomly sort of increasing the probability of a random unrelated class, you actually have a network which was pre-trained, which was pretty good to do this. In general, software distributions are very useful for pre-training methods because models tend to be overconfident. Pre-training on like software distribution is actually slightly easier than optimization problems. So you convert slightly faster as well. So both of these benefits are present in distillation and also like something that labels smoothing. Also because like smooth labels allow you to have like a dog looking cat or a cat looking dog, right? So if you have a very big network that has been trained on very many samples, it will actually have a proper idea of what is unambiguous perhaps image, right? And therefore, if you can actually learn the soft idea, you're gonna be learning more than if you just give that one hot label. I think we are running out of time. I think we are out of time like half an hour ago, but this was the question and answers session. If there are no, you know, really, really urgent questions still pending, I will be calling it, call the end of the lesson. So thank you for tuning in. I see you tomorrow at the practical session. Don't forget to come and that was it. So thank you so much, Isian, and I see you around. Take care, Isian. Take care, everyone. Take care, everyone. Bye-bye.