 All right, so welcome to class. Wednesday, the 7th of April, 9.30, New York City Live. Today, we're going to have Ishan Misral, today talking about self-supervised learning. Ishan is a research scientist at Facebook. He's currently working on computer vision and machine learning. His research interest is in reducing the need for supervision in visual learning. He holds a PhD from Robotics Institute at Carnegie Mellon University and graduated in 2018 when he joined FAIR. So his work are in high performance supervised image classification with contrastive learning and many more things, like a new way to assess AI bias, also in object recognition systems. So here we go. And I guess I disappear from the screen and I let Ishan talk for the rest of the lesson. Thank you for the giving the lecture, Ishan. Thanks. Thank you. Thanks for having me. Thanks, Alfredo. All right, so good morning, everyone. Today, I'll be talking about self-supervised learning in computer vision. So without really getting into a lot of motivation of why this is necessary, I think this used to be, like, the slide I would spend a lot of time on. But I think now it has become more and more clear what the limits of supervised learning are. So let's just recap it for a little bit. So getting real labels is really difficult and expensive. If you take the ImageNet dataset, which is considered one of the largest supervised datasets, it has about 14 million images in total, about 22,000 categories. And to label this dataset, it took about 22 human years. And if you think about it, 22,000 concepts that ImageNet has is not a whole lot of concepts because there are far more concepts in the visual world. ImageNet is just an image-based dataset, so it does not have videos. So there are no temporal concepts, there are no actions really annotated in this dataset. So overall, the 22,000 concepts really captures a very small portion of the visual concepts that we're interested in. And even that took 22 human years. So clearly, labeling does not scale very well. So in the past few years, a lot of research has really gone into obtaining labels in a more semi-automatic or automatic manner. For example, there is weekly supervised learning, which takes images or videos and their associated metadata. For example, hashtags or captions or something or GPS locations, for example, associated with those images. And then there is another paradigm which says that the data itself has enough structure or enough structure in itself to really be of importance. And we can basically just use the data itself to learn powerful feature representations. So that brings us to self-supervised learning. A sort of one-slide definition of self-supervised learning can be that you have observed data and you basically split it into two groups. You observe one part of the data and then you try to predict some property about a hidden part of the data from this observation. And basically by setting up this sort of a prediction problem, you're able to learn fairly meaningful features. So let's look at self-supervised learning in the context of computer vision. So in computer vision, self-supervised learning really started picking up a few years ago and there has been this sort of concept of pretext tasks. So pretext task is basically a task that you solve just to learn a feature representation. It's often not a real task. So pretext task is just where you're taking this observed data and then you're predicting properties about this hidden data. It's not the task you really care about and the only reason you're solving it is because you want to learn representations. So let's take a sort of a few examples of what this means and in general in vision, there have been a lot of pretext tasks defined for images, for video, for video and sound, but we won't have time to cover all of them today. So I'll just talk about a few examples using images. So one of the fairly popular pretext tasks for images is predicting relative position of patches. So in this task, you take two image patches. So in this case, a blue patch and a red patch and you sample them randomly on an image. Now the task is, you need to predict the relative position of the red patch given the blue patch. So you take a connet, a signage network. Thank you, Yan. And we basically just feed together the image patches into the signage network. We concatenate the features and then we basically solve the 8-bit classification problem predicting the relative position of the red patch with respect to the blue patch. So this is one kind of pretext task. Again, the interest over here is not to solve this relative position task. The interest is to basically use this task as a way to learn features. Another popular task was jigsaw puzzles. So here are the ideas that you basically take nine image patches and you permute them randomly. And now the task is to basically classify which permutation was applied. Now, because the set of permutations that you have for nine patches is very large, nine factorial, you basically restrict the amount of permutations that you apply. So you basically just say that I only have a fixed set of 1000 permutations and I'm only going to sample a permutation from within one of them. Another fairly popular task was basically predicting rotations. So you take an image and you apply one of four rotations, basically zero degrees, 90, 180 or 270. And then you output basically a four-way classification problem. So the idea over here is basically going to be that you solve just a four-way classification problem to predict the rotation type image. So if these tasks were so popular, what was missing from them and why did we have to really change the way we look at self-supervised certainties? So if you think about what's happening in a sort of pre-training stage for self-supervised certainties, there is a big mismatch between what we are doing at pre-training and what we really want to transfer to. So at pre-training, we are solving these tasks like jigsaw or like rotation, which have very little to do with the transfer tasks that we care about, which are say about classifying images or detecting objects. And there's a fairly big mismatch which means that this pre-training is not probably going to be fairly suitable for it. And it's a lot of hope that we had that this stuff would generalize. So the one way to check this basically is that what we can do is once we get this pre-trained network, so we take a bunch of training data, we apply the jigsaw problem, and we take the connet that basically trains using this. We can apply linear classifiers to the intermediate layers to figure out what the feature is basically at each of these intermediate layers. So what we'll do is we'll take the features from these layers. We'll learn the linear classifier to solve a particular image classification task. And then we can measure the performance of each of these layers to see what this network is doing. So on this plot, what I basically have is for a ResNet, I took five layers and I basically plotted the performance of each one of them. So from con one, basically being the layer that is closest to the input to ResFile, the little layer that is going to be closest to the output. And we look at the image classification performance on a VOC dataset and measure using the 18 metrics, so higher is better. So what we observe is basically when you go from con one to res two, there is an improvement in performance. All the way, basically the performance will keep improving as you go deeper into the network, which is like to be expected because as you move deeper into the network, the image features are probably going to get far more semantic. But there is a sharp drop in performance when you go from Res four to Res five. And this is basically the Res five is the layer that is closest to the Jigsaw task. And what this means is that at the last layer, the features that are being learned are very specific to this Jigsaw task. And these specific features are not really transferring whether we're in a mesh classification task. And so this kind of brings like brings an empirical evidence to our suspicion, right? In the pre-training stage, we were solving something like Jigsaw, which has little to do with classification. So the features that we learned at the last layer became very specific to this Jigsaw task and would not transfer very well to image classification. So if this is a problem, what is the solution? Well, let's take a step back and try to figure out what should pre-trained features do. So pre-trained features can satisfy two sort of fundamental properties. One is that they're useful to represent how images relate to one another. So if I have a picture of a tree and picture of another tree, I should be able to figure out that these two pictures are related, that basically there is the same concept. And if I had another picture of a cat, then basically I should be able to figure out that this cat picture is not as related to the tree picture. And the second property is being robust to nuisance factors. If I have a tree, I should be able to recognize that tree in basically different lighting conditions, in different weather conditions across the year and with different number of leaves. So there are a lot of nuisance factors and the features should basically be robust to all of these kinds of nuisance factors. So in the past few years, a popular and common principle for most of the sort of high-performing self-supervised methods has been to learn features that are robust to data augmentation. So the idea over here is basically that you are learning a function F, which is basically parameterized by theta parameters. So this can be parameterized by a path net. And what you want is the features produced by this network or for an image I should basically be stable under different types of image augmentations that you've applied. So if I augment the image with I, I should still basically get back to the future. And again, the reason it's useful is because when we basically satisfy this property, the features are going to be invariant to nuisance factors or data augmentation. And this basically means that I can recognize the same deer no matter the sort of color of it or like the time of the day and so on. So let's try to see if such an approach can even work. So let's go back to the Siamese kind of networks and try to see if we can implement this. So we take an image and we apply two different data augmentations to it and we feed it to an encoder. So the encoder is basically a Siamese network. And we get features for both of these data augmentations and we try to maximize similarity. So again, we're trying to basically say that f theta of I should be very similar to f theta augmented of I. And the similarity can be your favorite loss function. You can try to maximize the cosine similarity or you can try to minimize the L2 distance, whatever you want. You get basically gradients by doing this and you can back propagate all the way through, update your encoder. So why is like, if this is so simple, then it should just work, right? Why is there any sort of research needed for this? The problem is if you just go by this naive approach, you fall into this trap of trivial solutions. What this network will learn to do is essentially ignore the image input and produce a constant representation for the image. So this constant representation will satisfy the property that we want. f theta of I is f theta augmented of I. And why is that? Essentially, it's going to produce the same feature no matter what image you're feeding into it. So yes, the property satisfied, but this feature is not really useful for any downstream recognition task because it's going to not really capture this property that how images are related to one another. It will produce the same feature for a deer image. It will produce the same feature for a tiger image for a tree image and so on. So essentially it's not really helping to satisfy this property of how images are related to one another. And so what we can do is basically we can categorize most of the recent self-supervised methods as in basically the different ways in which they're trying to avoid this trivial solution. So why do we not clip the model of the ResNet 4? I mean, you can. So essentially you keep seeing the same pattern. If you clip the model at ResNet 4, then basically, and you had a ResNet because only four stages wide, you would get the same pattern that you would get poor performance at ResNet 3. So at each stage it's not really dependent on like whether it's Res5 or Res4. The point is that the feature that is going to happen at the end ends up basically becoming very specific to the task. Yes, so VAE is going to be helpful but sort of most of my presentation is not going to be about these kind of generative models. It's going to be about fairly like discriminative models for that. So moving on. So most of the recent self-supervised methods can really be categorized in ways that they're avoiding trivial solutions. So within those methods, we can actually draw like two different two different kinds of methods. So the first is a class of methods that's really going to maximize the similarity between features from image I and augmented versions of image I. And they're going to avoid trivial solutions in three main bills. Either by using contrasted learning or by using clustering or by using distillation. And there's another class of methods that's basically going to use redundancy reduction to reduce basically to prevent trivial solutions. So let's look at the first class of methods which is going to be contrasted learning. So before we start like talking about details, let me just give you a sense of what it means when you're sort of evaluating these methods. So for most of these methods when we're pre-training them using a self-supervised objective, we really rely on the image net dataset for fair comparisons. So we consider the image net dataset, the subset of it with 1,000 classes and we remove the labels. So we get about 1.3 million images without labels. And we pre-train the ResNet-50 model initialize randomly. And when we want to evaluate this representation, we basically take the pre-train ResNet-50 model and we can evaluate it in two ways. We can either just take a linear classifier on top of frozen features. So this is really evaluating the future quality or we can fully fine-tune the network for a downstream task. So in this case, we're basically seeing how good of an initialization the network provides. So one of the first works we did with contrasted learning was Perl which stands for pre-text invariant representation learning. And I'll show you how this sort of relates contrasted learning to pre-text tasks. So Alfredo told me that you guys have already seen contrasted learning in a lot of details. I'll try to be fairly quick about this part, but if anyone has any questions, please stop me. So in contrasted learning, you have groups of related and unrelated images. So you have light blue and blue which are related of green and all the green images are related or the purple images are related. And what we first do is we basically take a shared network, a Siamese network and we compute embeddings for each of these items. So we basically get all the embeddings on the right-hand side. And then the loss function is essentially trying to say that all the related embeddings should be close in feature space compared to the unrelated images. So I can essentially form pairs to sort of satisfy this constraint. I can say that all the blue, the distance between the blue embeddings should be less than the distance between the blue and the green embeddings or the blue and the purple embeddings. So for the case of Perl, it's fairly straightforward how we sort of compute these related and unrelated images. What we do is we basically feed in image I and another transform version of image I through the palm net. And we basically then add a contrasted loss on top to encourage similarity. So this image transform is basically going to be any pretext task that we have. For example, a jigsaw task or a rotation task. And basically by sort of applying this kind of data augmentation, we are learning a network that is going to be invariant to the pretext task. So the loss function, again to sort of put it back into this slide, you have the image features and the patch features which are basically being compared and you want both of them to be similar and you want them to be far away from any sort of random image or any other image in the data set. So the idea is basically that the pretext tasks can actually be considered as another way of data augmentation rather than thinking of them as basically being something you want to predict, you basically think of something that you want to be invariant to. So let's now try to see if basically by doing this kind of a property, we are actually even learning something meaningful. So in this graph again, we are basically taking linear classifiers to probe and figure out what the accuracy for each of the layers in a continent is. So we have two networks, one trained using jigsaw and one trained using Perl. And the only difference between them is in, essentially in the way that they're treating the pretext tasks. Jigsaw is really trying to predict the permutation, whereas Perl is trying to be invariant to the permutation. And what we observe is basically that the performance for the Perl model actually keeps increasing as you go deeper into the network. It suggests that basically the feature is becoming more and more aligned to the downstream classification task compared to something like Jigsaw where the performance kind of plateaus that res for and then drops down sharply. And the reason is again that basically you're really satisfying a fundamental property of what you want the features to be. You want the features to be invariant to these kinds of data augmentations. And something like Jigsaw, which is really trying to retain all of that information ends up basically becoming not as good for transfer learning. So this is just one way of doing contrasted learning. In fact, like in the past, all there have been multiple sort of works which show different ways of creating these related images and unrelated images. Also called positives and negatives in contrasted learning. So CPC style models basically say that patches from an image which are close by together should be related and patches from the same image which are far away should be unrelated. So this is how you basically form your positives and negatives. Another style of doing it is basically saying that patches from the same image are positive and patches from any other image are unrelated or negatives. And this basically forms the sort of backbone for a lot of popular methods like mocos and CLR. And all of them sort of rely on this kind of framing for contrasted learning. But why stop there? Why stop just at images? So people have really come up with all sorts of creative ways to really use videos and video and audio to define positives and negatives. For example, given a sequence of frames, you can say that frames that are actually close by in the temporal space are actually related and frames that are far away are unrelated. The same thing goes for video and audio. If you have a video and it's corresponding audio, you can say that these two modalities are related. And if you get an audio from some other video, then that is unrelated. And essentially by sort of doing this, you form your related and unrelated pairs, perform contrasted learning, and you basically learn feature representations. And this has also been used for something like tracking. So you can essentially take an object in a video frame and you can track it across multiple frames. And now the patches that you get from this tracking are actually the related patches and patches that are coming from a different video are unrelated. So they become the unrelated patches. And so basically by setting up this related and unrelated patches, you can again solve a contrasted learning problem and learn a feature representation. So this is all great about contrasted learning. What is the fundamental property about it that's really preventing trivial solutions? Well, it's basically coming up because of this kind of objective function that we have. And if you were to sort of come up with a trivial solution, you would not satisfy this property of a contrasted loss function. So let's look at how this is happening. So if you have the embeddings for the related groups, the positives, the distance between these embeddings should be smaller than the distance between our related embeddings. So the distance between a blue embedding and a green embedding. If you were to set all of these embeddings to be constant, this loss function would not, basically would not minimize this loss function. So then by basically having this attraction force between the positives and a refining force between the negatives, you're preventing a trivial solution. And these good negatives are really, really important in contrasted learning. In fact, a lot of research has really gone into figuring out how you can get good negatives and how that sort of improves performance. So I'll talk about three of the sort of standard ways of doing it. Of course, there are a lot more. But these three are sort of also informative because they talk about three fairly related self-supervised methods. So the first is SIMCLR. So in SIMCLR, the way that you are actually able to get negative examples is basically by creating a very large batch size. So what you have is your F theta network is spread say across multiple GPUs. So in this case, three GPUs. You feed forward your images across all these three GPUs independently. And now you get embeddings on each of these GPUs. So to use negatives, what you can do is you can just collect the embeddings coming from a different GPU and you can just use them as negatives. So if you have a batch size of 1000, then basically you get a lot of negatives coming from different GPUs. So essentially it's a fairly straightforward way of getting negatives because you just scale your batch size to a very large number and you can just collect the embeddings by spreading this batch size across multiple GPUs. So it's fairly simple to implement. Of course, the bigger drawback is that you need a large batch size to do this. So essentially you might need a large number of GPU accelerators to basically fit such a large batch size. The other way of doing this is to use something called a memory bank. So in memory bank, what you do is you basically maintain a momentum of the activations across all of your features. So if I had 1000 examples in my dataset, I'll basically have a memory bank of 1000 features. And every time I compute a forward pass, I'll basically update this memory bank and update this memory bank by basically the embedding that I'm getting right now from the forward pass. And I'll use this memory bank or memory bank's features also as negatives when I'm computing my contrasted loss. So this is fairly compute efficient because in the way that it's implemented, you really need just one forward pass. But the sort of bigger drawback of this method is that it's not online. Essentially you're storing these memory features in memory and you're basically updating them only once per epoch, which means that they get stale very quickly. It also requires a large number of, like large sort of GPU RAM as well because if your dataset increases from 1000 samples to 1 million samples, then basically you need to store features for 1 million images. And basically after a point it becomes harder and harder to store a very large dataset in memory. The third way of doing this was proposed in Moco, which is basically that you have, and you try to use the memory bank idea, but you really try to remove its constraint that it's not online and it requires like this entire dataset of activations. So to do this, you have two separate encoders. The F theta encoder, which is the original encoder that you really want to learn. And you also maintain a moving average of the parameters. So, which is an exponential moving average of this represented by F theta of EMA. And so now at each forward pass, you basically forward the sample through your F theta encoder and your F theta EMA encoder. And that basically gives you a set of positive embeddings and you keep a set of negative embeddings, which is going to be fairly small, much smaller than what you would have for something like memory bank. And that basically helps you solve this contrasted working problem. The original embedding is forwarded through F theta. The positive embedding comes from F theta EMA. And the negative embeddings basically come from a few set of stored features that you have. So now this basically helps you scale the memory because you don't really need, unlike the full dataset stored, which you did in the memory bank, you need a very small number of features. The only additional thing that you need is an extra forward pass every time to go through the theta EMA encoder. So this kind of concludes the explanation for all the contrasted methods. So now let's move on to the second way of avoiding trivial solutions, which is through clustering-based methods. So to do that, let's first try to relate how contrasted working and clustering are even related to one another, before we see how clustering is actually avoiding trivial solutions. So in contrasted working, we have these positive samples, we have these negative samples, and we are basically trying to bring together the embeddings from the positives. And we repeat this basically for each pair of positives. So essentially what we are doing is we are kind of creating groups in feature space. If I had all, so all the blue embeddings are basically the embeddings that I'm getting from a single sample, but different data augmentations of it, all the green embeddings are the embeddings that I'm getting from a single sample and it's data augmentations. I really want all of these embeddings to be close by to one another, but far away from like different samples. So essentially I'm creating these sort of little groups in the feature space when I'm doing contrasted learning. So another direct way of doing this is essentially to just do clustering, because clustering naturally creates groups in the feature space. So one, in 2020, we proposed this method called SWAV, which can be viewed as an online clustering based method. So the key idea here is again, we want to maximize similarity between the image I and the augmented versions of it. And to do that, all we are going to say is that the feature from image I and the feature from augmented image I should belong to the same group. And as long as they belong to the same group, I have maximized similarity. And of course, because if I keep doing this, I can actually have a trivial solution that everything gets assigned to the same group or the same cluster. So I prevent these trivial solutions by controlling my clustering process. Let's take a concrete example for this. So you have a bunch of embeddings coming from your dataset. So the blue embeddings are related, gray embeddings are related. And you have a set of prototypes which can be thought of as cluster centers. So you have in this case, three cluster centers. So what we want to do is we want to compute a similarity between each of the dataset embeddings and the prototypes. And this similarity helps us figure out which cluster it has a single sample belong to. So in the ideal case, what we would want is some sort of similarity that looks like this. All the blue samples get mapped to one nice cluster. The gray samples get mapped to one cluster. And the purple samples get mapped to one cluster. So in this case, what I've done is basically that all the blue samples are basically coming together into a single group. And that group is separate from the group to which the purple samples belong to. So I've basically satisfied my sort of invariance property and relation property. The problem is of course, that there are lots of trivial solutions. If I'm not careful about how we're basically doing this, how we're basically doing this, then we can get these kind of trivial solutions where everything gets assigned to a single prototype, which means that now I've basically gotten a trivial solution or a collapsed representation. So in one of the simple ways that we sort of enforce this constraint is by solving the sort of equipartition constraint. So the idea is basically that given n samples and k prototypes, we say that each prototype is going to be similar to at max n by k samples. So these prototypes basically are going to be equally partitioned. Sorry, the embeddings are going to be equally partitioned amongst my prototypes. And so this really prevents the trivial solution where everything could go to just one prototype or one prototype ends up basically dominating everything else. And to do this, rather than using something like k-means, so k-means does not really have this equipartition constraint. We basically use a clustering algorithm called synchorn-nop, which is basically related to optimal transport. I won't go into the details of what this algorithm is, but you can think of this as a clustering algorithm, which basically automatically has this equipartition constraint. So every time I perform clustering, I'm automatically kind of guaranteed that none of the prototypes are going to be dominant and that I'm creating clusters, which are going to be of uniform size. So if I had n samples and if I want to create k clusters, all of my clusters are going to be of size n by k. Whereas k-means, something like k-means does not really guarantee such a property. So we basically have now this good clustering constraint. We have basically a way to take each of the embeddings and assign them to the prototypes. So what next? So the first change we make is basically rather than computing a hard sort of assignment or hard clustering, we make the clustering soft. So rather than saying that one sample can only belong to one prototype, we basically say that there is a distribution that it can basically satisfy over each of these prototypes. So the blue sample has a soft assignment to each prototype. So you can think of this as a soft similarity. So it has basically scores like 0.8 to the first one, 0.1 to the second prototype, 0.1 to the third prototype. So these similarities will sum up to one, but it's basically a soft assignment that is telling you how each embedding is related to each prototype. And what you can think of these assignment scores to be basically the easier kind of codes. They're telling you how each embedding can be encoded if you were to just think of the prototype space. So to basically train this network, what we do is we take two crops from the image, we feed forward through the network F theta, we compute two embeddings, which were the new embeddings. And now we can solve the sort of sync on or optimal transport problem to compute these codes given the prototype. In the next stage, what we do is we basically just solve a soft prediction problem. So we try to predict the code from the code number two from the embedding number one. And similarly code number one from embedding number two. The idea basically is that if these two crops are related, and if I was invariant to data augmentation, I should be able to predict code number one from feature number two. Because both of these should basically fall in the same group or in the same cluster. So once I'm able to sort of solve this kind of a prediction problem, I can just compute the gradients and I can back propagate. And in this case, I back propagate through the encoder and I can actually back propagate through the prototypes as well. So the prototypes or the cluster centers are actually updated online through back propagation. And you don't really require explicit set of negatives. So there is no contrast of learning. The way we avoided trivial solutions is just by basically this sort of nice optimal transport or the sync on way of creating these codes which ensures that there are no trivial solutions. Can you tell us a bit more about the prototypes? Okay, so the prototypes are basically just initialized randomly at first. So you can think of them as just a bag of embeddings. And at each forward pass, what you're doing is you're taking the embedding that you get from the network, F theta, and you're computing a similarity to each of the prototypes. So if you had, say, B embeddings in your back size and you have K prototypes, you're computing B times K matrix. And now I basically say that I'm going to and then I basically perform this optimal transport algorithm which kind of ensures that my codes are nicely and evenly distributed across these K prototypes. And then I just solve this sort of cross prediction problem. So because the prototypes were used in computing the code, I can actually back propagate and update them using just like standard SGD. So it is like, it's not K-means. And basically if you try to do something like K-means very quickly, you can get into trivial solutions. So this is the reason for basically using sync on or the optimal transport method. So that you have this kind of equally partition constraint. So what is the name of the algorithm? Can you show it once again? It's sync on NOP. Okay. So it falls into this category of optimal transport algorithm. Sync on NOP is just an efficient way of doing optimal transport. But generally all the optimal transport algorithms have give this guarantee of equally partition. Okay. So now we basically are able to learn this feature embedding and let's try to see what this SWAV method basically ends up doing. So we evaluate this method by looking at transport learning performance. And like I mentioned earlier, transport learning can be evaluated in two different ways. You can take a linear classifier which is basically trained on top of fixed features or you can fine tune the full network. So in this case, we're going to fine tune the full network for the detection task. And we are going to train linear classifiers for image classification. So in the top row you have basically a supervised network. So the supervised network was trained using labels on ImageNet. And then we are transferring it to different downstream tasks. So you basically see that the supervised network performs really well on ImageNet. And that's kind of to be expected because it was pre-trained on ImageNet and you're basically transferring it to ImageNet's validation set. And so there is like a very large sort of nice overlap in the image distribution and the class distributions. So the features that you learned during pre-training are really well aligned to the downstream task. Compared to, then you basically have like other self-supervised methods in the second row and you basically see that they're also transferring fairly well. The one thing to note is that on tasks like object detection, self-supervised learning is already performing far better than supervised learning. And on the third row, we basically have SWAG which is really closing the gap to supervised learning on ImageNet. So it's like basically coming up within like a percentage point. And then on other downstream tasks it's outperforming ImageNet supervised pre-training. So what this shows is that if at pre-training you're learning this sort of generic feature representation, you can transfer to different downstream datasets which are not very well aligned with ImageNet. What I mean by that is for something like places, your dataset, the main classification task is to identify scenes. So whether this is a shopping mall or whether this is a beach or whether this is a church and so on. And at ImageNet, there are very few classes which really overlap with places. So if you do a supervised pre-training, the feature ends up really becoming specific to the ImageNet classes. And so it does not really transfer very well to dataset site places. When you do a self-supervised pre-training, you're learning features without really knowing the labels of ImageNet. So it's basically it makes it easier for you to learn generic feature representations. So when you transfer to datasets like places where there is limited overlap from the ImageNet concepts, you're actually able to perform much better compared to ImageNet pre-training. So there is a question here from Camila. How do we analyze the cost function when we have multiple clusters for embeddings which are not explicitly of negative same post-nature? I mean, how does the model figure out which is the most suitable cluster for its embedding if you could explain this a bit? So think about it, think about it basically what happens at initialization. At initialization, you have these random prototypes. So when you feed forward the embeddings from two crops, they're naturally, because of the way the images are, they're naturally going to be more related than the embedding that you get from another crop. So if you think of this as just a random projection, so the prototypes are just random feature vectors, what you're doing is you're taking this blue embedding and you're computing a random prediction onto this prototype set. And if you get a green embedding, you're again computing a random prediction to the prototype set. So at initialization itself, because of how images are, at initialization itself, the blue embedding is actually going to have a different signature or a different code compared to a green embedding. And all you're doing is basically kind of really bootstrapping that signal. So you're taking that signal and making it even stronger and stronger. So you're enforcing that, okay, at initialization, you're kind of random, but your signatures are different, but I want you to keep having very, very different signatures. So as training progresses, I want your signatures to become more and more different. Okay. Then we have another question from Raul. My question was, if we have a high class imbalance, and then we have K partitions, it divides into a capital N divided by capital K, won't that create a problem in learning the features because a lot of negative examples would be in one cluster? Yes. So that's exactly why we use soft, like the soft codes rather than the hard codes. So with the hard code, this is kind of a problem you get into because we actually observed that with hard codes, we were getting lower performance. With softer codes, you kind of say that it's not really creating K hard classes. It's actually creating far more number of hard and like far more number of classes because rather than saying that you're only similar to prototype number one and not similar to prototype number two, I can actually create a class which is I'm only 0.8 similar to prototype number one and 0.2 similar to prototype number two and nothing to protect number three. So I can actually create far more number of classes than just K. So these soft codes basically end up giving you a far sort of richer representation. So you become less and less sensitive to the number K. If you were doing this hard assignment where the codes were binary, where you have to be similar to one and not two, then basically the K value matters a lot more. All right, both students are satisfied with your answer. And so finally, you minimize the KL distance, right? Divergence between the... Yes. The two, right? That was it. All right. So there were also a few advantages that we found of SWAP which was that it actually had faster convergence than contrastive methods. So the reason for this is basically that in SWAP, all the sort of computation or all the similarities happening in the code space, you're never really comparing the embeddings directly. So, and because the code space basically imposes its own constraints, you're actually able to converge faster. For something like contrastive learning, everything happens in the embedding space, which means that you need a lot of samples to be able to converge and the convergence in itself is slow. And it was also sort of easy to train this model on smaller number of GPUs. So that was another sort of practical advantage. So before we move to the next part, let's take a look at what we have done so far for both contrastive methods and like the clustering-based methods. I made this sort of claim initially that we are going to evaluate all of these methods on ImageNet without labels, right? So this is not real self-supervised learning. One can in fact argue that this is pretend self-supervised learning because you're like closing your eyes and you're pretending that there are no labels. But really there are lots of labels on ImageNet. So what does this assumption do for us? So when we take ImageNet without labels, we are basically taking all of these images. And sure, we don't have any labels for them. But if you look at these images, they're very nicely curated. Like on the top left, you basically have all these side mirrors of a car. Then in the bottom right, you basically have this like horse example. In the top right, you basically have like the yards and the planes. So there are lots of these sort of nice curated images in this dataset, which means that even though we are directly not using the labels, we are really kind of using this curation process or this hand selection process of the images. So to concretely tell you what I mean is ImageNet data is curated because the images naturally belong to these 1000 classes. All the images contain a prominent object. That's actually how it was curated. And there is very limited amount of clutter because there's a single prominent object that are going to be very few background concepts. And this really sort of affects self-supervised learning. This is actually one of the hidden assumptions for it. So there was this really nice paper on demystifying contrasted learning which brought this sort of assumption to the fore. So when you pre-train on non-Im ImageNet data, it really hurts performance. So this image, this of the scene, such images are not really typical in ImageNet because this is a full scene. ImageNet would just have something like a chair and like a zoomed in version of this particular chair. Now what happens when I take multiple crops from this scene? I get these sort of four crops. And in contrasted learning or in clustering, I'm going to say that what you want is the embeddings from all of these four samples of all these four crops should basically be the same. Which means that really I'm saying that a refrigerator embedding should be very close to a chair embedding which is not really what we want. We don't want embeddings. We wanted to recognize a refrigerator and data augmentations of a refrigerator. We didn't want refrigerator embeddings to be similar to chair embeddings to be similar table embeddings. So compared to something like the real world, images really have different distributions, right? Images cannot necessarily just be like real images. They can be cartoon images. Nowadays there are lots of memes as well. There is no, likely that there is no single prominent object. Sometimes there is no prominent object or no object at all. So real world data really has very, very different properties. Now to sort of verify whether we had fallen into this sort of trap that ImageNet was the data set that everything worked on and outside of ImageNet nothing works. We decided to take like the SOG method and really try it out on large scale data. So this brings us to SEAL, which is basically taking SOG and testing it on billions of images, which are random and these images are not filtered in any single bit. So over here I'm basically showing you three different, like three different or four different models, which are going to, which are basically the fine tuned performance of models when transferred to ImageNet. So on the top we have SEAL, which is a RegNet model trained on 1.3 billion random internet images. So these images are completely random, not filtered in any way. So yes, this can include something like memes, it can be scenes, it can be completely text data, it can be cartoon. We basically don't filter them. In the next row we have SOG, which I just presented, which is a ResNet model trained on ImageNet. Next we have CMCLR, which is another modified version of ResNet on ImageNet. And then the last row is a vision transformer, which is a supervised algorithm trained on ImageNet. And when you transfer all of these, what you observe is basically that the SEAL models are working fairly well and they're working well across different model capacities. So on the x-axis we are basically looking at the number of parameters and each of the dots or each of the points in the dot represent a different model. So we can train basically models, like more than a billion parameters, which are going to transfer really well to ImageNet. And all of this is happening on completely random internet images without really looking at any labels or without even curating those images in any way. So the next thing we wanted to see is basically how much of a difference there is when we are looking at this curation images and when we are ignoring metadata. So all of this is self-supervised learning. The images that we had on the internet had some kind of metadata associated with that. So what happens if we try to use that metadata? So in the top row, we have a hashtag prediction model, which was pre-trained on 1 billion images. And these 1 billion images were really selected such that the hashtags align with ImageNet classes. So if I had a hashtag, for example, of a concept that is not in ImageNet, this image was filtered out. And the idea is basically that just by doing this sort of simple filtering process, you are creating a nice alignment between a pre-training dataset and your transfer dataset to this ImageNet. And we trained it as next 1-1 model on this, which has 90 million parameters. So you get a nice transfer accuracy of 82%, which is like really nice, given that you're not really looking at any ImageNet image at pre-training. In the second row, we have Seer, which is also 1 billion images, but these images were not curated in any way. So we don't have any filtering processes associated with them. Also, of course, we're not doing any hashtag prediction. So it's a self-supervised method. And we basically see that we are within one percentage point of this hashtag prediction method, which is showing you that there is like this nice generalization property that you have of self-supervised learning. You can really scale it to lots of images and you can learn fairly powerful representation. So before I move on to the next part, are there any questions for this? Okay, seems everyone is typing here. Yeah, all right, that's great. Either they understand everything or they don't understand anything. I know. Anyhow, all right. So coming to the next part, when I talked about contrastive learning and clustering, I presented them as two separate things, but actually there is a very simple way to combine these methods. So at this year's CVPR, we have this paper which shows that really you can combine sort of nice properties from both contrastive learning and clustering. So in this case, we study rather than images, we were studying videos because videos provide this, as you'll see next, a very nice avenue to combine clustering and images and contrastive learning. So we study this audio-video discrimination task where, like I mentioned earlier, the positives are basically coming from audio and video of the same sample. So you have two encoders, a video encoder and an audio encoder. You feed in the video through the video encoder, you get an embedding. You feed in the audio through an audio encoder and you get an embedding. And now what you say is basically both of these embeddings that are coming from the same sample should be close in feature space compared to any other embedding basically that is coming from any other sample. So it's really saying that across these two modalities of video and audio, the embedding should be the same or should be related. So this is straight old vanilla contrastive learning. There is no clustering happening yet. Now to introduce kind of clustering, we expanded the notion of positives. When I say expanded, we basically take a reference point. So that is the point on the top right and we compute its similarity in the video features or the video embeddings and the audio embeddings to all the other samples in the dataset. So basically this is sort of trying to show you there are lots of, there are going to be lots of different samples when you're computing the similarity. And for samples where both the video similarity and the audio similarity is high, we basically just call them positives also. So you can see this as a weak way of doing clustering because rather than in contrastive learning where you just had a very sort of limited concept of a positive, it has to be the audio from the same sample. It has to be the video from the same sample. Or in the case of images, it is basically the same image and just different perturbations of it. In this case, we are actually looking at positives which are different samples altogether. And the way we've computed these samples is just by looking at similarity in both the video space and the audio space. And we call this basically looking at audio visual agreements because we are looking at samples that agree with the reference sample in both the visual similarity and the audio similarity. And so what do these samples look like? So on the top, we have basically three different references. And I'm showing you what a positive looks which basically agrees in both the visual similarity and the audio similarity. I'm showing you what a visual negative is and what an audio negative is. So if you take the first, like basically the first column in this case, you have a person dancing and you get a positive which is basically similar in both video and audio which is also for person dancing. If you were to completely ignore the audio and just look at visual similarity, you could get someone exercising because visually both of these concepts really look fairly similar. But in our case, if you just look at the audio, it's going to be very different for someone who's dancing. They'll dance to like a particular kind of music whereas someone who's exercising will actually dance, like exercise to a different kind of music or have a different audio altogether. And if you were to just look at the audio part of it, well, that is confusing too because someone could be fishing with just the same background music. So if you were to just use the audio to compute your like expand a set of positives then you don't get a very good signal there either. And similarly we have like a few more cases here. A moving train, it sounds very similar to a moving boat but of course visually it's very different. And a moving train also kind of maybe looks similar because of like the textures and so on to something like a truck station but both of them are going to sound very different. So basically by doing all of these things, you're actually able to expand a set of positives and now you can kind of combine the advantages of contrasted learning with clustering by basically having this nice relation to different images and creating these groups in future space. So this brings us to the end of clustering based methods and we can now either move on to distillation or if anyone has any questions, I don't see any. Okay, cool. So these distillation based methods are again going to fall under this category of similarity maximization. So we have F theta i which we want to be similar to F theta augmented of i. It's just a different way of doing it. So you can view this as a student-teacher distillation process. So we are going to compute an embedding from the student for the image i and we are going to compute an embedding from the teacher for the augmented version of i and we are going to enforce similarity between these two. And of course, if the students and the teacher were exactly identical and everything about them was exactly identical then we would get the trivial solution. So we are going to prevent a trivial solution by asymmetry. And this asymmetry can basically come in two different ways and both of them actually can be used jointly. So there is going to be an asymmetric learning room between the student and teacher. So the student weights and the teacher weights may not be updated in exactly the same day when you're doing back propagation. And there is another asymmetry in the architecture. So the student architecture and the teacher architecture are going to be different and subtle ways just so that again there is kind of an asymmetry and that helps you prevent a trivial solution. So the first method we look at is build which explicitly constructs a student and a teacher framework. So you have a student encoder through which you feed an image features and you add a separate another sort of prediction head called a predictor and you get an embedding from the student encoder. From the teacher encoder you feed forward the image and you get an embedding directly. So a predictor is not being applied. So you can see over here already there's a difference in the architecture asymmetry in the architecture between the student and the teacher. Then you back propagate and then the gradients are only flown through the student encoder and not through the teacher encoder. So there is an asymmetry in the learning itself. Now there's a third additional source of asymmetry over here which is in the weights of the student encoder and the teacher encoder. The teacher encoder is actually being created as a moving average of the student encoder. So it's the same sort of more costile momentum encoder that is being used as a teacher. So now basically what we've done is we've created three sources of asymmetry. We have an asymmetry in the architecture basically between the student and the teacher. We have an asymmetry in the learning rule which is basically that the gradient only updates the student and not the teacher and then there's a third source of asymmetry which is in the weights of the student and the teacher. And teacher weights are very different. So by introducing these three kinds of asymmetry we can actually prevent trigger solutions. So this will actually learn meaningful representations that won't collapse. So do you need all these three sources of asymmetry? Well, it turns out that in 2020 another set of authors introduced SIMSIM which shows that you don't really need all the three sources of asymmetry. So in particular, they show that you don't really need a separate set of weights for the teacher network. So in this case, the student and the teacher network have the same exact weights and all you have are two sources of asymmetry. One, that the student uses this special predictor head on top. So yes, there is an asymmetry in the architecture. And second, when you're back propagating you only flow of gradients through the student and not to the teacher. So yes, there's an asymmetry in the learning update but you don't really need a separate set of weights for the teacher encoder. So now basically you can see that only by just using like two sets of asymmetry you're still able to learn fairly powerful feature representations. So this sort of covers the distillation part of it which brings us to the final part of this lecture which is going to be about redundancy reduction. Yes, so does anyone have any questions? So we don't update the teacher? In which one? I don't know, there is a question. I don't know. Vidit, can you clarify your question, please? I think it's about Sensei Am. So the question is about Sensei Am, yeah, we don't update the teacher but it gets automatically updated because it shares the weights with the student. Yes. So in the forward pass you basically compute these embeddings through the student encoder and the teacher. So in the forward pass basically both of them have identical weights and in the backwards you will not update the teacher but then before the next forward pass you'll copy over the weights of the student anyway. So that's how it's going to keep getting updated. Makes sense. And the student is satisfied. So this brings us to the last part of the lecture. I think we're going to have a lot of time so I encourage folks to ask questions in general because I don't have that much money. I think we have like about an hour left. All right, so this like this last set of the last sort of objective function is not really all about similarity maximization. It's about redundancy reduction. So it's called Barlow twins by these awesome folks. So the key sort of hypothesis here is actually inspired by neuroscience. The idea basically being that neurons in a brain communicate via spiking codes. And because your brain has a limited amount of real estate you can't really pack a lot of neurons into it not arbitrarily a large number of neurons. And because it also has another sort of energy constraint you can't really power the brain with like infinite amount of energy. So you have two sort of real physical constraints in the brain, which means that naturally you kind of expect that the communication that happens between these neurons is going to be an efficient sort of communication protocol. It won't be completely inefficient. And so Horace Barlow was really inspired by information theory which came roughly like a decade before he proposed this sort of efficient coding hypothesis. And the idea that he said is that these spiking codes should really try to reduce the redundancy between the neurons. And if you think about it, it kind of makes sense. So if you have like say 10 neurons you don't want all the 10 neurons to encode the same exact information. If you're doing that you're kind of being wasteful. If all the 10 neurons encode the same information about the input, then you basically not sort of maximizing the sort of stuff you can represent with the 10 neurons. What you would ideally want is a subset of it focuses on a different concept, a subset of it focuses on a different concept. So how do you take this sort of insight and try to apply it to representation learning? So at a very high level, and this is like very, very roughly speaking, what you have are say N neurons which produce a representation which is going to be n dimensional. So this can be the channels in your continent. So say for a resident that could be a 2048 dimensional feature. And for each of these neurons we want two properties to be satisfied. So we want the representation produced by a neuron to be invariant. So no matter the data augmentation that is being applied, the spike or the representation that's being produced at the neuron should be invariant to the data augmentation, invariant to the stimulus that's actually being input. And the second is that it should be independent of the other neurons. Because you don't want all the neurons to capture the same exact same, there should be some kind of decode relation between them. So very, very roughly speaking, if you had f theta of i which is basically producing this n dimensional representation and with these square brackets and indexing this representation. So what we have is f theta of i at the same neuron should basically be the same under different data augmentation. So that's the first property of invariance. And in the second property you want them to be independent. So really you don't want them to be kind of producing the same output. And this is not exactly mathematically right. But roughly speaking, these are the kind of properties that we want to enforce. So to illustrate this idea, you have the image that is basically being, you're computing like two distorted versions of it or two data augmented versions of it. If you get to an encoder and you get a representation. So in this case, ZA and ZB are representations of the same image under different data augmentations. And let's suppose for a minute that we had three neurons, right? So this red, green and blue neuron. So the first property, which is about invariance says that basically the blue neuron should produce the same representation for both ZA and ZB. And this basically the same thing should happen for a green neuron. So you basically should produce the same sort of representation across these different inputs. Can you go to the previous slide? There's a question about the I variable. Is I here representing a different neuron? So yes, the small I in the bracket, you can think of it as like indexing into a vector. And so I and JR basically representing a different neuron. So you want the same I, the same neuron is gonna be behaving the same way for the normal image and the augmented image. But the other neurons should be differently. Have a different value, right? Yes, that's right. So if you, the other way of thinking about this is that it's also preventing a trivial solution, right? What a trivial solution would basically say that, okay, all the neurons produce the same output. So in this case, the second property is not satisfied. Only the first property is satisfied. Makes sense. Okay. So, coming to this, you basically have the invariance, which is going to say that all the neurons are producing the same output. And the second is redundancy reduction, which says that all the neurons should kind of produce a different output. You don't want all of them representing the same thing. So that's kind of trying to prevent the collapse. So in implementation, the way to do this is to first basically compute a cross correlation between these feature matrices. So if you had a feature matrix of n dimensions B times D, where B is that size and D is like the dimension of the feature, you will compute a D times D. So a feature dimension type feature matrix, which is going to be the cross correlation of D. So it's just going to be like the outer product of them. So now once you have this feature matrix, the cross correlation matrix, to satisfy the properties that we mentioned earlier, we want this matrix to be fairly similar or as close as possible to an identity matrix. And so basically the identity matrix will again enforce that all the neurons, because in the diagonal, you're basically enforcing that all the neurons across the different data augmentation are producing the same output. And all the off diagonal term is sort of telling you that the neurons should not be, like different neurons should produce different outputs. And so LBT is basically the loss. And you're trying to say that the cross correlation that you predict from the feature should be very similar to the identity matrix. And once you minimize this loss, you basically get a back propagation. We can back propagate and get a gradient and update the network. So the one thing to note over here is that in this entire process, we have really not added any asymmetric operation. So if you think of this in terms of the student-teacher model that we talked about earlier, the student and the teacher have the same exact weights. So this is like would be in simsia. But both the student and the teacher are actually being updated. So there is no asymmetry in the learning update. And there is no asymmetry in the architecture. There is no extra parameters in the student which are not present in the teacher. So basically in this case, we've sort of removed that asymmetry. So in terms of math, here is what the sort of objective function looks like. The CIG matrix can be computed as a cross correlation between ZA and ZB. And this is basically like a function to sort of show you that. The loss is being computed in terms of like the cross correlation matrix being very similar to the identity matrix. And you can basically set it out into two terms. The first term that is basically taking the identity of the identity values. And so that's going to be one minus CII. That's the invariance term saying that the neurons should produce the same output under different augmentations. And the second term is the off diagonal term that's basically saying that all the different neurons should be different, basically produce different outputs. Now, why do you have that lambda parameter there? Well, the lambda parameter is basically computing a trade-off. So if you think about a matrix which is going to be like N by N, there are going to be just N diagonal entries in it. So the first term, the invariance term just has N values inside it. And the second term has N squared minus N values, right? So all the off diagonal terms. So the lambda is basically just trying to balance the contribution of both of these terms because we have a lot more redundancy reduction terms than we have in variance terms. So lambda is basically just trying to say that, okay, don't try to minimize the loss by just focusing on redundancy reduction. Try to balance both of these terms because both these properties are really important. Anthony is asking a question here. Is it correct to assume the distortions of 40 images are random every time? Yes, yes. So we sample like every time you compute a forward pass, basically before the forward pass, you're just computing a random distortion of things. So cool. So till now, all this loss function that I've talked about prevents trivial solutions like in one way, right? So if you had the neuron, like if you had a constant representation where all the neurons produce the same output, then you would not be able to sort of minimize this loss function and you would basically have a trivial solution. So you're kind of preventing trivial solutions this way. There's actually another set of trivial solutions which this won't really prevent all that well. So in that table solution, the neurons are producing basically different outputs. So they're completely decorrelated but they're constant across the entire input. So essentially each neuron is producing a very different kind of an output and you can or basically these neurons are producing different types of output but they're very, very similar across a bunch of images. So now to prevent that, we basically just center the ZA and ZB vectors before computing the cross correlation. So when I say center, what we're doing is it's kind of like a batch norm operation. So you take ZA and you subtract its mean and you divide by this standard deviation. This is like a fairly standard centering platform. And the reason it prevents this kind of trivial solution is because if you're producing the same sort of output feature across all the images, then when you center, you will basically get a zero matrix. So if I had a matrix of N times D and if everything was roughly the same, when I subtract the mean, I'll basically get an entire matrix of zeros. So just by doing this kind of centering before computing the cross correlation, we can kind of guide the network away from this kind of a trivial solution. And centering is kind of super, super standard when you're trying to do cross correlation in general. So now there are two ways that we prevented this. So first was basically having this invariance and redundancy reduction term that prevents this kind of completely constant output across all neurons. And then the second is basically that you can still have this weird kind of a constant output where the neurons are kind of decodulated, but still producing a same feature for the images. So that is prevented by basically doing a centering operation. So in this entire process, we've prevented trivial solutions without really looking at negative samples. Because at each point, when you're considering these embeddings, you're only considering the positive test. And we are able to prevent these trivial solutions without this kind of asymmetric learning. So in SWAG, we had that St. Horn operator, which was non-differentiable and that kind of prevented trivial solutions by doing this sort of EP partition constraint. In BioLine since I am the distillation-based methods, we had some kind of either asymmetry in the learning update in the student teacher or basically having this special predictor head. In Barlow twins, basically, there is the learning update is similar. The encoder, the student and teacher, if you want to think about it, that way are also similar. And the entire sort of trivial solutions are prevented basically by the loss function and the way the cross correlation is computed. So this makes Barlow twins fairly easy to implement. So this is kind of like a five-dot pseudocode for the entire Barlow codes method, including the data loader and the optimization step. So you can basically implement this method fairly straightforward, because no sort of asymmetric tricks are required to really make it work. So the first thing that we want to do now is once we have this method, we want to measure its transfer performance on downstream tasks. So the first thing that we did was basically fine-tune this Barlow twins method on ImageNet dataset. And when we're doing this fine-tuning, we are basically doing a fine-tuning using a very limited set of labels. So we take just 1% of the ImageNet labels and we fine-tune using just those 1%, or we take 10% and we fine-tune using 10%. So the Barlow twins method, the representation that we learned is fairly sort of competitive with state-of-the-art methods. So if you look at the top one accuracy when you're using just 1% of the data, it's fairly competitive and performs at par with state-of-the-art methods like SWAL. On the right-hand side, we are evaluating the representation in a different way. We are taking the representation, we are freezing it, and we are learning a linear classifier on top of the representation. And in this case, again, we're transferring to places, VOC and iNaturalist. And again, at this point, you can basically see that it's performing at par with state-of-the-art methods. It's slightly worse on certain datasets, slightly better on certain datasets. So what this shows is that it is possible to develop a simpler method that can perform as well as state-of-the-art. And it really advances our understanding in what it means to sort of learn representations and avoid trivial solutions. How many tricks sort of are needed to prevent trivial solutions? So the other thing that we wanted to observe was basically- There are questions here. So first of all, why does it perform better than other methods in low data settings? I don't think there is any, like, particular, I don't have any insight into why that's happening. It's just really an empirical observation. I don't think we have any reasoning for this. I mean, most of these things are actually just empirical observations. Okay, okay. In the previous slide, someone is asking, is it ZA and ZB always belong? Do they always belong to the same image? So do they always belong to ZA and ZB? Yeah, so I mean, ZA and ZB can be considered as matrices. So it's an N by D matrix and ZB is also an N by D matrix and each entry is basically corresponding to the same image. So the zeroth entry, like the zeroth row in ZA is the same image as the zeroth row in ZB. Okay? Yeah. Okay. So the next thing to inspect was basically whether the Barber-Quince method, how sensitive it is to the back-size that is being used. Because I did mention that we have the centering operation that we need to do before computing a cross correlation. So let's forget about Barber-Quince for a minute and let's look at Sinclair to understand what how back-size can actually be an important factor. So when I talked about Sinclair in terms of its contrasted learning, I said that the way you get negatives is basically by spreading the sort of, like taking a very large batch and spreading it across GPUs, right? So, and in contrasted learning, I also said that good negatives are super, super important because that really ends up preventing a trivial solution and also creates like these nice sort of feature clusters. So for Sinclair, if you reduce the number of, like the number of samples in a batch, you're effectively reducing the number of negatives that you are using for computing like your contrasted loss. So when you go from a back-size of 4,000 to a back-size of 256, you can see this sort of degradation in performance which is directly correlated to the number of negatives that are being used for contrasted learning. The second method to look at is Beall, which has the sort of distillation based flavor, student and teacher. And in this case, it turns out that basically, you don't really have that kind of a, you're fairly stable to a back-size. It doesn't matter whether you're using a very large back-size or a very small back-size. In the case of Barlow twins, the stability is somewhat in between. So it's not as sensitive to the back-size as Sinclair and it's basically fairly robust to a back-size, but not as robust compared to say Beall. And in recent, like past few weeks, we've actually pushed this to the limit and we've trained with even smaller back-size of 128 and we can actually observe that it's still robust, like even beyond discussion. So the other thing that we wanted to study was basically that there is this sort of importance of data augmentation when you're trying to learn these methods, right? So it's a question for the previous slide, though. Is the drop in accuracy for BT due to the use of batch norm, actually? So I mean, that's the reason to actually verify it at 128 back-size. So at 128 batch norm, like basically the variance in batch norm is going to be higher, but it turns out even at 128, the performance gap is actually fairly similar to that of like Beall. So it turns, I don't suspect it's because of that. And at the right end, where you're seeing like a big drop in performance at 4,000, I suspect it's more because of the optimization rather than parallel twins itself. So the thing is when you change batch sizes, you need to adapt a lot of the learning, or like learning hyper parameters, the learning rate, the weight, decay, and basically like how you're sort of decaying the learning rate and so on. So I suspect at the right-hand side, the bigger drop in performance is not because of the algorithm, it's because of the optimization hyper parameters that we used. So also, Raul is asking, can you tell us in detail why different batch sizes create so much differences? I mean, I think you just mentioned, like you just answered this question right now. Right. And I mean, apart from that, like apart from optimization, then certain loss functions are like more sensitive to the back-size. So like contrastive loss functions because especially the way it's implemented in CMCLR, it really relies on the back-size to get its negatives. And so depending on the back-size, you can actually see a very dramatic difference in performance because you will actually get fewer number of negatives. Yeah, yeah, makes sense. All right. So the next thing to study was how important the data augmentations are when you're creating these sort of different distorted or perturbed versions. So in this case again, we are studying say Beowall, Barber Twins, and CMCLR. So at the baseline, you basically have, what baseline in this graph means is like you're using all the data augmentations. And then at each step, when we basically move towards the right, we're removing a particular kind of data augmentation. The first thing to observe is basically that CMCLR and Barber Twins are roughly like similar, so similar sort of trend when you're looking at different data augmentations. Both of them seem to be fairly sensitive to the way like what data augmentation is being used. Whereas Beowall seems to be far more robust to data augmentation. Like you can remove a lot more data augmentation and the drop in performance is actually going to be much smaller. So if you think about it, there is a fairly different way in which both of these models are working. So in Barber Twins, you take the image or in CMCLR you take the image and you feed it through exactly the same encoder. So the Siamese network that you have is basically has the same exact rates for both encoders and you get a feature output. And now what you're saying is basically both of these features should be like in the case of Barber Twins redundancy reduction loss or in the case of CMCLR, basically you're saying that both of these should be similar by using contrasted learning. So the power that you have in this sort of case is that the data augmentation really needs to produce very different features so that you're actually like at every step you're basically doing something different. Whereas in the case of Beowall, when you're feeding in the image through the encoder, the weights actually are completely different for the student and the teacher. So even if you were to feed the exact same image through the encoders, you would actually get a very different output because the student weights and the teacher weights are actually different. So naturally the teacher is actually applying some, you can think of it as there is an actual amount of data augmentation that's coming from this moving average encoder. So essentially this is one of the reasons why Beowall seems to be far more sort of robust to data augmentation compared to other methods like say Barber Twins or CMCLR. Now, the next thing that we wanted to check was whether any sort of asymmetry is actually beneficial for Barber Twins. So far we've seen that asymmetry, without asymmetry we can actually prevent trivial solutions, but does adding any asymmetry actually help us? So we tried basically the same kind of asymmetric ideas that are present in SIMSIME. So basically using a stop gradient, so stopping the gradient to one branch or adding a different sort of predictor head. And it turns out really you don't need like in Barber Twins adding in both of these methods really does not seem to improve performance by a whole lot. In fact, adding both of them together actually seems to hurt performance. So it kind of suggests that the way we have sort of prevented the trivial solution, asymmetry is not needed at all. And in fact, adding it will probably not give you much benefit anyway. The third sort of property to verify is the number of non redundant neurons. This entire sort of discussion started with taking saying that you have N neurons and you want all these N neurons to be different. So to do this I'll talk about a little sort of detail in Barber Twins, which is important to understand. When you have the image, when you feed it through the encoder, you get a particular feature dimension. So for a ResNet that could be 2048. And then that feature is actually projected to an MLP before applying say redundancy reduction loss. And this is also standard in say contrasted learning. You actually apply from the 2048 dimensional feature to apply an MLP to compute a very small embedding and then you perform contrasted learning in that. And similarly for something like the all you have a predicted. So the in Barber Twins, all the redundancy reduction is happening on the feature that is computed after the MLP. So what we want to now study is basically does that feature dimension matter and how much it matters. So to do this, we vary the MLP dimensions. So basically the MLP will go from say 2048 to take as input a 2048 dimensional feature and it can produce say either a 256 dimensional feature or a 4,000 dimensional feature. So we just vary the stuff on the right. And for all of the methods, we basically evaluate still the 2048 dimensional representation from the encoder. So that sort of shows that we are just measuring the encoder's performance and sort of decoupling it from the MLP that is being used. So on the X axis, we have basically a different dimension for the projector. So either going from say 16 dimension to like in this case, 16,000 dimension. And we again have the three methods, Barber Twins, Buel and CIMCLA. So for Barber Twins, the performance really improves as you increase the dimensionality of the projector. So if you go from 32 to like 16,000, the performance is going to like really, really improve as you keep increasing the dimension. In fact, it's not even plateaued at 16,000, you can keep increasing it. For something like Buel and CIMCLA, it seems to be slightly more robust with this. But Buel, basically you see a drop in performance if you keep reducing the depth a little bit. So beyond 128, it really starts to hurt performance a little bit. For CIMCLA, it seems to be far more robust. Like you can increase the number or you can decrease the number and it's going to perform roughly similarly. So why is this sort of a different trend for these three methods? So in Barber Twins, when we're doing this redundancy reduction loss, we are really encouraging kind of a sparser representation. We are saying that all the neurons would encode something very different. So now suppose you had like 10 concepts to represent in an image. It means that we are basically kind of saying that all the neurons would capture different aspects of it. So we need maybe a slightly larger number of neurons because we are enforcing the sparser representation. We want the neurons to be completely correlated. Whereas for something like CIMCLA and Buel, we're actually taking like these very dense vectors and these vectors themselves, we are not enforcing any sparsity on top of them. So that's one of the reasons we suspect that for Barber Twins, we actually, we benefit a lot more with a higher predicted dimensionality compared to some other. And that actually brings me to the end of my lecture. End of finished earlier. But if you have questions I'm happy to take them. What is the depth of MLP used in the last network? Depths of the MLP in which one? Waiting for clarification. The projector. In Barber Twins or in Buel? So basically in Barber Twins, it's like two hidden layers. And in Buel by default, it's one hidden layer. We also try two hidden layers. It does not make that much of a difference. So Raul is asking the following. There has been so much development in the field and you have just presented different methods. What should be the blueprint to follow if we have to run a model for SSL task? Since someone is doing a project, I think in this, I think with this sense. I see, so is the question that they want to, that you want to pick SSL method, like what do you mean by blueprint to follow? Yeah, which model are better for which kind of tasks? Okay, so without doing much about the data set and what you're trying to do, it's harder to sort of give you a general advice for this. But there are different kinds of pros and cons for each one of them. So like now we have four on this particular side itself. So if you talk about like how these models converge, clustering based models converge faster. So if you were to just talk about in like you have limited computer and you want stuff to converge very fast, then I would go with something like a clustering based model because it will just end up converging faster. If you care about simplicity of implementation, I would go with something like Bars or twins because it's like I showed you, like the pytorch code is fairly simple. There are like very few sort of parameters to tweak. So I would just go with something like that. If you have like very different modalities that you're trying to compare, like we talked about video and audio or if you're trying to do something like, I don't know, RGB pixels through depth. Then in those cases, it turns out that maybe using something like contrastive learning is generally better because you have two very, very different types of encoders and two very, very different types of architecture. So the optimization problem is really, really hard. So in those cases, we've generally found that using something like a contrastive loss is better. Yeah, one thing I want to point out. So Yishan, as you may or may not be aware of, there's going to be a project for the class and it basically has to do with self-supervised learning. So students are quite interested by questions around what works best for learning features. Yeah, I've made a point in the class when I explained some of those techniques at a very high level. And so I think your lecture was useful in terms of giving more details about exactly how this works. And I may be biased, but I like the non-contrastive methods. There is a question as to whether Barlow Twins is actually not sort of a secret contrastive method because it does contrastive training across features as opposed to across samples. But there is, and then BYOL also is very mysterious why it works, but there is sort of implicit contrastive term due to the funny effects of batch normalization. There's been like a number of publications from DeepMind and others about why BYOL works by basically removing parts and then figuring, seeing when it fails. Do you have a good intuition for what makes BYOL works? I really think it's just the fact that, I mean, so there is a batch norm sort of, there is of course like this contribution of batch norm which people say sort of prevents collapse and so on. But I think like interpreting that as a hidden contrastive method is taking it a little bit too far in my opinion. So yes, you're relying on a batch statistic to really sort of widen the data, which is sort of, I would think of it as like preconditioning or sort of improving your optimizations rather than thinking of it as like completely contrastive learning. Because in contrastive learning, you really need lots of negatives. I mean, everything is really relying on thousands and thousands of negatives. Whereas if something is really working with like a batches of 128, I don't think it's fair to call it a contrastive method anymore. So now coming to like whether it can actually work with certain different types of normalization. So yes, it can actually work with different types of normalization. It depends on the architecture that you're using and with sort of careful optimization. So we recently found that basically you can use transformers and you can basically train something like build. And in that case, you're not really using batch normalization. So you can actually train these methods without using batch normalization. And yeah, because the architecture in itself does not have it. So I mean, there is this paper by DeepMind, right? That tries to analyze like removing batch normalization at various levels in the OOL. And if you remove it in the last layer, it's just completely collapses. So I think they have a trick for it where you can basically initialize. Like you can use layer norm and you can try to initialize the parameters slightly differently. And in that case, it tends to work. But I think more of it is probably like, it's just because of the difficulty in optimization rather than something tied to the method. So I do believe that if you were to like move on to different types of architectures for the student input and teacher encoder, this probably would go away. So move to something like a transformer which does not have any batch norm. There's a new question coming from the students. What do you see in the future of SSL and what is North Star for it? I don't know what the North Star is. I think in the future, so if you think about like all the image-based SSL work, it's really bootstrapping the same signal. Everyone is going with augmentations of the same image and basically saying that that should be the same and all the other images should be different or augmentations of them should be different. So everything on this slide and everything basically in the past two years that's really sort of working towards it of the art is kind of using this signal in some way or the other. And I think, so it's good because yes, now we're kind of understanding more and more what is necessary to make it work. But I think there should also be some kind of effort to understand are there any other signals that we should really focus on? So all of this is all about augmentation invariance and oh, because there is a trivial solution there, let's come up with different ways to prevent the trivial solution. But is there any other signal? Is invariance the only thing that we care about or are there other interesting properties that should really be modeled and self-supplied? I think that's like the bigger open question. Yeah, it makes sense. So I have a question actually. So what do you think of the possibility of applying SSL methods perhaps somewhat different from the one that we heard to video, learning features from video as opposed to still images? Because to some extent, the idea of distorting an image is just a way of building kind of an nearest neighbor graph. You have a collection, a large collection of images and you generate images that you know are similar in terms of content. And what you have in the end, what you use is a similarity graph, a bunch of groups of images that you know are similar. Now, if you have video, you could imagine the kind of a similar thing where successive frames in a video are deemed to have similar content. Or you could imagine that one of the branches in a joint embedding architecture would have a bunch of frames and then the other branch would have maybe just one frame, you know, pretty key in the next frame or some frame in the middle. Can you comment on SSL for video? So SSL for video so far has really followed, I would say, like most of the times, like video sort of architectures and video class tasks really sort of mirror image tasks or image architectures. So in the same way, like in SSL, it has really been, I would say, a mirror effect. So there are the same kind of contrasting methods like you talked about where you consider images from, like frames from the same neighborhood to be close by. You apply similar kind of data augmentations, cropping and like basically like color distortions and so on. So all of it is basically the same thing and you're doing contrasted working again. The only sort of newer thing that I have seen is like using audio as a supervision and that's really been because when you're computing these data augmentations in a video, there is a lot of redundant information. So the contrasted task becomes fairly easy. So you need to be even more aggressive when you're applying data augmentations because the task over here is to recognize two different clips from the same video and because there are like 10 frames, all 10 frames are going to have similar background stuff is barely moving amongst them. So the task becomes easier. I think predictive learning is definitely like one of the more sort of interesting and open problems there, but it is really, I would say one of the hardest problems there because video prediction, I mean, we've talked about this also multiple times. Video prediction is like one of the hardest things to do but I do think it is like the right way forward. I just don't think we know how to do it right now. Question coming in for SSL, has anyone ever tried to stick a gun to generate noise for data augmentation? Like trying to learn some limited noise that tweak the encoder in the worst way possible. I see, so kind of an adversarial setting where you're learning adversarial network to like create data augmentation such that the loss is like maximized rather than minimized? Believes, yes. Yeah, so there has been some work on it that I have seen but not like it hasn't become super popular. But I know like, I have tried a few experiments myself and I know other folks who have tried this as well. Basically trying to inject like some kind of that learn it along with the network. And it turns out it's like very similar to the standard sort of adversarial attacks that you get like, I really distinguish what's really going on. There is super high frequency signal being added and that makes the sort of loss go very, very, like it completely goes nuts. All right, so it seems like we have satisfied the curiosity of everyone in the room. Oh, okay. Platform to reach Ishan, how do they reach you? Email. Email, okay. Do we know your email? Yeah, it's like, I am ISRA, my first like Ishan Mr. So I am ISRA at FVDOPM. Okay, very good. We'll provide the students the email in case they ask. All right, all right. So thank you again for being with us today. It was very enlightening. The slides are very good. I saw yesterday, I saw them yesterday already. So I really knew what it was coming more or less but the last one is very pretty. I haven't seen this one, so. Anyway, so again, thanks for explaining and answering every question we throw at you. So looking forward to see you around, perhaps after the pandemic and to hang out a little. Yes, absolutely. Thanks for having me. Thank you. Of course. All right, bye-bye everyone. Thank you Ishan. Thank you. Thank you again. Thanks for everyone.