 So welcome to class. We are starting the lesson at the end of the lesson, but whatever, because I forgot to press record. So I'm going to go with the introduction of today's lesson, right, which is going to be taught by Justin, right, our students with me and Jan. And what is he going to be talking about, right? So we've been talking about EBMs a lot in this class, because of course, you know, we like EBMs, right? Jan likes EBMs, so we do like EBMs as well. And so let me give you a small recap of why and how these things come together. So there are two major types of architectures. The one that you have seen with me in the last few lessons were the first one, the latent variable generative energy-based models, or L, V, G, E, D, M, right? Or the second one, which is going to be the joint embedding methods, right? So let me show you what are the differences or similarities. So we start with the latent variable generative, EBM, right? We have AX, and there is a dot. You asked me before, why is it dot? Dot is a point, right? So I have one point. Then I have some encoder and predictor, whatever. I'm going to have my hidden representation in green. I have also a dot, right? The H goes inside a decoder. And the decoder is also fed with a source of variability there represented with a straight line, right? So Z, a input, a latent input, missing input, can vary along that line. H is going to be just a specific value, vector, point, whatever. Therefore, we also need to add a regularization term, right? Why do we need a regularization term on the latent? Because otherwise, the latent becomes too much, too powerful. And then the overall model is going to be assigning zero energy to all the possible values, so that's going to be a collapsed model. So we had a decoder now generate my y tilde, right? The y tilde varies around a manifold. How do we introduce this variation by having that Z that is moving along a line, right? So this y tilde now can move around as well. Whereas H basically tells you what is maybe the size of that ellipse, for example, right? Then, as I asked you before, where since we have a y tilde, I don't know how many people. There are some people with us, right? So you can still type in the chat if you feel like joining the lesson, right? Still. So we have this y tilde, right? My prediction, how do we, what do we miss, right? We need to add a spring, right? The spring allows me to get my prediction not to fly away from my target, right? So I have a target on the bottom, which is also varying, for example, around this manifold, this elliptical manifold. And then I have a spring in between the C term, okay? Then I ask you before, what is the energy of the system, right? And the energy is going to be the summation of all these red boxes. In this case, I'm going to have this E, right? Function of x, y, and z, which is going to be the sum of the C term, distance between my prediction and my target and the regularization term, R, okay? So these are the generative models. Why are they called generative? Because there is a y tilde. I generate a estimate for my target, global y. And I show you three types. I show you yesterday, I show you the architectural type, right? That was the like under complete hidden layer for the autoencoder. Then I show you the contrastive type, which was the denoising autoencoder, where we take a sample, we displace it, and we've enforced a high energy equal to the square distance of the displacement. Then there was the third technique, which was the regularize technique, which was the variational autoencoder, where we automatically assign high energy for things that are not observed. On the right hand side, I'm going to be introducing today topic of the lesson, the joint embedding method, where we start with a point, the x. I'm going to have the encoder, I have the projector, whatever. I'm going to have this e, I use for embedding, which is the same as h. Maybe I should have called it h. For the embedding for the x, so it's still a point. On the right hand side, I have my y that is moving around manifolds. The encoder perhaps just unwarps this manifold, and I have it just moving in a linear fashion. And then I may have the projector, which is collapsing that variability once again into one point. So this right hand side column, basically, has inbuilt an invariance for variations done on the manifold, right? So all the manifold variations, so the y has some degrees of freedom, these degrees of freedom that are constrained on the manifold are eaten away by the right column, right? So you have with two stages, we unwarp this elliptical shape into a line, and then the line is going to be condensed down to one point. Finally, I'm going to have my cost term, right? My energy term. And then finally, the free energy is free because there are no latent. So this f is going to be this big box, which is comprising just the c term. And so we have that the f, free energy, energy of the system, x and y, is going to be this cost between the two embeddings, the ex and the ey, right? Then I told you just before starting the lesson that there are different types of training procedure, right? We mentioned that we have contrastive methods, which are like the denoising auto encoder, where you choose points that need to have high energy. And then on the other side, you have the architectural and the regularized architectural, where the one that you implicitly, by making some choices from the architecture, you automatically constrain the amount of low energy that you can give. And the regularized that you basically add a penalty, right? For not giving, for giving low energy to too many things, right? And then I show you very quickly, but then it was not a major thing. The, what are these two options, right? So we had a contrastive, for example, we either push up everywhere or we push up at specific locations or we go off manifold to all manifold, like yesterday's. So like the denoising auto encoder. And on the right hand side, we have the architectural and regularized techniques. For example, we put a upper bound for the low energy volume or either we use a regularization term or we minimize the gradient and then maximizing the curling factor, okay? And so that was all I wanted to say before starting today lesson, which is about joint embedding methods, gems, right? Which you have gonna be hearing about, well, if you have watched the recording you already heard, but anyway, you can listen to this beforehand and you're gonna be hearing about contrastive, clustering, distillation and information maximization. We usually call it visual representation learning. Basically it means visual means we only care about images or videos. We don't care about natural language or speech. So like in general, there's a two type of method you can do like under the visual representation learning. You can either do supervised or you can do self-supervised on supervised. In the case of supervised visual representation learning, a lot of people call it the transfer learning. So basically you train your model on someone like a supervised dataset. Then you try to transfer some out of a distribution dataset and check the performance. But today our focus will be on the self-supervised side. So in general, we kind of have like three categories of method, the generative models and the pretext task and the joint evading method. For the self-supervised visual representation learning, really you have two steps like what you do. The first step is like a pre-training. So the idea is you use a really large amount on label data to train a backbone network. And so different method will produce a backbone network differently. So you will get this thing, like it's a encoder or it's a backbone network. So you get the image and you can generate some representation about the image. Then the second step is about the evaluation. So here then you can use a small amount of label data to train a downstream task head network. So there's two way you can do it. One way we call it feature extraction, you get the image, you go through encoder, generate representation. Then you use the representation to train a downstream like a task head, like a, for example, if you have the encoder, you can make all the image become vector, then you can even do some reinforcement learning stuff on you, you can attach a reinforcement learning like a text head on top of it. So then you can just do reinforcement learning on the representation space instead of the image space. Okay, there's a, so the only difference between feature extraction and the fine tuning is whether you cut the gradient, whether it's because whether in the fine tuning you actually change the encoder, whether for feature extraction you don't change the encoder. So that's the difference. So mainly for all the different methods, it's only the pre-training step, it's a different, like the evaluation step, there's some standard way you can evaluate. So make it a fair comparison between different methods. Okay, so let's talk about the other two methods first. So like for generative models, like one of the really famous ones, like the auto encoder, right? So you just get an image, let's say it's an image here, and you like for the case of denoising auto encoder, you got the like an original image, you add some noise, and then you try to use the encoder decoder to reconstruct the original image. So then in the end, you just took after you're pre-training the networks, and then in the end you just discard the decoder, only keep the encoder and as is the backbone, then you can use for other tasks. Okay, so then there's another category called protects tasks. So then it's almost the same, but you get the image, you go through encoder, but here you figure out some smart way to generate something like a pseudo like labels, give you an example, like this one is the image of a tiger. So you pick nine patches from the images. So then this B, like the nine different patches, you shuffle it and that's your input X. So what's the output of Y is the correct way to label them, so that's your Y. So you train your network to basically try to rearrange those patches and make sure they make sense. So like let's see in this case, it's actually if the network can successfully rearrange the patches, it must mean it's understand the image, that means the representation means something. So those two methods are quite popular, like between 2014 or 2018 or 19 even. So but then like this joint body method came out. So then like it's basically dominant, the self-supervised representation learned like failed. What is the major issue with the previous type of architecture with a pre-text task? Okay, in the pre-text task like the issue usually be how you design your pre-text task. So if you design your pre-text task too easy, so the network won't learn really good representation, but if you design it really hard, maybe it's even harder than your downstream task. So the network doesn't really train very well. So then your downstream kind of suffered. So it's like really hard to come up with a good design for the pre-text task. The representation will basically be tailored to the specific task we are gonna be training, right? Yeah, yeah, yeah, that's also another issue, yeah. So the loss here is just classification for this case? Yeah, here's just classification. Also in the previous one, generative models, what is the major issue there, right? Why don't we use generative models? So, okay, the major issue for auto encoder is this. So it's you need this decoder. So first of all, training a decoder is already really hard. Like if you have bad decoder, you will get a bad encoder. So also it's trying to solve a problem is sometimes too hard and... Why is that? Why is it hard to decode, right? Why is it hard? Yeah, because your representation is not necessary. You don't, for a lot of your downstream tasks, you don't really have to, like the representation not necessarily has to be reconstructed. Like I say, you have a two image of two dogs. So if you just want to do classification, your representation doesn't have to be able to, like it could project to the same representation, right? The two dogs have the same representation, means they're just dogs. But if you want to reconstruction, so those two dogs, they actually should have, they can all be the same representation. So that makes the... Yeah, that's actually harder compared to, you can just squeeze the representation space. Then the last thing is, sometimes like it's for auto encoder, maybe the loss function is not too good because the loss function is reconstruction loss, like sometimes Euclidean distance. So in the image space, it's not a really good, it's not a really good loss. You can imagine like a two image of a dog and I can, like I can find maybe an image of a cat, which is closer to one of the dog image. And so, which more than the other dog image. So in that case, it means that the Euclidean distance is not really good measurement in a lot of cases. So I think that's in general why the auto encoder is not a really good representation learning method. So is this why variational is important for making a good generative auto encoder method? Variational? No. Variational encoder. I'm not, I don't think so because, okay, for the only reason you want to make it variational for making the age, it's sometimes you want to sample from age. Like if you want to, like for really basic variational auto encoder, you want to the age to be Gaussian or something. That's because you want to sample from it. But there's no reason for, if I'm just doing downstream class, just a classification, I don't really care about whether the representation is Gaussian or not. So it's really just add actual, if you use variational auto encoder to learn the representation, it's basically just add actual constraints to the representation, but you don't really care about the country. The following one when you had the classification over the nine edges. So are we having like a soft argmax over nine categories here? I don't remember exactly how they do it, but I think it's more than nine categories. Yeah, I think maybe, yeah, you're right. Maybe each patches have a nine different categories, but you have to control it and make sure it doesn't, all patches belong to the same category. I see, okay. I think that's all with the question so far. Okay. So back to drawing bedding. So yeah, the game, the whole idea is just try to make your backbone network robust to a certain distortion because imagine if you don't trim that classification and you distort the image a little bit, the image still is a dog, right? You won't classify that as a cat or anything. So you should, it should a robust to distortion. So here's what you could do. So you get a image of dog and you get a two different distorted version of it. Then you encode it with your backbone network to two representation. And you want them to be close to each other. So it means the two images share some semantic information. Okay. But then there's like a really bad thing happened. This is the trivial solution. Why? Because actually the network could cheat. It doesn't have to generate, it doesn't have to adjust invariant to the documentation. It can't invariant to the input. Basically, no matter what input you gave it to it, generate the same output, then the distance will be just zero. So in that case, you got this trivial solution. So the whole issue, the whole way you want to do the joint body method, different methods are just how you prevent this trivial solution. So the general ideas, instead of just care about the local energy, like between you to pair for the distorted images, you actually get a batch of the images. You get impairs. And then you try to make sure the collection of the representation, the HX. So like for each image, you got the lowercase HX, then you collect any of them. So you make them like a matrix. So it's like HX. You want this HX. So the trivial solution means all the H, like a lowercase HX is the same. So then you just push this uppercase HX to be like each column or each row to be different. So what is this plate notation with this capital N? Capital N, it means you get N of the same thing, but the different X and different Y. Okay, so inside the plate, we're gonna have the energy, is it? And then outside those green boxes are? The loss function. Okay. Yeah, so the N just means you sample bunch of images and let's say you sample any images and generate NX and N over the Y, then you have N over HX and N over HY. So then you can't cut in them. Now can't cut them, maybe like a stack them. So you get the uppercase HX. So you try to make sure this HX have a certain property, like it cannot be all the same, all the row are the same. So, right? So the loss is gonna be acting on the batch, basically. Yeah, it will acting on the batch, yeah. Whereas the energy, the energy acts on the sample. I see, I see, makes sense. In the sample, yeah. Can you please explain once again what is the small HX and the big HX? What are the difference between the two? So the small HX is just a vector. It is embedding for one image. So the uppercase HX is a matrix. It is N times the dimension of HX, the lowercase HX. So N by D is the dimension of HX. So it means you just stack the batch, all the batch embeddings together. So this uppercase HX is a matrix, N by D. You can think of that. And what is A and B? A and B, yeah. A and B is just the loss function. Well, I will explain what they are like for different method. Okay, for different joint embedding method, the A and B means different things. Okay? So should I keep going? Yeah, I will just read the equations as they come up. Okay, yeah. So and the four, so if you want to come up with your joint embedding method, so you have to change for things, like they're forcing you to change. The first is like a data augmentation. It's like how you generate this to distorted version. Okay, then the second one is the backbone network. What kind of backbone network do you use? The third one is the energy function, like how you define the distance between the two representations. Then the last one is the LOX function, like how the A and B is. So we hear today, for today, we just assume we have some reasonable data augmentation and some reasonable backbone network, like either RASNITE or VIT or some like new fancy stuff. But those two are actually super important. There are like a lot of paper shows, if you change them, you're like the good result, those are state-of-the-art results, really put a lot of engineering effort on the data augmentation and backbone. But there's actually not much theoretical understanding about why some backbone works, like why some backbone doesn't work. So it is still a lot of just empirical knowledge. It's not really theoretical. So the only thing, we kind of have a sort of good understanding is about the loss function and the energy function. So for today, I mainly just talk about those two and we just assume we have a good data augmentation or do a backbone network. Okay. So the two backbone networks is actually the same weights. Then why not just put X and Y into the same backbone network? Okay. Yeah, you can do that. But if you read the papers for drawing, that people all draw like this. So yeah, but sometimes, like for this particular case, the two networks are the same, so you can do that. But sometimes they're not the same. I will actually introduce a case when they're not the same. And sometimes even they're the same, they may cut the grading on the one side. So I just draw it. So like all the drawing are consistent with each other for the later growth. Okay. Sure, sure. Also, I think we have other diagrams where we use this concept of parameter sharing, right? So we have maybe two encoders where the weights are the same, right? So for the sake of representation for understanding where the X goes and where the Y, we just replicate the symbol to make sure we understand where they go. So there's another question. Hold on. Yeah, yeah. What would the dot line between HX and capital HX represent? What function do we apply it on HX to get the capital HX? And someone here is suggesting that it is the stacking operator. So that's precisely it, right? So- Yeah, probably, yeah, just stacking, yeah. And then there is another question. Besides developed by Jan Lecan, why JEM is the method of choice? What is the evidence that probes this method to be better than other two methods? And the other two methods being the generative methods and the pretext task, right? So why do we prefer to use these JEMs rather than the other one you introduced before? Okay. The simple solution is that they perform much better. So like no matter on, if you downstream task classification, like no matter what your data set is, use the joint embedding method. Like there's a standard way you can measure it on the ImageNet. So those are joint embedding methods actually really close to supervised method. Let's say supervised can, in ResNet 50 on ImageNet, supervised can get 76. And self-super, like joint embedding method maybe can get 75. But if you use a pretext task or auto-encoder, it's probably only get a 40 or 50%. So the performance gap is really large. I think that's the main reason why so many people like the method instead of the other two methods. And yeah. Let's remind what are the major issues again. So the pretext task major issue is the fact that... Yeah. It's really hard to design a good pretext task. Whereas the generative approach, major problem is that... Yeah. I would think it's the loss function, the reconstruction, or... The training of the generator, right? Yes, the training of the decoder. Yes, that's also a big problem. Yeah, I was just reminding what are the major flaws, right? Yeah. That's it. So, okay. So then, there's usually four categories. Like for the joint embedding method, you can do contrastive or non-contrastive or regularized, like a young, like do call it, or clustering method. And the first category is like other method. We call it the other method because we do not quite understand why it works. So we just group them together, call them other method. Magic. Sorcery. Yeah. Okay, so, okay, let's start with... Okay, let's talk a little bit about all the shared trade of all the joint embedding method or the loss functions. So for the joint embedding loss functions, you must have the two components. Like it's like a term to push the positive pair closer. So usually that's just the energy function. You push the energy function lower, so essentially you make the two representation closer. Or the second thing is you have to do a term to prevent the trivial solution, which is constant output. So basically that's A and B in this graph. And so how you do it, that actually differentiates a lot like a joint embedding method. And I put an implicit here because a lot of other method, those other methods, they do not really have the explicit loss term to prevent the trivial solution from happening. And a lot of them actually can converge to a trivial solution. But during training, it doesn't. So that's why I put an implicit here. But in general, for most of the function, they have the loss function, they have these two terms. Then another thing is here, like for the joint embedding method, it's kind of different from the supervisor or the generative or anything. Because one thing difference is because of the loss function, the only input to the loss function is to the generated representation. It's all the product of your network. So in that case, it's different from supervised. Like supervised, you predict the label and you get actual label. But those actual label, they are actually fixed. So you cannot change the scale of it. Or like in the generate term, like you try to reconstruct the image, the original image is also fixed. But here, the loss function only takes the two representations. So you can change the scale of the representation. Like you can multiply all of them by 10. If you multiply all of them by 10, let's say you multiply, let's say your network produce all the representation of the hx and hy, but then you produce a 10 times larger. If you can produce a 10 times larger representation and the loss function decrease, like the loss decrease, then in that case your training will be super unstable. So in general, those drawing banding methods, they actually have a way to prevent this unstable happen. Unstableness happen. So I will show you, guess what I mean later, okay? Are the final embeddings specially insensitive? I mean, after different distortions, the x and the y we fill in may just be different parts of a dog. So the features can be especially dissimilar but the high-level semantic information remain the same. Yeah, actually that's like really, a lot of people currently working on that because in the depends on data augmentation, but one of the really popular data augmentation, they people found out, actually it's only work well for classification. Like because in the dog examples, see it's actually the representation ignore the special information, right? Because there's not a really difference, difference of two dogs, right? So like if you downstream path is object detection, that's actually really bad. So now a lot of people actually like research on that to see how to make the drawing banding method, what kind of data augmentation is good for the drawing banding method, like if the downstream path is object detection, and that's actually exactly what are we gonna do for your final, like you guys gonna do for your final computation, is try different ways to use SSL and try to do object detection. So try to keep the special information in the images as much as possible. And yeah, so that's like it's a still open challenge, but for now our understanding is if you train with those kind of data augmentation, actually the network can throw away a lot of special information about that. So there is like a similar connection with the pretext task and this data augmentation, right? Because it seems like the pretext task was giving us some parameters that are tailored to solve a specific made up task. And here these joint embedding methods seems to be building invariance to the data augmentation that is kind of also tailored to the downstream task somehow, right? So I think there are some sort of still limitations to be worked around. So then let's talk about the contrastive method, okay? So actually the contrasting method now like it become really popular, like a lot of people actually because this is so like a mostly early joint embedding method actually contrastive. So a lot of people actually call joint embedding method in general just called contrastive learning. Do we like contrastive methods or we don't like contrastive methods? No, we don't like contrastive methods. Why don't we like contrastive methods? Yeah, because the contrastive method you actually have to either do sampling or do something to push the energy surface into spatial locations. So in that case, if you use spaces, if you embedding space is really large, you cannot possibly push every possible locations. So it's much better to use some regularized method to make sure the volume of the low energy like areas are small instead of use a contrastive learning method. Actually, sorry, it's push up the negative sample you try to push up the negative sample. So let's see here in the chat if students are following. So yeah, what do contrastive methods do? Type in the chat here, people. Yeah. Let's see if we are online. I mean, if we are, they push up energy on incorrect wise, right? But then the major issue is that we have to, what's the major issue in the contrastive sample? Let's figure. We have to figure out where to find this wise, right? These specific samples, right? And so again, this is major issue rather than if we use these regularized techniques, we just handle many of them. It's the same problem we saw by the end of yesterday class, right? Whenever I tried to see the energy of the linear interpolation of two inputs from the regularized technique, right? The variation autoencoder was giving me a high energy for that kind of linear interpolation of two input digits. Whereas the denoising autoencoder, which is a contrastive technique, was giving a low energy on that linear interpolation, right? Anyway, let's get back to the slide and let's figure out what are these i's and j's and those things. Yeah. So again, as I mentioned, like all the drawing by the master should have two components, like the loss functions. So the first one is to push the positive pair closer to each other. So you've got one distorted dog image, another distorted dog image. You just push the representation of them closer to each other, okay? So then you got the, but to prevent the trivial solution, you actually push the negative pairs away. So if they actually from different images, so you x, h, x, i and x, x, j, they means they're from, you got any of them. So they're from different, they generate from different images. So that could be a representation for a cat image and sorry, a representation of a dog image and the representation of cat images or a person or whatever other images. So you try to push the representation of those away. So then that essentially per when the collapse you, because the collapse means you output the constant function, constant vectors. So if you output the constant vector, not only the positive pair will be close, also the negative pair will be close to each other. So that's a basic idea for contrastive learning, okay? What if the i and j are both dog images? They're, even they're both dog images, you still, but as long as they're from different dogs, they, you may still want to push them away. Because even they're different, okay, even they're both dogs, they may still have some differences. So that shouldn't, you just want all the positive pair like close to each other, means the representation for the same dogs are close to each other. But the representation from different dogs should still far away from the image from the same dogs. But it should be closer to the cat image. I see a person or a car or a truck or something, a bus, right? So, okay. So actually the contrastive learning method actually introduced like at the Young's Group, like it's like 2005 or 2006, it introduces a contrastive learning method. But that at that time, it doesn't work very well. Or if it's worse, it only works on a simple, really simple dataset. Okay. The issue here is actually for all the contrastive learning method is really how to find the good and negative pairs. Let's see here, if all my negative pairs always the cat and let's see, yeah, like see if all my negative samples always the cat and I don't know, like a classroom, like say a cat and sorry, it's a dog and a classroom. So the network will be super easy to find out all like the two things are different. It doesn't have to learn the full representation about the dogs. It's just a need to know, okay, this dog have a certain, let's say it have some certain texture of the skin of the dog. So it under the classroom like a lighting kind of different. So it can cheat on the, like for the training. So the really, really issues how to find the good and negative pairs. You actually want to find the dogs and another dogs, they look similar, but they're different dogs. So then you push the network to learn good representation and mostly early attempts is to do some hard negatives mining. So you have some prior knowledge about the images about the data you have. So every time you try to sample some negative that is super close to the original image. So like it's actually used a lot in the facial recognition. So people try to use some prior knowledge to find the people who have a similar faces but a different person have similar faces then you use those images as the negative samples and use the same person's image as the positive pairs. But in general, it doesn't quite work very well. So then, but then it says like 2005, 2006, then it came to 2020. And when this two paper came out, so called Sinclair and Moco and how they solve the issue with good and negative pairs they're actually just using really large batch size. If you sample a lot, a lot of negative images, you will get some good and negative samples. So that's how they solve the issue with the like a from the good and negative pairs. So many questions here, hold on. Can we adjust how much we push base on the negative sample? So if it's different dog or different cat. So how can we push with different intensity based on the content of the image, right? So I think the question is like, can we push differently based on the label associated to the image? But I think this is without labels, right? You don't even know what's the label of the images. That's a issue here, right? Because you don't know that's an image of dog or image of cat. Unless we are the one which are generating the data set, right? So if we are taking pictures of my own dog and you take your pictures of your own dogs, then we can actually have that, right? But usually we just have this huge collection of unlabeled images. And we have to basically build a representation that is somehow not collapsing down into a single point. And it is like descriptive enough to include all those different things, right? Okay, yeah. So any other question? Yeah, and then there were people answering the question. But I think it's nice to have the question, to read out loud the question such that we had in the recording, right? Okay, yeah. So the both SIMCARE and Moco, they're using some loss function shall influence AE. And the loss function is actually proposed pretty early. There's a 2014 paper, I think that's a, sorry, 2004 papers, like actually already proposed this loss function. But it's never really worked very well until we have enough compute to really able to use the larger batch size. And then early 2020, we got this two amazing papers and they use this inference AE loss to do the contrasting work. So next I will explain what inference AE loss function is. And yeah, so you will see how they do that. And also it's kind of related to the question, it's like, how can we give a different way to do different negative samples? And the inference AE actually do that and they do it really smartly, okay? So this is the loss function, okay? So you get the positive pair X and Y. So you took the negative log of the exponential of the beta, it's like hyper parameter of a similarity between your positive pair, okay? The similarity between your positive pair. Then you divide by the sum of the similarity between all the negative pairs. So the HX and HXJ, J just not means the image, the representation from other images and this one. So why did it make sense? We kind of reformulated it. So you put the log inside and cancel the log and exponential cancel. So the first term is minus betas, the HX and HY are the similarity between the positive pairs, plus it's a log whole same. And magically, not magically, it's doing it on purpose but you get the log sum X. So that's what we call in this class the softmax or some people call it the real softmax. So you get the negative betas, like a similarity between the positive pair and the softmax between all the negative pairs. So because this is a loss, this is a loss function. So you want to minimize this. So a beta is positive, right? So you want to minimize this, you try to push this term high. You basically push a similarity between the positive pair high. And then you do a softmax on all the negative pair and you push a similarity between the negative low but with a different force, right? Because you try to push the negative pair have a high similarity, much harder than the negative pair with low similarity because there's a softmax, okay? So in this case, you basically do the two things, push the positive pair closer to each other, negative pair away from each other. And as I said, you always have something to, you need to do something to prevent the gradient explosion. So the particular similarity measurement like people chose is this one. The inner, basically the inner product between the two, the representation. And so you, because you normalize the norm, so your vector, even your vector grew really long, you can still make sure that they're like a union vector, make sure you normalize to the unit vector. So that's the loss function of the inference AE. I think it's intuitively makes a lot of sense. Any question, Alfredo? Yes, there is one long question. I'm not sure if I just read it. Let's see what's going on. Oh, the second one, okay, sure. So these architecture should be really useful in imitation learning, then because you have control over what samples you have. Yeah. Imitation learning, you mean? I don't know, let's see. I actually don't know, I'm not sure. I think the RL people really like the self-supportive learning method, but I'm not really sure about the imitation learning. Yeah, okay. Maybe Vikkator can send some more, like you have demonstration pairs. So X and Ys would be demonstration pairs, is it what you're saying, right? Yeah, so if X and Ys are demonstration pair, to have basically, to get their representation close together, I'm not entirely sure, I... Okay, so Vikkator tried to write more new questions, that's that we understand better for trying to ask. Then I will ask the question, okay? I want to add this, for imitation learning, it's kind of like how we evaluate it, right? You, when you do the pre-training step, you don't really care about the, like, no, no, you care about what downstream tasks you use, but that's a, that's usually not to, we not directly do imitation learning when we do the pre-training steps. Yeah, okay. So should I go on? Yes, yes, I will read the question whenever it comes up, a new question. The difference between the same clear and the Moco, it's how are we really gonna do about this large batch size? Okay, so what the same clear that is, we just brute-forcefully increase batch size. Actually, the batch size they use in the paper, like it's 8,000, about like 8,200. So that's huge. If you know, like for, at that time, for supervised learning, people usually the common batch size like 256 or 128. So if you batch size is 8,000, that's actually really large. And like even at 2020, I think when the paper came out, people are actually really surprised and they talk about, oh, it's gonna cost like a $100,000 on like a Google Cloud or something to train it. So it's like really surprising for a lot of people. So, and so that's how you use, how the same clear, like make sure, like utilize a large batch size. However, there's a Moco user, I think a more clever way. And so that's also, it's also a really old idea, but the idea is to use a memory bank, okay? So the idea is you use smaller batch size, like let's say here is like a 200 or just 128, 256 or something, smaller batch size. But instead of just using the HY from just one negative, like for the positive, you always keep the same thing right. But for the negative, you will want a large batch size. How you manually do that? So you do not just care about the current batch just negative samples. You also get all the most previous steps, the recent previous steps and samples. And then you put them together to aggregate this or like an amount of negative samples. So let's say if you end is 256 and you could, let's say, if you aggregate all the previous 32 steps, 32 batches of neck samples, you essentially have 8,000 batches, like a negative number of negative samples. So that's really clever because save a lot of space, the space of the memory. And so you also utilize it because you don't have to generate again, right? Because you can just save it into a memory bank. However, there's a issue with that. The issue is because B is updated every step, the backbone is actually updated every step. So after a while, you owe the negative samples are not that valid anymore because the backbone here and the backbone here may be already really different. So their representation are really different. So if you do contrastive learning on that, you will see a clear, like a decrease of your performance. So the local, like the local actually means momentum contrast is actually they use something called momentum batch like a backbone. So the idea is actually you slow down the training of this right backbone. So you make them update much slower. So in that case, the difference between the older momentum backbone and the new momentum backbone is not that different. So it means those negative samples are still valid even you already trained a while, okay? So- Is there a stop gradient missing from that top? Oh, right, sorry. Yeah, there's a stop gradient here. So you do not only backpops top right. Yeah, okay, it's actually here. Actually you do not update anything at all, right? So how you do it? So here's, so for regular, for the backbone parameters. So I call it SATA, I actually should call it W. But this, so how you update it, let's say we're the optimizer really simple, SGD without momentum and weight decay. So you just use the loss of weight minus the learning rate times your gradient, times the gradient of the SATA team. That's how you update every like a backbone network. Does slower mean a very small learning rate? What does slower mean? Yeah, it's a, okay. Oh, let me, let me explain that. So let me first finish this thing. Okay, it's for the momentum backbone parameter. What do you actually do is you do every time, every time the, wait, let me close this one. Every time SATA T plus one changes, you just do this exponential moving average over the SATA T, over the SATA T. So basically the var SATA, var SATA is a momentum backbone's parameter. So the var SATA is an exponential moving average of the SATA T. So when you set the M, it's like a really large, like a, like usually M is to 0.99 or 0.996. So in this sense, you actually, you can, if you got time, you can try yourself. You can expand it. Actually the learning rate, the in fact the learning rate of the var SATA T plus one, is this like a learning rate times one minus M. So when Y minus M is 0.01, so basically this var SATA's learning rate is like a hundred times smaller than the learning rate of SATA T. So it essentially kind of like have a smaller learning rate. Or on the other hand, it's just another explanation. It has an exponential moving average of the SATA T. So it's a moving average. So every time it's changed really small, like a, yeah. So it sounds very counter-intuitive. Why should the momentum be set very high? Because you want to the SATA T to be stable, right? If you set the momentum high, it will update will be really slow, right? Let's say for example, in the extreme case, is M is just zero? If M is just zero, you cancel this one. So every time var SATA T plus one is equal to SATA T plus one. In this case, the two network can just share the weight. Basically like the same clear, they share weight. But when M is one, let's say the extreme case, the var SATA T, just the var SATA T plus one, just the var SATA T, basically you do not change the weight. So- Which is untrained, basically. It starts with the- It's basically untrained, basically random initialize, capable of random, right? So by changing M between zero, it means you, the two weight, the backbone and the momentum backbone have the same weight. If you set M to one, means it is not trained at all. So by changing M between zero and one, you can change the rate of how the var SATA changes. So when you make them M higher, so you actually slow down the changes of the var SATA, right? So that's a, I feel it's pretty intuitive, but yeah, I may be wrong. Okay. All right, so I think- Any question? Yeah, so Mike, I have a question. How many more slides do we have on contrastive techniques? That's all. Okay. So I believe that should be pretty much it for today. Okay. Oh yeah. I will- Check the time. Well, hold on, hold on. Because I forgot to record, I forgot to record the introduction, right, today. So we can say we are done with the lesson. If you, I mean, you can say a few more things if you have to, but I would say the lesson is ended and I will restart, I mean, I will repeat myself with the introduction of today's lesson such that for whoever couldn't come to class, they will have the first 10 minutes, okay? So students don't have to stick around, they can, if they want to ask questions about the same question they asked before, I will just do my best to try to remember what I said in the beginning part, okay? Okay. Like if you, I don't think I have any, I think I was to say, but if you have a question, I can just answer in the chat, okay? So Frito can do your stuff. You can do your stuff now. Okay, okay, okay. All right, so class dismissed. Frito, you're free to go. We are gonna be seeing judging again, of course, because this lesson was, I think very good. I mean, I liked it a lot, especially the color theme, right? I think that was very good. But okay, jokes aside, right? So we are gonna be seeing judging again very soon and then we can keep going with the non-contrastive techniques, right? And now I will just restart from beginning with the, with whatever I forgot to record, right? That was my fault. And that was it. So again, thank you for being with us today and sorry for forgetting to record a lesson, but nevertheless, we still have this post-introduction of the whole lesson. Bye-bye for whoever actually stick around. Oh, okay, someone actually did stick around. All right, so, see, I fixed it. Even if I messed up, I fixed the thing. All right, bye-bye.