 Welcome back to class. More things for you to learn and to be enjoying from this deep learning class. So what I want to start the lesson with. I'd like to start the lesson with a small story. So there was an undergraduate some time ago, right? So a few years back, four, I think, in NYU, right, Aditya Ramesh, which is a friend of mine, who now works at the OpenAI and he made another paper of his, which are just amazing. And I'm going to be sharing with you just a quick, yes, Dali, Dali too, actually, right? So perhaps we're gonna, I ask him if he want to come here and tell us how this works. Let's see. But let me show you first of all what he has done, right? Right, so here, if you go on my profile, right? You're gonna be, if you scroll down, you're gonna be seeing this one, right? So this was a generated through like the prompt was a teapot teaching chemistry through a group of teacups in elementary school while wearing a fancy three pieces suit, right? And these are the two generated images generated by the network, right? So this is a generative model, of course. And the Y is gonna be this one, right? Well, this is Y tilde, right? The violet Y tilde, whereas the X is going to be the prompt, the observation, the one that is given to the model, right? And so if you scroll down, you're gonna be seeing a few more of these ops, my bad. If you scroll a little bit down, again, you have like a cute dog playing piano and you have this one, which is like ridiculously, I think awesome, but I mean, all of them are just amazing, right? A painting of a said octopus playing the guitar while with seaweed in the background. So much fun, okay, sorry. That was the prompt of this one. I think this is just, you know, this is hilariously amazing, right? Bears taking over the planet, right? And I also have a bear, right? So again, pun intended, well, it's not pun, but again, I like bears. Okay, that was a pun. Oh my God, this is a bad joke. Okay, if you don't get the joke, it's even better. Anyway, bears taking over the planet, right? By Dali. Anyway, so you can scroll, watch. I think it's amazing. I actually tagged the thing, right? So you have here the post from Aditya. I'll check it out, right? All right, anyway, enough advertisement. Then we go back to our energy stuff, right? And so a small recap because again, yesterday we talked about optimization, then I show you a little bit of energy arising from moving, like what is the energy profile you observe when you move in the direction of linear interpolation in the ambient space, right? You saw that we had had the two low energies and then you had the kind of bumpy, like a high level energy if you perform linear interpolation in the input space, right? In the ambient space. So you can see that how the model basically tells you, oh, that thing is garbage, doesn't belong to the training manifold. Therefore I assign a high energy, right? And then back last week, we were talking about joint embedding methods, right? So we're gonna be restarting from there by making, connecting the dots of all these things we've been talking. I know we have been going back and forth, but I guess it should be clear enough, right? So we had, we say we had two different types, major types of architectures, right? We had the first one, which is the latent variable generative energy based model. And then you had the other joint embedding method, right? And then we said the first one was the one that basically has the white tilde and the variation in the white tilde. So we compare that manifold of white tillers with the target, blue target manifold, right? And how do we get the variation by introducing a additional parameter, the latent variable, right? And then we have the spring and that's the energy. And the energy is gonna be the sum of these squares inside the dash box. On the right hand side, instead we had the joint embedding methods where we're gonna have two branches, right? The left branch and the right branch. The right branch basically, we assume that is basically eating away that kind of variability. So there is some sort of, what's the word? Invariance, right? Over the variations through the manifold. Such that, again, you had two points and then you're gonna be comparing points, right? You either compare points or you're gonna compare the manifolds so you cannot mix the thing, right? It's gonna be a mess. This doesn't work, right? And this is gonna be the energy. Again, the big box, which has only one box in this case and that's my F. Then I show you so much of these generative models. With Aditya, if it comes, we're gonna be seeing one more generative model. So we said there are two different types of training procedures or ways of being able to actually train in order not to have a collapsed model. We either have the contrastive methods or the architectural or regularized method. So what are the differences? What are, again, we have really seen this so many times but Repetita Juvent, it's Latin, means repeating helps. Meaning I just keep repeating myself. It's all good. So training, how do we train this stuff, right? So we would like to find a energy such that the energy for the observed pairs or just targets if we don't have the X is going to be lower than the energy for the non-observed one, right? So that's basically what we'd like to do. So if we have those targets over there, we'd like to have a energy which is gonna be low for the observed dots higher otherwise, right? How to do that? Two options. So either have the contrastive technique. You push down the energy on the things that you observe and you pull up on the locations that are specifically, you know, choose locations that are not the blue locations and you pulled up the energy or the other option, which is like in this case the architectural case where we confine the region of low energy space to be a, let's say in this case, one-dimensional manifold, right? There's a curve, still one-dimension manifold embedded in this two-dimensional ambient space, right? Or you have the regularization term, right? Which is like a soft constraint. So I think you can usually think about the architectural. It's like a hard type of constraint and the regularize is some sort of softer, like springy thing, okay? There are different techniques. I don't want to repeat that. I may try to give you one example of these joint embedding methods, right? So I just try to give it my view and then we're gonna have judge and making the whole thing very clear, right? So joint embedding methods, contrastive, clustering, distillation, you have so many options. We will know more. So you have many, many, many options. There are many weird things. We might or might not talk about, right? So there are so many things and more stuff here on the other side, right? So there are, as you can tell, there are like several options available for you to use, right, to try. And each of these have pros and cons, right? The one that is the simplest to explain for you is the one that doesn't have these kind of aqua boxes, right? It's gonna be the left-hand side, which is, you know, the most simple type of architecture. So the one I would like to introduce to you today is gonna be this vicarag, no? So this vicarag has these two branches, the left branch and the right branch, and then it has several costs on top of these two branches. We actually figure that this drawing is incorrect because some of these boxes need to be a different color and since they are not part of the energy, but they're gonna be part of the loss, right? So it's actually not, I will have to update these drawings. So how does this vicarag work, right? So first of all, we can talk about this E here. I'm gonna be called this E for embedding. We may use, we may prefer to use a different letter such that we don't get confused with the energy, okay? So let's think at the moment about these letters and these terms, okay? So the E is gonna be the representation of a input batch, X, right? So X is gonna be a batch in this case and E is gonna be the representation of the whole batch of Xs, same for the Y, right? The EY is gonna be the representation for the batch of Ys. So my E is going to be a batch dimension times the D dimensions, okay? Whatever is gonna be this internal dimension. Each column of this matrix is gonna be of B items, okay? So each column of this matrix has B items and then each row of these metrics is gonna have, well, D items, right? You see this, right? So what these two things are doing is gonna be the first one is gonna have this similarity cost, right? So, well, this similarity cost. So we're gonna be basically trying to get these two representation to be close together, right? But if we just do that, we know that the model can simply come up with the constant solution. And so we have to introduce two additional terms, which I'm gonna be just introducing. And then again, I will just leave your judging to explain this more concretely. But the, so I just wanted to introduce this different perspective for the same thing. So again, if we just minimize the distance between these two items, we're gonna be getting a constant representation because that's the easiest way to get two things to be the same, right? So we need to introduce two more items down here. So the first one, this V, basically, will keep these vectors here for having a constant representation across the batch. So this variance term will try to bump up the variations of these representations throughout the batch. And then we have another term, which is this C term, which is gonna be trying to get each of these representation independent, such that it is maximally informative. Why is that necessary? You might, I would say, perhaps the C term is not really strictly necessary. You need the V term, which is basically having a different term for all of these. But then, if you try to correlate each of these item, you're gonna have some representation that are basically aligned to the axis. Again, the major point is gonna be that S allows me to have similar representation for the two branches. The easiest way to cheat would be having a constant representation throughout every time. And so this V term, instead, will enforce the variance across the batch to be some specific value. How to do that? We have this hinge loss. This component here is computing the variance for these ED vectors. And then, as you can see here, if we try to minimize this term, since there is a minus here, we're gonna be trying to push up these variance until when? Until we get up to this threshold here. And then there's gonna be a positive part. So we're not gonna be pushing more than the threshold here. And so this variance term here will make sure that the variance stays above this gamma. If this thing is below gamma, then when you train the system, you wanna try to minimize the loss, this term here will try to grow, right? Until when? Until when it becomes equal to gamma, right? If this is larger than gamma, the whole thing is gonna be negative. The positive part keeps kills it, right? Okay, that was all I wanted to say. And I took exactly 15 minutes. So what is E circle? E circle, I didn't say it, but I know I was just going to introduce this lesson, but again, I will tell you. E circle is my centered embedding, okay? What does it mean? Center, it simply means the E matrix to which I subtract the mean of each, like the mean row, right? So the right hand side of this expression here computes the mean row, right? EB is the given row. Here I sum all the B rows and I divide it by B. So this is the average row. And I get my matrix and I subtract the average row such that the column now are zero mean. Therefore, if I compute the square length of the column, you're gonna get the variance, okay? Again, I didn't want to really go into the detail. I just wanted to show you that this architecture doesn't require, because you're gonna hear now from Josh and that all these other architecture I showed you before, the eight I showed you before, they all introduce so many fancy things. This is the most simple to understand, we kind of understand, right? Questions? For the C, the C basically, what I'm showing here, here I compute the sum of all the squares of the covariance matrix and I subtract the diagonal such that by minimizing this term, I minimize all the cross terms in the covariance matrix. So this C term basically decorrelates the vectors, eds, okay? Again, not too important, the V is the most important, I would say, okay? So V just tries to boost up the variance for each individual dimension, the C tries to make those dimensions independent, okay? I took enough time. So the rest is gonna be Josh and trying to make some order and try to give you some more, you know, deeper perspective of all this stuff. I'm done for the day. I will ask Josh's questions about the content he's gonna be talking to you, okay? First of all, let's have a really quick recap of what we talked about last Thursday. So the first, we talked about the visual representation learning, like how it's a two-step process you do per training first, then you do evaluation, like a second step. Then we talked about how can you do visual representation learning, you got a generative method that protects the task or drawing body method. So, and we're mainly gonna talk about the drawing body method and how like they have two properties, like the intuitions, the first intuition is we, the representation should be invariant to the documentation, then by the day it will like cause the trivial solution. So we then will introduce some method like how to, how can we prevent the trivial solution? And the one way we can do it is through the contrasting method, basically you push all the negative pair, all the positive pair closer and all the negative pair apart. But so like how you found the negative pair, you get two strategy, all the old papers, they kind of use this hard and negative mining strategy. We didn't really talk about it much. So it's like it just found, use some prior knowledge to find negative samples. Then I introduced like a more modern, most of the paper uses like have a large negative sample pool. I don't want to use the batch because young, he actually talk about like what's the difference between batch and the pool. So, but anyway, we have a, so we want to have a like have a larger negative sample pool. So we can pick, so in by chance, we will have some sample like in that, like it's really hard, hard, some negative sample that really hard. So then I think I mentioned it like how Simclear did it and how Moco did it. So that's the recap. So this time we're going to talk about this non-contrasting method. Basically you try to prevent the trivial solution without using any negative samples. So why we want to do it? Because apart from like all young and Alfredo talk a lot about like the disadvantage of a contrasting method, like in reality, in practice, actually people found out the contrasting method like actually need a lot set up to make it work. So like if there's just a bunch of examples, you need to have those, like at least some of those to make this a contrasting method works. So in practice, it's actually introduced a lot of engineering tricks actually make it really hard to analyze like in theory like how it works. So also it's like sometimes it's really not stable if you don't use those. And so then there's a bunch of methods introduced, those are non-contrasting methods they introduced based on like some information theory says, oh, the representation shouldn't have much redundancy. It's called redundancy reduction. So like a bulletins or the information content of the representation should be the maximum. You should maximize the information content of the representation. So it's a big crack. And so the advantage of those two methods and some other methods that the TCR and like I won't mention them but they are kind of nice because they don't require much special architectures. You basically can train with like some basic setup. Yeah, you don't have to have too much engineering tricks. So then today we'll mostly introduce weak crack like about this. So weak crack based on this information maximization like a principle, basically you want your representation have the maximum information about your image because a trivial solution means for no matter which image you have you will have the same representation. Then it means the representation content no information content about the image. So you want to maximize the information content. So how you do it, you try to produce embedding variables that are correlated from each other because if all the variables, they're all the variables if they correlate to each other they like a covariate together. So the information content is reduced because if you have a too independent variable the information content will be higher than the two variable, they are called varied each other. So then- This is what the seed term was doing in the slide I showed before, right? Yeah. Then like a certain it's like then eventually you basically prevent this informational collapse which is the variable just to carry redundant information. Okay. So you see you have a two collapse, one collapse you like for no matter what image you have you always generate the same representation that's a trivial solution we talk about like a last Thursday. So now for we Craig you have a special type of collapse basically all the representation just to carry really limited amount of information. So although different image have a different representation but the information content is really low each representation. So how they do it, they basically are free to already introduce the loss function for V crack. So you basically have a, you have three terms. The first term is like, it's like it's the same for all the other method is like you just push the negative the positive pair closer to each other. Then the second sense you try to push the covariance matrix. You can calculate the covariance matrix like this. And so you try to the diagonal term of a covariance matrix is just a variance for each vectors. You want the variance to be high because it's the variant like for trivial solution the variance for each term is just zero, right? So this one actually perwent the first type of a trivial solution, right? You want the variance for each term to be high but then you will have the issue like all the elements they covariate each other. So you have the second type of a trivial solution. Then you use a third term loss function like to make the covariance of the embedding small. So you get all the off diagonal term of the covariance matrix and try to push them small because the off diagonal term could be negative or positive. So you add like a square here. So you push all the off diagonal terms small. So you see there's instead of a two-step you have a three-step. So this one is the invariant to a determination. Then it have this trivial solution like the first type of trivial solution. Then you push the variance high. Then it perwent the first type of trivial solution but introduced the second type of trivial solution. Then you push the covariance to be low. So then it is solved with a second type of trivial solution. Okay. So that's basically the intuition about the vcrack. And so I think Daniel asked a question says, oh maybe it's like why not it's just maybe it's just a clever way to use negative samples instead of directly like a repeal them whether you calculate the covariance matrix or something. But actually here, it's actually a legitimate argument but here like in practice people observe actually the requirement for batch sizes is much smaller because like you still need a batch of the samples. Why? Because you need to estimate the covariance matrix. You cannot estimate the covariance matrix I just have one samples. So but like the estimation doesn't require that the covariance matrix is easy to estimate. So it's the estimation doesn't not require too much samples. So just from the empirical result, it doesn't require as much of the negative samples as all the contrastive method. So in some sense you can see it's like a smarter contrastive method but in general like the inter-relation wise people just think it's a non-contrastive method. Yeah, so any question? I saw. No, everyone's just really happy because your explanation is just perfect. We like it. Okay, that's good then. Okay, so that's a like one category non-contrastive method based on information theory. Okay, so there's another category really interesting non-contrastive method it's called clustering methods. So basically it will try to prevent the trivial solution by quantizing the embedding space. And so it's actually not have too many method. I think in total there's only as far as I know only one group work on this stuff although they publish like a four or five different papers like about this clustering method. But it's a kind of auto fashion but it's really interesting. So I would like to introduce it. So you got this really complex graph. Let's take it the mean. So you get one image and another image and like sorry the same image but to this sort of version X and Y you generate the representation. Then you can stack them up to like as we did for all the other method. You get all of them then you stack up. You basically get this N by D matrix the uppercase HX and HY, okay? So then you do a clustering. So you do two clustering method. The first one you use this sync horn orcism. So this sync horn orcism I will talk about it a little bit about it later but you got the one clustering method like a one clustering assignment. So you got N, so K is the number of cluster. So you got N by K. Let's say this is the Q matrix. So in the extreme case, we can let's think of in reality is continuous but here let's just think about this one heart. So for each row only one element is one other or other element is zero. Okay, so you got N of them for each image. So you can do another clustering. This one I name it is solve the K means but actually it's another really solve the K means but you will know what like why I call it solve the K means. So this one you do a clustering again with the same centroid. So you will generate the prediction. Oh, sorry, it's actually here, okay? So you use Y and you do a solve the K means and you generate the Q the prediction Q tilde X. So that's a prediction for this Q X. So this is uppercase Q X. So then you use basically use H Y to predict the clustering of H Q X and you use the H X to predict the clustering of Q Y. So that's why it's called swap, swap because that's swap prediction, okay? There is a question here. How do you use a vector to predict a clustering? No, you mean here, right? I don't know, I'm reading the question. Okay, okay, let me explain it again, okay? Oh, maybe it's either to check the loss function together so you will understand, okay? So let's see this W is over centroid. So this one is basically K by D is over centroid and this H X is N by K. So this is a sync horn algorithm is like a, like let's say if you do K means, if you do K means like sometimes you will have this weird thing happen that basically all the samples being clustered to one like a one centroid. Like a lot of sample close to one centroid and other centroid have no samples at all. So this sync horn basically is the algorithm like it can prevent such thing from happening. So it actually will almost equally distributed the samples to not just to one cluster but to every cluster, to make sure every cluster at least have some number of samples. So that's a, they actually hyper parameter you can tune in your sync horn and depends on the hyper parameter. So you can have like a completely like a K means or completely equally distributed like a clustering. So then you basically after generate that you get this Q X that's a assignment and the Q X basically equal to like for each input like say for X one, you got this K dimensional vector basically tell you which centroid is close to and the line line street is one hot. So those are all those are one hot. Okay, tell you which centroid is the sync horn assigned. So that's the top, that's the top branch. For the bottom branch, you basically do this thing I call like a soft K means. What happened is you got a centroid and you got the H Y you multiply them for this D dimensional for this particular image Y and you multiply, okay, I forgot to mention this H Y is actually normalized. So basically W times the H Y is the time H Y, the similarity between H Y and all other centroid. Then you apply a soft max on that. So you will get this. So you basically like if you do not use a soft arc max you use arc max. So that's exactly the K means is tell you, okay, which is centroid close to the samples. But if you use soft max soft arc max you basically found the softer version of which is centroid close to the samples. Okay, so you get this. So, but you see this is H Y but you actually do a prediction for Q X. So that's why it's a swap. So you can, if you make this here's H X so you get a Q Y, okay. Then the energy function is simple. It's just a cross entropy between this one hot vector and this you make the prediction you make. So, okay, so let's study two things about the John embedding method. Why it's not, why it will push like a invariant to data augmentation, okay. The reason is you basically try to assign because here the Q Q X is generated by H X and this is till the Q X is generated by H Y. So you basically try to assign both H X and H Y to the same cluster. So instead of pushing directly pushing those two to be close to each other you actually push them to be inside the same cluster. So that's a different way to make sure they're invariant to data augmentation. Hold on, how can we interpret this? So what are these clusters, right? So we can think about this like yesterday we were talking about the, when we apply like a variational autoencoder over the digits of the MNIST dataset, right? So we can think as the latent space gets partitioned in perhaps hopefully 10 different buckets. We are not providing labels to our algorithms but automatically if the network or the algorithm needs to come up with a specific number of clusters it could happen that these clusters will be somehow connected to the actual classes of the individual items, right? So although we don't have the label information we can expect the overall algorithm to come up with this subdivision of the data based on these classes that are extracted from the data. So later on we can just train supervised with very few examples, right? Very few annotations. If we come up with 10 buckets then different clusters for let's say the MNIST dataset then later on we just need for example 10 data points to assign each cluster to the corresponding target, right? So that's how this clustering can basically be connected to the actual downstream task later on. Yeah, okay, I saw two questions in the chat so maybe I should answer. How is swapping going to help? Okay, so the same as if you do not, if you don't swap so basically you use the K-HX to predict QX but the QX actually generated from HX so the solution actually trivial. So you want, you actually want the HX you actually want to use HY to predict QX because the QX is generated by HX. So the, as I explained before, so the swapping basically enforce the HX and HY they've been clustered into the same class class but if you do not do swap you only enforce HX and HX to HX and HX itself to be clustered into the same class. So that's why that doesn't help. It doesn't help the environment to the data mutation. Okay, so there's another question is why softmax W-HX equal to QX? Okay, let me give you a simple example. Let's say K is just a two. So the W is a two by D matrix. So you get a two, so you basically only have a two centroid. So if you W times HY, so you basically have each centroid, two W, sorry, W is a two by D and HY is a D. So you basically get a vector of a size of two and each element is cosine distance between, sorry, cosine similarity between HY and the corresponding centroid. So then you basically apply a soft argomax to tend that cosine similarity because cosine could be negative positive into a probability, categorical probability. So that's why Q to the X. Okay, so there's another question. So you would want to the number of classes to be the closest to the number of classes as possible. Not necessarily actually because, okay, in the SWAT paper I think they use 8,000 classes even for ImageNet. ImageNet only have a thousand class but for clustering they have 8,000 classes. So you can imagine that, you can think of it as that. Even you have one, like if for all the image of dogs you can still subdivide them. And so like there's, you can divide them into different type of dogs based on the texture of the skin texture of what the color or like their big dog or small dogs. So actually if you can subdivide each class even more that's actually helped. That's actually provided actual information about the representation space. You actually want to divide them more even though you may not need it but the divided the more it may help your training. Yeah, so, okay, I think that's all the questions. So, okay, let's, yeah. So that's how it prevents, I just talk about how it prevents the, no, sorry, how it purges to the environment so for the data augmentation but I haven't talked about how it prevents the trivial solution. Like because the trivial solution exactly prevented by this synchro horn organism. So because it, when the synchro horn do the clustering it actually try to make all the different samples into a different classes have a equal amount of samples. So in that case you cannot put all the sample into one correspondent to one centroid. So in that sense, you actually make the representation space like not too close to each other because if you make all the sample because in the trivial solution you basically make all the representation as the same. So in that case, the synchro horn actually prevented because you try to push all the representation close to each centroid but okay, let me, let me rephrase it. Yeah, because you try to essentially make the all the centroid really far away from each other. So then you cannot make all the representation to be the same because if you make the representation to be same that makes this cross entropy impossible to predict because all the representation is same and it's like the synchro horn clustering will be really random because you do not see the difference between each images. That's how it prevents the collapses. Okay, hopefully I make some sense. Okay, is the synchro horn solved? Okay, yeah, let me answer that. So, yeah, synchro horn is solved. Actually in the, particularly in the SWAT paper they actually mentioned that they tried the hard version and the softer version. So, and they say if like in conclusion is the softer version actually better, works better but the hard version also works but because of the harder version like the intuition either to understand so I choose to use the harder version. Is this why we're using batch of variable for HX and HY? Yeah, for synchro if you want to do clustering unlike K-means for K-means you don't have to know all other samples because you just need to know the centroid and the samples you are at. So you just calculate the distance between all the centroid then you get the clustering. But for synchro horn because you want to equal number of samples in every classes. So you actually have to know the HX. You basically have to the uppercase HX. You have to know the other samples clustering. So that's why we use HX, the uppercase HX here and not the lowercase HX. For SWAT but K, yeah K is a hyper parameter you can tune but yeah I think in the paper they just use 8,000. I still fail with no H cannot be zero because H is normalized. H is fixed to be unit vector. I think I mentioned it but then maybe yeah the QX and the Q to the QX actually you can swap. There's another like a symmetrical like a loss energy function. Basically you got HY and the QY here you got the HX and the Q to the Y. So the actually that's a Y. I only write half of the loss function. So the last line I think we should have written the F equal to 2 C is right C. Yeah, yeah, yeah. 2XX plus C, Q, Y, Y, right? Yeah, yeah, yeah, you actually have that. So in the graph you have a 2C, yeah. So that's why. My bad, I forgot. It's okay, yeah. So like, okay so that's a clustering method. So the last one is like I call it other method. So why we call it other method? Because we do not really understand why it works. At least there's some theoretical study about those methods but we still do not know why they do not collapse to trivial solutions. So okay, so the early example is actually bio. So what happened to bio is you do now to need, okay, let me explain it first. You got the input X and the input Y as the same as the old one, like you got two disorder version, you got the two representation. But instead of just use the D, you can just change it to predictor. Sorry, you added the predictor. So you basically, so in the original paper, they say, oh, we call it predictor because they try to use HX to predict the HY. And this energy function is just Euclidean distance. Sorry, it's like cosine, actually it's cosine similarity between the predictor HY and HY. So you cut the gradient. But there's no term to prevent it collapse. So if you remember our graph for contrasting method, we have this box N and we say, oh, you sample a batch and this batch saying we need at least enforce a certain thing to happen. But here you only push the positive pair closer. You do not push the negative pair or you do not enforce anything, but it still works. So why it works? Like there's a bunch of theory, like maybe related to a batch norm and some says, oh, it's not related to a batch norm. And there's a lot of theory about that. Apparently there's a symmetrical architecture which actually actual layers works in this particular case. And the same thing is the follow up paper on bio. The only difference is bio use this momentum backbone but the same same just use a regular backbone. But okay, then we have this Dino method. So the Dino method is even weirder. So it's the extra just backbone and the momentum backbone. Okay, so you gather two representation and you go through this is self-arcomax and the self-arcomax. The only difference is the two self-arcomax have a different temperature or coldness like our failure we use the different coldness. So then you do a cross entropy between those two. You basically push this two together. Still there's no negative samples or anything and it still works. So the last one is like a really recent paper called Data2Vac and so it actually just add a layer norm at the end of this representation and it works. So for all these reasons why it works, we are not quite sure but it's interesting because it means we may, there's probably some implicit regularization happening for those networks to prevent them from like a conversion to this trivial solution. So and all those methods are really nice because the loss function are really local. You only need HX and HY to calculate the loss function. But for all the previous methods, no matter Vcrag or like all the contrasting method, you actually a batch of a pool of negative samples or a pool of samples. So it actually caused a problem when you do distributed training because in distributed training, you have to do this collection to collect all the vectors from different devices. So that's why a lot of people actually focus on trying to figure out how those are other methods which like do not prevent to trivial solution and in what condition they converge to trivial solution. Okay, so I saw there's some question. Okay, I think it will work but maybe converge slower. Actually no, like they actually, okay. So first of all, for all the drawing embedding method actually convergence is kind of low, kind of slow. So this other method not a particular slower than other like a contrasting method or other like a Vcrag or Vcrag. So that's why I said that's why there's some like implicit regularization happening. So to make them do not converge to like there's a trivial solution. So what is the connection with the contrasting learning? Yeah, like a lot of people think especially for the bio terms. So maybe you can actually like if you, if this particular particular, if this is just a linear predictor, you actually can road the update function for this predictor. And you found that you can find out actually do some implicit negative sample contrasting thing. So, but when it really complicated like this if you just make it a three or four layers fade forward network, then it's almost impossible to analyze. But those three or four layer fade forward actually worked much better than just a linear predictor. Yeah, so what if you initialize and to give a trivial solution? Yeah, if you, okay, here's, if you initialize a neural network to give a trivial solution, then those networks will never works. Why? Because if you already had a trivial solution, then those loss function will produce zero gradient. So it means if it gets into the trivial solution and it can never escape from the trivial solution. However, for whatever reason, it is just the training dynamic and never really converge to the trivial solution. So that's the thing we're not sure why it happened. In data to work it's necessary to go after the batch with the, add a layer norm after the batch branch with, I to go after the brand with it, to be late after that. Or if you, okay, if you add a layer norm here, I'm actually never tried the data to work to add another layer norm here. But maybe you can try it for your project to see whether it works or not. Yeah. Is there any good papers for like a, try to figure out what the implicit regulation are? I saw some, like some papers, but none of them are like really give you satisfying explanation. So it's like a really active research. Like for data to work, it's actually came up like maybe three months ago. So like all those things are still relatively, like a super new. So we are not sure like what's happened. Yeah. So yes. Okay, there's a question about swaps. Swaps is very similar to the contrasting method. It is like instead of using everything in the batch as a negative pair, it uses those other cluster as negative pairs. Why is it not the contrasting? Okay. Yeah. I mean, you can, it's really, it just depends on how you view. Okay. Let me, let me actually stress on this point. So first of all, yeah, you can absolutely think instead of contrast to our negative samples, you just contrast to the negative centroid. You can absolutely think of that, but they think that it's not necessary to help you to understand what's happening. Like you can, because the contrastive learning explanation is not a really surreptical satisfying because you just told me, okay, I want to push negative pair apart, but you do not tell me what is the good and negative pair and how you can, how you can, how you should use them and whatever. So you can, you basically can, like you can think of them as a contrastive learning, but actually think of them like a regularization, like it means it gave you better explanation. Like say you quantized the embedding space into this K number of subspace and it actually make it easier to understand. And so, yeah, and maybe that explanation actually helped you to develop more algorithm. So that's why people more tend to think of those non-contracting method. Yeah. Okay, so, okay. Just all those other methods don't have any regularizations to, yeah. Yeah, that's true. All the other methods, they do not have like any regularization. At least you don't produce trivial solution. Yeah. And like, but they're quite sensitive on the setup, on the hyper parameter. If you set the hyper parameter wrong, like it may convert it to trivial solution and it will converge really fast through trivial solution. However, for us, if you set the hyper parameter correct and so it will not convert to trivial solution, but no one understand why. Yeah. Okay, so that's all about the other method. So the next thing I would like to talk about is about documentation and network architecture. So let's say I will like be really brave about it because we do not have a really good understanding about those and, but they're actually super important if you can find a good augmentation. Like usually it may be even boost more performance than accuracy than just to change the loss function. Sometimes the loss function doesn't really matter that much. It's mostly just to provide you understanding what you measure like how it works. But actually data augmentation and network architecture sometimes are more important. Like if you only care about the accuracy, they're actually more important in some sense. Okay? So for the data augmentation, usually people like a, let's say like from 2020 to 2021, people just like a most dominant augmentation is proposed by same clear and they improve a little bit by bio. So you basically do, for you give an image, you do random crop and then you flip it horizontal flip it. Then you do some color jitter basically change the color or make it a grayscale, like a change of contrast or like the brightness. Then you could do a Gaussian blur and then make the image blurry. So that's what the standard data augmentation method for like for like for all the same clear or like a bio or whatever. So, and if you know, okay, sorry, if you know something about the data augmentation in supervised learning, those data augmentation actually super strong. They're like a crazily large number of the data augmentation like in, if you in regular supervised learning, now people divide some makes max or some data augmentation, but you're people in supervised learning they just use random crop and color jitter. And the color jitter is not even that regressive, okay? So, okay, so that's the first thing. The second thing is actually if you people try to remove all of them and see what the effect and almost always you found out the random crop is the most critical one. If you do not have a random crop, even you have the flipping color jitter Gaussian blur it will, the representation will be horrible. But even if you only have a random crop and you do some smart engineering tricks, the same works. So, like why the random crop is the most critical one? We're not quite sure, but there's some, like at least there's some understanding about it because maybe the random crop is the only way we can, like the only way we can remove, like to change the spatial information about the images. So that all the others, like maybe flip can do it, but the flip is like really weak. Like for color jitter or Gaussian blur it's more like changing the channels, like the representation, okay? So, and then now actually since from maybe during of last year, so people try to from the traditional augmentation to this masking augmentation. So, like I say, you get the image and you mask out this a lot of patches. Actually here you must out about 75% of the patches and then that's your data augmentation. You do not use the color jitter or Gaussian blur of flip. So this one actually works pretty well but there's two issues. The first of all is only works with a transformer type of architecture. Doesn't quite work with a confidant. So a lot of, I know a lot of people include me actually we are trying to work out like how to make the masking work in Comnet, okay? So the second thing is the success of this thing actually replace the random crop. So in some sense, so we do not need a random like a, it's another way to remove the spatial information like or like remove the redundancy of the spatial information. So that's a really still in active research about how to do the masking, yeah. And that's about the data augmentation. And for, let me, let me check questions. Why does it work, okay? But it doesn't work because, okay? So in short, basically if you use a confidant it reintroduces too much random, too much edges, right? You see it to have so much artificial edges but if you have a transformer you do not need to care about those edges because for all the, like you, because you just do, because, okay, for any transformer the first layer is a comb layer, okay? So when you do the comb layer you basically do the kernel size equal to the stride and not only equal to the stride it's also equal to the patch size. So in that case, you never, you never experienced this artificial edges because you already set the edges but in the comb that you cannot do that because you have this sliding windows. Even you do it smartly in one layer, in first layer you do not make the sliding window to see those artificial edges but in the next layer you have no choice you will see those artificial edges. So I think that's the main reason why the masking doesn't work for comb net but the work for like a transformers, okay? So, okay, yeah. So then for the network architecture so it's also we do not know much but there's something we definitely know from the empirical result. It says first of all it's always better to add a projector after the backbone and so the projector only use during the pre-training when you do the evaluation of the downstream tax you just remove the projector you just throw away and only use the backbone and sometimes people call it a projector sometimes the expander so projector project to lower dimensional and the expander project to higher dimensional. So like for contrastive learning you have no choice you can only use the projector because they're young explain like how the negative samples happen but for all the V-crack those methods you can use the expander or the projector. So that's the same. So the second thing is after the Moco like both use the momentum encoder and the memory back is actually people found out even you do not use memory back you just use the momentum encoder actually it's still improve the performance for downstream task especially when you only have weak augmentation so that's the case I talk about when you only have a data augmentation like you only have a random crop you don't have this thing you don't have a color jitter Gaussian blur and if you do not use the momentum encoder actually the network will learn horrible representation but if you use the momentum encoder with this simple data augmentation random crop only it just still works and it's actually worse it's definitely not state of art but it's actually close to state of art so the performance doesn't decrease much so why this momentum encoder helps we are still not sure so some people think is added some actual augmentation you can, yeah, you can think of that way and what is the projector? Okay, projector is just a neural network you can do two usually just a two or three layer fade forward network so, yeah, instead of just use the representation from the backbone you actually get the representation from backbone it's a vector and you go through this two or three layer fade forward network and then you use the output of projector to calculate your loss function or energy function, yeah so why are we removed? Okay, because, okay for contrastive learning okay, the question is why are we removed the projector during the evaluation? Okay, for contrastive learning there's a good reason why we do that it's because the output of the backbone is like a 200, sorry 2048 usually or 4,000 something and then the output projectors are usually 128 or 256 so it's a really small if you downstream task only based on this 256 output so it's actually not too good so you want your vector because of this projector remove a lot of information so actually you want to use the larger vectors and for other type of like architecture like a vcrack and then something so actually there's a paper from Pascal Vincent group like they're from University of Toronto I don't put it here but they found out that for vcrack even you make the projector dimensional same size as the dimensional for the output so the output projector and the output backbone is the same size and you will find out actually the output of the backbone contain the more information about the image and the output of the projector get the less information about the image so it's a projector kind of like a remove a lot of information from the backbone representation so it's still it's not a really there's no like a concrete explanation why it happened but that's a basically people just to show it empirically works that again okay yeah I think it's a tie so okay so and you can if you have a more question you can ask otherwise you can we can just say the class up yeah so I think we can take the last two questions now okay yeah I think it was just great I mean I love this lesson yeah in this slide I saw that the reference paper at bottom is the title yeah okay and for this particular reference it's actually about the optimal transport reference about the Singapore if you want to okay I highly recommend to read this book like it's actually Singaporean just like a chapter five or chapter six of this book you just read it and there's amazing orgasm like mathematically it's amazing and the orgasm even even more amazing than like than the mass so how to do the the orgasm is so simple and so effective actually the original paper introduced the Singaporean it's called light speed optimal transport calculation or something it's like a super fast and make a total sense and it's like a really amazing orgasm yeah I highly recommend you read it you can you can just find this I think NYU library have this book for free yeah you can find it okay I think that's all the questions I think yeah I saw this one if the projector remove a lot of information why use it during training okay again it's just an empirical result we found out if you use during training it's actually works much better than not using it so that's a that's actually in the original Moco they did not use they didn't use the projector and the same clear user projector and the same clear outperform Moco then the Moco guy went back and they published like only two page paper called Moco v2 the only difference is they added the projector and the boost up the performance yeah so the expander services opposite the purpose actually I think even you you think as expander you project to higher dimension actually in some sense it is still remove the information so basically it's still like a projector I I do not know why like even if the project higher dimensional it it just still lost a lot of information so I have no explanation for that yeah okay I guess that's all the question thank you judging so much for teaching I I didn't have to say anything you are completely self self teaching machine right I think I get used to this chat because it also cause I it's give me hard time to check this chat today you were just today you are just perfect I have no no no whatsoever other feedback so I hope we are gonna see judging in the future again depending if he has more things to say he always says things to say so I believe we are gonna be seeing him in the near future otherwise everyone enjoyed the end of the week and we see last comment about the project we're gonna release the detail about the project how you can access the Google Cloud and everything tonight so yeah you will just wait my emails all right so you're gonna get an email from judging tonight otherwise enjoy the evening the end of the week and then I'll see you next week on Wednesday for the next lesson okay bye-bye, have a good night bye-bye