 All right, all right. So good morning to the last class of this semester. We are very sad, of course, that we reach an end, right? We would have been wanting to provide more and more content to you. I mean, at least I like to talk and, you know, explain things. So today the plan is the following. We're going to have Jan talking about random stuff. That's what he said. Then he's going to be talking about the future of AI deep learning. I mean, what he thinks the road in front of us. Then we're going to have the question and answer section. I'm going to be reading the most upvoted questions you wrote on campus wire. So if you haven't yet upvoted questions or just have a look to the whatever questions that come out, you know, overnight or whatever, such that you if you see something that is interesting, you may want to, you know, vote it so it comes up. So you actually can hear the answer to that question. And this is going to be for the first three quarters of the class. And then in the last half an hour is more or less we're going to be going over the video presentations of the top five entries in the leaderboard. So we can then announce the winner of the semester competition. And I don't want to say spoilers, right? So I don't want to add more about the competition right now. And I have no idea where he is at the moment. He seems to be online. He should have the link. We are on the right class, I believe, because I see 50 people here already. Hello. Maybe someone has internet connection issues. Otherwise, what do I talk about, right? So how can I entertain you? Let me think. So why we are waiting, I will just entertain you with some additional knowledge, right? At least this time I thought it in advance what I should be doing. So I'm going to be sharing the screen such that we actually can do. We're not going to waste our time, right? Share the screen. So there is another notebook interesting that we didn't cover. It's going to be Pdl, Conda, Activate, Pdl, and Jupyter notebook. And there we go. Yeah, I'm just improvising, right? This is not, it was not planned. It's okay. On the notebook, right? You have this additional folder, which is called extra. And then inside here you have other three notebooks. Okay, Jan actually appears. So actually, we can actually start with the lesson. But otherwise, I would also recommend you to have a look about this projection, which is showing you that everything in a high-dimensional space is basically also online to each other. And then there is this custom grads module notebook, which is telling you how to create custom modules, which have functions maybe written in a Python, in a NumPy. But then you also have to specify two functions, the forward pass and then the backward pass. Okay, since Jan is here, then we are not going to have this additional section. So let me stop my screen sharing. Maybe, you know, we can talk about this next time. Hi, Jan. Morning. Good morning, everyone. I already introduced today's lesson. So we're going to be starting with the random stuff as you pointed out. Then we're going to move in on to the future of AI. And then we're going to have the question-answer session. And finally, the project's top five entries. Okay, so let's start with, you know, tying up some odds and ends. So a couple topics, short topics I meant to talk about and didn't get a chance until now. There's basically only one topic, actually. The rest will be mostly answering questions. So let's start with something I meant to talk about, and this is going to be a combination of a couple slides plus a little bit of algebra on the screen. Okay, so here's a reformulation of deep learning that is constrained optimization. And the reason I'm seeing this, I'm talking about this, is because it opens the door to sort of other ways to do deep learning or backprop, sort of generally speaking, that we have alluded to in the past without really kind of delving into the theory. Okay, so when we're building a deep learning system, let's say it's just a layered system, right? So it's just a sequence of modules. We don't have any complex graph, we can easily extend the formalism to a general connection graph. It's just that the notation becomes a little more hairy. So I didn't want to kind of make it too heavy. So we have a loss function, which depends on the training sample X and Y. This is for a single sample. And it depends, of course, on the parameters of our system. We want to minimize some cost function that basically measures, you know, the discrepancy between, for example, the last layer and the desired output Y, right? This is for supervised learning or a genetic model of some kind. But we have to satisfy a bunch of constraints. And the constraints are that the internal state of the system, the activation tensor, if you want, at layer K, or layer K plus one rather, should be equal to a function GK applied to ZKWK, right? So you have the module number K and the module number K takes as input the input ZK with a parameter WK, and it produces the output ZK plus one. And what we're saying here is that the output of module K should be equal to the input of module K plus one, right? But we view this as a constraint, essentially. And then there's an additional constraint, which is just a notation. We just denote X as Z zero. Okay. So we just make sure X, Z zero is equal to X. A small note here. Z is not the latent variable. Z is actually our internal representation. So we should actually be writing H for KIDN. So just a small note, right? Except that in the next line, it will become a latent variable. Okay. So the next line is here. So here I made the last function more explicitly dependent upon not just X and Y, but also dependent upon those two extra variables, Z and lambda. Okay. And basically I've written this in the form of Lagrangian. So Lagrangian, if I'm sure pretty much all of you have seen how you can express an optimization under constraint with Lagrangian. And I know it's intuitively mysterious why this works. Okay. And I'm going to attempt to explain this more graphically, but just take it as a given for now. The way you express a minimization under constraint problem is that you build a function which is the sum of the original function you want to minimize, plus the sum of the set of constraints that you have multiplied by what's called a Lagrange multiplier. Okay. Which is usually denoted lambda. And here our constraints are vector constraints. So Lagrange multiplier itself is a vector. So this is the dot product between the Lagrange multiplier vector, which would be the same dimension as Z, as ZK plus one in this, in this case. In fact, I should probably call this ZK plus one technically. In fact, I really should have called this ZK plus one. That's a typo. And this just expresses the fact that ZK plus one needs to be equal to JK of ZKWK. Okay. Right. So now the automatic conditions for Lagrangian optimization is that we want to set the, we want to find a set of values for Z and lambda and W also, such that the gradient, the derivatives essentially, but the gradients or the part of this function with respect to Z, lambda and W is zero. Okay. So we have three conditions. First one is derivative with respect to ZK must be equal to zero. And again, this is ZK being a vector. This is a row vector. Second one is with respect to lambda K. And that's one with respect to WK. Okay. So this kind of optimization is actually a saddle point optimization. What we're looking for is a minimum with respect to Z and respect to W, but it's a maximum with respect to lambda. And this is where the intuition of Lagrangian optimization comes in. You don't have, you don't need to have the intuition. You can just do the math blindly. But it's kind of cool to be able to understand what's going on. So it's sort of a very quick reminder of really what Lagrangian optimization is. So let's say you have a function. I'm just going to make it depend on the single variable Z. Okay. And you have a set of constraints, which is, you know, Z must be, you know, inside of some curve. But it must be, you know, it must belong to a curve of some kind. Okay. So essentially in the space of Zs, let's say this is Z1, Z2, I have a cost function, which, you know, let's say it's a quadratic cost function, which is kind of something like this where these are the lines of equal cost. Okay. And the curve, the constraint is something like that. Okay. So where is the value of Z that minimizes the cost but yet is on the curve? Okay. And that value is right here. Okay. This is the point where, so what's characteristic about this point? This, at this point, the gradient of the cost is pointing this way, right? Because it's orthogonal to the, you know, the line of equal cost, right? Hopefully I've drawn this properly. Okay. It's not exactly orthogonal here, but just imagine this is orthogonal to the line of equal cost. Okay. But what I can, the way I can express the curve is the curve is, you know, some function, GOZ equals zero, right? That's the equation of the curve. Okay. So my Lagrangian would be something like L of Z and lambda is equal to COZ, the cost function I want to minimize, plus lambda, which is a vector, times GOZ because my constraint is that GOZ equals zero. Okay. So GOZ itself has a gradient. In fact, I'm going to make it green so that, okay. In fact, no, I should make it red because I did it, I made it red initially. Sorry about that. Okay. And what is the gradient of GOZ? So GOZ is a function that obviously, you know, to the bottom left of the red curve is, let's say, negative. And to the top right of the red curve is positive, right? Because it crosses zero on the red curve, right? The red curve is GOZ equals zero. Okay. So on this side, GOZ is positive. And that side is negative, let's say. So I can compute the gradient of GOZ. Okay. And the gradient of GOZ is going to be an arrow. And at the optimal point, GOZ is actually, has the same direction, the gradient of GOZ. Okay. So this is GOZ with respect to Z, right? And the blue stuff is the of C with respect to Z. Okay. So at the optimal point, basically those two are aligned. Okay. And we can look at another point, right? So if I take another point, let's say this one here. Here, the gradient of the cost goes this way. Whereas the gradient of G goes that way. Okay. It's orthogonal to the curve G, right? If I take another point here, the gradient of the cost here is kind of like this way. And the gradient of G is that way. Okay. So it's just, you know, those two arrows are not aligned. And when you reach the minimum, you are on the curve, but you are at the minimum of the function. Those two gradients are aligned. Okay. And so basically what you can say is that, you know, there exists some value lambda. Okay. For which gradient of COZ at the minimum. So let's call it Z star is going to be proportional to the gradient of G of Z. Okay. And the proportionality constant we call it lambda. So then the problem becomes, you know, how do we find, how do we find this lambda? And how do we find the point where those two, those two gradients are aligned? And so here's a trick, right? If I, I mean, it's not a trick, but if I look at the function, if I cut, I make a cut, you know, near the solution point, I mean at the solution point. And I look at the values of the two functions at the, you know, the two curves. So C here is going to, you know, in this sort of purple, in this purple frame of reference, the C function, you know, is going to look like this, where it's going to be parabola of some kind. It doesn't go to zero because we, not necessarily at least. And then the constraint is here. And what is the slope? So I'm going to plot G. Okay. And G kind of, you know, I kind of make it simple. I can say that G is linear, for example. Not G is equal to zero or no at that point. You're right. I'm sorry. So we have a gradient, you know, a slope here for this function. And of course, we have a slope for G. And those two slopes, you know, we have to find a lambda basically that makes that when we multiply the second slope by this lambda, those two things are equal. So why does that matter? It matters because we basically want to make the effect of the constraint. Okay. Here's the thing. If we don't, if we have a small lambda, the system will want to find a point that really minimizes C. So this is C and this is G. So if lambda is small, okay, the gradient of G in the last function doesn't matter very much. Okay. So the system will find a minimum of this compound function for a small lambda that is, you know, close to the minimum it wants. Not taking the gradient of G very much into account. Okay. And as you increase the lambda, the importance of the second term kind of raises. Okay. And so as you increase lambda, the solution, the minimum of this function with respect to Z is going to get closer and closer to that until when G of Z equals zero, this term doesn't matter anymore. Okay. So you increase lambda, you crank it up until the second term disappears because G of Z becomes zero. Right. And that's when you satisfy the constraint. But at the same time, you know, you find a point where those two gradients are aligned. And so that's the kind of intuition behind Lagrangian minimization of the constraint. Okay. And I tell you a lot of people don't have like, it's a difficult intuition to have, but it's kind of useful. A lot of people are asking themselves like, why can I just make lambda infinite? But the point is you don't need to. So here's another formulation of minimization under constraint, which is through a penalty function. So I can write another Lagrangian, sorry, Z and lambda. And it would be C of Z plus, I'm not going to call it lambda, I'm going to call it alpha. And it's just a scalar now. G of Z squared. Okay. So now this is a penalty function. I'm making the system pay for making G of Z non-zero. Okay. And I just choose alpha, which kind of will pick the tradeoff between minimizing C and minimizing the square norm of G of G of Z. So this will not find an exact solution where the, you know, when the constraint is satisfied. Okay. So you'll find a tradeoff. And there, if you want to exactly satisfy the constraint, you basically have to crank up alpha to infinity. Right. And so that's an easy intuition to have. If you have a penalty function, and you want the penalty function to be zero, you crank up the coefficient in front of it to infinity, which of course may cause all kinds of numerical problems. But that will give you, they will approach the solution that, you know, both satisfies the constraint G of Z equals zero, which also minimizes the cost function C. Okay. But the beauty of Lagrangian optimization is that you don't put the square. It's not a penalty. And the lambda only needs to be just large enough so that the gradient of G is equal in length and co-linear, but opposite to the gradient of the cost. Okay. And as soon as lambda has this critical value, you satisfy the constraint. Okay. Now, what's interesting about this is that the process we've been doing here, cranking up lambda, is a maximization of this L function with respect to lambda. Okay. So the optimal point for our optimization problem is a minimum of L of Z lambda with respect to Z, maximum with respect to lambda. Right. So that's what we're going to do here. We're going to compute the gradients of this function with respect to Z lambda and W. These are the variables we're interested in computing. And we'll see how it goes. We're going to start with this one because it's easy. Okay. So we're going to compute the gradient of this with respect to lambda. And the gradient of this with respect to lambda K is just this, ZK plus one minus GK, blah, blah, blah. And we're going to set this to zero. And what is that going to give us? This is going to give us forward prop. Okay. So if I compute the gradient of the Lagrangian with respect to lambda, I get ZK plus one minus GK or ZKWK. I set that to zero. And what I get is ZK plus one equals GK, ZKWK. It's not surprising really, right? That's the constraint we wanted. So basically satisfying the constraint tells us you have to run forward prop. Okay. Nothing new there. I mean, why did we need Lagrangian optimization for that? But the second one is where it becomes interesting. So now we're going to talk about the first condition here. So we're going to have to differentiate this with respect to ZK. There is a special case where ZK is equal to ZK, right? The last layer. And there we're going to get the gradient with respect to the cost with respect to ZK. But I'm going to ignore this for a minute. So here we're going to have two terms. One term is going to be lambda K plus one because if I, my notation proper, when I differentiate with respect to ZK here, I get this lambda K, right? Multiply by the Jacobian of this G with respect to ZK. Okay. But there's also the case where, you know, this is K, right? And therefore this is K minus one. Okay. So what I will get is when I differentiate with respect to this, I'll get lambda K minus one and then minus lambda K times the Jacobian matrix basically of G with respect to ZK. And that's what you see here. Let's see. I actually screwed up the indices here. Sorry. This is wrong. This should be a minus one. This should be ZK. And I'm not sure why I got this wrong here. I'm going to redo this. But basically if this were correct, so you'd get lambda K minus one, gradient of GK minus one with respect to all those things times lambda K. Okay. And that's back prop. And I'm not sure why I flipped my indices here. In fact, I can probably get this right. This is minus one, minus one. Okay. And this is the back prop equation where lambda is the gradient vector, right? You take the gradient vector at layer K, you multiply it by the Jacobian matrix essentially transposed in this case of the function that computes ZK from ZK minus one. And that gives you the gradient with respect to ZK minus one. That's just back prop. Okay. So it's a bit of magic here, which is that you don't have to think about anything. You can just do blind, not too blind like me, but like blind algebra essentially calculus. And you'll get the back prop equation naturally. And then for the last equation, which is the last condition, which is with respect to the weights, you don't get anything useful. You get a way to compute the gradient of the loss with respect to the weights, but it doesn't give you a condition that you can easily satisfy by just computing something. And so that you're going to have to optimize with gradient descent, but now you know how to compute the gradient. That's basically the gradient of the loss with respect to W. It is similar to what we've been talking about. You take the state at layer K plus one, ZK plus one, and you multiply it by the, in fact, this is backwards, by the transpose of the gradient at that layer. And that essentially gives you the gradient with respect to the weights. Yeah, the dimensions here are a little funny, but if you instantiate GK, for example, as being WK ZK, so a linear layer, you'll find a usual, and you substitute in this, you'll find a usual equation for back prop and how you compute the gradient. Same if GK is a nonlinear function and there is no WK. Okay. So why am I telling you all this? So first of all, you've seen, you've heard, Alfredo will talk about optimal control, model predictive control, the Katie Bison algorithm, and things like that that you've heard me sort of mention it and planning. The people who invented this for the purpose of planning were optimal control theorists in the 50s and 60s. So in the West, it's known as, so this principle here, that you set all those things to zero here, that's called a Pontry-Again principle, actually, Pontry-Again-Extreme principle. So Pontry-Again was a Russian mathematician working on optimal control basically and sort of came up with the sort of mathematics that goes around this. There's a lot more to the Pontry-Again-Extreme principle than this, but that's the basic thing. But originally, this came from classical mechanics, Lagrangian mechanics. So in fact, this was pretty much invented by Lagrangian Euler and a few others. So when you want to, for example, compute the time trajectory of, let's say, a rolling ball running down a track, you have to say the trajectory is constrained to a particular curve, but then the energy, you know, kinetic and potential energy are conserved. And you write to Lagrangian, which is the overall energy. In fact, Lagrangian is the difference between the kinetic and potential energy. And that, you know, can include the constraints of the trajectory and then do this mathematical focus, and you can compute the trajectory directly. Okay, so one reason to think about this is because of this idea of replacing a constraint by a penalty. So if I rewrite this Lagrangian, and I do the same transformation I did before, which is that I turn, I'm going to turn this cost function into a constraint, and I'm going to turn this constraint into a cost function. If you do it in a particular way, that's called a Lagrange dual. Okay, so I'm going to have a lambda prime transpose here, C of, and I have only one. And those are going to be alpha k, which, you know, can be a, in fact, I don't even need this, Ck plus one minus Gk, Ck wk squared, right? So I basically turn my constraint into a penalty and turn my cost function into a constraint. Okay, so now I have a new constraint optimization problem. If you think about this, it's like I have a neural net. It says it's got three layers with a cost function. Okay, but then I have a penalty for making a particular Z, okay, let's call it Zk, and this guy is Zk plus one. I have a cost, so this is G, this is G, this is Gk, Gk plus one, Gk minus one. So this cost is a square distance between the two things that enter it, and it's, you know, basically one of those terms in this sum. And I have another one here, of course. Okay, but this one now becomes a constraint, an equality constraint. I could keep it as a cost. Okay, so how do I use something like this? Like, you know, what's the form of backdrop that works here? I plug an X and a Y, and then I have to, you know, minimize this new, this new Lagrangian with respect to, with respect to Zk, okay? I also have to maximize it with a step to lambda, but I don't care about this. I can just make, I can just make sure that whatever comes out at the output is equal to Y, okay? That's basically the constraint. And now what I'm going to do is I'm going to find all the Zks that essentially guarantee that the output is equal to the one I want, but that all of those additional terms here are minimized, which means I want every Zk to be as close as possible to the whatever output comes out of the previous module, okay? And I can do this, I can do all of this by gradient descent, right? So by gradient descent, if I maintain this constraint, I can find what value should I give to Zk plus one so that this constraint is satisfied. I can do this by gradient descent. That assumes that G actually has the right property. You know, it may be impossible for me to find a Zk that actually satisfies the constraint. In that case, I'll just minimize Zk. I mean, C. But I'm going to find Zk plus one that minimizes that, you know, makes the output equal to Y. And once I have that, I can find the value of Zk that makes this guy and this guy as close to zero as possible, right? So I can sort of jointly minimize with respect to all the Zks that are free variables, latent variables now, so that my Lagrangian is minimized. And this is target plot. So basically, once I've done this optimization with respect to the Zks, what I have for the Zks are targets for the previous layer. So now my learning procedure is super simple because to figure out the weights of that G, the Wk, I just need to minimize that cost because, you know, I know the target value that it should take for this particular sample. So I can just, you know, if G is, for example, a linear layer, I can actually compute analytically by solving a system what the value of the weight is that would satisfy the constraint. All right, to your gradient descent. Now, I can't really do this. I mean, I would need to do this over a fairly large batch if I wanted to do this, right? So I would take a batch for each sample in the batch and we compute the optimal Zk, okay, which is the entire state of the internal neural net. That kind of optimizes this regression. And then once I have all those Zks, I can locally apply a learning procedure to every layer that essentially modifies or just sets the parameters of the layer to whatever value kind of minimizes the new quadratic cost that comes out of it. And if the layer is a linear layer, it's linear regression, right? Super easy to solve. So that's target prop. We've seen an example of this in the LISTA method, right? So in LISTA, you have a Y, okay, and you trade an autoencoder. So there is a decoder and there's a latent variable Z and the latent variable may have another cost, some sort of regularizer like a sparsity or something, right? And what you're doing simultaneously is that you're training an encoder to predict the optimal value of Z, but you compute the optimal value of Z by doing gradient descent. But it's an additional term in the cost, which is how different Z is from the output of the encoder. So you have three costs here. You have the reconstruction error, C. You have the regularizer on Z and you have the distance to the prediction by the encoder. So there, C is no view that the constraint is just another cost. You find a Z that minimizes the sum of those three terms and once you have that, Z now acts as a target value for the encoder and you can train the encoder to do a better job at predicting next time. And if the encoder does a good job, then once the system is properly trained, when you plug away, you can just run through the encoder and get a pretty good prediction for what the optimal Z is. And so, as you remember, people now call this amortized inference. And this is the idea of essentially training a neural net to predict the optimal value of an optimization problem so that you don't have to run the optimization algorithm every time. Okay, I think I'm going to now switch to talking about the future of AI and I'm still going to use the board actually. So, okay, there's several questions concerning the future of AI and this is my personal idea about where I think things are going. Okay, you may or may not believe it. I imagine a lot of people may disagree with what I'm going to say, but I'm going to tell you what I think is, you know, where I think things are going and what is important and what to watch for for the next few years or what to work on if you want to do a PhD or if you want to get into research. So clearly, as I said, when we talked about self-supervised learning, the future is really self-supervised learning. So, increasingly, I think most of our neural nets are going to get bigger and bigger and they're going to be trained with some form of self-supervised learning before being fine-tuned for tasks that we are interested in. Okay, so that's the future. The future is sort of unified architectures or unified models. You see this trend already in industry and you've seen this in some of the guest lectures that, you know, in the past, you would train a separate model for, you know, recognizing a particular type of object, localizing objects in images. You would train a separate model to do translation from one language to another. You would have separate models for speech recognition for each different language. And now the trend is that you basically get one giant big vision system that does everything. Okay, it's trained on multiple tasks, but usually it's actually pre-trained in self-supervised learning or with weak labels and then fine-tuned for all the tasks that you want it to be good at. Okay, but it ends up being a bigger network, but it ends up doing a lot of things simultaneously. NLP systems, let's say translation systems now are multilingual, right? So they're trained with multiple languages simultaneously and they basically build an internal representation of text that is independent of language so that you can use them for a lot of different tasks. And similarly for speech recognition system, you basically have a single speech recognition system that is trained for all languages. So why is it a good idea? It's because all of those tasks have a lot in common and so you want to sort of share resources between all those tasks so that you can train bigger networks with bigger datasets and exploit the redundancy between those various things, right? So you want to translate one language into different sort of sub languages of kind of a root language that is such that the sub languages are very similar to each other. You really want to exploit the redundancy between them. Same for speech and images. Recognizing images is just... Images are images, right? At least natural images are natural images. There are images that are unnatural, connected from various types of sensors and everything. But even then we can see good transfers, right? So people have done things like pre-train a convolutional net on ImageNet and then fine-tune it for X-ray. Detection of tumors and stuff like that in X-rays. This actually improves the performance to do the pre-training on ImageNet strangely enough because images are images. So that's the trend towards unified architecture. What that means is that the combination of those two trends is going to cause neural nets to become bigger and bigger but to do more and more things for us, okay? I think the future of SSL, and you've heard a lot about this, is non-contrastive methods because contrastive methods don't scale. So things like... I've talked about the BYUL, Balu twins. There's a whole bunch of them that are appearing not every day but almost every week nowadays. There's a new one actually coming from my group at Facebook called Vicreg. So this is uppercase VIC. And it's not out yet on archive but it will come out probably over the next week or so. It's sort of a variation of Balu twins. So I'm very hopeful that this kind of methods actually are going to be developed not just for learning visual features but for learning features for all kinds of applications. And I think the big question is when are we going to be able to use some versions of those to train a system to learn about the world from video, okay? So as I said in the lectures, there's two applications of SSL really, right? The first one is learning representations and the second one is learning forward models. About SSL, there's a question here from the student. So in this competition, we found that SSL took a large amount of compute. An example is that for many papers like Moco, they stack 128 GPUs. Most of us took a very long time to train. Does this kind of force this research to be tied to very heavy, to have very few specific companies who can make the huge GPU clusters? So I think the large-scale applications of it, probably yes. Okay, so the two things. First of all, hardware is evolving really fast. Hardware for training neural nets is evolving really quickly. We use GPUs. They're getting cheaper and cheaper and more powerful. Right now, they're not that cheap because basically NVIDIA is a monopoly, okay? But there's more and more vendors that are kind of coming up on the market that are gonna get cheaper. So this kind of competition is gonna get a lot cheaper over the next few years. So that's the good news. The bad news is that regardless of how cheap it is, the best systems that are deployed for large-scale applications are still going to use tons of them. And so the high-end large-scale applications are gonna be in the hands of organizations that have the resources. Now, this is not necessarily just companies. So in some European countries, for example, they have, you know, policies where, you know, they set up large data centers for research, for academic research. And those systems are similar sizes, what you get at Facebook or Google or whatever. And you, you know, more or less can use it for free. There are some limits on the resources, obviously, but so I think it's really in the hands of governments to really provide the means for researchers to kind of do this kind of exploration. It's not a new, like a new problem. There's been similar problems, for example, in climate simulation or weather prediction or free dynamics, you know, back in the old days, you had to use supercomputers. You didn't have a mature choice. And the US, you know, has a bunch of supercomputers that you can request to use for that. They're not designed to do neural nets yet, but they're getting there. And, you know, various countries like France and others have these kind of capacities. So that's kind of the relative good news. The other thing is, even if there is still a gap in performance with what you have access to yourself, you can still come up with good ideas, right? In other words, industry doesn't have a monopoly on good ideas. A lot of good ideas come out of academia. And if you have a good idea, it may not beat a record on, you know, some, you know, big name benchmark. But it will show the way. Let me give you two examples. You know, some of the most interesting ideas over the last decade or so, I mean, half decade, maybe, were things like GAN. So GAN was, you know, came out of Unicef Montreal. Then it was picked up by industry, okay? But only then. The whole idea of using attention mechanism or duplicative interactions for things like translation, that was King Yung-chul when he was a post-doc in Montreal. That was quickly picked up by industry and scaled up to a huge degree. So, you know, those are a few conceptual ideas. So it's not like Unicef Montreal and King Yung-chul beat a record on translation that was done later, actually by a group at Stanford. And then that was picked up by industry. And now, you know, academia, I can't match it, basically. However, industry makes those trade models available so that that's interesting. There's actually a coordinated effort in France, for example, to kind of build very large language models for like multi-lingual, very large language models that would be, you know, public. It's actually initiated by Hugging Face, I think, but it's kind of a public effort. So, you know, I think there's going to be a bright, a pretty bright future for this. Okay. I hope I answered all the questions and didn't forget anything. Let's see. Yeah, so, say again. Did you prediction you were talking about? Yes. So, when, okay, so we can learn, we can use this to sell to learn representations and so far in vision, it's been, and as well as in NLP, it's been done by augmenting the input, artificially by, you know, substituting words in the case of NLP or transforming the image in the case of image tracking. And then using kind of, you know, joint embedding techniques in the case of images or denoising the two encoders in the case of NLP. But I think the sort of non-contrasting joint embedding is something I would bet on. Right. So, right now with things like Big Craig, Barlow Twins, BYUL, et cetera, you know, SWAV, you have two branches and, you know, some objective function between them that is to kind of maintain the amount of information coming out of the two networks but while at the same time making the output of the two networks identical when X and Y are related, okay, or nearby. There's going to be, you know, there's extensions of this, right? So, in BYUL and similar models, in fact, the architecture is a little bit different from this. So, here you have the same weight, right? But you've seen architectures but one of the two networks has an extra so-called predictor on top. And you can imagine that this predictor would take as an input the parameters of the transformation from X to Y, okay? If you produce Y by transforming X, you might actually give the parameters of this transformation here. So, there's been a few attempts at this but they've not been so far kind of really published or successful or whatever. Almost none of them have latent inputs here. Although there are a few papers that are mostly kind of focused on neuroscience that sort of attempt to do this. Okay, but that's pretty much what you need. Ultimately, what you want is a system where you find you show it a video clip you feed it in an encoder and then feed it into some sort of predictor, I don't know, or transformation module, I don't know how you call it. And then simultaneously you show another piece of video and those two encoders might be different because this one might take into account maybe three or four frames or some more larger number of frames and the number of frames here may be different, maybe it's only one maybe it's two. And then what you want is some way of telling with a cost function here, somewhere for the system to tell those two frames are a good continuation for those four frames. Okay, so this would be kind of a prediction combined with joint embedding, but then you can't rely on sharing weights between the encoder for X and the encoder for Y, because they may have different inputs. It could be that what you're doing is not predicting video, but you are predicting audio from video or you're predicting text from images or something like that, right? It could be cross-modal prediction. So if we find a good way of training systems like this in ways that prevent them from collapsing like Balotwins, for example or Vicreg, which you might see in the week or so I think that would be the ticket so something that does not rely on the fact that the two encoders have to have shared weights that that would be the ticket for basically large-scale self-supervised learning from things like video. We could plug a system on YouTube or whatever video source that we have watch videos all day and then hope that by training itself to predict it might, in a hierarchical fashion, it might sort of come up with concepts like that the world is three-dimensional, that there are objects that are in front of others, that some objects can move independently, some objects are animate and some objects are inanimate, that objects are hidden behind other ones to exist, that objects are not supported for, but because of gravity all the concepts that we learned when we were babies and perhaps it's the collection of all of those concepts that constitutes the basis for common sense so if I see a path towards sort of more human-like intelligence or more animal-like intelligence even it would be through something that does prediction of everything from everything else which is the principle of self-supervised learning from video and would learn not just for representations but also learn to predict because prediction is really the essence of intelligence and you could see this system here, this predictor here that looks at a video and predicts the next segment of video or represents the computer representation of the next segment of video as a forward model a forward model you can use for planning for deciding how to act and things like this in fact you can imagine that one input to this is the action you're taking so I'm taking an action and it's going to affect the world and so the way I'm going to observe because of the action I take is going to change and my predictor might predict how the world is going to be affected by my action and if I have that, that's a forward model so why is it good to have forward models so you've seen examples of this in the truckbacker-upper and things of that type that Alfredo talked about but there is sort of a general architecture of an autonomous intelligence system that relies on having a predictive model of the world so you want a predictive model a predictive world model in your intelligent agent and by the way you have this in your brain your entire prefrontal cortex the front half of your brain essentially does this it predicts at the bottom of your brain you have an objective function so what you have is a piece of your brain at the base of your brain it's actually called the basal ganglia or at least it's located in the basal ganglia basal ganglia is an anatomical name but you have a piece of your brain that computes whether you are happy or not whether you are comfortable or uncomfortable whether you are hungry or not whether you are thirsty or not the things that are instantaneous they tell you right now I'm not in a good state it takes the internal state of your brain and it basically measures how happy you are instantaneously so if someone pinches you this thing lights up because it hurts but simultaneously with this you also have another module so I'm not using the usual symbolism in terms of so-called squares but we have another module which we call a critic and what the critic does is that it tries to predict the long-term value of the objective okay so basically it's a predictor it's a temporal predictor for what the objective will tell me so if the first time we meet I approach you I pinch you let's say I'm not going to punch you in the face I'm not that nasty but let's say I pinch you it should be completely inappropriate the second time you see me you're probably going to stay away from me because your critic is going to predict that it's likely that bad things will happen because by the time you got hurt or at least it was unpleasant okay so you're gonna you're gonna back out and it's your critic your critic makes a prediction as to the fact that you're not going to be happy because of the situation it could rely on your predictive model but it also can rely on prediction of just how happy the state of your state is going to be okay so if you have a predictive model and an objective or a critic all of which are computed by your brain there are not things that are given to you from the outside the objective cost gets external inputs like your pain detectors right and also prospective inputs like your whether you're tired or not, whether you're hungry or not or whether you're thirsty or not internal sensors if you want and they compute the instantaneous cost but this is actually more like a function and you can the output of that cost can be seen as kind of a target for the critic so the critic might try to predict future values of the objective and the critic uses the output from the predictive model to do this so that's where reinforcement learning occurs if you want okay now what you need also is an actor and so the actor is the thing that proposes actions, action sequences that you feed to your predictive world model which allows your predictive world model to predict what's going to happen and whether those outcomes are going to be good or bad using the critic okay so the actor proposes a sequence of actions you run those sequence of actions in your predictive world model you imagine what's going to happen as a consequence of your actions okay and then the predictive world models predicts what the sequence of states of the world is going to be this is your internal world model predicting the state of the world and now the critic can tell you this is good or bad okay and so the actor by gradient descent or something like that by optimization they try to find a sequence of action that minimizes the overall cost computed by the critic possibly even the cost that computed by the objective which you know can take the critic into account now everything here is differentiable so this is like this is an example of model predictive control which you've heard about from Alfredo but there is a lot of tasks that are very repetitive right so you learn a particular task of building a chair or something or sailing or a test that require a little bit of knowledge of how the world works around you and what would be the consequences of your actions and as you practice or driving and as you practice more and more you don't need to become as attentive about it as you were initially and it's because this whole deliberate process gets automated essentially so that suggests so first of all there is another module that is super important which is perception and perception looks at the external world it's called this X and basically tells you an estimate of the current state of the world so that goes into initializing your world model okay but here is this other thing the other thing is that I was just mentioning it's we call this a policy in psychology this is called system one whereas this the entire rest is called system two so a policy the system one policy takes the estimate to the state of the world given by perception okay and directly sends an action to the motor system so this is why if you want I mean this is this is the motor system whereas in the system two is the actor that produces the sequence of action that goes to the motor system okay so I'm going to run out of colors here but you could think of all this the yellow the yellow stuff as what's called system two or it's called you know planning conscious action conscious behavior etc so system two is a term by economics Nobel Prize winner Daniel Kahneman who wrote this little book called thinking fast and slow and so system two is the one to use for deliberate planning and system one is basically just this part so once you've learned so basically the way you train this this system one policy is that you train it by measuring the distance between the actions that have been planned okay and the actions that are produced by the policy so the policy is trying to just match the actions that are the result of planning okay so that it can directly compute an action from an estimate of the set of the world from perception without having to go through a whole phase of predicting and planning okay so it's basically you can think of it as advertised inference for the entire autonomous intelligence system so your system two arrives at a sequence of action through optimization by planning and then kind of figuring out a good sequence of actions and then you train your system one which is a part of your cortex that is close to the motor system the motor control areas to basically directly react and so now you can do the task without having to plan right and this actually works in the human brain for even tasks that are very high level like playing chess if you play chess game against a grandmaster and you are not yourself a grandmaster the grandmaster doesn't have to think about how to play because you are an easy prey so you will just look at the board and instinctively move the thing it's not challenging to them so that's basically just system one it's just pattern recognition they compile that because they played so much but to you you are going to have to spend 20 minutes per turn because it's challenging so you are going to have to do the tree exploration and all that stuff okay so that would be the architecture of autonomous intelligence system there are a few things that we don't know how to do in there and the main thing is really running predictive world models that's really what we don't know how to do particularly in the context of prediction with uncertainty so that's what we need to solve okay I'm going to stop here and I think now it's time to hear about the best projects the projects that got the best results and I want to watch those videos it's a surprise to me I only watched one I didn't know you were just finishing right now one sec slides I tended to have a Q&A but I took too long yeah we do we may have some time actually we have like 50 minutes 45 minutes so we still have some time maybe for a Q&A at the end we start with actually a slide I'm going to be cropping the bottom part so you don't see who won at the beginning so virtual poster session deep learning spring 2021 thanks for putting together these slides and also the whole competition and he's been taking care of everything for this challenge so credits to him so we start with this diagram so we can clearly see there are five major groups on the left hand side who have a quite large score right above 40% we can see that there are several groups like this group over here so at the end of the slide you're going to see also the numbers now we don't see the numbers we see that someone actually really benefit from the additional labels perhaps this didn't quite use too well the unlabeled data I would say comparable to team down here whereas other teams like this one perhaps managed to do quite well with the unsupervised part but also got a major contribution from these additional labels and of course the winning team got both the major accuracy from the unsupervised plus also decent contribution for the additional labels so let's start with team number well I don't know the number minus 5 to top we're going to be listening and watching the presentation of these top 5 teams and we're going to have a question a few questions from the creators if they are around so we start team in fifth position we have team 18 Jerry which has the testing accuracy of 41.66 right so it's over here with the unlabeled part and then with the 44.22 with the extra labels right so we go up to here and so we're going to be now watching the Jerry team so team 18 hello everyone today we will be presenting our approach to self supervised learning on the image data set provided in this competition often times we have very limited training data set available in the real world but by using self supervised algorithms we can learn useful features from the unlabeled data set also in a general SL training the model is trained on an unlabeled data set using a pre-test task and then later fine tuned on the label data set another task was to select a subset of images for which we used the active learning approach using core set which we choose representative examples in our data set for our task of SSL we went with Balo twins but the question comes why Balo twins? the factors that we need to consider before choosing a model is the amount of computer sources available and the objective function with contrastive methods like Moco and Simclear they need large negative samples for good learning of features thereby requiring large path size and hence more compute this comes up in the solution with the objective of reduction of redundancy and having invariant features as it doesn't require large negative samples it doesn't lose its performance on lower path sizes so in the basic Balo twins pipeline we have two augmented versions of the same images which goes through recent 50 architectures and creates a cross correlation matrix of the representations the output matrix should be as close as to an identity matrix as possible in the loss function we have an invariance term which increases the correlation for the on diagonal terms and the redundancy reduction term which punishes cross correlation between the off diagonal terms the lambda parameter here acts as the weight of the off diagonal terms to balance each loss on this slide we visualise the features learned from the Balo twins SSL method on the top right is the feature map the actual layer of the encoder here it displays features of the legs of the monkey and the body of the monkey on the bottom right is the synence map white pixels represent the most important 10 pixels from the input to the output despite only being traded with the SSL on the encoder it has learned exactly where the subject of the image is for our labelling request we adopt the core set approach the objective of the core set approach distance between any image and it's nearest neighbor image request we request a label for by minimising this value we are also minimising the difference between the full unlabeled data set and the subset of images that we request the data for this is achieved by taking the full unlabeled data set passing it through the encoder to get a set of embeddings using PCA for compression and breaking up into 12,800 clusters with the K-means algorithm then the image closest to the centre of each of these clusters this becomes your subset of images that you request labels for supervised learning fine-tune we train a linear classifier on the labelled data set on top of fine-tune representations of a ResNet-50 model pre-trained with Barlow-Twin's method for data augmentation we used flips and random crops for better conversions we used croissant annealing learning grid scheduler and to increase speed we used distributed data parallel training as we can see there is good convergence in classification saliency maps well to the subject matter in most cases we have cases of good examples and bad examples shown here where in further bad examples we can see that the model fails to distinguish background objects and the object in focus we ran several contrastive methods on the data set and concluded that Barlow-Twin's gives better results at lower batch size as compared to MoCo and Sinclair overall Barlow-Twin's gave 41.75% accuracy for the original data set and 44.45% accuracy for the data set with extra labels okay that was great I think do we have questions for this for team 18 anyone in class or Yan or do we have feedback for team 18 Yan if Yan is still around great let's see the camera this is great you are probably like the I would say the fourth or fifth team in the world to actually use Barlow-Twin's so very few people have really played with it because it's so new do we have questions from anyone from class for this team I'm checking the chat we are going to be moving to the fourth position how many epochs did you train has been asked here but I guess I cannot yes you can speak of course it's easier for me to read everything it takes time to write every answer so Barlow-Twin's we trained for around 200 epochs and the particular augmentation strategy that we used were the basic default in the paper itself so there was a color jitter random noise Gaussian blurring and random rotation and the flips and did you train other SSI techniques to a similar number of epochs as Barlow so now we first went with SimClear and we tried to train it for 200 at first but we only reached an increase that much because our bad size was small but then we went with Barlow-Twin's and we saw better convergence at 200 epochs than other methods cool thank you Rowan alright so moving on in fourth position we have team number 4 super serials learners these are serious people we have 44.8 at the beginning right so we have like a rather larger improvement in terms of unsupervised and actually is superior to the third position team and then further improvement with the with a few additional labels so this is team number 4 and we are going to be watching now the video good afternoon everyone and welcome to our project presentation for deep learning I'm Amber Thang and I'm joined by my teammates Batman and Duc and Pete and today we will talk about our MOCO model for this project our team used momentum contrast for unsupervised learning in particular MOCO has been shown to be effective in unsupervised visual representation learning tasks because of its self-supervised nature and its use of contrastive laws to build a discrete dictionary on high-dimensional continuous outputs contrastive loss methods can be thought of as ways to build dynamic dictionaries and that variety of sample pairs in representation space this dictionary is said to be dynamic in the sense that the keys are randomly sampled and the key encoder evolves during training to look deeper into contrastive learning and loss functions we'll share two core parts of MOCO I'll be handing it off to my teammate Ducan as we heard from Angela MOCO has two encoders one for the query and another one for the keys in the dictionary the info nce loss function captures it it is a contrasts loss function whose value is low when the query q is similar to its positive key and dissimilar to all the other keys to achieve similar key encoders MOCO implements the key encoder as a momentum based update where m is kept quite close to 1 and all the parameter theta q are updated in the backprop propagation which ensures that we obtain the encoded keys at different times from similar encoders specifically we are using an improved version of MOCO version 2 it adopts some design improvements from simclr adding a multilay perceptron head using augmentations to figure 2 and a cosine learning rate schedule for the xr labeling we trained our model on the training data set then determine the most likely class for each unlabeled data sort them by the probabilities and choose the instances with the lowest probability so at the end we have the instances the model is least certain of we use active learning pool based sampling with the least confident query strategy Hi my name is Fat Muin and I want to talk about our training procedure to utilize all images available to us for training we combine the images from both the unlabeled and the trained data set to do retraining the image augmentations we used for retraining are similar to the ones used by MOCO version 2 for image net such as random crop color distort and Gaussian blur we also calculated the mean and standard deviation of this image data set which is slightly different from those of image net to be used for normalization next I will talk about our simple by settings the model we use for the MOCO encoders is ResNet 50 we found out that the ResNet 50 model achieves significantly better performance than smaller networks while having reasonable training time we trained the model with batch size of 512 on a system with 2 GPUs we chose the SGD optimizer using an initial learning rate of 0.06 and a cosine annuling scheduler although the 2 GPU system allows a much larger batch size we found them to be less stable than the chosen value the model is trained for 1000 epochs and each epoch took about 13 minutes to train on a GCB system for the classifier training we use the SGD optimizer with an initial learning rate of 10 again with the cosine scheduler when training with extra labels we increase the learning rate to 20 we use a midi batch size of 256 and train the model on 1 GPU for 20 epochs finally we achieved a validation set accuracy of 45% and 46.4% when using extra labels we also show the previous little board test accuracy what's the reason for using learning rate equal 10 for the classifier I'm just reading your question so we are already training one layer for the classifier and it is overfitting very fast so we are training for only a short amount of epochs the second is after hypertuning 10 is the value we found to be most successful I didn't actually quite get exactly how you select the samples sorry, do you mean the data set for pre-training how you select the samples for which you asked for the labels yeah we just like train our model and calculate the most probable class for each unlabeled data and from that we choose the least circuit of or least probability one okay so the one that has like the so is it based on the maximum score like you select the ones with the lowest maximum score or is it based on the entropy of the distribution or we choose like the maximum probability for each class for each unlabeled data and then sorted by the probability okay thanks okay sorry I was muted I didn't figure too many buttons here then after number four we have this other team which is worse in terms of unsupervised performance with the catch up the cut up with the actually larger improvement due to the actual annotated images which is team number oh drum roll at team number 15 lossless with 43.34 initial accuracy so lower here and then we have this 47.63 okay let's watch the lossless the team 15 video hi this is the video submission for team 15 to start with the task we looked at the data set and try to find similarities between images for example checking whether the subject is centered which turned out to be the case for a couple of classes for what we assume to be water tank but not so much for classes like room this was done to see what sort of augmentations we could use to extend the training set secondly we wanted to pick the right architecture initially we utilized smaller architectures because the training set was small but as we moved on to SSL techniques we realized that the smaller models weren't able to capture all the information so we moved on to a deeper architecture we tried deeper architectures but they wouldn't fit on our single GPU moving on to the pre-training methods we tried we divided each of the techniques listed here amongst the three of us however with every other technique apart from Barlow we ran into an upper limit of what the model looked like and then we moved on to the pre-training methods we tried we divided each of the techniques listed into an upper limit of what the model could learn for SIMCLR we reached a point where training the model further in fact gave a reduced accuracy as for Barlow we played around with the augmentations for example Gaussian blur improved the performance despite the larger training time a key observation which we found interesting was that the standard ResNet finished training in 5 days for 1000 epochs whereas the custom ResNet took around 9 days despite having a lesser number of parameters to build the classifier itself we took the backbone we trained from Barlow and experimented with a different number of fully connected layers on top of it the sweet spot was two layers with ReLU in the middle so as to project the embeddings into a higher dimension before giving class scores to juice out a little bit of extra performance from the classifier we used pseudo labeling we evaluated this approach by checking how many confident images are correctly labeled on a validation set this was done to find the percentage of data we should skim off our unlabeled set in the highest 3% of the data 90% of the images were predicted correctly thus we made the model predict on the unlabeled set with the highest confidence scores along with their predicted labels to train the model further we did this 3 times each time taking 3% of the unlabeled set while reducing the learning rate by a factor of 10 we did something similar for our labeling request too we sorted based on the difference in the scores of the top 2 predicted classes of each image in order to prevent outlives by just taking the confidence scores taking the top 100,000 images in this manner we got their image embeddings using our backbone and clustered them using kmes into 800 different classes we picked the top 16 images from each of the clustered to ensure even distribution this was done using all the models we generated up to this point with each model having a vote based on how well it performs the main points when we saw huge bumps in accuracy was moving to a deeper architecture training Barlow for a higher number of epochs utilizing dropout between to the tool fully connected less and using weight decay in the optimizer great so are there questions for team number 15 and if team number 15 members are with us they can unmute themselves to reply to these questions yeah this is cute both the kind of self-learning technique to use pseudo-label from the unlabeled data it's interesting that it works but it's a cute it's a good idea to use this in this context this would qualify as semi-supervised running I suppose and then the way to select the active samples also is quite nice Jeffrey is asking did you do any specific compression to the k-mean algorithm for unlabeled we found one half a million is hard to solve for your computer half a million yeah no actually the k-means took a lot of time we tried a couple of different k-means algorithms and we wanted something standard and I worked on k-means earlier and scikit does pretty well to optimize whatever k-means is there and it ran in around half an hour and I think that was enough we didn't have to do any sort of compression and as for the pseudo-labeling thing it was very finicky in the beginning so when we were just using ResNet 18 the accuracy actually dropped pretty fast and we weren't seeing any improvement so but when we switched architecture to ResNet 34 then we could see an increase of around 2-3% using pseudo-labeling and we just have to make sure that each time when we were skimming off data from the unlabeled side we were applying the same augmentations we were doing to the training side otherwise it tended to overfit I see awesome thank you moving on next we see so we are in so this was team 15 right so we have now the top two I guess so we are going to oh drumroll we have team number 20 so team number 20, ABC 1, 2, 3, ok this is super ingenious name I guess ok just kidding we have a testing set accuracy of 50.51 I think is the first one over 50 and then slightly better slightly better with unnotated data so we can see here basically almost no improvement with the additional part anyway so let's have a look to team 20 video hello and our volunteers we are team ABC 1, 2, 3 and our team members are the Song Yun, myself Colin and here are today's content here are the self-supervised learning methodologies we try to leverage our labeled data set the first two methodologies share a similar structure the data is pushed through a free free extraction network people contrasting the positive and negative samples however they require either a large batch size or large memory to be effective the auto encoder learns the feature in the encoder section one can train a classifier with the features afterwards in our experiments even though the model has good reconstructions the features learned are not useful for classification the last one is for given a labeled data set the network minimizes the difference between prediction of strongly augmented version and the weak augmented version and for labeled data set the network minimizes the classification loss in order to select the data that best benefit our model we computed the entropy of the data prediction and selected the ones with the highest entropy in other words the extra label come from the data that our model is these confident about now I'm going to illustrate a frame of call match different from most existing semi-supervised learning method call match jointly learns the encoder F and the classification head H and the projection head G and drawling optimize three losses a supervised classification loss labeled data and unsupervised classification loss on labeled data and a contrast loss in call match the high dimensional feature of each sample is transformed to class probability P and it's normalized low dimensional embedding Z given label samples we firstly perform memory must do the labeling on weak argumentations which reduces confirmation bias by leverage structure of the embeddings then it constructs a pseudo label graph W2 which defines the similarity of label samples now let's look at the model results we've trained the data using several architectures notice that call match achieved the best accuracy in both 5% label dataset and 7.5% label dataset in 400 epochs also it is efficient to train call match compared to other models such as semi-clear so the confusion matrix of our prediction we know the accuracy and recall is high at the training set but relatively low at the validation set the third plot shows that the prediction of the training set is very balanced however for the validation set we can see there are some underclass bad labels at the bottom left and some overclass bad labels at the upper right let's first look at what our model learned given the butterfly image our model successfully detected the shape and texture of wings and also background flowers then let's look at what our model failed to learn the first type is underclass bad labels as we mentioned before underclassed and hand centered they have challenging characters like skill variation and intro class variation the second type is overclassed bad labels like computer and comic book cover the feature of computers are very general like rectangle and covers of comic books are very flexible there can be any object in the future we will pay more attention to those challenging classes and get our model improved thanks for listening great team 20 did very well do we have questions for team 20 and if team 20 members are around they can unmute themselves and take the questions so Mr. provise running is still in the race that's interesting to hear do you think cometsch is well trained Jan is asking we set the training horizon to about 400 because the limited computer resource and we used a cosine decrease in the learning rate so in the end the learning curve did converge but we're not sure if we extended the training horizon whether the accuracy will keep going okay do you have any sense of whether other teams got similar results as you with battle twins so you had a table with some results for battle twins which were pretty bad 25% so anybody else there got similar numbers I'm trying to figure out whether those numbers are good for battle twins what would you expect for battle twins or whether other people got better results hi Jan we actually used battle twins and the training approach is pretty large for our case it still cannot compete with the cometsch score they reported so yeah is there any any hint that some combination I mean they're kind of complementary right because one is semi-supervised the other one is self-supervised so do you think they could be combined somehow I think the cometsch structure itself contains some contrasted learning structure for example they're essentially trying to push up on the probability of having the same image and then push down on the ones that have different even though it's not as clear as in the contrasted learning models they do have an element within the model itself right okay then moving to the last team and then we're going to have a few questions for Jan so actually we are going to be asking something so we go here right this was team 20 if I'm not mistaken so on the top of the leaderboard team number 2 MioNet with a testing accuracy of 55.8 with the unsupervised part and then with the testing accuracy of 57 so one and a half point above basically for the when using the full when using full part also with the labels so let's see the videos of team number 2 congratulations team number 2 you won hi my name is Jan from team 2 MioNet in this video Wen Jie, Jin Fu and I will introduce our method which achieved 56.02% top point accuracy on a better addition to the set with only 0.5% of the data in our method we utilize the battle twins as the unsupervised learning method it construct a cross correlation matrix from two representations of different views of the same image it encourage this cross correlation matrix to be a diagonal matrix we utilize the fixed match as the semi-supervised learning method this method predicts the suited label from a weakly augmented image and it does consistency regularization between this weakly augmented image and this strongly augmented image in this slide we will introduce our method in details first we obtain a fishing batting from unsupervised learning with the battle twins and then we do a balanced suited label iteration in each iteration we use the current best model to predict the probability of classes of each sample then we pick top key classes for each sample to generate a pool we pick top peak samples for each class from the pool and we generate the suited labels we fine tune our model with the suited labels plus the training dataset and at last we fine tune it only on the training dataset we run it for four iterations and keep the k equal to 10 and increase the p from 200 to 500 along the way in the third stage we fine tune our model from semi-supervised learning method with the fixed match and we use the fixed match to further improve our model by training it with the training dataset and the additional dataset last but not least we utilize the test time augmentation with the model scale inference we use the average of the prediction from the resized image and we got our final score for the label request task we use a naive active learning method we use k-means with cosine similarity as the diversity sampling method to form 100 clusters from the whole label dataset we use margin of confidence as the uncertainty sampling method to choose 64 samples from each cluster here we show the visualization of different model weights we show the final feature maps of the third residual block produced by ResNet50 so what's happening here is that given the input dog image the feature maps on the right column are more concentrated on some key points of the dog like the head and the legs on the left column the activations are all around the entire image region meaning that the feature student from the baseline follow trains is less discriminative and less clear here are some interesting things we find after analyzing the top 625 images of each class some classes have lower accuracies because of crossing balance we notice a drastic increase in errors after the first 300 images in the sewing machine class indicating a lack of data so if dogs have very high accuracies and we believe a high number of different dog classes in the training set teaches the model more refined features of all dogs here I would like to conclude what we learned from this competition for the unsupervised learning a large model, a large batch size and a sufficient amount of training steps are crucial for the studio label iteration we always search for a good learning rate at each step and also a balanced studio label iteration will improve the model accuracy for the semi-supervised learning the fixed match prepares a large ratio between the unlabeled data to the label data and you can apply the EMA at the end of the training to smooth the model in general observing the data after each stage and analyzing it helps a lot and don't forget the test time augmentation is a free launch thank you for watching the video it's definitely awesome both the video the technique and super cute as well the final cut do we have questions for the winning team it's very nice work yeah it is really good there is a question here what is EMA exponential moving average so the EMA we're using is that we assume that model has been pretty well trained at the end of the training so basically we are actually averaging the weights along the way basically we are taking the weights from the last epoch and putting in the weights of weights about our legacy of 0.99 and adding up the new weights we train at the current stage so basically it helps model to converge well enough if the initial model itself is good but this kind of technique might have some issue if your initial model weight is not good enough how did you incorporate such a large batch size for battle twins oh so you can have different techniques while technically you can apply it will be the gradient accumulation but we are actually training on more GPU that GCP was providing because we found out that the GCP cluster is too slow for our approach so we used the 1000 app to train the ResNet 50 and the batch size is 1024 if you're looking back to the slide also another issue is that we also tried the ResNet 18 comparing to the ResNet 50 at first so actually at the second benchmark that we are releasing we are actually still using the ResNet 18 with the BYOL and the accuracy is around like 20% and we'll say okay we cannot go with that so we switched it to the ResNet 50 and it gives us a large boost yeah the reason is that we actually got a pretty bad result at the first leaderboard it's like maybe 10% at the second leaderboard we actually got an error and at the score at the second leaderboard we are actually still around 20% sorry we didn't mean to avoid these two leaderboards so basically for the ensemble batch training we used 1000 epochs for the fixed match we are using 40 epochs for the better the studio label iteration for each iteration we used 40 epochs so basically for all the fine tune procedure we all used 40 epochs cool so I'm no longer in mute I think yes there are no questions so we have questions for Jan instead we have many questions for Jan we're going to be asking a few questions if Jan is still with us okay Jan are you here another Jan sorry and instead of a G yes it's not the first time I'm confused with someone called Jan alright so here we go my bad I had to specify the big one alright so a question what is the recipe to follow for any deep learning problem and during the competition also we found that models are still a black box and we don't find a concrete reason why one model is working better than the other expect from the visualization of the filters so how can we go about yeah so there is the bad news is that there is no method well there is a generic name for the method it's called engineering and it requires an ability called engineering NAC so what is it it's the ability to foresee in advance if a solution you're proposing is going to work or not so and again it's related to something I just talked about before which is what is your model of the world is your internal mental model of of how things work accurate enough to predict whether method architecture A is going to work better whether architecture B whether this trick is going to help or not so that's part of the story intuition for what is going to work and then empirical validation sometimes your intuition are proved wrong happens to everybody including me happens a lot to me actually it's where things I thought were true ended up not being true you have to accept that your intuitions are wrong so in the face of overwhelming evidence you should change your mental model and that's the most difficult part it's getting your mental model to fit reality that's how we get the gradients right so one observation there one observation on the other side we get our understanding to change direction also how important is in applied AI research same person it depends what your ambitions are it depends what you are interested in what type of work you are interested in so if you want to do research research which means you want to be able to talk about what you do publish it in various places and things like that it doesn't mean it's not because you get a PhD it's not because you don't get a PhD you don't get a research job but it's much easier to get a research job if you do have a PhD and it puts you in a better position to work on your own things invent your own technique as opposed to reproducing other people's results things like that so if you are interested in sort of blazing new trails creating new algorithms etc you need to find an environment where people will let you do this and that means either a university or public research lab or an industry research lab of the type that lets you work on your problems as opposed to work on their problems and that pretty much more or less requires a PhD there are ways to do this without a PhD which is to basically get hired as an engineer and work with other people that have PhDs but it's a different kind of job you're doing Aditya comes from mine though he just had an undergraduate degree here at NYU and then he just published recently the Dali paper and he's a research I'm not sure if he's a research engineer I mean it happens it doesn't matter what your title is but there are a few research scientists at fair that don't have PhDs and there are a lot of research engineers at fair who do have PhDs and some that don't have a PhD but basically do a research job so it's somewhat flexible but if you want to put your chances on your side you really want to do a PhD I know a lot of people who I've met a lot of people who had the opportunity perhaps to do a PhD and basically chose not to or started one and never finished or kind of never thought about it really and many of them told me that they regretted not doing it I have never encountered anyone with a PhD who told me I really regret doing a PhD I should have not done it okay maybe we are shy just kidding let's do something now the thing is it's difficult to get into a PhD program so it requires a bit of effort, abnegation, motivation a little bit of self-confidence that you're going to lose probably the first year of your PhD and then slowly recover as you go but you need to realize that you're probably you're not less smart or whatever than a lot of people whose papers you read okay so don't have any inferiority complex that's the thing a few more questions because people actually wrote them and upvote them so we actually should ask them a few I think while reading a paper in deep learning we get to know about the successful hyperparameters for a given model usually how much do the publishers try and test to see if the model works which is how many tryers on average are undergone to finally get the desired result well most of the time they don't tell you okay the authors don't tell you how much effort they've spent trying to tune their hyperparameters how much computation they use although now some conferences actually require authors to tell like how much computation did you use because you know that sort of have to be taken into account right if if you didn't get like you know record breaking results but you know you did it with you know two GPUs in your dorm room you know it counts for but you got close to you know because of a good idea I think that idea should be disseminated when you compare yourself to you know a team of 10 people at Google who use 2000 GPUs for two weeks and did like systematic hyperparameter search you know of course they're going to get good results right but but like you know which method in the end is kind of more interesting so it's a tradeoff like you know both strategies are perfectly legitimate you know there is some value to completely sort of empirical search based on computation but not everybody can do that so that's one reason why some conferences are asking now like you know how much competition did that take all right so that is asking what is the difference in deep learning research between our university and industry like Facebook AI research so those two things are complementary really I don't see them as one being superior to the other they're really complementary it's two different things and it depends a lot on the diversity and on the industry research lab so first of all what you have to realize is that industry research lab that you hear about whose results you've learned about you know over the last 50 years if not the last century there's only a handful of them okay so the recent ones that you know about are you know Google, Facebook you know Facebook AI research, Microsoft research and you know there's like smaller pockets of interesting research going on at NVIDIA and Intel and and then you know bigger companies like IBM and NEC and things like that but and then of course you know you know Huawei, Baidu you know etc although most of the stuff they do is more applied but they have like you know really good contributions to supervision for example so you know and Yondex there's like you know quite a few companies that are it's a relatively small number and then you go back in the past there was a period during the late 90s early 2000 where early 2000 where basically the only company that had like a really good research lab where people were publishing and the influence in the field was Microsoft essentially and then you know you go back further up and there was you know AT&T labs and before that Bell Labs and then IBM research which was you know very productive Xerox PARC you know back in the 1980s etc you know General Electric even had a very interesting lab now you think about what all those labs come from they're all labs from very large companies that basically are very well established on their markets and do not need to fight for survival from one quarter to the next right you cannot have research in a company that fights for survival because there's just not enough resources right a company that needs to find its place cannot afford to invest in long-term research and probably cannot afford to kind of talk about what the research they do they so that's why you know the good research labs are only in very large companies that are profitable and well established in the market and you know make over you know 20 billion of revenue or something like this I mean it's not a fixed number but that's the thing now the kind of research that takes place there there are some labs that are very bottom up so fair fair actually has two sub-organizations one called Fair Labs the other one called Fair Excel and Fair Excel has kind of slightly more organized projects where with teams and that are managed and everything whereas Fair Labs is very bottom up it's you know just driven research there's a lot of collaborations and everything but it's you know it's sort of very organized chaos if you want Fair Labs used to be like that as well Fair Labs research at least the research part so this is very you know somewhat similar to the kind of research you would do in academia except the incentives are slightly different so in academia most projects have students and postdocs and the most the main motivation for them is to get you know get papers out so they can get a job at the end okay when they graduate or when their postdoc ends and so you will work on sort of opportunity projects that you know may not have a very you know bright future there might be a dead end but you know you're going to get a paper out of it so you can work on this right you can work on theory that may be irrelevant but it will get you a paper and you know some theoretical machine learning conference like Colt or something because it's you know brilliant mathematics even though it may be completely relevant in practice so the motivation there is different in industry you tend to work on things that you know eventually maybe in the very long term but eventually are going to be useful so you tend to be less opportunistic you tend to have more of a long-term goal and then work towards that goal but in organization like Fair Labs it's kind of more itself it's self-motivated if you want Microsoft research is very similar Google has small pockets like Google Brain some parts of Google Brain where it's kind of bottom up and self-organized most other places of kind of Google research and Google AI are much more organized and top down DeepMind also has some pockets that are very bottom up and a lot of it is more like Fair Labs more kind of organized so it's a different style you won't find those organized groups in academia very much last two questions then we say goodbye is going to be the question from Colin do theoretical proof still matter in industry so I'm very connected to what you were talking about for example under some conditions to ensure some kind of property of network it seems most of recent are formulated from intuition and stuck in different legal blocks together well okay so I actually have a whole talk about this about the this title is the epistemology of Deep Learning I gave that talk at the Institute of Advanced Studies in Princeton a few years ago and it's on YouTube you can watch it it's called the epistemology of Deep Learning and you know it talks about the relative importance of relationship between empirical science and and kind of more the empirical discoveries as well as sort of theoretical discoveries and the relationship between science and technology so Deep Learning is part of computer science and it's an engineering science so in engineering science you're not studying nature you are inventing new artifacts and then you're using the scientific method to analyze and explain and predict you know analyze what you observe in those artifacts and it's very often the case in the history of science and technology that invention of the artifact precedes the development of the science that explains it so a good example of this invention but a better example maybe even is the steam engine so the steam engine was invented in the late 1600s and it was you know developed empirically essentially with engineers you know and people who are kind of inventive for about 100 years before people developed thermodynamics with the idea of the chrono cycle and the context of the concept of entropy and all that stuff and the fact that you know you have a limit in the efficiency of a chrono cycle due to difference in temperatures of two sources of temperature and you have the second principle of thermodynamics that you can't have you know perpetual motion and you need if you want to have a thermal engine you have to have two you know two bodies with different temperatures basically so since they don't occur naturally you have to basically heat one of them so it thinks like that right and you know thermodynamics became one of the like most important theoretical construct intellectual construct and theoretical foundations of all of science right all of science uses thermodynamics one way or another and in fact some of the methods that we talked about like variational inference and stuff like that they use all the mathematics that come from thermodynamics so so it's very often the case that an artifact gets invented before the theory for it gets developed and we're in a bit of that situation with deep learning where there is a lot of theory that pertains to deep learning that for example gives the limit of machine learning right I mean the statistical learning theory the vatnik-chervenanke's type theory tells you the limit of what you can do with machine learning it tells you if you have a completely general machine that is infinitely powerful can could learn in principle anything it cannot learn anything without a huge amount of data so necessarily you need to have either you know regularization or some other way of kind of restricting the set of functions that your learning machine can can realize or can implement and you know deep learning breaks some of the stuff that theoretical results that you read in the statistical literature because you know in deep learning we train gigantic networks gigantic models with terms of parameters only relatively small amount of data so that's your work and it's a bit of a it's a lot of work on the theory of that it's a little bit of a theoretical mystery but less and less now I mean it's more and more understood why this works but it pretty much invalidates a lot of stuff you read in statistical books you know textbooks so I wouldn't say it's completely empirical I think the probably the best work in deep learning and some of it is completely you know empirical people kind of new architectures because of intuition and things like that but a lot of it is also intuition driven by a theoretical argument and I think the role of theory is mostly to tell you what's impossible it's not the case that the theory will tell you here is how you design a deep learning system but it will tell you like you know there are limits to like how well it can work regardless of which architecture you use and so there's no point trying to search for architecture that's universal for example it's something like that right there's a lot of work on like you know graph neural nets you know equivalence and invariance on the groups and manifolds there's a lot of theories to be done and I find it absolutely fascinating last question next question then we say goodbye actually I'm picking one because we already answered one here so how to encode non-model physical constraints prior knowledge expert systems a learning based method can we do it using supervised learning for robotic systems well okay so here's a question for robotic systems for robotic systems if you have a robot arm for example you can write down the equations the dynamical equations of this robot arm they're not that simple but you can write them down and that gives you a predictive model that you know tells you if you put this action in this arm you know it's going to move in this direction at that speed you know whatever and so you can use this as a predictive model if you write it properly it's differentiable so you can use it for planning and in fact it's a standard way of doing planning in robotics using model predictive control we have done this two weeks ago so yeah that's right and you know a lot of things you see those impressive videos from Boston Dynamics and everything you know they use a lot of that with a lot of other techniques of course that are much more sophisticated to you know take into account the like changing conditions and stuff like that but here is the problem if you want to control a robot that grasps an object or you know or does things like this like pushes an object to arrange them in a particular way or puts a screw in a hole or like a rod inside of a hole and it's just the right size there are heuristics to use to be able to do this and the simulators don't work for this so a simulator does not simulate friction very well usually it's not very accurate and so if you plan purely in simulation when you transfer it to your real system it's not going to be accurate because the simulation is not perfect so what you do when you do this when you are faced with that situation what you have to do probably is you know build your internal model of the robot you start it with a kind of physics based model but then you add some free parameters to it and then maybe you add a little neural net on the side for some of the parts of that model so that the system can adapt the parameters of that model and make it more accurate in situations where the pure simulator itself would not be that's one way now for the more general question if you have knowledge in the form of rules and facts and things like this how do you incorporate this into a neural net I think the best bet for this would be things that are based on large scale associated memories or information retrieval you might have a large collection of knowledge in the form of text or in the form of text that has been reduced to statements you can embed them in a vector space and then you can do kind of search in that vector space whenever your system needs to answer a question or refer to a particular piece of knowledge you can do sort of associative retrieval and we've seen how to do this with those kind of soft associative memories there's quite a lot of work on this at the moment and you could think of transformers that's kind of exploiting this to some extent but people working on data systems that use large scale memories basically as knowledge basis okay I think that was it for this year semester actually thank you Jan for being with us and telling us all these nice things and it's so knowledgeable it's so pleasant to learn so much also on my side I enjoy so much teaching although we are all completely remote but I think we managed to actually have proper illumination proper lights proper bears moving during class I don't know I had fun although I do miss hearing people in person reacting and smiling I don't see you smiling I hope you're smiling and you also have been having some fun with us I definitely did I hope you did too yeah so thank you everyone for attending this class and for your attention it's been a pleasure on my side as well I hope this was useful to you even if not everything was completely understandable you know we're kind of fine tuning this course over the years so it's getting better you're not getting the first version of this yeah I changed everything halfway through this semester I already apologized given your energy based model thing now I see things from a different side and I cannot pay each things from the previous way so I'm like okay let's change everything why not but then I think actually it came out quite nicely I think everything is consistent colors are consistent notation and symbols are aligned so things are getting better this is good I think it's an improvement one thing I should say is you might think that the energy based model way of presenting things is my own view and it is this is a relatively small number of people who really embrace that vision I find it simpler and more understandable at the intuitive level than the purely basic framework there's a lot of models that can be explained through probabilities but I don't think looking at the equations gives you any intuition for what really is going on so it might allow you to use those things blindly but not really understand how they work so the EBM framework is really the simplest way I've found that relies on the minimum amount of prior knowledge to be able to understand the large collection of different models and so it might seem a little annoying to you right now but I think the whole approach is going to take him off this whole workshop that I clear Friday actually on energy based models in which I have a keynote so it's an approach that I think is getting grounds so you are maybe a little bit in the same situation that students who took my machine learning and pattern recognition class in the early 2000s the first few years I was at NYU and I was teaching about multi-year neural nets and there were students who were wondering why are you even talking about neural nets nobody is using neural nets ever and I'm sure those people now are quite happy that they heard about neural nets from my class some of them actually are a PhD with me and are now all kind of highly placed executives in various tech companies and paid so you know seven figures so so you know I teach for a long term the stuff you've learned has I think a relatively long shelf life hopefully so enjoy also thanks so much for Vlad Subol and Josh and Joon they've been creating the homework they were I think amazing homework like they're so high quality right the homework, the correction of the homework the judging took care of completely everything about this competition and then of course we had the graders, Tina and Iraj that took care of you know going over all your submissions and making sure everything is actually on point so thanks all you behind the scene that you didn't appear in this you know video and again that was pretty much oh we have Vlad here we don't know whether judging is the camera is working but yes it works working okay so now we see everyone on screen and with this last meeting here we see you tomorrow for the last class where we're going to be learning about how to do prediction and planning with a stochastic environment with minimization of uncertainty so all those nice things we saw so far all put together in the final lab of this course thank you again see you tomorrow bye bye take care