 So have you watched the videos? Have you watched the previous, the two videos I, the two required videos, right? So there are two videos you should have watched before coming here, which were like required in order to be able to digest today's content, right? Okay. I think, okay, I think I should have better planned this thing, right? So I can either go on and try out this or, okay, I can try, I can try, I give it, I can try to give a little bit at least the, at least part of it, right? I can talk about, since you're here, right? I don't want to waste your time, right? So since you came here maybe for the spinning thing, let's say I, can you please give a light intro or put it in the description for new cameras? Seeing that as the point, there is a full description on the, on today's videos here, but I guess it was not clear. So I think we are going to be redoing this, such that there are actually, you know, there's a full class coming in order to also have, you know, back and forth answers and questions and answers, right? So this was, this would have been the summary, right? I think we covered it before people joined and, you know, with class people joined at the beginning of class, everyone. So my expectation was, oh, we start at nine, the live stream, everyone is showing up at the same time, people don't get, you know, delayed or whatever, but of course it's Saturday morning for people here or wherever in the world. So I guess didn't work out as expected. So anyway, so the, the table of content was the following and my computer finally slows down. So maybe this is a good sign, maybe we can actually get something out of it. So there was a jittering introduction for EBM, maybe we can do that, right? And then maybe we can skip these parts for today. We're going to be doing this the second time around. So this, let's call it a test, a streaming test. I can give out the, since this is completely new, new material, I am completely open to, you know, practice it with you, I guess. And so this would be the first, like the energy perspective for the classification task we cover in the video number two, well, number three, I guess, the practical number three from last semester. And then so this would be, I guess, what we cover today, plus these animations, right, these drawings, right? And we see how to draw these things and then how to get this final spiral, final, final energy level, okay? Sounds good. So we, I just give you a smaller fraction, right? And then we go for the full blown out, I guess next week or whatever, whatever next iteration of this thing. And again, things, you know, we are figuring things out when I'm figuring things out as this goes on. So bear with me, right? Okay, be patient as usual. Okay, at least I'm having fun. I hope you are having fun listening to me blubbering around, okay? All right. So we start with, I guess, some slides about these things over here, okay? Sounds good, right? This stuff, right? So we're going to be talking about these things here. In the previous videos, we have seen already this one and this one. So this is one slide, a new one. And then this is the new slide, right? Plus the visualization. So let's open the, okay, let's figure out how to do this. PowerPoint, PowerPoint. We can do it, all right? Of course, we should be able to do it. Ciao Marco. Things are working. The only thing that doesn't work is the chat on this screen. But we knew that already. Yeah, okay. You should be able to see something. Yes, yes. I also have a preview screen here. Very good, very good. All right, all right. Let's give it a try, right? So the objective of this first lesson here, right? It's again, one time more, like once more classification for with neural net, right? But then I'm going to be introducing you this new perspective, this energy perspective, which I'm going to be a gentle introduction of the energy perspective, right? For classification. Why is that? Because later in the course, so this is going to be video 2.5 or 3.5, right? So you're supposed to watch this before the energy-based section of the videos, right? Such that you should be more easily introduced in this energy perspective thing of the things, right? Why? Because I think it's confusing. I took me four years to understand how it works. Maybe I'm a little bit dumb when I'm definitely I am, but you know, it took me effort to understand, right? So now that I understood, well, now that I'm understanding it, right? While I'm understanding it, I'm changing the way I explain things. Therefore, you will add this lesson, okay? So let's make this little package. So, okay, it's going to be a short, right? Instead of having the full lesson, that was my, what I was going to do, that it was like full lesson online, broadcast worldwide when it didn't kind of work out. Well, people are not showing up. So, I'm going to be doing these short things, right? The items, so that it's contained small, clear, and maybe it's going to be okay, right? Okay. So, artificial neural networks, A&N, right? A&N, supervised learning classification, and my fan just finished spinning. So this is a positive sign. I don't know. Maybe everything broke and I don't know. I'm talking to my camera and someone is watching and listening, but it seems that my computer finally stopped, you know, complaining about being broadcasting. Okay. Are you still there? Say something. No, I'm saying something. Type something. Okay. Everything works. Okay. Okay. Okay. All right. Okay. So finally, yes, I like it. Okay. Very good, right? Let's start. Computer is quiet. All right. So the task at hand here was to do this classification of these points, right? And then we said that we can write down the equations for this parametric curve in this manner, right? And then basically T goes from zero to one, which is controlling the radius, right, of this spiral. And then you have the X and Y coordinates that are following a sine and a cosine. And then we have lowercase k, which is the current class that goes from one to capital K, where capital K is the number of possible classes. In this case, capital K is going to be three, right? We have three spirals. And then, you know, every, every, there are three parametric curve, right? So capital X can be capital X1, capital X2 and capital X3, which are the X points associated to each of these three spirals. Moving on, we're going to be adding some perturbation, some noise there, such that things look a bit more real rather than this, you know, sequence of, you know, perfectly generated points. So we are going to like reality, right? The fun of the computer starts spinning again, but okay, whatever. So we have this point, right? The classification, what would we do when we classify things? We would like to separate these different spirals in different regions. But then they are obviously non-linearly separable, right? Again, we covered this in the previous lesson. So I won't be going over these details in too much detail. I'm just going through the things we already seen, right? This is a recap from the last video you should have watched, right? Video one, video two, right? Or whatever. One, two, three from Practica, from this Spring 21 Deep Learning edition, okay? All right. So we said it's not linearly separable. Therefore, there are some intersections between these straight lines and these branches. And those are the issues, right? So we said, how do we fix that? Then most of the people will be say, oh, let's bend these lines. No, because how you bend lines? I don't know how to bend lines. We have neural networks. Neural networks are this kind of sandwich of linear, nonlinear, nonlinear, where every module is going to be simply basically a linear classifier, right? Or a linear, whatever, network, right? So we only have linear things followed by nonlinear, right? So the idea that I like to think more about is the fact that I can twist somehow my data in order to fit this kind of linearly separable regions in the last layer, okay? So this is like the how I like to think about things. Right. So some seems like the example that in the TensorFlow playground, okay, no idea, never use TensorFlow. All right. So here is what I tried to suggest, right? We do basically unwarp the spirals. How did I unwarp the spiral here? Well, I simply took the parametric function I generate the spiral with and I undid the angle, right? So I just multiply the angle by a scalar. And I basically, there is either no angle here, zero, like, or, like, there's always the offset, right? That's why you have the three things, but then there is no warping around. So this was what I drawn, like, intuitively, right, thinking then in on the right hand side, I'm showing you how the final decision boundaries of the network change during training, okay? So these are two different ways of seeing the thing, right? On the left hand side, you see the unwarping, the warping, right? Unwarping of the data. Whereas on the right hand side, you can see, like, how the decision boundary seen from the input space get warped by the network, right? So the network is like a warping channel, right? So you either look from the bottom, right, from the input, and you can see the warped decision boundaries, which were straight on top, and then they're warped if you watch them from the input side, right? Or other ways the other way around. So if you watch from the output of the network, you watch the input spiral, the network disparalize the spiral, right? So if you have a disparalizing network, you either have a spiralized decision boundary or a disparalized input spiral, right? Makes sense, I think. It should make sense for you, too, right? I assume you have, you're following along, right? Okay. All right. So classification data. We covered this in the previous lesson. So we have points, 2D points in this case. I have m of them. n is going to be the number of dimensions we play with. So those are points in 2D. So this is the n equal 2. And then m is going to be the number of points I have, right? On the right hand side, instead of having the classes associated to each point, we're going to have c1 is going to be the class associated to the point number one. c2 is going to be the class associated to the point number two and so on until the last one. And I have m of these, right? Because there are m different points, therefore I have m different c's, no? And these c's are the labels associated to the points. Then we decided to have a one-hot encoding for these classes so that we come up with this capital Y, which is the collection of all these blue, bold y's, right? And these bold y's will have the size, no? Of the number of classes, right? So we have in this case three spires. Therefore, there will be three elements, right? In each y. And then they are all zeros, but one which is indicating which element is active, right? Okay. So again, capital K. In this case, since we said capital K are the number of classes, so these vectors are capital K long, and then we have m of this, okay? Anyway, so this is just recap, right? There is no new concept. You should already be familiar with this. There's a video recording about this thing. Now there's a second video recording, so I mean, you should be okay, right? Let's check in once again the chat here. Okay. There is no one anymore. What happened? Okay. I see two people only. Did something happen here? Is everything working? I'm checking with people online, right? Oh, you see 31, okay? I don't see the, okay. I don't know. Okay, okay, okay. All right. I have no idea why this stuff doesn't work to me. Oh, I see the messages. I don't see if people are online or not. Okay. All right. Keep going. Let me know if there are issues, okay? I mean, if the chat keeps working, let me know. Okay, all right, cool. All right. So finally, we introduced now the main concept of today's lesson, right? So today, again, I was planning something. We go for something else because that's how it goes. It's going to be the energy perspective of this classification network, okay? So we start again with this pink, bold X at the bottom, why at the bottom? Because a network is a hierarchical structure. So if you climb the hierarchy, then you go up, right? So if something is hierarchical, you climb it, you go upward, right? If you draw networks the other way around, then it's like upside down, doesn't make sense, right? So the input of a network is at the bottom. So we have this bold, pink, bold, lowercase X at the bottom, right? Which represents, what does the pink, bold X represent? Tell me in the chat. What is the pink, bold X in our case? Either this is a lag in the chat. They see a input. Yes, yes. And more precisely, what is the size of this X, okay? Okay, everyone in answer inputs. And I guess that's correct answer. But then what is the size? What is the dimensionality of my input in this case, right? So what is this bold pink, bold X? Are we following? Okay, in our case, I show you some spirals, right? Okay, we're not following, I guess. I show you the spirals, right? So from these diagrams, you should be able to tell me what X are and what is the size of X and what is the size of Y, okay? So from this diagram, you should be able to see, yes, you see this diagram. Yeah, a vector in two, in two dimensions. Yeah, in two. Okay, the answer is correct. The chat in YouTube is at least like half a minute behind. No, no, it should be like five seconds behind. I try to use the low latency. But yes, I saw it last time yesterday. I tried to see. Okay, I should clap and see when the clapping happens. But we try, it was shorter than half a minute. Anyway, that's correct. So X, we said, has two dimensions. Okay, so then this X goes inside something, which is called a predictor. What is a predictor? Well, predictor simply predicts what's going to be the hidden representation h, okay? For my final, for my target, okay? So then this h, which is the hidden representation of this final target or the final prediction, gets decoded into what? Well, into the prediction, right? So the prediction of my model is going to be called Y tilde. Okay, so this is first different from a semester. Y is called Y tilde. Why? Because this Y, so the tilde on top of the Y, it resembles the, like, more or less equal, right? The equal, more or less something. So this is going to be a way for me to express the fact that Y is an approximation of what the bold blue, the blue bold Y, lowercase Y, should be. So I have my X on one side, this side. I have my Y on the other side, right? From the X, I try to predict the Y. So since it's going to be an approximation, a prediction for the Y, I'm going to call it Y tilde. Y, more or less, okay? All right, so where are these two, how are these two interacting, right? So how is this Y tilde interacting with the Y? So this both goes inside a box here, right? This box is called C, C stays for cost. And the cost tells me how far my prediction Y tilde is from my blue, blue bold Y, okay? The input. So how many inputs does this system have, right? This is a question you, you, it's not, maybe it's, I don't know if it's a tricky question, right? But based on what I explained so far, this system, right? This overall system of, you know, drawings here, right? How many inputs does it have? Can you tell me? What are the inputs of this diagram, right? So I'm not, I haven't defined exactly what an input is, right? Yeah, two inputs, Y through and the input, right? So again, input X, yes, yes, exactly Jonathan. So X and Y are my input to the overall system, right? So whenever we consider a model using this kind of energy perspective or I guess any, any perspective, there are two different inputs, right? One is the conditional observation variable, right? No, not to each layer to the actual overall system. Yes. So X and Y are the input that we provide our machine in order to do something, right? And the machine will try to predict one variable given the other one. So we try to predict the Y given that we, well, we observe both of them, right? So both X and Y are observed. We provide both examples. And the model here, the network will try to come up with an estimate for this Y, which is called, which is going to be called Y tilde, like a, again, this tilde means circa, no? In Latin, which is more or less, if you see circa 2000 or circa the first century, right? It's more or less that year, right? Yeah, so the Y tilde is the estimation, no? The prediction that the model makes for the associated Y given that we observe X, okay? In this case, we observe X. So we talk about supervised learning. Sometimes we will not observe X. So X is going to be missing. That's when we talk about unsupervised learning, right? So there's going to be only some Ys. And again, most, I would say all the literature you're going to find online talks about unsupervised learning without Ys, which is wrong at this point. Because, again, the X is what you may observe, right? You may or may not, right? The Y is what we try to learn. So the Y stays with us, right? Y is our target. Y is what we want to learn. We always want to learn something, right? Like today we want to learn a lesson, right? So Y is what we want to learn. X may or may not be here. Today, we're going to be talking about supervised learning. So Y, X is with us, right? Anyway, the answer is correct. Both X and Y. What is the output of this system, right? So here, someone could argue that the Y tilde is the output of the system, but I will tell you that we consider as the only output of the system the sum of all possible red boxes we have currently, okay? So in this case, there is just one box. And so we're going to be considering our box with a C inside to be the output overall of this diagram, right? If there were multiple red boxes, then you would have to sum them all together. And these red boxes are simply a scalar value, which is basically telling you how far the two inputs that are coming inside perhaps are, right? Or it's going to be, anyway, a scalar value associated to the incoming things that are going inside, okay? So it doesn't necessarily have to be a distance. In this case, it's a distance, and I will call it a cost, okay? All right. I hope it's clear, okay? Overall, summarizing, we have vectors, tensors, whatever, as input, and as output, we're going to have a scalar value, right? If the Y tilde and the Y are very similar to each other, then this cost, I guess, is going to be a low cost, okay? And then the Y tilde is going to be close to Y only if the model will process correctly my X, right? And we turn my X into a correct or a very good Y tilde, okay? And so automatically, a low C means that the X and Y are coming from our training set. But more about this in a sec, okay? But anyway, this C tells you what are, what is the level of compatibility between the Y tilde in this case and the Y. And indirectly, it's going to tell you what is the compatibility level between the X and Y, okay? We provide. I hope it's clear. Anyway, so these are the equations for the computation of the hidden layer, which is going to be a squashing function, a non-linear function of the rotation of my input. And then the Y tilde, my final prediction is going to be the rotation of the hidden layer, which goes through another non-linearity, okay? We like to think about the first stage as being a prediction. We predict the hidden representation of my output, given that I provide my input X. I like to think about the second case, the G1, as a decoding stage where I take my hidden internal representation for my final stage and decode it out there. F and Gs are arbitrary non-linearity. We don't care. So this one also was covered in the last video, but like, I'm just going to be repeating anyway. So we have that this Y tilde is going to be a function of X, okay? So again, having a Y tilde close to Y means that the network was able to produce a good Y tilde given that the X was provided, right? And this X is called observation. This Y tilde maps our N points to RK points. And these RK are real numbers, actually, from, you know, with as many items as the classes, whereas my bold blue Y, you know, they were actually 01, remember? They were one hot encoded. So anyway, the network overall, the last symbol here, says that it's mapping Xs, right, to this Y tilde. But what actually happens, you know, if we expand and look more carefully, we're going to have that this Y tilde that starts, this Y tilde function starts, get an input from the Rn, goes into an intermediate d-dimensional representation, and then it's going to be spitting out this K dimension, a number of classes dimension. And then we want to have usually an internal representation, which is much larger than in the input. Why is that? Well, that's actually connected to the part of the lesson, which I'm not going to be covering today, because again, we try this in the second iteration and see whether I make things more clear for people to understand when this is going on. But again, we want a larger intermediate representation such that things are kind of easy to move, right? What does this mean? Again, next lesson and I'll tell you more. Anyway, so finally, we introduce this free energy and this cost over here. So we're going to have this, so this is the definition, right? What is the free energy? So this free energy capital F function of X and Y express the level of compatibility between my X and Y, okay? So I have an X and Y, the model will tell me how these two work together, okay? Yeah, so we want to encode the information. No, so the point of having like a D larger than N is that the optimization is actually easy to run. So if things stay in low dimensions, is I move something this direction, this one followed there, right? So I'm in 3D right here, well, I'm in 2D on the screen, but anyway, if I pull this thing on one side, I will get pulled on the other side because I'm all compact, right? If you shoot me in a high dimensional space, you can just pull things out of me and I still don't feel this dragging in this direction because in high dimensional space, things are very sparse, things are very far apart, everything is equally distanced to everything else, right? So again, this was part of the lesson of today, which is going to be going through, but again, things in high dimensional space are very easy to move around because they are equally distanced and there is no much connection between points, okay? So the answer to the question, why do we go in a high dimensional space would be because the optimization process in high dimensional space is much easier, okay? Things are very smooth and you already seen this in the video I show you where I do that animation interpolation of those five spirals that were very smooth compared to the other one, which get all foldy. The foldy one was using a network which had only two neurons, maybe it was actually even a deep network, but it always had two neurons, two, two, two, two, two, two, two. And that was really like choppy animation, whereas the other one, which was using the hundred intermediate layer, no? Of hidden representation, like hundred neurons in the hidden representation, that was allowing me to have a very smooth representation, transformation, right? Anyway, we are introducing here this capital F, which is this level of compatibility between the X and Y, which is equal to what? As I said before, it's going to be the sum of all the boxes we have. In this case, we only have one box, so we're going to have this C of Y and Y tilde, which is the distance between Y and Y tilde. So you should be able to tell now that this free energy is going to be low if the model has been trained with these kind of pairs, right? So I haven't told you about how this is trained yet, right? But then you can actually, I guess, understand and guess that if you provide, well, if you provide X and Y, you know, tuples, right? What's called in the capos, right? Is it called? I guess capos, right? If you provide X and Y, then you're going to be trying to predict Y, well, Y tilde, given X, right? And you will try to get this Y tilde as close as possible to Y, right? And so if we train with these X and Ys and the network starts to output Y tilde that are close to Y, then the cost is going to be low, right? The distance between Y tilde and Y are going to be close. Therefore, since F is equal to C, then you're going to have that the free energy for X and Ys coming from the training set has to be low, right? Because if Y tilde is close to Y, right? If there is a small difference between the two, then since F is equal to the actual C, it's going to be also the same value, right? And then you can think about, you know, so how is this Y tilde computed, right? So this Y tilde here is going to be my Y tilde of X, right? So this C of Y, blue Y and purple Y tilde, I could actually write it as C of Y comma Y tilde parentheses X, right? So C is actually function of X, right? And so F, it simply, you know, states the independent variables to be X and Y, okay? I hope it's clear. And I hope the chat works and people are still here. All right, moving on, how to train this system, right? So again, we already know, okay, animation didn't work. Okay, let me, let me do the animations, right? Because why not? We can cut things afterwards. Okay, I think it's correct now. Let's see. Yeah, okay, works. All right, cool. Animation has been fixed. So how do we perform this classification? So we have that my Y tilde is going to be the soft argmax of my logits, L. So these L are the logits, the final layer of my, the final linear layer of the network, okay? What is the soft argmax? Well, simply it's converting a vector, a real vector into some sort of softer version of one hot encoding, right? Why don't we simply use the argmax? Why don't we convert it into a one hot encoding? Because there are no gradients, right? So I can't really use like a binary one hot thing, right? I don't know in which direction to change things a little bit, right? To change the final output. Same reason why the original perception was not working, right? It was not differentiable. Then we're going to be picking this C, you know, this cost, the distance between the Y and Y tilde to be this negative log of the inner product between the Y and Y tilde, right? So Y was one hot. Y tilde is this soft argmax of the output. Well, if you multiply the two together, you're going to be basically extracting the value of the Y tilde in correspondence of the one that we have in the blue one, okay? This is also called the cross entropy or negative log probability, but again, no, we don't care. Then we define the loss of the overall network, right? To be this average one over M. Remember, we had M samples of this lowercase curly L, which are these per sample loss functions. What is this per sample loss function? Well, by definition, in this case, right? In this example, I will pick these to be equal to the free energy, right? So we pick the loss, which is called the energy loss. So the curly lowercase curly L by definition, in this case for the classification, the most simple case, is going to be exactly the capital F, which is simply again, as I told you before, the minus log of the Y tilde at the location corresponding to the one, right? So the lowercase C was telling you what is the class of a given point, which is again the location of the one in the one hot vector. And so this is simply the Y tilde at the location lowercase C. So far, doubts? Is it clear? Are you following? Everyone is here? I cannot see if people are here. Are you here? Are you okay? Are you following? Yes, I'm clear. All right. Cool. Also, in the previous lessons, we cover how this works. Again, we can maybe just quickly go through this. Oh, hi, Joao. We can go quickly over how this minus log works, because we already covered in the previous video, right? So I don't want to waste your time. So this is C is the label. And then we said, if we have an X, given X, and then C a class equal one, then we're going to have the corresponding Y is going to be the one hot with the first one equal one. And then if my prediction is more or less one, more or less zero, more or less zero, then the loss is going to be giving me basically roughly zero. In the other case, if I have like a erroneous prediction, then the loss with the erroneous prediction is going to be plus infinity. Okay. And again, we already seen this in the previous video, so I won't be going into much detail. How do we train? We know how we train this stuff just very quickly. We just collect all these parameters into this capital theta, which is like my collection of parameters. Again, once more change of names for the same thing, here usually not everyone does this, but someone does. So I just put it for completeness. We call this J my objective function. I want to minimize with respect to the parameters theta of the model. And so if we have the J of theta and a scalar theta, we start with an initial guess for this theta, which we call it theta zero. We have a initial value there, which is going to be called J at the location theta zero. Then we check what is the derivative, the slope. We derive the J with respect to theta at the location theta zero. So this thing in green is a function, the derivative of my objective function. And then I compute this derivative at the location theta zero over this location. So it's the slope at this point over here. And then since I want to move down the slope, I want to move to the left. The derivative is positive here. So I'm going to be using, of course, the minus derivative to go down the slope. And so this is called gradient descent. I'm using this gradient to go downwards, down the hill. How do you compute the gradient? Using chain rule. And since we like fancy terms, we call this back propagation. So the second part of this lesson was to actually have a look. So after class on Monday, this Monday. So I taught this lesson on this Monday. And then someone asked me a question. Oh, can the free energy, so since we are minimizing here, we are minimizing the free energy. You see here I'm deciding that my loss of the system is going to be the free energy, F. So the question that I got in class was, can the free energy still be high after we train a network? So this is a question I asked you. So you people there in the chat, can the free energy be high after we train the net, right? So if I choose my loss to be my free energy and by gradient descent, I minimize this scalar value. And this scalar value is the loss. Well, this objective function, which is the loss, no different parameterization, but the same thing, right? The loss, which is the free energy, right? No, no, like we completely train, right? The network. So someone is suggesting underfitting here. No, no, no. The network is perfectly, perfectly overfit, right? So it's like really you learn the learnable, okay? There's no way. So you completely minimize this free energy, this loss, right? Yeah, Jonathan is correct. So Jonathan is saying, I guess it will be high for wrong inputs. So Jonathan maybe knows already the answers. If the network has enough capacity, the free energy should be low. Not necessarily, right? So, okay. No, the energy is low. Okay, then no. All right. So there is some confusion. That's, I mean, that's good. Like, in the next part, there's gonna be answering this question, right? So if there were no confusion, then I was like wasting your time. So I'm not wasting your time. Okay. Very good. All right. So let's check this out, right? And then there are actually notebooks to do this, right? Yeah, yeah, yeah, I'm getting there. So this was right for this energy, free energy, okay? Why there is a cursor here? Okay, here. So these are my points, right? I show you 5-spires because it's pretty, right? There are more colors. Rainbow, yay. Anyway, so these are my X and Y points, right? I train my network. My network does very well. It's perfectly, okay, perfectly. It's not perfectly, right? There are some edges here that are not correct. There are some points that are not correctly classified, but whatever, there are just a few points. Majority are correctly classified, so we can consider this a well-trained network, okay? The free energy has X and Y as input. So yeah, points outside the training set can still have high energy. Yes, that's the answer. That's the correct answer. I'm going to show you right now, okay? So do you remember how we define today the free energy? What was the free energy equal, no? So free energy is going to be equal to C, okay? So F of X and Y is going to be C of Y, Y tilde. How did we define today? Yeah, F equals C. How did we define C today? Where you're following? What is C equal to? I'm reading the chat, right? I'm waiting for you to answer. You cannot even go see the slides because the slides are not online, right? The distance between the Y and Y tilde, yes. And how did we define this distance? Okay, the log of Y hat, almost the cross entropy, yes. Can you write down the equation? Almost the Y of log Y hat. Well, it's not Y hat, it's Y tilde, but okay. The minus log of the cross entropy, well, not. Minus, okay, there you go. Nishan answered correctly. Yeah, yeah, so the answer is minus log of the Y tilde transpose Y, right? Or Y transpose Y tilde, no? Either one, doesn't matter. So let's go back, like the double check, right? Here. So this is our definition, right? So my C, which is either this one or this one, but again, this one uses the C. I don't like it, right? I mean, here we have, we need to have access to the C. Here we don't have access, like we don't need to have access to the C, right? So here, I'm simply saying that these two are the same, right? So I have minus log of the correct component, right? The value for the Y tilde corresponding to the one in the blue Y, right? So if the blue Y is one hat, you multiply the one hat by this vector, you extract the only item that is, you know, corresponding, aligned with the non-zero items of the Y, right? So this thing is a scalar product. You do all the products, right? You multiply everything by zero, but the only non-zero item of Y will make the corresponding Y tilde element survive, okay? Oh wow, message retracted. You can remove messages, I see. Cool. Anyway, so this is my choice, okay? The negative log of the inner product of the two. And we also figure that the Y tilde can go from zero, positive zero to negative one, right? So it's bounded by zero and one, right? So if you compute the log of a number that goes, so that if you compute the minus log of a number, why the minus log? So this is the choice of cost. My cost is going to be the cross entropy. The cross entropy equation is this one. Today we don't go into the Y this because there are many options, right? Okay, the question is actually legit, right? It's legitimate. There are other options to pick costs for a given, for a classification model. In this case, we pick this one, which is very popular, right? So we don't want to introduce too many variants from what everyone is doing these days, okay? At least today. So good question, but I'm not addressing the question. I'm not telling you what are the other options today. So we take this minus log of this item over here. This item over here can range between zero and one, excluded zero, excluded one. Therefore, what is the range of the minus log of a open interval that goes from zero to one? Answer? I know there is some delay, so I'm waiting, right? What is the range of this free energy here? Either the chat doesn't work, some people are not typing. I don't see anything here, you know. Okay, someone is typing. Zero to plus infinity, yeah. That's correct. Why is that? Because the minus log of something of zero plus is going to be the plus infinity here, and the other one, so the output is here, right? It's written over here. You get the correction almost correctly, you're going to get the zero plus, you're going to get the result incorrect, like you get completely mistaken, you're going to get basically plus infinity. So in my notation, there is a distinction between infinity and plus infinity. Infinity is going to be both ways, right? Plus and minus infinity. Otherwise, so I don't have the symbol for plus minus infinity. For me, infinity means plus minus infinity. Plus infinity is the positive infinity, negative infinity is the negative infinity, okay? Just making small clarifications. Anyway, so let's have a look to this thing, right? This energy, right? So the question was, can we have high energy even after we train the system? Well, let's check it out, right? So here I'm showing you this cross entropy free energy. When I pick y equal one zero zero zero, okay? So in this case, we said we have the phi spiral, right? So capital K is going to be equal five, right? And then I can pick either y equal one zero zero zero zero, or zero one zero zero zero, or zero zero one zero zero, or zero zero zero zero, zero zero zero one zero, or zero zero zero zero one, right? So you have all five possible peaks for the y, right? So I can pick either the red, the orange, the yellow, the green, or the blue. And given I pick one of these, right? You select basically one of the logits, right? One, sorry, one of the white tildes, right? One of these are soft argmax outputs, to which I compute the minus log, right? So what you get is going to be the following. So in this case, I pick my y to be the first class. The first class is going to be the red class, right? And so in this case, you have that all this region in the bottom right hand side is going to be roughly zero, as you can see from the color bar. And over here, you're going to have the first level curve of height equal one, right? So all over this curve, you're going to have that the energy is equal one. And this stuff keeps growing. Here you have the level equal five, so you have four points more, then more four above, you're going to have the nine and 13 and 17 and so on, right? So until this point here, you're going to get things between one and zero, right? In this region where I'm using my, where I'm moving my mouse. Outside this region, things go up rather fast. Here they go, you can see they are quite large, maybe the space between these lines. As you see, as they are moving towards the left hand side, maybe I should use the pointer. As I'm moving to the left hand side, these level curves get closer and closer, right? So this means become steeper and steeper, right? So this stuff is bounded here by zero, right? As you can tell. And on the other left hand side, everywhere else, but in the red region area, it goes up very quickly, okay? How about I pick a different y, right? So how about I pick the zero one, zero, zero, zero, right? So these are the all energies I have, right? So this is going to be the minus log of the software, right? For the different items, right? So you have the software, right? So you have a vector of five elements. And then I'm checking what is the minus log of each of these items. And as you can tell, each of these items will have a flat region in correspondence to the given category, right? And then there is this kind of wall going up very quickly. But since you can't really tell how quick this wall is, right? I'm going to be showing you how quick this thing is for this case, okay? The red case, right? And so here you're going to be seeing this wall, okay? So this is exactly the same lines I show you you. I show you in the previous slide. But here the height also is, oh my god, what's happened? I think I missed it. How are you generating y tilde again? Here, y tilde is the soft tag max of the linear output of the model, right? So my model has two layers. It's very simple. Oh, what happened? My model has two layers. The first hidden layer is going to be the positive part of the rotation of the input, okay? The y tilde is going to be the soft tag max of the rotation of the hidden layer. My hidden layer has 100 units, okay? My f or c, whatever, is going to be the minus log of the y tilde in the specific location, right? And so if I cycle through all possible y's, I can have either the y's going to be 1, 0, 0, 0, 0, or the 0, 1, 0, 0, 0, or the 0, 0, 0, 1, 0, 0, and so on, right? All the options. So I'm checking all the logs of the, well, I'm checking, okay, what I'm showing you is the minus log soft tag max, right? So it's actually a function there, right? If you open PyTorch, you type log soft max, there should be torched log soft max, right? They forget the arc in the name, but again, it's a soft tag max, right? So this is the negative log soft tag max of the logits, right? Which is this internal thing, right? So let me maybe circle it, right? Pen, maybe I don't like the red color, okay? Whatever. Let's put different color. Color, so bad colors. So the things inside here, right? This thing over here is going to be my logits, right? Maybe I should bring the tablet, right? So I can actually do this more. And then this G here, this is going to be the soft tag max, right? But then what I'm showing you is going to be the negative log soft tag max for one of the five options, okay? So let's go back here. So in this case, again, I chose the Y to be the class for the red dots, right? The first class. And then I'm showing you what is this free energy for all possible values of the X, okay? And so these are the level curves. As you can tell, this stuff goes up rather quickly, right? So it's like a wall, right? And the bottom here is flat, right? So all here, you can see this is the curve. I don't know if you can see this, my mouse cursor. I'm highlighting one level curve. That's the level curve for the zero value, right? So you have zero. And here, this value here, where is it here? This value here is actually zero, right? So I'm moving my mouse on the zero point. So this energy is completely flat at the bottom, right? And then as you go outside the red region, right? It goes up rather, well, I think kind of linearly, I would say almost, right? Maybe. Yeah, you can see sort of, right? So if I show you again, this one, right? You can see, right? All this region here is flat. And then here, it keeps going up like there's like basically like a kink, right? It goes up. I think should be exponential curve here. But then it's kind of going up linearly. So I don't know. Alternatively, okay, but again, this can be sufficient for today. So if you got it until here, very good. If you stick around kudos for you, how are we on time? It's okay. It's been an hour and a half. But I guess, you know, we've been waiting for people for half an hour to show up. So this was the thing I wanted to show you, right? Well, there are notebooks as well. We can go over. I'm going to be online actually. So you can also see that those yourself as well. Alternatively, we can consider the logits to be my free energy, okay? So depends on what you consider logits, you're going to be having different losses, right? So in this case, I said we consider my free energy F equals C, okay? Okay. So in this case, F, hold on. F is going to be equal C every time for any case. But we make an assumption here, and we said that C is going to be the minus log of this thing, right? So I could actually say that my output of the network is just the logits L, okay? And I can say that my C, my cost, okay, hold on, let me think. So my cost basically would be comparing these logits with the Ys, right? So this is a bit weird, right? That's why I didn't go that way, right? So in this case, we have the C, no? It's telling you what is the distance between my Y tilde and the Y, right? But then if I consider the inside, right, to be the logits, I mean, if I consider the logits to be my actual output of the system and my free energy, then the C doesn't no longer tells me the distance between the Y and the Y tilde, okay? But there is this other alternative interpretation, okay? So forget about the C. We may want to say that the free energy, but again, this is an alternative way to see these things. Why do I tell you this one? Because it can be helpful. So we can consider F to be the level of compatibility between X and Y to be the negative logits. So the lower the logits and the more compatible, well, if these are the negative logits, right? The higher the logits and the better is going to be the free energy, right? So we can consider the minus F, right? The minus free energy to be the logits. The higher the logits, the lower the free energy, okay? And then we can consider the loss. The loss will have to include the minus log of the soft argomax of the free energy, okay? So either you put the soft argomax inside the loss or you put it inside of the F, right? So again, potentially we can consider F to be the negative logits. So this is the alternative perspective. And so I wanted to show you how these maps, these level look for the logits, okay? So here I'm showing you the negative logits. So picking the free energy to be equal to the negative logits, you're going to get these level curves, right? So this one is no longer as steep as this thing and it's not flat on the bottom, right? So this was the cross entropy, free energy cross entropy. You're going to get this flat region. In this other case, you get basically sort of constant lines, right? Okay, here there's something else, but again, things are different, right? They look different, right? These again are the negative logits. Low energy, low value of the negative logits. And then I can show you different peaks for the wise, right? So I can pick different targets. And again, here all these curves look like more or less uniform kind of, right? I mean, they are more uniform than in the previous case, okay? And in this case, I'm going to be showing you how they look for again, the red class, right? So this is less maybe intuitive than what they look for the, what happened for the case when we picked the free energy to be the negative, to be the cross entropy, okay? So again, this is an alternative choice of scalar value you can extract from the network, right? And it might be more convenient in some cases. That's why I'm showing you, right? And this one is no longer bounded, right? As you can tell, so here for sake of explanation, I can show you that this thing goes from minus 30 to plus 20, for example, right? Whereas the other case, we said it was bounded by zero, right? So it was only positive, the cross entropy. In this case, if I'm considering the negative logits, it's not bounded, it's a real number, right? And again, things look pretty much smooth, right? And so you can see that the effect of having that flat base, right, comes from the utilization of, okay? Let's see if you follow until here. I know I'm talking about, okay, these are talking about things I had never explained, right? So I'm confused as well. So this is the output of the model, right? The logits or negative logits where I swap the sign. In the other case, you have the minus log. The minus log wouldn't necessarily give you that flat bottom. The minus log gives you definitely a plus infinity thingy, right? So the plus infinity, the thing that goes very quickly up, I guess, that would be the plus infinity associated to the things going close to zero. So that could have been an option, right? The minus log. What are other options we don't consider here, right? So where are we probing this? Where are we looking at, right? You have your network, right? Input, predictor, decoder, logits, right? Well, yeah, you have the logits. This I'm showing you the logits with the inverted sign. Where do the logits go into? Where do the logits go to? Answer. The decoder, the minus log. Nope, nope. The C. Nope. So how do we compute the final loss? Where do we send the, I don't think I got the question. So in our network, there are a few stages, right? We have the input, we compute the hidden layer, we, okay, the soft argmax, yeah, soft, yeah. So the logits go inside the soft argmax, right? And in the soft argmax, you're going to get the exponential and divided by the sum of all the other exponentials, right? So if the negative logits, it's really negative, right? The numerator goes to, if you have an exponential of a very negative number, the outcome is going to be a value that is very soft argmax, right? We saw that before. Close to zero, yes. So here you can see this is the, the logits, right? The negative logits, right? So the negative logits goes inside, very, very down, very down here. Then I'm going to be putting this very negative value inside the exponential, okay? You're going to get a value that is very close to zero. And let me think, okay, but there is also the negative log, okay? Many things are happening, okay? We don't put the negative logits, right? We put the positive logits, right? So we put the positive logits inside the exponential, we divide by all of the things, we take the log, right? So maybe I should, I should maybe write it down in the, compute the log. I should write down the formula, right? Such that we can make this reasoning together next time. So what happens here is that when you use this formula here, if this number grows a lot, basically we're going to get this one divided by something that is going to be just very similar to this large number, right? So overall, this entire thing will tend to be equal to one. And so the negative log is going to give me a zero. Okay, okay. So I think I will have to expand that a little bit. I take a note on that. So I wanted to make a reasoning, right? I wanted to make you reason why we have a flat region over here, right? Why we have a flat region over here in correspondence of this thing that keeps going down, right? So if this keeps going down, and you're going to be using the soft targ mean, which is going to be the one with the minus inside, because that's what we should be using if we consider energies. Okay, I'm mixing so many things. So these are the negative logits, right? So the logits here, this stuff just goes up, right? So these things that you see going down that are the negative logits would mean that it goes up if you consider the positive logits, right? Because here again, I'm showing you this equation in terms of positive logits, right? Hygiene. So I'm showing you this equation, right? In terms of positive logits, right? This is a soft arc max, okay? Which has this expression. If I'm, I think what I want to show you, then I guess I will edit and next iteration, I guess in class on Monday, I will show this new version, which makes more sense to reason about. So if you want to look at the soft arc mean, okay? If you think about the soft arc mean, then I will answer your question. The soft arc mean has a minus inside here, and has a minus inside here, okay? There is a beta term as well, but okay. Doesn't matter. So if you consider the soft arc mean of the negative logits, you're going to get exactly the same, right? As the soft arc max, right? Because soft arc mean has the minus, then if you put inside the negative logits, that the two minuses go away, and then you're going to get the soft arc max. Hygiene, okay? All right. So the soft arc max will be very close to one, right? For very large value of the logits, the soft arc mean will be very close to one, for very negative values of the logits, right? So they are one, the opposite of the other, right? Why is that? Well, soft arc max, it tells you where is the arc max, right? Soft arc mean is going to tell you where is the arc mean. If the largest value is very large, this soft arc max becomes an arc max, right? It's going to get basically the one hot. Similarly, the soft arc mean, if you have a very negative value, is going to be very concentrated towards the minimum, right? So this one should make sense. Make sense, right? Yes. Soft arc max, distribution of probabilities, right? You're going to get a softer version of the arc max as long as these values are all comparable. If you have two values with the same height, they will get 0,5, 0,5. If one of these values is going to be very large, then the soft arc max will convert to the arc max. Similarly, the soft arc mean is going to be telling you where are the lowest values, right? If you have two values that are very the same negative, like same smallest value, you're going to have like 0,5, 0,5 values in correspondence, right? In two dislocations. If the negative value is very negative, the soft arc mean is going to be becoming an arc mean. It's going to be just a one in correspondence to the lowest value, okay? So this should be making sense, right? So we've created a new slide with the soft arc mean, which is what we use when we talk about energies, which we saw in the future lessons, but I didn't explain. So yes, I will add one extra slide. So let me finish the explanation, then I take the both questions, right? So if you have the negative logits for the free energy, then we use the log soft arc mean, okay? With very negative values, the soft arc mean becomes an arc mean, right? And so the output of this arc mean will be exactly one for these regions of points over here, right? So these values here are basically around minus 25 to minus 30, okay? And so if you have this very negative value and you compute the soft arc mean, the soft arc mean will tell you, hey, look, all the other logits, the other planes are we have higher values, right? So this is going to be almost entirely the probability mass. So this is going to be almost entirely the minimum value, right? Therefore, you're going to get that all actually, as we have seen, right? All points that are in these regions here, right? All these points will be receiving a one from our soft arc mean, right? And then since we compute the negative log of this, this is going to be going to zero, okay? So all this area over here will get, you know, the most of my soft arc mean, right? So this is the output of the soft arc mean. And this is going to be the output of the negative log, right? Of the thing. Negative log of that item, okay? And then how this looks is exactly this, right? So it looks like this. All this area here, right? All this area, it's set to zero because we compute the negative log, right? Of one, right? So all of that here, the energy here is completely flat, right? And then outside this thing, this stuff goes up very, very up this direction and goes up this way over here, right? Okay, I think it makes sense plus my drawings. Sorry, my amazing drawings case with this thing. I guess I should get the tablet, right? Next time. Okay, so I'm taking questions right now. Okay, at least for someone makes sense, right? I don't think I'm going to leave this video online, no? Because it's, again, it was a test, right? So I think this is a very good test we had found this morning, waking up at seven on Saturday morning, right? Didn't work out as I expected. I'll try to plan something better, right? Let me take the questions first from Marco, right? Okay, there is a notebook, right? So let me actually show you where the notebook is, right? Should I save the annotations? Should I not? Okay, let me save the annotations, right? So I can actually edit the slides, right? Sure, I answered that in a second. Okay, so at least I know how to click around and get full screen, right? Let me see. Terminal, but again, it doesn't make sense to go full screen, right? Because I'm showing you, I should be showing the terminal. I'm showing you the terminal. I wanted to see if I, okay, I should buy the keypad, right? To choose different screens on the, on the OBS, right? Again, first time streaming with OBS. So this is going to be, it's been a very interesting experience, right? I learned something. I also plan to use the iPad for so many things. I will answer the questions. Give me a sec. Where to get this stuff, right? So if you go in Github, right? And then you, you, you, you clone the NYU deep learning spring 21, no? Gith, pull, oh, oh, Gith, pull. Oh, it's already pulled. Okay. So if you do remember, Conda, uh, the V8 PDL, right? We know this. And then we do Jupyter notebook. Oh, oh, okay. Works. All right. Then we have this new one, right? Spiral classification. So I mean, this is like same title as last semester, but then it's new content, right? So this I made yesterday. And so if you, uh, run everything, uh, kernel, cell, run all. Okay. Let's see if you actually run. So we have bugs. Okay. So first of all, the data generation is a little bit nicer than the what is online that was really crappy. So I have the model, uh, actually worked. Uh, okay. We pick a linear model in this case. Okay. So because I, you had to uncomment this line to pick a nonlinear model. So by picking a linear model, you gotta get linear decision boundaries, which is what you expect to get. Let me go full screen, right? So you can see better. Okay. Uh, so you get linear decision boundaries, right? Uh, here I even run, it went through, right? And it computed all the things. Uh, uh-huh. Interesting, right? So these are the energy levels. Wow. Okay. By mistake, I actually went through this, right? So this is the cross entropy, free energy. When the model is a linear model, right? As you can tell, no, you only have this kind of shape at disposal. Interesting, right? I will answer the questions in a sec. Okay. I didn't know this, right? Cool. Okay. Uh, and then this one didn't work. What happened? Oh, oh, what happened? Why is it not working? This is how the free energy looks for the, it's pretty choppy. This is how it looks for the linear model, right? Let me make it a bit more smooth, right? You see, right? So these are the energy levels for a network that is linear, right? So it doesn't, it cannot really shape the free energy around the spiral, right? So it only has this kind of, uh, well, right? And so the other option, so this is actually something I didn't even see before, right? So it's new for me as well. I will answer these questions. Don't just bear with me, right? So this is one option for the free energy. The other option would be to pick the, uh, negative, uh, logits, right? We said, so let's see how this looks instead. So I'm picking here the negative logits, right? And then I'm visualizing, oh, oh, can we have the in line here? Okay. Okay. Okay. So, okay, this is interesting, right? So these are the logits, um, that goes from minus five, the negative logits, right? Going from minus five to plus five for the red class. Okay. So these are aligned with this part here, right? Then you can see, wow, okay, this is so cool. This is the, the one aligned with the class number one, right? With the orange class. You can see, no, this is like, uh, parallel to this one, right? Uh, this one is parallel with the yellow one. Wow. Okay. This is so nice. This is parallel with the green one, right? Okay. Okay. My Monday lesson, I'm gonna be changing my Monday lesson, right? Thanks for trying out. Okay. So I will use this lesson. I did this, I guess, okay, fine. I decided. So I'm gonna be using these live streams for trying my new lessons before teaching, right? Because then I learned something while I'm talking to someone that otherwise is gonna be me talking to a, to no one, right? Then I feel lonely, right? So if it's okay for you people online, right? You're gonna be my guinea pigs and, and, and I will try with you my new license for my students, right? Uh, because I, you know, the content I put online, the, the content I put online is tailored for my NYU students, which have a specific knowledge, right? So the pace and the, yeah, these are dry runs, right? So I think I will do this, right? Because the, the students I have here, right? I know what they know because I taught them and they also have the prerequisite classes they, they had to take with other colleagues, right? So I will, I can have a conversation sort of with them on these topics. Whereas I can't really have this with random people from online. I mean, I'm more than welcome to talk to you, but there is like a much higher viability of who's joining, when people are joining, either if people are joining, right? And also, again, the tools today, we, we, we kind of figure together today. Anyway, so these are the linear, I have to actually still think through why these are linear, right? So I don't know right now. But these, these are lines, right? And these are the output of my linear layer. And the thing we notice are that these lines are oriented in the, you know, corresponding to these branches, right? All right, cool. So let me show you the 3D version, right? For the negative logics. And these are the planes. So I guess I should change the limits, right? So for the untrained network and the trained network, so you cannot see, oh, maybe it's fine to keep these limits. So you can see how these things change, right? This was too slow, maybe let's go with 4, right? And so this is how the untrained logics look for the last color, right? Which is the purple, right? And so it's a plane, right? In this case, or it looks like to be a plane. It should be a plane. I have to think about this. It is a plane because it's a matrix, right? For every x and y location, the odd input, I give you one value of the logics. And to go from the x, y location to the logics, I only have matrices. Then of course, this is going to be a plane, right? So you have z equal, z, z, how do you spell z? Okay, z equal some xax plus by, right? So you have like, okay, bax1 plus bx2, right? So of course, this is a plane, right? And that's why you have these lines that are equally separated and parallel to each other because it's a plane, right? Of course, it's a plane. The thing that is no longer a plane is when you introduce a wheel, right? So if you uncomment this line, right? Now the model should be training, yeah? So these were the decision boundaries, which are just planes given the input. Now we train, oh, it doesn't work. What happened? Bugs? Oh, okay, of course, of course, of course. 2D, 2D. Okay, I think I have to set some x something. All right, so these are the networks that is being trained on with a nonlinear function, right? With an activation function. And then let me show you these things, right? So these ones are going to be the logits for the, as you saw before on the presentation, the negative logits, right? These are the negative logits for my five classes, right? And then you have even a spinning one. So you can see now how the plane that we just, okay, maybe this actually makes sense, right? So before, so interesting, right? So this is like a slight variation. Okay, so interesting, right? I think if you use a linear model, this is a plane. We just found out, and well, we should have known, but well, we might have not, you know, thought about this, right? Anyway, so this started to be a plane. After the model has trained, it turned into this kind of, you know, deformed plane, which for each of the directions, right? So this one, I picked the direction that goes in this way, right? So this was the one for the purple class. Like, let me show you the one, maybe what class you want. Let's do the green class, right? So it's going to be zero, one, two, three. So let me pick the class number three. Let's take here three, no range, three. Okay. So of course, so this is for the class number, the green one, right? It looks like, yeah. So this is for the green class. And so let's see how this one looks, right? It's very similar. Let me run again. You can see, right? They are very similar. They are slightly deformed, right? Slightly deformed planes. So the, this final transformation is not really that far from the, at least it seems, right? It doesn't seem to be that far from the original linear network, right? Okay. And then let me show you the cross entropy, right? I'm going to be showing you the cross entropy for the train model. Maybe I can show you all of them, right? And these are the one you saw in the, in the, in the class, right? Here you can see how I made the animation. And here I'm going to have you in the, I can show you the spinning version, right? And see here, you can see the one I was talking to you before, right? With these vertical edges, and then the flat region down to zero in this, in this part over here. Okay. All right. So I include the linear thing, the soft argument. What else did we say? I think these are the basics. So, so, so, so, so, so let's get some questions. All right. So there was a, were a few questions. Hi. Okay. Hi. What happens if we choose as cost one minus one T one transpose Y tilde. You can try, right? You have the code, Marco. You had this, this notebook, right? So you can, you can choose yourself. What is the, the cost, right? Here in this case, I choose the C to be this cross entropy, right? And then the loss is going to be just the mean of this per sample. You see that the notation actually is the same as we saw in the, in the slides. So we chose the, the free energy to be this cost and the cost is going to be the cross entropy with the, the logits actually. And so in this case, the C, the C is actually the, it gets the logits as input, right? Not the Y tilde. The Y tilde are computed inside the C. So with the output of the model is the logits, not the Y tilde. And then the C has inside the negative log of the soft arc max. Otherwise you can combine the negative, the soft arc max, the, with the negative log like negative NNL criterion. Anyway, you can decide to choose F, right? You can do the C, right? You can choose the C. You can create your own loss equal to what you try. And then you can see how it looks. And you can tell, you can, you can try out this thing. Okay? I'm hungry. I didn't have breakfast. Hello then. So are you, are we using soft arc max to change the boundary of logits results more interpretable? No, so the soft arc max is part of the training procedure, right? So this is called maximum likelihood also estimation. But so the, this arc max, soft arc max is necessary in order to compute a Y tilde that I can compare to my Y, right? So I have my X a 2D point, a Y, the one hot. And then given the X, I come up with this Y tilde, which is my approximation for the one hot. I need to make it smooth such that I can run gradients backward, right? So the soft arc max is required for me to get predictions that are close to my targets, the one hot, right? And then they need to be smoothed out. Otherwise, again, no gradients, right? I hope you answer your question. It makes sense. I don't know what makes sense. Can I ask more questions, general questions? Yes, you can ask whatever you want. Why is called free energy? That's in the, in Jan's lessons, in the next episodes. I know Bert and Simclear can be interpreted as energy-based models. But what are examples of models that were created as energy-based models, not reinterpreted as one later? Let me think, let me think, let me think. Well, for example, CLEEP, right? So CLEEP is an energy-based model. It provides a level of compatibility between your inputs, right? Again, this energy perspective is just how you think about things, right? So rather than thinking about your model just outputting, like outputting, like a prediction or whatever, this doesn't work whenever you have multiple predictions that can be associated to the same X, right? So I guess this is covered by Jan in other lessons, right? Whenever you have one to many relationships, then your model cannot really produce all possible outputs, right? You have to come up with an output. So instead, our model can produce this kind of energy level, right? And then you can use this energy level to tell whether your pair of inputs is compatible or not, right? So again, CLEEP is an energy-based model. Again, maybe people are not aware of what these things are or why they are cool and that's maybe why they're not using this terminology to start with. So they might not know or understand why this is actually valid or helpful. I think stream deck. Okay, no idea. I guess the free energy was for having the incorrect answer as correct answer for some case. I don't know. Does you don't want the model is flat for incorrect answer? I guess, I don't know. There is a delay. Yes, there is a delay. I know. Yes, dry runs. That's what we thought. I think this is what I'm going to be using this, right? So if you don't mind, then I will do this way. And then I guess I will record the final version with my NYU students, right? Somehow I will just call the one from the previous semester and tell them, hey, we want one more lesson. Let's see what we have here. What happens to the lines if it goes through a nonlinear? That's what I show you. Linear with equal increment or decrement should be a plane. Compositions of linear transformations. Could you please share your notebook? It's online, right? I told you already. It's here. Let me show you where it is. So you go on my GitHub page, then you click on NYU Deep Learning Spring 21. And then you have the first one, right? Inspired classification, which is the one we just covered right now. Okay, so this is the code I'll just show you. Okay, so it's in the GitHub. My name, NYU number four. Thank you very much for doing this. I look forward for the next one. Very good. Are restrictables and machine energy based models? I think so. I don't, I never studied them. So I don't know. Can latent variables be used to control model output? Say a latent variable which controls for style or translation? Sure, of course, right? The latent variables are those extra inputs. We didn't talk about latent variables today, right? But latent variables, so we have three inputs to our system. Today, we only talk about x and y. And so x and y, and then we have this f, which is the energy that is telling you this level of compatibility between x and y. If we have x, y, and z, then we're going to have a capital E, which is going to be the energy, okay? So free energy is the energy free of latent, right? We have only x and y. So f here, x and y. Then you have E of x, y, and z, okay? So three things, f, x, y, e, x, y, and z. And then the z is going to be an additional input which we never observe, right? So x we may or may not have access to, y we don't, we only have access during training. And the z you never have access, right? So you had to find out what is the missing input, right? You can find this out with two ways. You can either minimize the cost, no? You assume that you can grab, run gradient descent with respect to the, you know, this latent variable. Again, this is covering the, this video, right? We, this is future license, where the future in the past, right? Let me show you. YouTube, aha, okay, show. So cool, there's a live stream going on. What happens if I click here? It's going to be like recursive live stream. Oh my god. Go back, okay. Where is it? Okay, how do we use YouTube? Here. N-O-U, right? Uh, this one here. Number eight, right? No. Number, number five, right? So number five is going to be the latent variable energy based model inference. And then number six, you have latent variable energy based model training, right? So inference and training. Here we talk about the E of X, Y, and Z. Today we talk about the F of X and Y, where we don't yet talk about the E near the Z, right? So I did, okay, today intention, I didn't go as expected, was to try to create this additional video for prepping students for the next videos, right? But then, yeah, I didn't go as expected, so I just used these dry runs, but still, nevertheless, we introduced this F of X and Y, rather than the talking about E of X, Y, and Z. Okay, wrapping up, right? Okay, so this was something, I guess. I think I will take it down. I don't really think I'm going to leave it up these videos. It was okay, it was fun, but yeah, definitely not the next video in the series, right? It's definitely some experiment, right? So it's okay. All right. Okay, so you have a nice Saturday. I'm gonna have breakfast. I'm hungry. Okay, just a few questions, right? Where are you all joining? Okay, how many of you are there? Because again, I cannot tell because I am incapable of using this system. Here, I can see only one person, so definitely I don't think there is just one person. Where are you from? Oh, I can run a poll. Create a poll. No, no. How? Add option. Where are you from? Demi, where are you? 22, watching. Okay, thanks. Where are you from? Like, where are you joining from? Okay, Argentina, India. I cannot read that. Singapore. Okay, Los Angeles, Vietnam, Iran, Italy. Singapore. Oh, okay, two people from Singapore. Brazil, Greece. Oh, wow. Okay, New Jersey. Okay, all right. Okay. Yeah, I think no one in the East Coast decided to wake up at this time for this thing. Reasonable, right? So maybe I should change the time. China, Shanghai. Wait, how are you joining? Are you on a VPN? Germany. Yay, okay. Okay. This is cool. It's cool, right? I hope you enjoy at least for whatever. I don't know. Maybe I'll leave it up for the weekend. Okay, I'll leave this video up for the weekend for whoever wants to watch this thing, right? That was something. I think I like the idea. India. Okay, we have a few from India. I think this can work as dry runs. I mean, I can, I enjoy trying things out for like, before going to class, I guess. Definitely, I will update the slides based on the flows that we identified together today, right? The things that I try to explain things that they don't quite work out the way I did because, okay, let me, I'm entertaining you right now, right? The lecture is finished. You can leave. Feel free to say goodbye, but I'm hanging around, right? I'm hanging here a little bit more. So what I was saying, so the lessons you've been seeing, right? The deep learning 2021 spring edition. This is been basically now nine years, no? I'm teaching these things, right? Not everything for nine years, but you know, you can check my first videos on YouTube are from 2016. And these are, they were already like, not the first generation, right? But then they actually were written on the, on the, on the tablet, right? A Samsung tablet. And those were like sort of Khan Academy, but then every time if I write things, I may forget things and it doesn't come out too nice. Maybe you cannot read what I write. So then I decided to convert them into latex and use like a consistent color scheme and all around. And then things, I think, got a little bit nicer as things progressed. And then finally, this thing just, you know, converged into this deep learning spring 21 thing where we were all remote because of, you know, global pandemic. And then, you know, I have proper cameras, a proper, proper lighting, proper microphone and so on. So I think the quality is nice. And I would like to keep, keep it up, right? I'd like to keep a similar, similar level of content. And a large, actually fraction, like how do you call it, like a large, the main reason also why the content is not that bad is that the students I'm addressing, right, are the students here from NYU, which are really, you know, bright. And therefore I have to keep up with their expectations and their knowledge and therefore I have to provide content that is adequate, I guess, for who's listening here, right? And so again, the things that are online are rather, how do you call it, furbished, furbished is a word, like there, no, it's not a word. Like they are the outcome of a lot of, a lot of iterations, right? And then today, again, is the first time trying this thing and it went as it went, right? Actually, when some somehow, like, I got something out of it, right? I mean, I got, like, practice and definitely like a new, we identify improvements, points, right? That's good. So I guess, yeah, this episode won't be joining the collection, but I'm ready to record a new one in a more, you know, condensed matter for, similar to what we had so far, okay? All right. So I guess everyone replied with where they are joining from. It's very nice and flattering to be followed by everywhere in the, in the, well, you know, across the globe. Still, the guy from China, I'm not sure how he joined because there is no YouTube in China unless you are on a VPN or you're on a university network. I guess that's how he, she, I don't know, is joining. Okay. That's it from Mongolia. Oh, okay. Wow. Cool. All right. Yeah. Have a nice day and enjoy your weekend. I'd be going dancing very soon. But lunch, lunch first, I guess. No breakfast, lunch. I'm hungry. All right. Bye. Have a nice day. Oh, September fifth is the, oh, the celebrated teacher days. Yes, I actually got an email actually. Yeah, yeah, someone, someone, someone, someone sent me an email. That was so cute. It's like, oh, happy teacher day. Although it's an Indian thing, but I think I wanted to write this to you. I'm like, oh, it's so cute. All right. Bye. Stop this sharing thing. How do you quit? Stop streaming. Bye. Click. Yes. Yes. I want to click. I want