 Actually, in first point, we have a website now. You can find a website if you go to the repository on my GitHub. Now you can click there on the link, and you get redirected to the website where you have the summaries of the previous class and the previous lab. So it's your duty to go over those summaries before class. Otherwise, if I take the first 15 minutes revising what we have seen last time, then we have 15 minutes less to see new stuff. It's very hard when we have just 15 minutes on Tuesday. Nevertheless, we start with a question. So let's say I'd like to do classification between images of dogs and cats. If this is my cat image, where will be my dog image? Near to this point, right? So how can we tell them apart? First of all, I have to take it. If this is the zero, I will have to talk. You have to suppose to talk. What am I doing here? Am I? Translation. How do you translate stuff? Matrix multiplication. OK, what does matrix multiplication do? Rotation, reflection, scaling, and sharing, right? Sheer, what is pronounced? Actually, scaling. How can you do scaling also? Why are scalars called scalars? Because they scale, yeah. Good. So you can always think about a matrix. You can just normalize it such that you have like unitary determinant. And then you actually have a scalar, which is changing the size. So you can have, I usually think about matrices as rotation. And so I usually just say we rotate stuff in whatever dimensional space. And then we are going to be doing another operation with neural nets, which is going to be squashing. So you're going to be hearing me repeating this very many times. Neural nets are simply rotation and squashing. Rotation and squashing. What's coming after? Rotation and then. Fantastic. All right. So let's get started. Then just because I like advertising, I just put there my handle. And again, if you need to say anything, just call me. Hey, I have no idea what's going on. Please repeat. And if you don't know who that guy is, you may have a look about TV series from the 90s. Anyhow. Oh, A double N. What is it? Artificial neural networks, I guess. Yeah, supervised learning, classification. So this is going to be kind of a revise of stuff you have already seen before, but in a very much prettier way because I spend the whole day making this stuff for you. All right, so let's go on. And we have this guy, right? We have seen this last time. And let's say those are simply three branches of a spiral. So in this case, where is going to be my data living on? So where is the data here, if I show you this stuff here? Those branches are made of points. And these points live on what space? R2, right? It's the plane. So all those points are moving around that plane. Why do they show colors? Those are the labels, right? So three different classes, three different labels. You can make this drawing with Matplotlib and Python and NumPy, OK? On the other side, oh yeah. So we have like a T goes from 0 to 1, and then C is going to be the class from 1 to capital C in this case. Let's make things a little bit more spicy. So let's add some crap there, such that we have more crappy looking data. There's actually more real data, right? OK. So what does classification mean? If you'd like to do classification, let's use whatever thing you want. What's called the thing, the logistic regression. So what does logistic regression do here in this case? So it's going to do something like this, right? Some linear planes for separating the data. What's the main issue here? A second? It's not linearly separable. It's not linearly separable. So but what is the main issue here? How would you define what's the issue here? Like, yes, they are not linearly separable. Therefore, you have the best. Yeah, but what don't you like from that drawing? Right. So points, in one region, you have multiple classes, right? This means that those branches are crossing my decision boundaries, which are linear. So how can we fix that? OK, because you'll be seeing the video from last week. So usually I have people saying, oh, just make those decision boundary nonlinear, right? That's what usually other professors do. I do the things that are more cool. So I just get the data to be linearly separable, right? That's just different perspective to see the same stuff. Anyhow, the main issue here is that we have intersection between these decision boundaries and the data. And therefore, I'm going to try to do this, right? And I already seen the video from last time. So OK, this is no news. Or the thing that you're maybe more used to, see how the decision boundaries over training will try to adapt what is the distribution of this data. Here we are watching things from the bottom up. So if you have your network where the first layer is on the bottom and the last layer is on the top. And if you draw networks in the other direction, you get one grade less. So the input is going to be on the bottom. Why is that? Why do we have input at the bottom? Does anyone can anyone guess? Yeah, why the neural network? That's correct, right? So we have low level features at the lower part of the network. And as you climb up in the hierarchy, you basically want to draw this network in the way it is made. So you have high in the hierarchy in the upper part. So if you put some classifier, you will put a classifier on top. So the first time I was reading a paper from Joffrey, he's like, blah, blah, blah, blah. I put a classifier on top of what? Of the network. I'm thinking, what's the top of a network? I have no idea, right? So networks are drawn from bottom to top with the first layer on the bottom where you have the input coming in, lower level features. As you climb up, you have the top. And therefore, if you have multiple outputs, it's called a multi-headed network, right? Like a hydra, hydra, whatever it's called. Anyhow, so we're going to figure out now in this lesson how we can do these things. Do you know how to do this stuff? No, yes? Okay, you should be because you should have taken machine learning before. But perhaps it's not being very separable, right? So we just add one more layer and things start working. Anyhow, training data. Okay, so yesterday, well, last week we have seen that a neural network, when you just initialize it, it makes some kind of transformation. So we were feeding that kind of cloud which was sample from a Gaussian distribution with identity metrics, as covariant metrics, and mean zero. So what was the roughly average radius of that cloud of points? Three, oh, you actually remember, very good. The radius was three, right? So things were within a radius of three. And those kind of circular shape was fed inside the network and the network was getting you any kind of arbitrary transformation which was super pretty, right? Yes, it was very pretty, yeah, good. But then that transformation wasn't instrumental to do anything, right? So today we're gonna see how, by using data, we can enforce some kind of meaning over that kind of transformation that the network does by itself. And so data is gonna be the most important part. So here we're gonna have, that should be in pink. Yeah, it's too bright in here, so whatever. So the X is gonna be my input data. It's bold because it represents a vector and this guy lives in a Rn, okay? How much is N in our case? Two, because? Because points live on that space, right? Of the spires. Okay, fantastic. And then it's gonna be my Ith sample. I have several samples, right? And this takes quite long to draw. You have several samples and they are like row vectors and I put them, I stack them one on top of each other. And I have M of them, right? So that matrix, what is the size of this matrix? Shout louder. M by N, fantastic. So I have N columns and M in the height. If I would use this matrix for doing some operations, what is the dimension where I'm shooting to? Try again. M, okay. Because the height of the matrix is gonna be the dimension where you shoot to and the width of the matrix is gonna be the dimension where you shoot from, right? Because you multiply column times, row times column, right? All right, cool. Then we have, sorry, we have Ci, which is gonna be my different classes for each of those points in the plane. And so here, we're gonna have those Ci's gonna be equal to one to capital K before it was capital C. I still have to fix that, I know. So how many, how much is capital K here? Three, because we have seen three colors, yeah, fantastic. All right, so if I stack all those Ci's, how many Ci's do I have? Say again? M. So if you stack M of these guys, you get a column vector here, C. All right, finally, this is, yeah, height of M. But the problem with this kind of notation, like one, two, three, whatever, basically is that they introduce some kind of ordering, right? So class one comes before class two, which comes before class three. And that doesn't make any sense, right? They are colors. So it's a categorical distribution. I don't want to have something that also has order. Therefore, I'm gonna be using this alternative representation, which I'm gonna be basically converting those Ci's into vectors of the size of my capital K, so the number of classes. And then I'm gonna have a one in correspondence of the class, which is indexed by the specific Ci, okay? So let's say Ci is gonna be equal one. Then you have basically the first guy here, okay? And since we are talking about mathematics, I count from one, right? One, two, three. If you are talking about Python and C plus, plus, whatever, so you switch gear, you have different hat, you can count from zero. Math, you count from one. So if you stack all those Ci's converted into that representation, what do you get? We don't have numbers, just use letters. You have M by K matrix, right? So this is gonna be my capital Y matrix. And then you have my capital K number of columns and M number of rows. And each of these guys here is gonna be a vector, which is the zero one set to the K, where only one, I mean only one item is set to one. And so you can say that the zero norm is equal to one. Moreover, you can also think about this notation has having some probability mass, which is completely concentrated in one specific spot, right? So you have three possible spots, you have three possible classes. You put all your 100% bet over that specific category. The network will try to approximate this. It won't be able to. But that's how we train a network with these kind of hard labels. Questions so far? No, sorry, that was exercise. Questions so far? Am I too slow? Yes, a little bit, no, okay. Do you like the font, the colors? Okay, thank you. It takes forever. All right, latex. That's why we move to markdown, right? Anyhow, so this will be basically the first exercise. You have something similar in your first homework. So we skip this because it's gonna be due for two weeks. And basically, if this would have been a tutorial, it would have been typing stuff now. All right, so let's see how a fully network, fully connected network works and look like. So at the bottom, why is at the bottom? Say again? The input is at the bottom, why? Low level feature, fantastic. What's the color of the X? Pink. I mean, yes, that's correct, but yeah. Then we get a fine transformation. It's shown there by the arrow. And then we get into that green F where F is gonna be a non-linearity. The output of the F is gonna be called H, which is representing my hidden layer. So H is something that is inside the network. I can't see from outside. And so it's called hidden. It's bold because it's a vector. Moreover, then I have another affine transformation. You only see the matrix there, which maps to G, which is another non-linear transformation. And then you have the final output, which is Y hat. And so the output color is, you don't say white, but what's the bubble, output bubble color blue, right? And then the hidden is gonna be green. It's gonna be always constant. All right, so these are the only equations basically you're gonna see in this course. You have the hidden layer H vector. It's gonna be in this non-linear function, point-wise non-linear function, which is sorry, element-wise non-linear function of a affine transformation of the input, which is the X here plus the bias, right? So this is a linear operator plus the bias. This is a fine transformation. And then the F is gonna be your non-linear mapping. Again, then you have your Y hat, which is gonna be my output of the network. My hypothesis is gonna be a non-linear function, applied to each element of this vector here. And this vector is basically a affine transformation of the hidden layer, okay? That's all you get in a neural network. Affine transformation, I usually call them rotations. I call non-linear function as squashing. And so you just repeat, rotation, squashing, rotation, squashing, rotation, squashing. Fantastic, thank you. All right, so that's it, right? Easy, right? So far. No, you have a question mark in your face. Ask. What's up? Yeah, both F and Gs are arbitrary non-linear functions. You can use anything you like. This is only one hidden layer. My output layer is gonna be my blue guy. Let's add the output. It's on, you can see Y hat on top. So it's output, and the X is gonna be your input on the bottom. So I call this one three-layer neural network. Yan looks like he's calling this two-layer neural network. I call it three-layer neural network because there is an input neuron at the bottom. There is a hidden neuron at the center, and there is an output neuron. So one, two, three. But he counts from zero, like programmers. So two-layer, but no, three. How many linear, how many affine transformation does a three-layer neural network have? Fantastic. How many layers of neurons you have? Three. Okay, cool. Yeah, question. Uh-huh, yeah. Well, affine transformation, right? So there is also translation. Because I like usually to extract the scaling, the scalar from the matrix, and then I have my unitary determinant matrix, which is basically rotating stuff, and then you have the other one, which is scaling. Then you also have flipping, right? If you have the determinant, which is negative, right? Usually, matrices are just rotating stuff. It's a bit hard to think about in high dimensions. I just say matrices rotate stuff because they apply the same kind of movement of everything. So it's kind of global operation. Other questions? No. All right, so example of nonlinear functions are here, a few, so the first one is positive part. Basically, you get the positive as it is. If it's negative, you set it to zero. Other people call it relu, rectifying linear unit, or other stuff, I don't know. Yeah, whatever. I like positive part, which is math-ish. Then there is a sigmoid, which is the one over one plus x of the minus, whatever argument. Hyperbolic tangent, which is just a rescale version of the sigmoid. We saw that last time. Then there is the soft argmax. I'm gonna call it this way because it's gonna be just a softer version of an argmax. An argmax is gonna give you all zeros, but an index equal to one in correspondence of the correct highest value. A softmax is gonna give you something like that, where it's gonna be almost one on the highest value, kind of zero-ish everywhere else, right? But if you have two guys at the same height, you're gonna get half and half, and the rest is gonna be kind of zero, okay? Yeah, I guess this has a nice derivative. This guy is easy to, I think, use for training. I think you could use that kind of normalization as well. So the question was why don't we use resizing? Why don't you automatically set the output to be within zero to one range? I guess because it's gonna be dependent on the output, right? You change the output, you have to change all the time the scaling. This one is just one scaling all the way is the same. I guess, yeah, that could be the answer. All right, so, oh, okay, and this took five hours drawing. Okay, we have our X on the left-hand side with five elements. Then here you have, for example, your first hidden layer. I may have a second hidden layer, third hidden layer, and then finally my output layer. So how many layers does this network have? How many columns can you count here? Five, okay, fantastic. How many gaps between columns can you count? Four are the rotations you have, okay, cool. All right, so we go from the first layer, which is also called A1, which are the activation at the layer one. We go to the activation at layer two, and so on A3, A4, until the A capital L, the last one. So we go from activation at layer one to activation layer two with the W1 metrics. Basically, you go as well with W2 there, W3 and so on, right? So how do you get that first neuron? Can you see anything? So, okay, let me see if I can make notes too dark. Is it any better? Kind of, okay, whatever, I go like this. Boom, okay. Anyone taking notes on paper? Yes. Sorry, okay, yeah, whatever. After I turn on the light. Okay, so how do you get the values for this guy here? So this guy is gonna be the jth neuron on my what layer? Second layer, right? So A2. So this is gonna be my element wise, sorry, my non-linear function F, where I have this Wj, which is the jth row of W1 metrics, which is multiplied by X. So I have a row times a vector, you get a six scalar, thank you. And then plus BJ, what is BJ? It's a scalar, yes, that's correct. Also called bias, right? Okay, what a choice of words, netters, all right. And this is gonna basically be like the sum of the scalar multiplications, right? I mean the sum of the multiplications. And so how do you get this one? You get these guys, you multiply them by the weights, and then you get the first guy, right? And then for the second one, you try to do copy and paste, it doesn't work. So you have to draw all those lines again. And then you start realizing that you made a very bad decision when drawing everything else, okay. So where are these weights stored? In W1, okay, fantastic. Now I'm gonna be fast forward, you know, fast forward. Yeah, how pretty, no? This is PowerPoint, ninja skills. Okay, I should be doing advertisement for, and they should pay me, Microsoft. All right, so this is a neural network, right? How pretty is it? Thank you. Okay, I give you back the light here. I should just turn on, right, everything, I guess. Do we know if we can turn off the first line of lights? Okay, we have to figure that next time, I guess. All right, so we just used this representation where each of those layers you were observing before are just condensed in one ball here. So in this case, H is a vector, right? And so a vector of whatever number of elements is just represented there on the screen. And the same for the Y. So here you have those matrices which are shooting the right directions, the right dimensionalities. And then here you had the nonlinear functions. So you can think about my Y hat here, this guy, as some kind of Y hat function of my current input X, right? So the X, the pink guy, it's fed into the network and then the network will give me some kind of, I expect you to have this kind of output prediction. So this can be thought as a function that maps Rn to Rc, where C is number of classes that was capital K, but I guess I had to fix the slide. So you can see this as mapping inputs to final predictions. Usually it's better to think in a different way. So what's happening is actually you map these Rn to some intermediate representation, Rd, and then you will finally to the final classification dimensions. Where D, the dimension of the internal layer, it's much, much larger than the input and output dimensions. Why is that? Because whenever you go in a very high dimensional space, everything's far. Like really, really, really, really, really far apart. And so if things are very far, it's very easy to rotate stuff and get things to move a little bit there, right? If you go everything in a very tiny little crumb space, you try to move things, everything moves together, right? But then if you go into this intermediate space where everything's so far apart, you can just kick things around, okay? And it's much easier. So going to a higher intermediate representation, dimensional intermediate representation, it's really, really, really helpful. So potentially you can have a very fat network, right? You have an input, very fat, hidden layer, just an output. The cool part is that if I have 100 neurons in my hidden layer, I can simply use two hidden layers of 10 neurons, and it's gonna be performing roughly the same. So instead of having this very, very fat intermediate layer, I can just decide to stack a few hidden layers, and the number of combinations of those neurons will be growing exponentially. So if you want a 1000-neuron hidden layer, you can just have three hidden layers of 10, right? So in the second case where you have a cascade of things, you will have data dependencies, and you will have to wait for the guys down to have finished before starting the next operation. So by definition, the more layers you stack, the slower you get. On the other case, you need to have so many, many, many more neurons in the other case in order to be able to go, to actually be as good as you are by stacking just a few of those layers. So in the other case, you will have to do many computations as well. If you have parallel software, I guess, like parallel hardware, then you can prefer maybe those large versions, but I wouldn't go for those. Yeah, that's correct as well. So it takes us so much more space in memory. All right, so how much time? Oh, make noise. Okay, I can't watch my watch. Okay, never mind. All right, so yeah, okay. So in this case, my input lives in a two-dimensional space. My hidden layer is gonna be living in a 100-dimensional space, and my final class, my final output is gonna be living in a three-dimensional space. And this would be the second part where you start doing stuff with your computer, but we are in class, we have no time. So neural network training, huh, how does it work? So we use this guy here, the soft argmax, which is a softer version of the argmax, right? Which is simply the fraction of the exponential divided by the sum of all the exponential of the other items, right? You already seen this yesterday. Why do I write this stuff lives between zero and one? Why doesn't, why didn't I use the square brackets? So the answer was, it's very unlikely. I would say it's impossible. Why it's impossible? Because the exponential function at the numerator is strictly positive. Why doesn't it reach one? Because the exponential is strictly positive. That was the answer, correct. Yeah, the bottom doesn't, it's always gonna be slightly larger than the numerator, right? All right, so the input to the soft argmax layer is called logit, and the logits are the output, the linear output of the network. Here we're gonna have our total loss, the loss for the specific, for the old dataset we have, which is function of the capital Y hat, the prediction of the network for the whole set of inputs and the vector C, which is this vector of labels. And which is gonna be simply here the average of this lower case L, this curly L, which is basically the per-sample loss. Yan uses capital L for that thing. So in this case, if I want to do classification, my per-sample loss is gonna be basically minus log of the output of the soft argmax at the correct class C, okay? So C is gonna be my correct class, which is that index, one-hot. Actually C was the cardinal number, right? The blue Y was the one-hot encoding, and my Y hat is gonna be the output of the network, which is the output of the soft argmax. So my per-sample loss is gonna be the minus log of that soft argmax at the correct case class, right? So do we understand what I said? Yes, so just to test your understanding, you're gonna be telling me what to write next. So let's say, yeah, this loss is also called cross entropy or negative log likelihood. So let's say my pink X, which looks white, is there, and then I have my class, the orange class equal to one. Therefore, what's gonna be my blue Y? The one-hot encoding of the class C. One-zero-zero, okay? Fantastic. All right, so let's say I feed this X to my neural net, so I'm gonna compute this Y, okay, it's missing a Y, sorry, my bad. So here there's a Y, right? Y-hot of this item here. So why do I write almost, oh no, sorry, sorry, my bad. So I put this X here inside this Y-hot, and the output of the network is gonna be this, almost one, almost zero, almost zero. What is almost one? Is it one plus or one minus? Okay, good. What is almost zero? Zero plus. Zero plus, very good. And what about the last one? Zero plus, good. All right, so if I have this guy here, what's gonna be my per-sample loss? So my per-sample loss is gonna be this output, almost one, almost zero, almost zero, and then class number one, okay? So what do I get here? So here you're gonna be computing the, what is Y-hot of C? Y-hot of C is gonna be, hold on, yeah, so Y-hot of C is gonna be almost one, which is one minus, okay? Then you have log of one minus, so it's gonna be zero minus, then there is a minus in front. And you get zero plus, that's correct, very good. How about, so basically, if you input this X in my network, the class I expect to have for that input is one, and my network says, ah, one, zero, zero. Well, the loss is gonna be, which basically is the penalty for saying, say, bullshit, is gonna be, oh, no penalty, you're doing well, okay? So let's see what's happening instead. If my network says, oh no, no, this sample is zero, one, zero, huh? So my per-sample loss of zero, one, zero, and one is going to be what? So what is Y-hot of C? Almost zero, what kind of almost zero? Zero plus, what is log of zero plus? Negative infinity, what is, with a minus in front? So this guy approaches positive infinity, right? So that's why there is a plus there. If you just write infinity, that's wrong for me, okay? Plus infinity, minus infinity, and infinities are three different items, different animals. Okay, makes sense? So if your network say, bullshit, you're gonna say plus infinity, very bad network. If the network says, oh, kind of right answer, you just say, oh, all right, you're doing well. Yeah, question, second, sorry? So Y-hot, the question, you cannot see maybe here, if you squint a little bit in gray, you have Y-hot is gonna be a function of your input X through two rotations and two squashings. One squashing being the positive part, and the other squashing being the soft argmax, right? And so if you use this per sample loss on top of your soft argmax, then you get this kind of cross-entropeter, and then if you compute what's written here, you get basically this. If you input an X and your network is correct, you say, good network, good boy. If you instead, your network says, bullshit, then you're gonna say, bad dog, no, bad network, whatever. Make sense? So Y-hot is a vector, and I choose the C's item. So then you can think about this one as Y-hot subscript C. But C is actually the, yeah, C is gonna be one, two, or three, right? So it's gonna be basically, or Y-one, or Y-hot two, or Y-hot three. So one of the three items, elements of Y-hot. Make sense? No, yes, you can say no. Did I answer your question? No. So Y-hot, it's a vector of three items, three elements. Yeah, here, right? Y-hot, if you squint a little bit, it's gonna be the output of the network. K-classes, yeah. And the output guy, so this inside, this guy here inside is gonna be the logic, the linear output, and then you apply this G, which is gonna be the soft arc max, okay? Which is kind of normalizing the outputs to be within zero and one, such that the sum of all the items of the output is gonna be up to one. So why don't we use mean square error? I think Jan yesterday addressed this question. He actually wrote and wrote and written, he hasn't written on the notebook, right? So the point is that this is the loss function that we use for classification, which gives you something that actually works. Other losses may not be optimal. So when you'd like to do classification, we use this cross entropy, which also has other statistical property I'm not gonna be talking about here. All right, so since I'd like also to go over the notebooks and stuff, I will try to speed up a little bit. All right, so training, how does training work? So I just get all the parameters, all the matrices, like weight matrices, biases, and whatever, I get a set of those weights and I call them capital theta. It's gonna be the collection of the set of all my trainable parameters. Such that I can write now my J of theta, which is defined as the loss. So it's the same stuff. Why did I change notation? So what can you notice here? What's inside this function? This function is function of the parameters, whereas this function here, the loss, is usually function of the output of the network, right? This white hat is the output of the network. In this case, since its capital is for the whole batch. Right, so this is the way we use to write when we write like the loss, the type of loss. This is simply say, oh, J is gonna be my objective function for a optimization problem that we're gonna see right now. So how does it work? We can think about J being this purple guy here, which looks like that. There I use a lower case theta, which is basically a scalar. Just think about having one parameter. And so if I have J on the y-axis, I'm gonna have theta on the x-axis, right? So how do you train these networks? Usually you start with a randomly initialized network, which means you pick just initial theta zero value. At that specific point, you can see that the J, which is gonna be called training loss, has a specific value, which is gonna be my J at point theta zero. There, you can compute the derivative, which you can't even see. It's a green line there. Parallel to the, yeah, if you can't see, I know. Sorry, again. Right? How pretty. So you had the green slope there, which is showing you the derivative at that point. And then that is basically the derivative for my J function with respect to the parameter theta computed at the point theta zero. Now, the only thing you have to do is gonna be taking a step towards left. So is that derivative positive or negative? So do you agree that the derivative is positive? Positive, fantastic. But I'm taking a step towards the left. And so, how do you do that? How do you do that? Exactly, you just put a minus. Okay, fantastic. So this is called, how is called this stuff here? Okay, so this is gradient descent, right? How do you train a neural network? Gradient, whoa, whoa, whoa. Okay, gradient, whoa, whoa. I heard another word here. Who say back propagation here today? No, I mean, did I, okay, I didn't, yes. I didn't mention back propagation yet, right? So how do we train a network? Gradient methods, right? Okay, cool. I have to compute these gradients. How do you compute these gradients? So what is this, the J in the W, Y? So this guy's gonna be my the J in the Y, which is the partial with respect to the output times the Y in the W, right? And similarly, how this, what is the partial derivative of the, what is the Jacobian of my target, my objective function with respect to the WH? It's gonna be my partial derivative of the output of the cost with respect to the output of the network. Then times the Jacobian of the output of the network with respect to the hidden layer. And then finally, the partial of the hidden with respect to the W, right? How is called this? Back propagation, okay? All right, so what is back propagation? The derivative computations, right? How do we train a network? Okay, if you get it wrong on the midterm, I fail you. All right. Yes, this left is an exercise. Why does our plus here? Okay, so in the last, how many minutes I have? Five minutes, really? Nine minutes, oh, that's a lot of time. So we're gonna go through two notebooks. Maybe terminal, CD, work, GitHub, PyTorch, Conda, activate, mini course, yeah, works. Jupyter notebook, okay? How do I share my screen? System preferences. Yeah, but I cannot see. Arrangement, mirror, okay. Okay, can you see? No, so we're gonna be going through the spire classification right now. View, don't show, okay. So since it's gonna be, oh, you can see, right, stuff. I don't have to turn off the light, okay. So here I'm gonna do basically import random stuff towards and then opt-in math, such that you can see something. I use my awesome default column figuration. We have a device, what's the device used for? For whatever the device you want to run this stuff on you want, so it is taking care. Here I just put the same numbers as we have seen before. You should be able to understand this stuff and do it yourself. That's part of, I think, kind of the homework. Next year we had to do this as homework, remember? All right, so I just visualized the data. You already seen this stuff, okay. So, oh, there's no surprise. So this is the starting point. Points I have in two coordinates, like each point has X and Y and then you have color for the different class. I'm not running it too fast, right? I'm not, you're not supposed to read the code now. You read the code later and play with an notebook at least one hour. Right now we just go through the notebook just to see that it actually runs because you never know. It's an open source project. Right, so linear model. I'm gonna be training this guy over here. So I create a sequencer, which is a container. I don't have to use it, but it's easier. And I put there two linear. What is linear? It's wrong. What's a linear? It's on a fine transformation. What's on a fine transformation? Five things. It's gonna be on the midterm, so you know. Just realize. Going from D to H, where D is actually the input space, H is gonna be the hidden from hidden on the output. Just linear layers, right? So how are the decision boundaries here? Crappy. Yes, correct. Linear. So I start this guy, trains in a blast. And then I show you the output. Nothing, right? Bad. Bad network. It did its best, right? So why are these decision boundaries put in this configuration? Why they are not rotated? Why do you have the yellow area on the left-hand side and not on the other sides? It tries to do it best, right? So my accuracy, it's 0.5. Can you see? So what is 0.5? Okay, who said random? Okay, very good. All right, otherwise you want to refresh your probability. Okay, one divided by three, right? Okay, we have no time, but I wanted to add one more thing. Yeah, we don't have training losses here. What is the first value of the training loss? The first, this is the latest value I get, right? So this is my final loss, all 0.86. What is the first value of my loss? Any idea? Oh, but you see my screen here. Damn, I cannot even, okay. Okay, figure out how much, the first number you get there is, no? Why? You figure out, if you don't figure out next week, I tell you, but you should figure it out, right? So you should try to use your brain too, sometimes. All right, so let's do something here. I'm gonna be adding this positive part in the center, okay? So I just add one more little tiny, zero, the negative numbers, right? Everything else is the same. I don't change anything, but I delete negative numbers, like with the zook, the thingy there. I don't know the English name. Accuracy? Oh, no? No, okay. You wanna see? Are you excited? Are you waiting for it? Kind of, people are sleeping, I guess. All right. Tada. Oh, okay, cool. So, yeah, two minutes left. Try changing the number of hidden layers, right? So try changing the, sorry, try changing the dimension of the hidden layer. What does it happen? If you use two hidden, two neurons in the hidden layer, or five neurons. So I try playing with this, right? And figure out what's happening. You can comment on Piazza, you can start talking, you can have conversations too, right? We really highly encourage you to play with these notebooks and figuring out things. In the last 30 seconds, we go through another notebook, which is gonna be exactly the same, okay? But in this case, my points look like this. Can you see anything? Kind of, right? Can you see like this kind of banana? Which is not a banana, it's like a double Nike symbol, Nike, Nike, Nike symbol. All right, so I'm gonna be training here my linear network. I train everything. Okay, I explain now, don't worry. All right, so I train my linear network. What is my linear network? The same stuff as before. I have a sequential, a container, where I put my tool. What are those? A fine transformations, yes, correct. I send to the device, oh, okay. How do we train a network, right? So to train a network, you have your input X. These are all my points. I feed them to the network inside my model. And my model will give me my Y-pred, right? And the definition of the model, it's this one, it's this container here, okay? So we have the model, we feed the whole X, you get Y-pred. Then I compute the loss, which is gonna be my criterion, to which, which is computed over my predictions and Ys. The Ys are the classes, right? And the Y-prediction are those output of the network. In this case, the criterion, it's written here, is gonna be a MSC loss. So we change. Before we were using actually a cross entropy, minus log of the soft max, soft arc max. Now we are gonna using a MSC because we are using a regression, and we are in a regression problem. So we compute the loss, which is the quadratic distance between your output and the target, right? So if we talk about regression, we talk about targets. If we talk about classification, we are gonna be talking about classes and labels. So labels, classification, regression, we talk about target, okay? All right, so we compute the loss here, and then I had to clean up whatever happened before. So I say to my optimizer, clean all the previous leftover from the previous operation, which is this zero-grad pass. Then I perform backward. What is backward? What is back propagation? Computation of the partial derivative of the, say again, partial derivative of loss with respect to the parameter. So loss dot backward is simply chain rule and computes all the partial derivative of your final loss, which is gonna be in this case, the mean square error, with respect to each and every parameter of your model. Finally, you do step, which is gonna be stepping, if that's the gradient, you step backwards in the direction of the gradient, okay? All right, so we train this network, which didn't have a nonlinearity and it did something. I don't know how to interpret this loss. And this is gonna be my output of the network. What is it? Linear regression, yay. Okay, boring. Okay, so now I'm gonna do deep network, super exciting things. What do I change? I just delete negative values, okay? How cool is that? All right, so I remove the negative values sometimes here, really, or sometimes I just use a hyperbolic tangent. I just split in two times so you can see the comparison. I train this stuff and let's see the prediction before actually training. So before training, you're gonna get these kind of predictions, right? So it's kind of shooting towards zero kind of horizontal line. And the green line is gonna be representing the variance. Here, where my cursor is, is zero. So the variance here is like 0.2 roughly between all those predictions. Let's go down and let's watch how the network changed their final output given that we have used the loss, right? And we have, we moved against. So we are in the mountain. There is fog. You can see where the valley is. You just walk towards the slippery, you know where the thing that goes down, right? Towards the, down to the valley. So after doing this procedure several times, are you excited to see how this network perform? I don't hear you. Okay, thank you. All right, so, here, boom. So guess which one, okay. Guess which one is using the positive part and which one is using the hyperbolic tangent? Okay, because you saw the typos, damn. Okay, so left-hand side, you can see how my network approximates my input as a, what is that thing called? Can you see anything? Maybe you can see. How does this stuff look like? This looks like a straight line, right? So my network, it's simply a output, it's simply a piecewise linear, you know, approximation of your input. This is because I just deleted the negative things. Instead if you use the, what is it? The hyperbolic tangent, yes. You get this guy here. How smooth, right? Okay, why did I do this stuff? This is nicer, right? I think. Okay, first of all, you can see the yellow thing, which is the standard deviation, which looks like fucked up here and less fucked up on the left-hand side, right? See, this is standard deviation, it's really spiky. So if you train an ensemble of network, these networks will kind of not agree as consistently as those other dudes on the left-hand side. Moreover, let me change one line. Let's put here number four. So I'm gonna be now looking at outside my training region data, okay? I'm gonna be looking what's happening on the left and the right. What do you expect to see? What do you expect these networks to do when you actually test the network outside the training region? Okay, good intuition, something similar. So this network will not work, right? Because network will only be able to generalize over data that is in a similar range. If you ask your network that has been trained on this data, how to interpret things that are over here, they will say, I don't know. And unfortunately, the main issue is that these are regression networks. So they will not tell you how confident they are, right? They are absolutely, yeah, I shouldn't say bad words. Whatever, they won't tell you how confident they are, right? They are just telling you whatever, that number. But let's see what's happening now. So what's going on here? Now I just show you a little bit more, right? On the side. The yellow one is gonna be the standard deviation and the green one is the variance. You can see on the left-hand side is the reload, the positive part function keeps having the final branches keeping, they keep the same slope. Whereas the one trained with hyperbolic tangent, it will saturate eventually, okay? And so you have to know this network will have side effects. The choice of nonlinear function will have a side effect, especially if you go outside the training domain region. Luckily, you can use this technique, the ensemble variance prediction or estimation, in order to estimate somehow the uncertainty with which a prediction is made. This is really, really, really, really important in terms of research, right? So if you have your network that does regression, you have no whatsoever clue about its own confidence. If you train a bunch of networks that have different initial values, you train them all with the same procedure. You can compute the variance in order to estimate the uncertainty with which a given prediction is made. With this, that was it for today. Thank you for listening and I'll see you next week. Bye-bye. Questions? Okay, are there questions? Hold on, hold on, hold on, wait a sec. First of all, okay, the scribers and whatever, they have to pay attention to what he writes. We have always the notes coming up on the website by Sunday, such that you can revise the content before class. If you come to class without revising content, you may not be fully, you know, receptive for whatever we talk about. That was it, I think. All right, thank you.