 I don't assume you are going to be able to understand absolutely everything. I just like you to try to follow along whatever happens in a high level way. Then you actually need to spend some extra time by your own and try to understand what's going on. All right, so let's go to the second part of our talk, which is going to be artificial neural nets, and we start with the supervised learning classification in this case. So we are going to try to do some fun stuff and see how neural network tried to solve this kind of problem. So our initial point is going to be trying to generate nice colored spires because everyone likes spires, nice color spires. So you should just draw the equation of this guy is simply a parametric formula of this form. So it's not nothing fancy. It's just like a sine cosine term, which is starting from goes up to 2t, where t goes from 0 to 1, and then it goes from actually 2t divided by c, where c is the number of categories. So in this case, capital C is 3. So we have three different categories. We have three different branches, branches of that spiral, and they go up to basically 2 divided by c, which is 2 third. So this first guy starts in this direction here, and it goes like 1 third, 2 third, so it ends up here. This guy, second one, it starts this direction, this is first third, and this is second third, so it ends up there. And the last guy here, it starts with this direction here. This is going to be my first third, and then is a second third. So as you can see, with this parametric formula, we can express the written drawing there. So now we know how to draw some lines. But we need to get some data, because anytime we'd like to perform supervised learning, what did I say before? What is supervised learning? We have a sample. With labels. And we have a label. So what are my samples here? So I'm going to start now generating these samples. So I get the same thing here. It's the same formula, I think. Yeah. I add some disturbance here, some noise. And so this is what ends up looking like. So these are going to be my data points. So one data point here is going to be a point in a plane. So each of these points here is represented by a x and y coordinate on this space. And the label is going to be basically specifying whether this point is purple, or it's red, or it's yellow. And what we'd like to do is basically train our algorithm in order to assign the correct color, given that I provide a position on the screen. So these are all these collection of points. Are going to be my training points, my samples. And the samples are expressed by 2D coordinates. There is the x coordinate and the y coordinate. The point at the end is going to be having a machine learning algorithm, in this case a neural network, which given a two coordinates, like an x and y coordinate, is going to give me a categorical value, which is going to tell me, oh, this point here, I think it should belong to the purple spiral. It should belong to the red one. It should belong to the yellow one. So we are just trying to categorize these points as belonging to a specific spiral. You can play with the code later on and add more spirals, add more points, less points. You can play a lot and understand better the difficulties of this problem. But so far, it should be clear. So given a point in the plane, we'd like to learn the color of the point. And the point, the colors, we learned by using the C. So the C is the class, the specific class. And the capital C is the number of total classes we have. All right, so what we are going to be trying to do, so what we hope the network is going to be performing is some sort of division of the space. So in this case, I'd like to say, oh, everything that is here belongs to, let's say, if I just look to this small region, everything that belongs here is red, everything that is here is yellow, everything that is here is purple. The issue is that those spirals are not linearly separable. So if I just draw some lines and I say everything here is red, well, not really. Stuff red is happening here. Here we have some purple, we also have some yellow. So whenever you use, let's say, a linear classifier, so it would be basically just drawing some lines like this. And we're going to be soon seeing that neural networks are simply somehow stuck in multiple linear classifiers together with some nonlinearities in between. But basically what we are trying to do is draw some lines and getting data to stay within some specific region. So how does a neural network perform this stuff? A neural network will do this basically. It gets the space and it starts rotating the space in order to disentangle those representation. So anytime we use a neural net, we have some kind of disentangling operation. And at the end, I can simply use a linear classifier as the one that is shown in the last animation. Over here, I can say everything that is yellow here, everything that was here, it was purple, and everything that is above here is red, okay? So my last layer in a neural net is a linear classifier, again, which is drawing those three lines. And the other layers in between are just performing this kind of the warping. So someone can say ha, someone say ha, okay nevermind. I know how to rotate things, right? How do you rotate things? You're a physicist, you know how to rotate stuff. With matrices, right? Okay, very well, right? So if you come here and you rotate this spiral, what's happened? It's like you have those kind of wind things. Same thing. So they still rotate, but they don't disentangle. It's a question there. You can ask the question. That last slide, was that a real deconstruction of a network that solved this problem, or was that just your animation? It is actually going between the first layer to the last layer and back to the first one. So the answer that you found, the program that you found to solve this, literally does exactly that. Absolutely, yeah, cool. So, and that's what I was going to say. But how they draw that is not that way. I mean, you can do. That's an interesting question too, yeah. To make things look pretty, I draw them my way, but that's how the network is going to be. I think so, it is a cartoon. It's a cartoon, but I'm showing you not cartoons in one minute. So, yeah, so yes and no. It was advertisement and real cartoon later. That's correct, and they look exactly the same. So, but this is like what we hope it does, right? So, the problem here is that if you use a matrix, you can rotate this guy here, but it's gonna be like those little color things with the air, right? They still rotate, but the things are still intertwined, right? And the thing that we are actually aiming here is to rotate the space and warp the space with different kind of rotation given the different position from the center of the space, right? So we have a non-rotation here and a stronger rotation as we move away from the center, okay? So basically, the neural network is gonna perform some kind of space warping through the layers. This is basically seeing things from the input space as you move across to the output. This is the output space view back to the input, back to the output space. If usually, this is not the way you see those things. The way you usually see those things are from the output of these networks, which are these kind of graphs. So this is my network training. At the beginning, it doesn't know anything, and as it trains, it tries to shape its own understanding by following this drawing. And this is not a cartoon, this is actually the training part. So the point is that this shape that has learned, basically, this is like looking at the output from the input. That one basically is just looking through the network. But they are basically the same operation. So we are trying to basically have a network which is gonna be learning that this region here belongs to the yellow, this region belongs to the purple, this region is belongs to the red class. And is it clear so far? Are you excited? Shall we move on? Very well. And then, or training data, huh? No, let's make a demo. So I'm gonna show you something cool here. Possibly, yes. So I'm gonna go back to my notebooks, and I'm gonna be just running quickly. The space stretching. So here, I really don't want you to read anything, I'm just showing you very pretty pictures. Okay, so this is my initial distribution. Here I had just some Gaussian points, okay? So you had just a spread of points in the two direction with a diagonal matrix with unitary variance, okay? Just points in the center. Just forget what I said before. If you multiply those guys by a matrix, what does it happen? So someone said, oh, use a matrix, right? For rotating. Well, yes. So here I multiply this stuff by a matrix. This is my input. Here is by this guy. This guy has this eigenvalue here, right? 2.6 and 3.0.3. So you get the six and the reduction in the other direction, right? So this is whatever happens with that. Whenever you multiply by matrix, you get some squashing one direction. You get some stretching the other direction. You get also some rotation, right? But everything is rotated the same way in the space because it's a linear transformation, right? Here you have a different matrix. So you have a very, very, very tiny eigenvalue in this direction, a very large eigenvalue. The other one. So this is nothing new, right? This is simply a matrix multiplication with very pretty colors, right? You like the colors, right? Good. All right, so more matrices, more matrices. So here both of those eigenvalues are very large. So everything gets expanded. So you understand, right? Small eigenvalues, things get collapsed. Larger eigenvalues, things get expanded. And so on, blah, blah, blah, very nice drawings. All right, so how can we do that in PyTorch? So in PyTorch, we do exactly the same, but we use something that is called nn.linear. Can you see here? So nn.linear is just using a matrix and multiply by your input. It's actually the same. So although it's called nn, just forget. It's just a linear transformation. And here, it's actually the same. This is one of the, it's basically examples that we have seen before. All right, so let's do something more fun. Let's introduce a non-linearity. So in here, I just multiply my x with identity times s to scale. So this is my scaling matrix, which is just scaling. It doesn't rotate anymore because it has the same guy in both directions here. And we have a tan h. You know hyperbolic tangent, right? OK, good. All right, so let's try to run some stuff. OK, so this is my hyperbolic tangent first. Everything that is below minus 5 is basically set down to minus 1. Everything that is above plus 5 is basically 1. As you approach the 0, you get basically a linear part in the center. And you get this kind of s shape. So this is also called a sigmoid function, sort of. All right, so and let's put here a sequence of a linear transformation, can you see here, and a tan h. So let's see what happens here. So here I have my input, which is my scatter in the two direction. And here I have my scaling factor equal 1. So I scale things a little bit. Actually, I don't scale. I have the same scaling. But then I use this tan h. So things that are above here get pushed to 1. So everything that is outside the plus 5 we have set, it goes to 1. So everything here goes up to 1. And things that are where below, OK, let me show you the other graph here. So everything that is outside this region gets pushed towards 1. Everything that is in this region gets pushed to minus 1. The same for up and below, right? So we get this kind of nice cloud shaped as a square here. Does it make sense so far? We are applying it only in the RIT to some kind of zooming factor, right? So what does it happen now if we increase the zooming factor? It will go more. Yeah, there you go, correct. So everything goes more. And if I push more, everything goes even more. And even more, right? And even more. And again, you understand, right? So the whole point here is that we have converted that initial cloud, which was very dense of many, many things. We pushed everything on the side. And we have a few samples here that are very, very disentangled. So here you can already see how things get pushed away. And we have different stretching in different regions, right? So here these guys are very stretched. And here things are very collapsed. Remember before what we were trying to do? It was we were trying to warp with different amount, different regions. The same thing we see here. Here things are very expanded. And things here are very collapsed, right? So if I go up above the other one, here you can see a more smooth transition, right? Here things are very, very, very pressed together. And here are much less pressed together. You can also apply some rotation if you don't have just the identity metrics there, OK? So this is just showing you how I map cloud of points into a square. And the nice part is now that if I'm just running a neural net, basically a sequence of a linear mapping, a nonlinearity, and a linear mapping just with random weights without training anything. And I apply this stuff to my little cloud here. You're going to see some very pretty pictures. So this is simply a random neural net applied to that initial cloud, right? So these are my initial points. And here I applied this 2D coordinates into my network. And this is the output of my network. And you can see here just uninitialized neural net can perform some very nice squashing and pushing data around, OK? So we are going to be using this stuff in order to push space around and devolve that spiral in a bit. Questions so far? No? Are you with me? Are you sleeping? Are you OK? Are you interested? Yes. OK, good. All right, so we have an intuitive explanation of why there seem to be different numbers of lines in these. So that seems to add very clearly two edges, maybe. Yes, yes. So if you check here, there is a number which is called number of hidden layers. So in here we go from dimension of 2, which is the input space, the coordinates on the plane. We go up to five dimensions, which is a five-dimensional space. And then from five-dimensional space, you go down to 2. So when you have this kind of expansion, you can start pushing and moving things around. And you get a lot of degrees of freedom there. So was each of these images the output of one of the layers? Those images that you just see there, they are simply random initialized matrices with a non-linear item between. And I feed them inside. And I just show you what a neural network can perform on those points on a plane. OK. And the color text stays to this point? The color stays with the x direction of the initial position here. So the color here express the initialize. Yes, yes. All right, so all we have to do now is basically get some kind of control over how these networks perform these kind of funky shapes. So if only we would know how to change those parameters in order to enforce some specific de-warping, right? So all we have to do now is get in those spiral things and start performing some de-warping, which allows me to de-warp that kind of intricate spiral. And that can be done very easily with Torch, just with a few lines of code. All right, so how much time do we have left? Half an hour. OK, it's a lot. All right. So all right, so we have understood that these neural networks, even if they are not trained, they can perform some space. This was completely empty. Yes, I'm aware. I can see. Thank you. All right, so we have seen now that these neural networks can perform some kind of space warping. They get some points, I showed you before, and they map them in different regions. And we have now to control them. We have to enforce some way in order to move those points in specific ways, right? Those warped things, we won't like to move them around. And so let's just do that. How? The few next slides. So I have just one question. Yes. When you're making this movement, if it's not, shouldn't it be always, should it be connected or not? Do I understand what I'm saying? So when you have a movement, if you have two points which are connected, they will stay the same. But if you have, for example, with these dots, one area which is red and around is blue, and again around is red, then you don't have movement which will put all the red together. OK, very well. I understood. Yes, I understood. So I'm trying to repeat the questions. So here, the guy in the first row, he asked, so if we have some points in the center, let's say red points here, then we have like a ring of green points. And then we have external other points in red. How would we manage to move these guys? So can we put together things that are not on the same plane? Not connected. Not connected. So can we do something like that? Yes, we can by adding additional dimensions. So we were on a 2D plane. But if you switch to a 3D plane, you can get the red stuff which was the ring. You pull it this direction, and the other guy, you put them here. And this way, then you can just cut this way. All right. So training data. We need to now put the data in a way that we can use it for training. So first of all, we are going to be defining a matrix, x, which is our, it's called design matrix. In my rows, I have those kind of x with a bar which simply represents vectors. And here, I have my first vector, which is like the coordinate of my first point in my spiral plot. So I get the first point there. And I put here one number, two number, second number. Then I get the second point, number two. I get the first coordinate, second coordinate. And I go down until the last point I have in my spiral, first coordinate, x, and second coordinate. Are you sure? You understand? OK, very well. All right. So this one is simply the collection of all my x points. How many points do I have? m. So m is going to be representing the number of training samples I'm using for training my system. What is n? n instead is the size of my features. In this case, since we are having points in a space, in a plane, the space, the n is going to be just two. We have two coordinates, right? So in this case, given two coordinates, x and y, and given m examples, I'm going to try to learn those three different labels. OK? Right, good. Second point, we have to see which are the labels of the classes. So for every example here, we had the one, two, up to the m example in my dataset. I have my c1, so it's the class of the first example. And I have c2, second class, cm, last class. So sorry, the last example. So this one could be red. This one could be purple. This could be yellow, OK? So these are categorical bit of information which are connected directly to my design metrics. So therefore, this is a metrics of size m times n, whereas this guy is simply a vector of dimension m. Good? Yes? All right. One more. So what is this? All right. So these are actually my real labels, because this guy here represents whether it is red, it is yellow, it is whatever. But I have to represent this kind of information in a more computer-like way. I cannot say, oh, this is red, this is yellow. So we are going to use these kind of Y vectors. So there is a line underneath. If I just draw them with a computer, I just make them bold. This is just to show you they should be bold if they would be drawn by computer. So this is my label, which is somehow built with this guy associated to the first example. This is the computer label for the second example, and so on and so on. So we have m labels. And in this case, we have capital C classes, items. So can you guess what is going to be the representation of the label for the machine? Does anyone have an idea? Could be RGB. But let's say it could be 1, 2, 3. But again, 1, 2, 3, it infers some order. So 1 becomes before 2, right? And here I already show you that the size of these guys is capital C. It's going to be a one-hot encoding. What is this stuff? So this stuff means if you belong to the first class, you're going to be a vector 1, 0, 0. If you belong to the second class, you're going to be a 0, 1, 0, and so on, right? So I just draw a 1 when you belong to a specific class. And in this way, I don't enforce any kind of categorical sorting order, OK? And this is called? Sorry, yes. Are we doing this like to be probability-disappeared? You don't have to, but this is in order to be defining later on a cost function, which we are going to be using to train the system. But yeah, it can be used for a probability later on. But OK, you can see this one is like, what is the probability of this point to belong to the first class? This one, so 100% first class, 0%, 0%, OK? So it can be also expressed and thought about in that way. OK, so far, good? Yes, no? Yes. People are not moving their head. OK, someone is moving their head. So someone is still alive. All right, so here, xi, which is my i-th sample, it's simply a point in Rn. And in our case, it's just n is 2 because it's a point in the space. m instead is the number of the samples in the data set. In this case here, my y instead is like belonging to Rc, where capital C is the number of classes. And c is, well, it actually is not Rc. It's going to be like a set 0, 1 times c, powered to the c. And c, in this case, is 3 because we have three spirals. And the last one are going to be my classes, ci, which is basically a number from 1 to capital C. All right, so let's move on. So in here, I will show you how to draw the spirals with torch. I think we can move on because it's simply just drawing those spirals things. I think it's more cool if we are going to be actually getting some results. So I would recommend you to just go through the data generation later on after we are done. So let's see now finally how and what are these neural networks. So we have spoken about those cute and fancy things, but let's see how they work. So in here, I show you a three-layer neural network. This guy here is my input x, which is fed inside this first guy, first second guy here. So we have one, two, and three layers. There is one guy in the center, which is called h, because it's hidden in the center of the network. And then I have here my output over on the top. Networks go from the bottom to the top. If you see them drawn from top to bottom, you're wrong. Just stick with this notation, please. All right, so what is the equation that is basically governing this neural net? So my hidden representation h is simply a nonlinear function f, which is point-wise applied to this guy here. So this guy here simply is a fine transformation of this vector x. Do you understand this nomenclature? I mean, you're all physicists, and you already seen matrices. So you know, right? I mean, who doesn't know what a fine transformation is? Just don't see. You know, right? You're not undergraduates. All right, so I just keep going, right? Just stop me if. All right, thank you, because I'm not sure. So here we have a fine transformation of my input x, right? I have a bias there, which is just shifting, and I have a linear transformation here. I call wh, which is mapping my input to the hidden space. And here is a nonlinear function applied point-wise to every element of this vector here. The same way, I just keep doing the same operation, and I have my y hat, which is my predicted output, which is, again, a nonlinear function applied to this affine transformation. And this affine transformation is applied on the hidden space, hidden layer, okay? So you have an input, affine transformation, nonlinear function, affine transformation, nonlinear function, output. Good? All right, yes, question. Why the hidden layer? Why not just have a nonlinear function from input to output? Say again. Why do we need a hidden layer? Why can't we just have- Because you would have shallow networks, and they don't, so if you have one layer only, you only can do the squaring thing, and you can only do, so if you have affine transformation, you haven't seen it before. You can stretch things, you can push them, you can rotate, but you cannot do warping, you cannot do those fancy diagrams we have seen. So as soon as you stack two guys of this guy, those two guys here, you start seeing all those fancy behavior that those neural networks that I just generated before without any kind of meaningful weight were showing, okay? So that's basically it. So this is a neural network. Oh, really, there's not much more than this. We're gonna see some more stuff tomorrow, which we are gonna be introducing other cool stuff, but that's pretty much it. Everyone is okay with this thing, no issues, right? We are not afraid of this guy. All right, good. All right, so what are these guys? Simply, I guess you know. So this guy here maps from the input space, the N, to this D space, which is our internal dimensionality. And therefore, my bias is gonna be also on size D, right? Because we are gonna be summing like this linear transformation going from N to D, and then you want to sum also D. And the same guy here, right? This is gonna be outputting C classes because we are gonna be expecting that kind of one-hot sort of representation. So C classes going from D. And then the other guy here is gonna be also capital C because it has to be in the same size given that we are summing. All right, so what are these F and Gs? These F and Gs are, again, non-linear function applied point-wise, so for element-wise. They can be simply the positive part of a function. They can be a logistic sigmoid. They can be a tonnage, the one I showed you before. They can be some Boltzmann distribution thing. You can just put there any kind of non-linearity. So some are more used than others. I won't go into much details. This one is very nice. You know, so you're physicists, right? All of you? Not all. Not all. Who is not a physicist? But you know Boltzmann, right? Okay, so. Personally, yes. All right, so you know this guy. So this is simply a Boltzmann distribution. All right, and so the overall summary of this slide is basically my y hat, which is my prediction here, is basically a function of my input, which is fed down there. And then y hat goes from the input space to this kind of categorical space where we are representing different classes. And here we go again, mapping the input to its own class, okay? All right, but this one can also be seen as mapping n space to a d space, which is potentially larger. And then we go out to the final capital C class, where the dimension of the internal space is larger than this device here. So we go from two dimensions to a larger space, maybe five dimensions. Then we go down to actually, we go to three because it's our three classes, right? Question so far. Is that necessary for the intermediate layer to have higher dimension than the? So yes, for the reason he asked before, so how would you separate things that are not connected? So you need to push things in a higher dimension. This is very true also when you deal with like, even larger inputs like images or audio and all those things, you really need to go in larger dimension in order to move things. So if things are in small dimensions, you cannot push things around because everything is like just packed. So first push everything away, like larger space. And then you can just get things and move them around much easily. So as soon as you increase the number of dimensions, you have much more freedom of pushing things in different direction without being constrained of moving other stuff, okay? Other questions? Yeah. I mean, should you pick like, how do you know five? You cross validate. So you're gonna be trying empirically and see what it works better. Other things, other questions, yes. So how do you determine the final term? Um, it's gonna, I didn't tell you anything, right? So far, these guys are coming from the sky. No, someone gave you. So no one told you anything. I haven't said that. Other questions? Are you ready to see how we train this stuff? Are you excited? I don't even know what we're saying. Okay. I, he's not hired. I don't know this guy. We didn't prepare this before. All right. All right. So, okay. So again, then you're gonna be bored, right? So this guy is two dimensional because it's a point in the space. This guy is 100 dimensional. It means you have so much freedom. And the last one there, it's a three dimensional because there are three classes. All right. What's next? I guess it was the tutorial. Let's skip this. So you are putting two dimensional space in three dimensional space. Yes, but going through the hundred, right? I go from two, two hundred, two, three. Are you okay? So far? But you're going from two to three. So I mean two, oh, I say two to four. Okay. So you're just going from a two dimensional space to a three dimensional space. So I go from my point in a plane to a probability defined over the number of classes. I see. This is what I do. So I start from two dimension and I end up with capital C classes dimension, okay? So n dimension to start, capital C to end up. Are there questions before I go on? There you go. All right, so I'm skipping all the demos. And so this is the Boltzmann equation I showed you before. So what is the softmax? Softmax over a vector. It's simply the exponential of that specific element divided by the summation of the all other exponentials. So this one basically given a vector of whatever energies, you get the distribution of probability on which, on those energies, okay? So actually, yes, there's gonna be a minus. We don't care because we just learned the negative energy. So yes, absolutely there should be a minus if you're a physicist, I'm not. But yes, yes, you can put a minus, nothing changes. Those are called just for convention logic. So the definition of logic is the input of the softmax. I'm confused. Could you say that it's either zero or one? No, I see, this is an open. Oh, the interval is zero to one. I'm sorry. The third one excluded the. Right, right, right. All right, so what is this stuff? I define now finally something. So we are gonna be defining a cost loss function so that I can run a optimization something, an optimization program, which is gonna be trying to minimize a specific cost. By performing this optimization, we are gonna be basically straightened up that final warp space, the kind of spiral, right? So the whole point of defining this loss here is gonna be basically telling the network how we'd like the final result to look like final, right? So we have to define, we have to ask the network. Okay, network please perform a specific task. The task is gonna be trying to lower as much as it can this kind of cost. By lowering this cost, it's gonna be translated as in warping that space accordingly to different strengths, okay? So this is like the, without mathematics. Now let's try to see what is this. So I have here my loss function capital L defined on the whole capital Y, which is the whole set of labels. It's just the average of those per sample losses, okay? So this is an average of those per sample losses. And my per sample loss, it's simply the minus log of the correct class on my prediction. What does it mean? Let's see an example. So this is also called cross entropy or negative log likelihood. So let's see for example, we have my input here on X and the correct class is one. And so my predicted the correct label should be 100, okay? All right, so let's say my Y hat produces something that is very, very similar to 100, okay? Very similar. So how is this loss performing? So we are gonna get the minus log of the correct class. The correct class is one, right? So if I do minus log of this guy, so log of something slightly below one, how much is it? Log of one, zero. Log of almost one, a little bit below. Minus zero, right? And given that we have minus here, we have zero plus. Okay, so given that our network produces an output like this one with this kind of function again, then the loss is basically zero, a little bit more on the zero, but just zero. So if we predict the correct label, then the loss associated to predicting the correct label is zero, right? Good so far? Yes, no? Yes. Yeah, okay. What does it happen if we predict a wrong label? We are gonna have the loss on something that is almost zero, almost zero. What is the log of almost zero? A bit negative. Negative infinity. Then we have minus here. So we're gonna be having this guy tending to plus infinity, okay? So predicting the wrong label basically gives us a very, very, very high loss. Predicting a correct label give us a zero loss, okay? So now we have this guy here, this average loss, which is the average of all these per example losses. And these per example losses are gonna be the lowest whenever we have correct predictions. And they are very, very large whenever we have wrong predictions, okay? Good, all right. I know, stay with me one more slide and just, you know, I still, just borrow me your few neurons that are still working on your brain and let me finish up. All right, last slide, seriously. So here I define my set of parameters as the collection of all these guys here that I draw, I show you before, which are just initialized with random values. They don't have specific values. Like my neural network you have seen before. So the networks I show you before, they were just randomly initialized neural nets with random values. And you were seeing those very fancy kind of diagrams, right? What we are gonna try to do now, we are gonna be defining a kind of different loss function here similar to the one we have seen before in order is parametrized by this guy here. And then we are gonna be changing this guy here in order to minimize this loss, okay? I'm gonna be a bit more specific here. So here I'm defining these loss here, J of theta as the previous L as on the Y function of theta, right? So these are my predictions, which are a function of my parameters, right? And that's basically saying, okay, this is my loss expressed as in labels, true labels and my predictions and this is my loss defined in the parameters I use. So it's simply just, bless you. It's just simply changing the variables, right? Here the loss is defined as predictions and targets. Here the loss is defined as function of my parameters, okay? Yes, no? All right, are you sure? Okay. All right, so more drawings. We can think about this loss. For example, if we have one direction only, I have my lower case theta, perhaps like a quadratic function. So if I'm here, how do I reach the minimum? How do I get here? I look where is the slope going, right? So I just go down the hill, right? How do you look for the slope? You check what is it? What is this point here? What is the slope here? The slope here is gonna be the derivative of my loss with respect to this guy right here, computed in theta zero, right? So this guy here expresses me the slope here. And if I'm here, I'd like to go that direction, right? So here the slope is positive, I'd like to go to that direction. So if I have positive numbers, I'd like a minus here. So I go there. If I'm here, my slope is negative. So I put also a minus here, so I know that I had to go to positive numbers, right? So here we observe that in order to find the zero, we'd like to step towards the direction of negative gradient with of the loss function with respect to the parameter. I know it's a mouthful, I hope you understand. Shall I repeat? No, is it good? So it's just an optimization problem, right? All right, finally, this is called gradient descent. You step towards the direction where the gradient is pointed against, huh? All right, so what are these guys here? So this guy is partial derivative of my loss function with respect to this first parameter here, which is basically, you know, chain rule. And so it's just the loss by the y and then dy in the w. And this guy, given that it's the previous h, is gonna be still the chain rule, no? And that's why we use pi torch, because this guy here, this is called back propagation, which is just the chain rule, and which is performed automatically by pi torch. So pi torch, whenever you perform operation with those sensors, remembers what operation you have performed. And therefore, if you have a sequence of operation, it's just basically as having a computational graph. Whenever you have the last guy, you can go back and you get basically this expression here, expression there automatically computed for you. So we use pi torch because there is a automatic differentiation engine behind those sensors, and it can memorize all the operation you have performed on those vectors. So that we can perform back propagation with simply one line of code. And that was my last slide of theory. I'm gonna be just showing you that this stuff actually works if you would like. So shall we go and I'll show you, okay. I don't care if you have questions, you're gonna see me later, because otherwise I won't show you anything anymore. All right, so back to the notebook here. I'm just gonna be running the spiral classification. It's the fourth, actually. I skipped the automatic tutorial, automatic differentiation. So this one I was going to show you how those computational graphs are generated and how the chain rule is computed. But I guess you can go through that by your own, your own. So here I'm just executing every line of this code like that. And there we go. So first part, which we also skipped was the generation of this data here, which is defined in a previous set. And I just draw exactly the thing I showed you before in the presentation. Here we are training our network in order to minimize that J function, J of theta with respect to those. So we are basically updating those parameters in order to step towards the direction that is not pointed of the grade, right? In the directional opposite of where the gradient is pointing. And so we are gonna be, I show you here that this network is simply a two layer, two, it has two different linear layers with no linearities in between. And then here if I show you how it worked, well, sort of, not really, right? So it tried to, but it's still linear, right? What's wrong here? So if we check the model here, so if you see this guy, my input X goes through this guy here, a linear, and then goes through another linear, and then the output. So I have two linear transformations. Well, nothing works, right? Because it's just linear. I want to work things. So what do I have to add? I know linearity, right? So I put here a nonlinearity in between, and that's it. So we go down, same two layers network, right? I go down here, here you can see my output, my output here, of this first affine transformation, it goes through a positive part function, and then the output of this positive part, it goes inside the other linear layer. So we have an affine transformation, a positive part, an affine transformation, and then outputs, okay? So we just put a positive part between those two linear layers. That's all I've done. We are doing exactly the same thing, but, can you see here? Accuracy runs. Huh? Accuracy, it should be bigger or lower. Which one is better? Accuracy. Accuracy is better, okay? And, oh, it should be, right? So this is exactly the thing we have seen a few pages above, with the three lines, where there was no nonlinearity. Here I just added a positive part nonlinearity, and you have automatically that the network learns how to curve its own decision boundary, in order to achieve the highest, well, actually to achieve the lowest loss possible on the training set. Is it clear? So the only difference between this guy here, which couldn't basically classify, we got like a accuracy of 50%, which is not that bad, because we have three classes. 0.3 would be the random guess. And so this one is the best performance you can get with linear models, or shallow networks, and just using two layers, the same two layers we have used before, by adding the positive part nonlinear point-wise function, we are managing to curve the space and perform this very nice separation. Questions, yes? It looks like the model is slightly overfitted if you extend past the point in which the data was generated from. Do you have any idea why that would be? You mean this one, right? Yeah, so it goes right where the data ended. Right, right, right, right, because that's a very good point. So the model capacity we are using is actually a very, very tiny model. There are just two layers, and actually the dimensionality, okay, a hundred dimensions, maybe it's fine. So the hundred dimensions shows you this kind of curving here. But then all the power, the modeling power has gone into shaping this curve here in order to maximize this separation of those points in these classes here. And you're saying, ha, why didn't this guy perform anything? Because here, anything he does, this little curvature, it doesn't improve the training loss, right? So if this guy goes this direction, or this direction, or this direction, they don't change the training loss too much, but they are actually changing a lot how well you manage here to shape this curve. So the more you curve here, and hold on. The more you curve here, and the less you can curve here. So you have always kind of a trade-off between how well you can characterize parts here, whether you can characterize it in other places. Then it's correct. This is just my training set. I didn't show you, like, where are the actual points belonging to a validation set? They may be here, and then we have like perfect generalization, or they may actually happen to be here, and then we are gonna have a very poor generalization. Can we decrease 100 up to 50, and do the same? The dimensionality, right? You said the hidden dimensions. Hidden dimensions. So first I want to decrease to 50, and second, increase to 150, but I don't know how much time it will take, but for 50 it's possible to decrease, right? So that's why you have the notebook, right? So you have the notebook on your computer. Yeah, but at first I need to manage to run this. So if it's easy, let's try here. And then others also will see. Of course. Here, so here we have defined dimensions. Two. No, no, no, hold on. It's not the dimension of the plane. Yeah, I don't know where it's defined, hold on. H. Is it here with H? Oh, capital H, okay, I see. All right. Capital H, here, oh, number of unit units, yes. So you want it 50. So I go here, I generate this guy with 50. 50, same, almost. Oh, okay, you can see here, there is much less of a turning, right? So also here, there is much larger region where the space is not bent here and here and the same here. And then if you actually enlarge this view, you're gonna see that this stuff is gonna be linear in this direction. This stuff is gonna be linear in this direction. And this other guy here is also gonna be linear, which is the thing that doesn't cost me computational modeling power, okay? Now can we put 160? Yes, we can do 150. Why don't we need that first exercise for the next one? All right. You'll just check and then we're gonna go. All right, all right. Really? Yes. 150. You're not sure? There you go. No, it's okay. Yeah, but... I'm done. Voila. And now you can see much, much, much of a smoother curve. Other questions? I think if we have two minutes, yes. The way we chose Y was like 1001. And then correspondingly, we chose our cost function to be Y hat times C, minus log of Y hat times C. So this can only do classification, right? So if I want to do something that's like a prediction problem. Yes, if you are gonna be doing a prediction, you're gonna be basically using an MSC, mean square error, where you're trying to see what is the quadratic distance between your target and your own prediction, okay? So that's a different cost function. This is gonna be a different cost function, that's correct. The other question? Could you add a fourth classification that just says it doesn't belong to either of them? Would that improve your accuracy but not add more computation or like a little bit of more computation? Which kind of samples would I put there? Just everything else? I mean, could you generate them? But where would you put them? Right now we have three different classes. You can have, okay, but this applies for other, I understand your question. So maybe if you don't like face recognition, maybe you have like five different identities and then you would like to have an others identity. Is it something like that, right? Yeah, yeah. So what I've seen that usually having the other category is bad because it tries to get all together. So the networks are gonna try to put together things that are different rather than having more classes, maybe. And that's actually easier because the network has to just group smaller regions. So these ones are just relatively small. If you have different things that are all around and you try to have them together, the network will try very hard to somehow push those things together. So the other category usually is not a very good idea in this case. Or one more question? No, if not, thank you so much for listening and if you have any other question, I will be outside here. Thank you.