 Morning or afternoon or evening and happy new happy lunar new year. Happy. How do you say I cannot say that? all right, so we start the lesson today is the It should be the third practicum, but again, Jan stole the first one So I guess it's the second practicum. We are a little bit staggered, but we'll figure how to catch up I Start this lesson with sharing a few tools. You don't have to use them. This is what I use I like them a lot and I think If I like that, perhaps you may like it and you know That's nice, right? Nice sharing tools. I think so one thing that I've been sharing. I also share last year It's this software to take notes, which is called type aura In this software here allows me to write Oh Okay, I don't know I'm messing up. How do you change the zoom? Zoom in zoom out zoom. Oh with a shift Okay, there we go. So here I can write markdown all of this is markdown, but then also I can write latex, okay? And so this is like very convenient way to take notes for papers. I think it's called type aura T Y P O R A It's free. It works on every platform. I like it another tool that I've been using starts using it recently, but again It's free for students is this thing called notion. This is my control room here I have things for work like do many things These are private life. I don't gather pictures from Christmas And then you can also have like pages with inside things, right? Like I have a page with the media and inside there is all type of media Unconsuming from the internet, right? Like our these are blog posts Like blog website to do a priority subject And then like I don't know the author and so on right another one It's like I don't know books for example that I was start reading just I just started this stuff This is all personal development. Maybe I get to improve myself or cooking, right? So just started again. I have another database with some recipes. I think it's very nice to like to have like things in one place and I'm gonna start using this for research as well. I have articles and Here for example, you can have a database with all possible articles that you've been reading and write notes each of them You can open and then inside you can actually write Latex as well. Okay, so it is latex Supports latex, so I think it's super cool All right promotion done. I'm not getting paid, right? So it's not promotional Promotional message, but I think it's nice someone say about obsidian. I haven't tried perhaps is good If you didn't know about this now, you know, if she knew already, okay, sorry to waste your time. All right, so Before starting the practical today, we're gonna be talking about training but before talking about training. I need to make a very big distinction and Be very clear about the fact that If someone uses gradient descent, okay. No, okay question to you. I'm reading the chat, right? How Does someone train a network? What is the procedure for training a network? What is the algorithm that one uses for training a network answer in the chat? Gradient descent. Yes. Okay, that is a correct answer. All right So green descent is actually the answer the correct answer. All right. So every time I okay rely on pytorch Every time I ask this question so far and now I don't know why you know the correct answer Every time I ask this question in class everyone answers back propagation and then I get a little bit, you know, I Wouldn't say annoyed, but maybe yes, I know it. I'm again. I'm a joking right What is back propagation if I ask you right if you're maybe you watch my videos already But okay, if I ask you what is back prop right and what is use what is back prop use for? Job interview, right? If I ask you what is back prop use for a way to compute the gradients cool? And that's it, right? So you don't train with back prop back prop just gives you the gradient the training happens by Stepping perhaps using the gradient, right, which is the gradient descent. Okay. Fantastic. You already know everything. All right, so If I'm showing you this chart over here I'm gonna be talking about inference here. Oh, this is another software, right? So I show you another one before and this is the third third tool. Okay, so this is called How do I hide this bar? This one so this is called draw dot IO now they changed the name because I don't know All right, so here I use all these shapes to draw the Diagrams that young pointed out, right? So this is an observation because there is a Gray scale like there is a gray background. So this is an observation. I feed this one inside this item here, which is a encoder, which is a delay in the advanced palette, okay And so I get my Y bar. So why my Y bar? I put the X inside this function. So how is called this procedure? And you know, it has a name. I mean, it's stupid name, but Before we talk about the back prop. This is called If it's not backward, it's type down something in the chat Forward prop. Yes. All right, cool. So this is for propagation, right? So you get an output Y bar given that you provide an input X to this module here So this is the mathematical formula, which is like written in the Wrong direction, right? I mean the things I usually read from left to right So X goes through the encoder to give me a Y and here is written Y is given by F to which you input an X So again, I think the graphical version is more intuitive. Cool So how about this thing over here? What do you see here? Can someone describe to me in words on the chat? What's the difference between the second row? Inference with new example, okay Both of them are inference, okay So when I compute something given something else is called inference. So both of them are inference But then what's the main difference so far of these two drawings? Can you tell? Can you spot the difference? It's like, you know, I Can see the bar over the X. That's first point and then the second thing is that The observation is the output, okay, and so there is no bar over the Y Yeah, so there's no bar over the Y and this is the observation. It's gray grayscale background The the bar is on top of this guy right over here And so how do we find this X bar given? I have my Y anyone can guess Big prop. Yeah, I mean I Okay decoding would be using a different module right so decoding would be actually having here a One of these guys flipped in the other direction. You feed this decoder with this Y You're gonna get X so that's That is something called How it's called Blah inference hold on okay judge and if you can type down I forgot Okay, it's called blah. I forgot the word. No, no, it's not target Okay, I let you know that later on I forgot but it's called something amortized. Yes. Thank you, Vlad. It's called amortized inference but You can do the inverse but we don't do inverse so what we do actually Okay, it's it's the following right So here you're gonna get the X is going to be the outcome of performing a minimization Over how far the network is Output is from my target. Okay, so yeah judging pointed out target probe So we have a target which is our actual Observed Y which is this why without the bar and then we have something that the network output Okay, which is this thing over here and then you just do gradient descent Such that you get the network to shoot to as close as possible to this Y by changing X right and so finally you end up by using gradient descent with the X that Has the model provide The closest answer to our target. Okay, it's not tractable. This is like gradient descent neural networks, right? Right, so the parameters are fixed The network is given to you already Like before we didn't no one asked me know about the the parameters of this network here So here we already assume in the first case right who was asking Colin in this case We also no one asked me. Oh are the parameters fixed? Yes, of course their parameters are fixed Here is the same right so this guy this network is already given to you How do you find the input that gives you a specific output by minimizing the You know difference between what the network out actually outputs, which is this one with the actual Observation right and so this is still in France but the outcome like the output of in France is a Solution of an optimization problem, right? So again If someone's asked you is back propagation used only for training This is a important question. Let's say is a exam question, right? We don't have exams But okay, whatever so Mike if do you see this question right job interview is back propagation use only for training Answer, would you answer? No, okay? Good. All right, that's it Ah Nope Okay, fantastic. I try out this one draw a yo and draw use this stuff these diagrams Right delay and then the circle you can also turn on the latex extra Okay, dark theme of course. So do you know why I use dark background? You know why I use dark theme Because bugs are attracted by light Okay, all right now, you know, okay All right, sweet. So we starting today lesson right ten minutes delay now 15 me But okay, I you I gave you something already, right? So we we learned that back prop is not only used for training every every every year But this year everyone answers me that back prop is how we train networks, which is not correct and You figure that also gradient descent can be used for inference, okay? And actually we see we see it's very soon that this is like what is the Standard way of like the more generic way of of doing inference Cool, so what do we talk about today? Training right? Yes. So today we are gonna be talking about training after I talk about this stuff that is not training And so we start from Here all right, so again sponsorship not sponsorship just come over to Twitter and see I say hi And that's me. I don't fucking know this is just for making people smile during Conferences like during talks so people don't get scared. All right, so We're gonna be talking about classification Why am I talking about classification because I can show you that everything that is done in machine learning We can do it deep learning and it's a very easy way to explain things because in theory You should already know all this stuff So Perhaps I don't have to go that slow Although if I'm too fast, you know, slow me down. There is a button on the reaction. Go slower. Oh go faster up to you Okay, I sweet Okay, so what what what is the outcome or the outcome here? Well the what are we trying to do so here? I'm just showing you a few spirals like a few branches of a spiral and Which is simply, you know, I can I can draw by using a parametric formula. I written down there and Then T goes from zero to one right so zero you're in the center Then you go to one is gonna be the tail and then you have capital K Spires right in this case we have three of them Let's make things a little bit more Interesting so we can add some noise there And so you end up with this kind of more realistic young data So what is the? What is the objective of classification? So what does classification do? So I am asking questions here right to figure out. What is your knowledge right so? Given this data set, okay, I didn't specify things right? I'm trying to understand your understanding Given these drawing. Okay. What is the both possible way of doing classification here, right? So what is the input? What is the output? What am I trying to do? type in the quiet in the chat No one writes anything So this is not okay Multi-class classification. Yeah, what what am I trying to do here? What is my objective? Define okay define decision boundaries. Okay too many people right? Okay, so the fine decision boundaries, okay, perhaps Inputs should be a data point. What is a data point? Okay, what is the size of the input here? Okay input our points in a 2d space right so a 2d coordinate is one like one one of these points here That this is the the 2d location right the x and y location is going to be my input. What is my? What is my target? So what what should I given this location here? What should I do? The color right should I say it should I should should say a red if I take this dot over here What should I do what I should say? It's yellow if I take this dot over here. I should have my Classifier say it's purple Okay, sweet. So What is the easiest way to do classification just linear classification? I mean I'm just drawing straight lines, right? I'm gonna be doing like Like this duck and so this is my classifier. Okay, we're gonna be training this on in a second Awesome, right? What is the problem now? Error, okay. What is the error? What is the issue here? So Okay, sweet that that was the key one right so these regions here like these branches these bars are not linearly classifiable Separable right so I can't use possibly a linear separation a linear boundary here to tail apart different Spires, okay And so how do we deal with this right? Well, so again the issue here is that there are all these intersections that are gonna be, you know, not bring not leading us to To have a good result, right? All right, so on the left hand side, this is what when I was doing my PhD I was naively thinking all my network is unwarping the space right like that basically here. I undid my My my my parametric function When I train a network instead to fit this data From the input perspective where I still see all these things in the like input space. I can look at the Final decision boundary, which were linear because you know, we only have each module is linear But then we have a non-linearity, right? So if I look at the last one is gonna still be a linear But then if I look at the output From the input I can see how these output linear boundaries will be Morphed around. Okay, so this is on the right hand side is what usually you see In most of the tutorial online on blog post and so on But I don't like it. I like more The left one right to see how things are warped around right and that's why I show you last time in the in the first lab The this animation right which is gonna be Coming up as soon as I show my screen again. All right, so At the beginning I show you these five branches again Each point is a 2d location and then I have a capital K equal phi So I have five different possibilities and this is actually what really really happens, right? And so what I'm doing here. I'm just doing a linear interpolation Between my input to a embedding layer. I'm gonna be drawing this in a second But what I'm trying to show you is that before getting to the output of the network this embedding layer Which is gonna be And shown to you in a second when the animation and ends up like finish complete You see that these classes are now linearly separable. Okay, so all the Previous chunks of the network does is gonna be basically warping this, you know, this space, right? I like to call it space fabric because it looks like, you know a piece of napkin And so the point is that the network and warps what is this data? No, and let's you have it linearly separable and then I use different planes So this green one over here. It's a plane that goes up this direction Then if you have another plane going up the other direction is gonna be a yellow one The intersection of the two planes is gonna be a line, right? So you have this line over here Then if I have a third plane going the other direction you have an intersection between a plane and with a line Which is gonna be a point which is this point over here instead This one here is gonna be the intersection between the yellow and the blue, but then I have another plane No, and so you have another line here It's alright. So here you have basically five planes And if you don't know why there are five planes you can spend more time with the math and playing around But that's not important Here you can see how the final layer can be separated by these intersections here. You can see Which is these are intersection between planes. Okay, so there are basically we I can call it even hyper planes, right? So there are basically I Can chop this space that was unwarped before it was all warped It was impossible to linearly separate separate now the Negronet basically unwarps the space such that it can be linearly separable Cool so what am I showing over here and now it's gonna be drawing time So what I'm showing you right now is gonna be is the following. So here. I show you a network which has two inputs Which are my X and Y coordinate Then it goes up to 100 neurons is our 100 100 neurons So my first matrix. What's the size of the the first matrix? So if I have here my w one matrix This is a matrix of size Who's two times 100 are we sure okay, we are always using column vectors Okay, thank you hundred by two So the height of the matrix is going to be the the dimension I'm shooting towards So this is gonna be 100 and Then the width of The matrix is gonna be the dimension I'm shooting from okay, so I look from here and This is gonna be where I'm going to so I have 100 So the the height is what tells you the outcome I mean if I use column vectors, right? I know I draw them horizontally But because I tell you why in a second why so then here I have my no linearity Which is gonna be a real and then here I'm gonna have Down to two neurons again and then I have two five okay, and The network. This is the input right in Put this is the output And Networks are always Going this direction. Why is that because you go high in the network you go high in the hierarchy, so What you have here on top? It's a classifier right so usually people say oh you just put a classifier on top, right? So this is my classifier This is a linear classifier on top right cool, cool, cool So why do I have this intermediate layer over here? Well in this way I can do a linear interpolation between this guy and this one right so I can do like 100% this one then I have 99% 1% 98 2% 97 3 and so on right and so that's how I show you the animation before And how do we train how do we train this network? Well, we're gonna be covering this right now, okay? The important part here is that I use here 100 neurons in the intermediate representation. Okay, and this is why Things are so nicely behave and smooth and work so nicely Later on before the end of the class. I show your network which only uses two neurons And it actually is a very deep network But only has two neurons all alone. Okay, so Possibly a network with two neurons can move things in a plane, right? But the nice part here with using 100 neurons is that you can move things in hundred dimensions So you have much more freedom of moving things around and then you can project them back to two dimensions and Going in a high-dimensions things are much more separated and more easy to move things. So the The what's called the optimization process in a high-dimensional space is much easier, right? Things are not constrained if you go if you try to do a optimization Process only by using two items like I show you right now like a network very tall network But having only two neurons each layer Try to train the thing. It's like a nightmare. I try it doesn't I mean I managed to make it work But again, it's like more of a hack and I show you as well the outcome. Okay, so this is just to give you a small Like explanation of what I show you on the first lab. Okay clear so far, right? Let me see. Yeah So far everything is fine, right? So this is half-class so far Yes, no, you have thumb bumps thumb down green green circle to green circle three. Okay Okay, all right nine eight. Okay, there is a portion of the class was following There's someone clapping. Okay. Yay. All right sweet Let me move this bar away again, so we're gonna go back to the slides and let's figure out How we train these networks, okay? And every time I go I Present something these bars appears in front of my face There we go. Okay bar away Okay, okay. Okay, so training data So the network we saw last time well On the second practicum that they perform arbitrary transformation of these points then we have to Enforce the basically transformation to be To be like, you know following a specific Like to be instrumental to our task, right? So now we would like to do classification of points, so I have to introduce the training points So my input which is going to be always in pink and it's a bold vector So it's a bold symbol which means it's a vector in this case for me my x The example I is going to be belonging to our end. So there are end components in x Um And then here I have my capital x which is going to be it's called my design matrix where it has all my Column vectors lying on the side. Okay, so in this case My x with the parentheses are the horizontal the transpose version Okay, so I have all of them stuck on one after each other And this is what happens as well in a in a pytorch, okay But again, this is the design matrix So I have n columns and then I have m rows Uh in on the right hand side instead. I have my ci which is going to be my class, right? It can be one two for example five in the last In the last drawing last animation. I show you So here I have my collection M items right one for each sample So the x1 will have the c1 the x2 will have the c2 right the class number two the xm like the Input so x means input for me, right? So the x the m input Will be associated to the class cm So I have m uh labels, right? So these are my labels my classes in this case So I have three okay three or five classes, but these are the labels associated associated to each data point Okay A way to encode These labels is going to be using the one hot. So what's one hot one hot simply says, okay Since one two three. There is a like ordering like in the numbers. I don't want to use ordering I just use a vector of all zeros and only one one only a one in the position which is You know Uh representing my given class, right? So if I had just three classes Uh, I will have these vectors here that will have Uh three components, right? So ideally this thing will have capital k Uh components, right? And so this one refers my first Uh, it's my first item, right? Like first class second class third class Uh m is the number of data points, right? So I have one two m data points And therefore each of each data point will have its label So c one is the label for the example x one c two is the label, you know for the x two, right? So x m belongs to class c m, right? So c m will be, you know, one two or whatever Um C is the output label c is the label No, c is not y I didn't talk about why why why who's talking about why K classes distributed as labels over m points. That's why yeah So far everything's okay, right? I think so. Otherwise, please flat judge and answer questions. I cannot catch All right. Cool. So what do we need now? So now I'm going to be introducing my y So what is this y y is going to simply be in this case? My one hot encoding of these numbers, okay So in this case, there will be capital k columns, right And then I'm going to have uh m rows, right? So one y per given x And the given y will have a one in correspondence to the uh Index represented here, right? So if this is two then the you know, you're going to have this guy here, right? The second item is going to be set to one If c m is three, then you will have this, uh, y is going to be equal to this one. Okay Capital c is the class index. There is no capital c There is lowercase c but yeah Uh, why is one hot? Yeah, okay. Justin is explaining sweet. All right. Cool So Here we have that my y I right the ith sample the y ith uh label basically, uh, it's a Zero or one, right? It's binary With k capital k elements, right? So you're going to have as many elements as capital k and then the case Over the yeah, the case element is going to be set to one, right? Which is told to you by this number Meaning if this one is going to be red, you're going to have the one on the red So it's going to be a red purple yellow, right? So if this coordinate here should be yellow, you're going to have the one in correspondence to the yellow column If this point here supposed to be purple, you're going to have the one in correspondence of the purple So each column is going to be a red yellow purple Okay, and each of these rows going to be one to the coordinate. I show you in the In the drawing before Cool. So fully connected layer Again, we are revising this one. So I let me know if I'm going too fast I have my input at the bottom. Then I have a matrix wh It's bold. So it's a matrix. You know this already now Then we go to a h in green, which is my hidden layer Also, this one is bold means which means is a vector and then we're going to have another matrix W y which is going to be mapping. It's going to be rotating h to go to the final y hat So these are the equations basically Like the the only equations we're going to be seeing basically we have the h is going to be My squashing function f applied to the affine transformation of the x, right? And then you had that y hat is going to be the squashing function g applied to the affine transformation of this part here And we already watched and seen this very extensively last time. Okay in the like in the in the previous practical And this squashing function can be the positive part sigmoid the hyperbolic tangent and the soft arc max and whatever you want. Okay So let me draw this in an extensive manner, right? So this is like a more condensed Neural diagram right where these are supposed to be like clusters of neurons here I'm going to be drawing each separate neuron. So this is my input neurons also called x Then I have my let's say first hidden layer. That's why it's called h for hidden Then I have the second hidden layer Then I have the third hidden layer and so on until I have my final output layer And so here I can see there are Uh The first layer is going to be called also a one like the activations of layer one They have layer two three four until the last one. No, we have five Um five layers in this case, right? So for me layers and it's basically well The layers are the collection of neurons and I do count the first layer to be the input layer And to move from the a one to a two you're going to be using the matrix w one Then you have matrix w two w three and w whatever l. Okay So how do we compute? Let's say the the first Hidden neuron, right? So let's focus on the first Element on the hidden layer right the first hidden layer So this this item here will be simply Be my a j right the jth neuron at the second layer Which is going to be my squashing function applied to these Scalar scalar product between the w j and x plus these Bias term over here, right? So first of all, what is w j? Well w j. It's simply the jth row right of the w one matrix Again, you should write down on a piece of paper if this is not like clear And what is the scalar product? Well, the simply is the summation of all the product the product right so the element wise like you have like the Element wise product then you sum them all And then there's going to be one more offset term here and then we apply this squashing function f And so basically you have that all these input x On the left hand side are multiplied by some coefficients that are those w's They are all summed together And displaced by the bj term to give you the first value Which has to go through this squashing function Second one same stuff you have all the inputs multiplied by a second bunch of weights Which are these weights. Yeah, the coefficients They are summed together Then we displaced by the bias term and then we apply a squashing function is f Function a kind of sigmoid no f can be anything it's written down here, right? f can be the positive part like rillou the sigmoid hyperbolic tangent the soft arg max whatever you want It's just a non-linear function because if you only have linear function What is a linear function of a linear function of a linear function of a linear function? Linear function, okay Actually, I mean yes on paper not on computer So there is a paper from last year where they train a network which doesn't have no linearity still works They use the floating point approximation as non-linear function, but okay, you know, whatever All right, uh, so that's why we need to put something that is non-linear, right? Piecewise linear. Well, well, that was the question No, if you have a linear function of a linear function of a linear function, it's still linear There's no piecewise piecewise is if you use a real which is basically selecting regions All right, so how do we get the whole layer? Well, you just do the same for all of them, right? And boom and this one It's done in PowerPoint. So I drew every line by hand Don't do it again. Like don't try to replicate. Okay. Thank you Okay, thank you Okay, painful. Okay Then guess what? I I have all these weights collected in that matrix down there. Okay Same for the second hidden layer You try to do copy and paste doesn't work. So I had to draw again by hand Because they are not aligned, right? See, they see there is an offset. So again painful So every every neuron is multiplied by so every neuron here Has one weight then this one has this weight this one has this weight This one has this weight all of them are multiplied by a weight of efficient Then they are all summed together here and then you apply a squashing function, right here And then you get this value and so on repeat And then all these weights are stored in this matrix Now yes, we can use copy and paste to make the following and then Finally you end up with the last one Which uh, you're gonna store here the weights and then the final outcome Again are it's still thing, right? So I mean this It's just multiplying Values by coefficient summing and then applying some non-linearities and then storing this stuff in our weight matrix for later usage What are you using to make this visualization a mouse and powerpoint? All right, so back here Summary, right? So x maps to h mapped to y. So these are the mapping function Fine transformation to which you apply a squashing function These are the dimensions again If just check later right if you are comfortable with the sizing whatever And then these are the squashing function we may use So what I want to say here is that my y y hat in this case is going to be a Function of my input, right? So you can think of you can think about the network y Uh as mapping is a function mapping my input rn towards rk, right? So mapping this axis towards this y hat But again, what we really are doing here Is actually mapping this input rn to a Usually larger space intermediate space rd and then back to this rk, right? So usually we have this this d needs to be Way larger than the input if you Hope to be able to disentangle the input, right? So usually the input is really like cram in a little tiny space And it's really really hard to tell apart, right? So that's usually what Deep learning has been very successful to do, right? It's been able to unwarp like this thing So first thing you need to explode it in a large dimensional space And then when it's in a large dimensional space, you can move things around much easy and easily and then when you're done that down back to the domain Dimensions, okay, but there is always this kind of upsampling or up Spacing whatever you want to call it even if you're using images of one megapixel. You still have to go and explode the dimension In a very Control manner, right? Because if you have a megapixel, you cannot use like A few more millions hidden layers, right? So there we cover that in a in a in a future lesson But we still need to go in a higher dimension in order to be able to move things So this is the answer to the first question from last lesson, right? Why do we go in high dimensional space because in high dimensional space things are easy to move? And so the optimization algorithms actually can work It can move things around and then when you're done moving then you go back to the dimensionality You actually work with Okay, cool. So Two more slides and then we go to the notebook so that we finish on time perhaps So in this case my input are two dimensional we go to a hundred dimensional intermediate representation then back to three Okay, and that was the example I show you I think for okay, I guess for the This example here on the on the spire, right? Then I use five on the animation just to make more prettier in the outcome, right? All right, so neural network training first slide of two How do we train these networks? You have a reminder of the equations on the top right So we're going to be using these soft argmax, okay, which is simply the The ratio between the exponential of a given component by the sum of all these Other exponential of the all the components, right? So you have one component you take the exponential you divide by the x of all the other components Um, of course this stuff will never be lower than zero or even equal zero, right? You cannot get like exponential is a positive function. So you can't go to zero Nor you can exceed or be equal to one, no because this the bottom will always be larger Like the the the dominator will be always larger than the numerator So it will never reach one and all of these quantities are positive, right? So this is Like you know within zero one extrema excluded, right? And so the L there stays for logits The logits is the final output of the network. Usually it's the output the linear output of a network, okay That's definition So i'm going to define here my loss my curly l right my loss over the entire data set right my entire y hat and So sorry my my y hat which is the output of the network given the whole inputs, right the the whole x And then the correct class c c is the correct class And so this is going to be simply my average no one over m summation that goes from one to m Of this lower case curly l, right? And so these are called my pair sample loss Function, so here I have one outcome right my y hat The ith for the ith input, right and then the correct class for the ith item And so what do I use here? Well right now i'm going to be using the following So my loss my my pair sample loss is going to be defined in this case to be the minus log of the Y hat At the location told me by the correct class, okay? And again, this is going to be also called Cross entropy negative likelihood. We don't care right now. I'll just show you what what I'm doing, right? May I ask why there is a dot on top of the equal sign? That means it's a Define right, so I define it right now like yeah So what happens if I have an x and my correct class? C can be from zero to capital k minus one. Yeah No one to capital k actually I count from one in in math Uh, how about x how about I have an input x and c equal one? Okay, so let's say my My y know the one hot version is going to be the the first item is equal one, right? So I count by one one two three these are three items, right? So c one means I have the one in the position one here and then My network basically says oh, okay, so given the x I say roughly one roughly zero roughly zero Okay, first of all, uh, what is roughly one here? Who can I who can see who can say more precisely what is roughly one on the chat? No, no one Tends to one see from which side It can yeah, so this is one minus right and so those other here. What are these? Zero yeah zero plus right these two all right, so if you apply this this one to to the loss given the label is one you're gonna get the minus log of One minus which is how much is it log of one minus is Yeah, there's a minus right zero minus then there is the minus in front Zero plus yeah, so this one goes to zero plus Similarly, if I have the network telling me oh, no, no, this is the other one Right, so you're gonna be having that the loss is gonna be the log of zero plus which is Minus infinity then we have the minus in front we're gonna get plus infinity, right? And so if the network makes a mistake a very big mistake you're gonna get the loss that goes to a plus infinity Otherwise the loss is gonna be going to zero I'm gonna be Showing very quickly this part here, but you are already familiar so because again, we are going to be going to the notebook So I have my capital theta is going to be the collection of all the parameters So all my w's and biases are all here w stays for weight b stays for bias And capital theta is for the collection of parameters Uh, I define here. I just change the variable right so instead of having this curly l Which is function of the output of the network I have a capital j function of the theta, but they are the same function and this is a positive function So let's assume it looks like this This is my j given a scalar uh scalar parameter only right so I have my initial value My theta zero I compute the value over here I have my j at theta zero I see what is the derivative The derivative is positive means I have to go So this my positive derivative at the location theta equals zero and this is a number right So all these green thing with the purple the orange thing This is just one number a positive number And so given that I have a positive number I know I need to go to the left hand side and so I will change theta zero by a negative so I have to put on this one minus right this Positive numbers to go to the left hand side And so Yeah, this is my stepping size right my my My coefficient right I had to step towards the negative direction of this thing here No, because I want to go to the left given that this value is positive. So I need to put a minus in front And this is my green descent right so How to compute these partial derivatives? Well, I just use my chain rule here And the same for the other parameter right so this was with respect to the wi This is with respect to wh and so here I'm gonna have you know the Just chain rule again. You're gonna be Chain rule right back propagation, but you already answered correctly to this question at the beginning of the class. So I'm positive you understood so far In the last five minutes We're gonna be looking at pytorch, okay Uh, if there are questions we'll take them offline on the campus wire or after uh, we are done with the With the with the pytorch part Okay, so Boom Again this go away from my face Go here. All right. So we go to fcd work GitHub PyTorch deep learning then you do a conda Activate pdl right pytorch deep learning then we do Uh, I show you that I have this one, right? So this is my jupyter notebook Okay And it's opening here one moment There we go So I'm gonna be opening this spiral classification Let's go full screen Let me hide a bar I don't know what happens here. I will choose pytorch deep learning kernel And so we're going through the code. I still see you here. Okay, awesome Let me zoom a little bit Maybe it's fine now. All right. So I import random crap. I import some drawing libraries You need to clone the the whole thing, right? So there are instructions on the repository Here I said I set some default Uh settings for plotting and then here I choose a device. We already covered device in the other other lab Here I just draw my spirals. Okay, so There is no much thing we should spend time there, right? So this is my spiral. Okay So in this first lab, you're gonna be learning how to Train no we just apply the things I just show you in the slides So we start first by Uh Training something that is made of two linear model, right? So in this case here I had that my model is gonna be two linear layer the first one maps from should be n to d, right? So maybe someone should fix this thing because it's bothering me and now This one goes from my input x that has n dimension to d of the hidden layer Then we go from d to the capital k, right? So this is my my my neural net, right? Well, it's a linear model. It's not too nice. But again, that's what we have So here I define my criterion And again, if you want to know how this cross entropy works, this is simply the minus log of the Output of the correct location. I show you before you can click inside here You can do shift tab. Okay, if you do shift tab, you're gonna be opening the Uh documentation you can place this one in order to see The full description here, right? Okay All right, so we pick this loss We pick an optimizer which is simply doing stochastic gradient descent which is applying, you know the negative Uh gradient and so we're gonna be looking at How to train this stuff, right? So how does the training work? So first step you're gonna have that the uh, we predict The outcome so we we feed the input to the model, right? So this is going to be feedforward pass This is step number one. So you have to feed forward the input Step number two compute the loss, right? So the loss is going to be simply that minus log blah, blah, blah Which is my criterion over here. And so here is step number two I provide my prediction and the label to my criterion to get the loss. So step number two Here I just compute some uh accuracy. We don't care step number three Step number three is this zero grad. So what does zero grad? So in a pytorch We never compute the gradients from scratch. We always accumulate to whatever we had before This is very convenient whenever we use more complicated architectures and whatnot So instead of just, you know Erasing everything there was before If there was something before I just accumulate something on top, right? And so we since we just want to compute the gradients in this case I have to zero out whatever it was Store before This was step number three step number four. We compute the back prop which is computing the partial derivatives of the loss With respect to the parameters, right? So this is my fourth pass and then the the fifth step. So there are five steps The fifth step is going to be stepping. This is going to be going Jumping in the opposite direction of the gradient gradient says, oh, this is direction of maximum increment of my function Well step that direction. Okay, so what are the five steps? Feed forward, right? You have to feed the network with the input here Second point is going to be compute the loss I guess with this given specific criteria, like prosentropic in this case Third point is going to be clear up the gradients because pytorch keeps like a buffer of all previous like of the previous gradients The gradients are the partial derivatives of the loss with respect to the weights. Okay Fourth point is going to be Uh computation of these basically of the of the of the partial derivatives Back prop actually does the accumulation, right? So if you want to just compute them, you need to clear up this stuff before And fifth step is going to be stepping in the opposite direction Is backward just propagation or backward is back prop is back propagation plus accumulation Okay, so if you want just to if you just want backward, you need to clear up So you need these two things together in order to compute The partial derivatives, right? If you don't have this line over here, then you will accumulate to whatever There was before so You just want to get in this case new gradients Without remembering without regardless of whatever Uh will store in advance, right? So when we call model x it actually calls model forward no Uh forward it does other things. I mean it does also call model forward. Okay, but we didn't talk about forward, right so far So this is how you feed a model, right with the input given that you define the model in this way There is no forward yet cool, so all right, so we have the losses whatever the Okay, what is the original what is the first value of the loss here now it's 0.86 at the end of training Uh question for you. Let me know next time. What was the first value? Second question. What is 0.5? How is this model doing is it going well is doing bad? Is he random? What is it? Is it bad? Why is it bad? Not great. Is he like chance? Okay. Why is it chance? Okay. Oh wait, hold on two classes. How many classes are we training on? Okay, there are three classes. So 0.5 is not Too bad, right? Ah, yeah. All right, cool Anyway, it's not either too good Uh, and here I'm just showing you the model, right? So this is the the first thing I I I show you before right in in in the in the slides, right? So this is my linear model, right? So okay, not too fancy. I Unzoom a little okay So this is my linear model, right? A linear model just has linear decision boundaries, right? And that's what I show you at the beginning, but then there are these Issues, right? There are like intersections. So one line change Enter deep neural networks. We fix every problem. Okay So let let zoom back in. I can't see anything otherwise. All right, cool. So what I change here Let me just execute this and first What I change here is one line. I just added this positive part After the first layer, right? So again, uh, this stuff was we go from n dimensions to d dimensions of the hidden layer then I I Zero out everything that is negative and then we go from d to capital k, right? So the only difference between these two cells is the positive part add with relu equal So this is a three-layer network. I have an input. I have a hidden and then I have an output value. Okay So there are two and then modules here The first module maps the input the first layer to the hidden layer The d right and then they have this other which is mapping the hidden layer to the output layer So I have input hidden output like count three There are two modules, right? Two I don't know what's that two. All right. So the only difference was that I added one line And are you ready to look at the performance? Yes, no, okay. I'm already late. So I should not waste your time Yes, okay. There you go accuracy zero nine four nine. Okay, and let's print the model Okay, you have tada too There you go Okay Does by torch do the one hot encoding by itself? Yeah, you can check the the code, right? So You can change also from sgd to adam you can Oh, yes, I think that it changed. So someone can send a in a poor request. Please Someone fix this stuff But nevertheless if you use for both networks, let's use adam here as well, right? uh them Just to make sure Okay, so this one is still zero five Oh because I think We had to specify Okay, so now both of them use the same. Okay I believe that I had to pick a better Uh learning rate, right? I didn't pick the I picked the default learning rate here for for sgd. Okay Now both of them use the same optimizer. You still see one, uh works And the other one Does straight lines. Okay Good point, Randy Okay, uh In the last 30 seconds I'm gonna let you go. I show you instead This regression, right? So just to be uh fully, you know to show everything By torch deep learning So here I just do exactly the same thing. I apply the five steps Once again, what are the five steps? Can you remind me on the chat? five steps forward Uh lost computation Uh clear up the gradients, right? Fourth is going to be back prop fifth is going to be stepping. Okay Okay, so in this case, I'm going to be trying to do regression and I try to get to learn this function over here I start by using Just a linear model and of course a linear model is going to be just drawing a line through these points Uh, but then let's use like now, uh a few networks One network with the One network with the tiny h hyperbody tangent and one with the uh Reload so these are my initial networks, right? So these are before training the networks are giving me basically Horizontal results and they are not a green of course because they didn't see data, right? Then we fit these curves, right to our own curve and see let's see how they they look, right? so There we go. So in the left hand side, I use the uh redo And you can see this is basically a piece while it's linear Interpolation basically whereas here with the hyperbolic tangent things are much smoother. Okay Uh something interesting is going to be Trying here a different zooming factor. Let's go four way four four times further away And so what happens is this following, right? So in this case the network just goes straight linear because we've been using So the easiest thing to do is just keep going straight ahead And you can see how the variance here increases, right? Not too much instead in this case. I use a hyperbolic tangent And you can see that as you go away from my training data You're going to be re-observing this kind of you know sigmoid Shape, right? And then here actually the variance is much increased is much larger like in a faster way so the the the main point here is that if you train several networks By observing the standard deviation or the variance between multiple predictions you can tell how far you are from the Training manifold. Okay, so this is actually very A very very important part, okay so Usually whenever you train these regressors, you don't know exactly how well or how bad you're doing But if you have multiple networks trained on that data You can always measure the level of agreement or well this agreement in order to estimate how far you are from this region Okay, and again different non activation. No no no linear functions We'll show different kind of fringe behavior and behavior outside the training region I took 10 minutes too long. Well, all right. I think that's was it, right? If there are more questions, please let us know on campus wire. We'll take any question there Thank you for being patient with me See you next week Why? unless there are Imminent questions, there are no questions, right? It's too late anyway All right. Thank you again for sticking around and again, there are recordings for the one that left Bye-bye