 Welcome to the sixth class of this deep learning fall 2022 edition, 5pm almost live New York City. Thank you for joining tuning in today as well for a very dense two hour lesson and tomorrow we have the third one about new content that has no one has seen it before so it should be correct. I mean I showed it to a few students of mine today but they said it was okay a little bit heavy so that's why I wrote the announcement right don't get too scared we'll try to get together why are we we are the first to see basically yes for my followers yeah two students I've seen this before just for practice anyway the point is that it might be a little bit harder than usual so I will try to go as slow as I can but you have to help me out figuring out what are the hardest part still we should be able to do it okay so just trust in yourself be hungry intellectual speaking and let's give it a shot okay before that we're gonna be having some small announcements no slides share no no slides with me and and and lectures come afterwards right because I want you to pay attention to the you know like how I do a young cosy striptease of the slide like elements happen one after the other if you already have the slides then I cannot withhold information from you or temporarily speaking right and so I can't play language modeling with you right so if you don't see the future then I can query you and your mind in order to give me a prediction before you see the ground truth and that's how you how I teach right so you I can't give you the content before the class because otherwise I cannot teach so we go always here and there are two things I think they're nice okay this is from a colleague of mine from uh Sapienza the university in Rome uh he has made an amazing work I really have to say that he explained how auto differentiation works right so gather around twitter folks it's time for our beloved Alice adventures in a differentiable work our magical tour of autodiff and back propagation I think it's just amazing right and then you can go through this post and I think it's just wonderful I believe crazy people are here to make our life more interesting okay so Simone Simone is just amazing okay the other announcement is about this book here it's a paid book so you don't have to buy you can always you know find it let's say it this way but I shouldn't be saying these things I basically recorded uh so the the point over overall here is the the book from Daniel Godoy has very nice diagrams which is something I really support which explain in a similar way I do right with this kind of flow diagrams this kind of schematics this kind of circuits whatever you want to call them right so I believe at least for me these are much more intuitive than the mathematics mathematics is not always the best abstraction to explain connectivity well it's not the best abstraction to explain connectivity you have wires to explain connectivity that's it all right so we start okay let's even start with bonus thing right I just recap what we have seen last time with me such that we can refresh the ideas and then let's try to progress from where we left off and then let's see how to be on top okay all right all right I just this is the same thing we saw last week okay so it's actually the same I don't want to well the major problem here was the overlapping of the um linear decision boundary with what is the worked or tangled manifold that's the correct word so how to avoid this overlapping well either you bang those decision boundaries I don't like it the perspective I prefer to unwrap the the data okay that's what I prefer that's what I show you well this is a stupid animation but then I show you all those nice drawings and fancy things and I think I try to give you some intuition about what we are trying to do this is the other perspective when you watch the decision boundaries that are linear still from the input of the model which is the bottom right so your linear decision boundary seen from the input get worked basic basically by the network so it is exactly the same thing just two perspectives right then we talk about the data the pink the blue and the orange this is split in two parts one side is viewable the other is not viewable the one that is viewable are shown with shaded color it's observed they have the x the y and the z z and the x are optional and then you always have the y the y is what we care about the thing we want to learn okay so the y is the target the objective you always have a target if there's no target then there's nothing to learn okay so we always think about y the blue ball y as the target and the thing that we are interested in regardless whether there are conditions like x or there are unforeseen events the z okay then that was it i have capital p distinct items each of them are going to be size of n here i say that y p belongs to the i matrix the i is the identity matrix right so if y is one of the column of the identity matrix is going to be a one hot so exactly this so we have p of these targets of capital k elements then we saw a neural network inference we had a pink bold x at the bottom it goes through the predictor in order to predict what the hidden representation of my target okay that's why it's called predictor we predict the hidden of the target so if i take the hidden and i decode it then i get my prediction okay that's the just semantics predictor predicts the hidden of the target i get the target well approximation of the target by the decoder the fact that is an approximation is shown to you by the tilde that means more or less okay then we said we would like to have that approximation to be closed right in in distance right in a value to my target and therefore i put a spring represented by the cost of the spring remember we talk about the the hook loan the the energy you compute the the integral of the force right the work and then you have exactly the msc then we had to a question the hidden layer the output prediction we said f and g are arbitrary non-linear functions we said that the prediction is a function of the input of course but we'd like to think about is going through these intermediate high-dimensional hidden representation okay h stays for hidden okay we don't talk about latent until later later later on in the semester so far the h stands for hidden or internal representation these are the two terms that means the same latent means something else we don't care for for the moment okay so far no questions good and then there is this first difference from from other other other lessons right so we introduce here this f what is f f is the level of incompatibility between the inputs of my energy system right so my energy model energy base model has two inputs right now x and y the f the scalar value f tells me how incompatible this specific pair is you can do this just definition how do i compute this level of incompatibility this level of incompatibility of the two inputs is going to be equal to the cost i pay to produce a prediction that is far from my target okay so c is the cost that you pay for making a prediction far from the target you basically measure the distance of divergence of whatever you want to call it that's the c cost f is the incompatibility level of the impulse okay this is just definition but there are two distinct things okay so far just recapping right so no new content then we define this y till is this soft arc max this is a kind of a different thing people call it other things but that's incorrect and i didn't explain what it is but i just defined it as being the exponential of the vector this s is the linear sum right it's the output of the linear summation the linear module so i take this linear thing i exponentiate it and i divide it by the sum of all the exponentials that's what's written here just definition there is the dot by experience a definition then we say that the loss function the curly l tells me how bad a set of weights w is for that specific capital we are curly s which is representing the for example training set okay so the loss tells me how bad a parameterization is for a specific set okay and that is given to you by this average one over p sum of all the p's of this capital L which is the per sample loss function okay the per sample loss function tells you how bad a specific parameterization is for the specific pair of inputs okay so distinction of the definitions but i hope it's clear right so one is talking about the whole set right you average out all of them the other one is going to be the per samples going to be how bad the parameters are for that specific pair of samples now the thing that basically simplifies everything in this classification thing we had that the per sample loss which is telling me how bad a specific parameterization is for my given pair we set it to be right that arrow with the equal means we set i i choose is a choice is not an equality i chose it to be equal the f the energy which is the level of incompatibility of that pair which in terms it's equal to the cost i make from having a prediction it is far from my target okay so there are many steps here but they all the same thing more or less right but the value is the same the meaning is completely different right i i hope if you didn't catch it just listen to me it's recorded right but just make sure you understand each meaning of each symbol right they're completely different things moving on we define we choose in this case to have the c to be the cross entropy or negative log probability which is going to be the negative log of the inner product between my software max and the one hop and then we said blah blah blah this stuff goes to zero if you get it right this stuff goes to plus infinity if you got it wrong does the equality with top arrow means assignment here yes yes assignment is not in your homework assignment but in the assignment as i choose right so if i choose the loss a specific loss of being something right so i may choose many many losses in this case i show i choose something that is called the energy loss the energy loss it's basically saying that the loss is equal or it's chosen to be equal is set to f right it's like having that arrow with the when you write the latex right you have that you understand right you choose it's not equality okay so how to train this stuff right so we have w is going to be the set of all the parameters and we have this loss which is how bad a parametrization w is for my set training set perhaps uh and we saw this in the center right we start from initial location we compute the derivative right where to go you figure this positive you want to move to the left hand side you move down until you hit the the minimum and then we just i just wrote this oh how to compute this partial derivatives oh well you just multiply this bunch of things and this is called chain rule in mathematics or back propagation in uh deep learning okay but then i didn't explain how it works today i will right that that's why i will try to again go through this stuff before that let's have a look since it's not very straightforward what this energy uh is okay this is free energy remember what is the energy right now here the energy we decide the energy to be equal to remember the diagram i showed you like a few minutes ago my energy f the level of compatibility of x and y we choose now to be equal to how much the cost function c and the c was defined as the cross entropy which is equal to this negative log of the inner product perfect okay good all right so let's have now a view of how this free energy which is function of the inputs x and y looks okay i trained a model i didn't show you the training i would maybe training we do tomorrow so these are my my my original points right in this case i have five branches because it looks prettier you have the whole rainbow instead of just the colors i trained a model these are the decision boundaries seen the the the linear decision boundary of the top right seen from the bottom like from the input of the model and this is going to be one energy okay so this is one of this cross entropy energy what do i show you here so for every location x of the plane right i show you what is its energy level given that i chose the class first class number one okay so in this case i chose the class associated to the red branch right the red spider and then i show you how the energy the free energy changes across the whole x domain right so x is the the plane whereas the y are going to be countable right so i have five possible y's right and then for each of these possible options of y i have one full r2 plane makes sense right because x moves in a plane and the y is just the discrete so you have basically a function which is giving you five planes right basically like you have five planes one per each class and so here i show you all these five different choices okay so i change the first second third class fourth class and fifth class and then i'll just keep repeating okay as you can tell all points that are in the purple region right they all have zero cost you can see the color on the on the side the color map tells you that they have zero height what does it mean what does that mean right that means that all those green points have what does it mean tell me what does the energy tell us the compatibility right and so all those green points are compatible with that specific choice of y okay and so given i pick a y now i can see what are the x's that are compatible with that specific pick choice of y okay so all these points here will also have zero energy which are very compatible right things are getting more and more well less and less compatible as you move away from this area and you get level of incompatibility 21 25 and whatever 233 and so on okay so this is how this stuff looks okay now hopefully makes sense how come some classes are more incompatible with a green because they are closer because the compatibility because this is a smooth function right and so this network will try to learn some small sort of smooth energy which is of course going to be depending on the on the distance right here i show you the first energy right the energy associated to the first class is this level of curves right and so i make it spin such that you can see clearly despite the the artifact of lines disappearing okay and so here again this is the energy associated to the first class for all the x points and then in the height i show you the level of incompatibility of specific axis given one choice of y finally how do we compute this back propagation right and so now let's try to go again once more just make sure we all understand and i also i am able to understand how exactly back propagation works okay i mean i i think we know but why not having another try all right so this is the same almost the same diagram as we saw before there is a difference in one letter there is a all on the top left hand side instead of the y tilde and also there is no c capital c there is a capital link here right so there is there are two differences from before i tell you in a second what these differences are so h is exactly the same i find as i mentioned the input that goes through a non-linear function then let's split the the coder in two items i have this s orange bold s is going to be this linear sum the output of this affine transformation of the hidden layer okay so i have the hidden hidden value hidden layer i apply an affine transformation which is written down there and then i call this my orange bold s for linear sum then i get a o the output a violet bold o as being this g non-linear function apply element wise or apply to this linear sum s g is going to be now slightly different from before this log soft arc max okay that's why we're gonna have to use a different term instead of the c we use this d and the d is just the negative inner product of the two okay do remember what we'll see c was the negative log of the inner product right so what happened here is simply i just move the logarithm back into the o right so you have the o is simply the logarithm of the y tilde and you have that the two things are exactly the same right dy o is the same as the c y y tilde okay uh question shouldn't be wxh plus bx plus instead of blah blah blah so the way i i see this here right so this is going to be the matrix that this means the hidden layer so this is the matrix for the hidden layer and this is going to be the bias for the hidden layer okay that's why i i put an h here is this a question krishna about s yeah about s i have s is going to be the affine transformation of the hidden layer so i have my w for the y like the size of the y plus the b y right b y is going to be something in a five dimension and w y is going to be something like five times whatever dimensions h has okay why there is s in the picture yeah i will tell you in a second okay hold hold hold on hold your horses all right i show you the s where it is there is no s yet it will come in one second um so what we are interested in we said just to start with is going to be the partial derivative of my loss with respect to for example this first weight matrix okay so we have the variation of the loss with respect to this parameter right what is the loss the loss tells me the badness of a specific um configuration of parameters right there for that specific sample with respect to so the variation of the loss but with respect to variations of this parameter and this is going to be i just copy and paste what i written before more or less is just the chain rule okay we're going to be going through this in a second what is the question why is the so this is my identity right um i i wrote here d is going to be simply this thing here i just put the logarithm inside the i just compute log of y right and then i took the y outside so log y log y till they it's just all okay i just put together two things and i got this one outside so i take the log of the vector and then i multiply the log of the vector times the one halt okay those are exactly equivalent and exactly yeah nothing changes this is going to be okay why do i why did i do that because in in the code later on we are going to be using these log soft arc max why do we use these log soft arc max we saw that the soft arc max had that exponential you never take an exponential and then the log one after the other because you're going to have approximations numerical issues right and so the log and the exponent somehow simplify cancels out and you don't have exponential you know things going to numerical problems okay so that's how things actually work in practice and that's why later on on the notebook i show you how it works implementation wise in the implementation we use a log soft arc max it when we talk about mathematics we just prefer to talk about the soft arc max which i haven't yet explained fully what it is okay does it make sense my explanation so far okay i'm still missing how these two things together right what happens if you multiply a one halt with a vector jack if you multiply a one halt with a vector you're gonna get one value right then you apply the log you're gonna get that single value then you apply a negative sign before here what happened here i compute the log of all of them and then i simply extract that single value no it's not gentle it's not generic i just show you in this case i pick one vector i pick one scalar element either before and then i compute the log why pick the scalar vector afterwards okay the scalar value of the vector i do i do the indexing before or after okay this is just in this context works okay let's move on otherwise we don't go anywhere so what is this w w is the collection of all the weights what comes next i forgot see okay okay okay okay so we we are interested in this uh a question here that maybe it doesn't have a meaning to me it didn't have a meaning okay even when jan explains i see the the equation i don't understand right so today we try to well i i'll try to make you understand what this line means okay so let's clean up the the f thing on the left hand side let's remove that the whole network there and also let's remove that the thing and now we're going to be replacing that box with the partial of the d with respect to all right so that's we're going to be computing now the degree of variation of the d with respect to the all which is going to be exactly since d equal l is going to be exactly as this partial of the l with respect to the all right those two things are identical so first question here to to people at home how much is the partial of d with respect to all this is d d equal this thing here if you take the derivative with respect to all what you get okay correct and so the first thing we saw right so out of this full longer question we already know what's going to be the first term so the first term that comes back from the loss function in order to tune the parameters of the model will simply be the target with flipped sign first concept okay this works in the case of the classification okay let's move on second thing this second item is going to be the Jacobian of this log soft arc max with respect to its input right do we know what it is we will learn about that uh towards the end of the class let's for now just you know we know we have to compute this big thing but let's ignore it for a moment moreover we said that we're gonna uh we have split the decoder in two parts people were asking why did I do that because we're gonna be doing step by step and figuring out what each of these items in this long equation actually mean and so we have the decoder now it's split in several parts the first part is the affine transformation that's called I call it a the output of the affine transformation which is a function right it's going to be this s the linear sum and then the linear sum goes through the where does it go through the log soft arc max right which is just called g for so we don't have to bother with this long work so we split the decoder into two sub-modules with an intermediate value in the middle such that I can put it down there for bookkeeping so far no magic right so far it's all all clear no one is complaining so I just keep going anyone complaining no okay and then I keep going all right what next let's clean up the screen let's make some room let's move that up and let's clean the screen so we said we would like to understand what this long piece of equation mean we already know that the first item in this series of multiplication is the negative target right negative even transpose target but we don't care so first item we know the second one it was the Jacobian of the log soft arc max which is like I feel like uh what's what what's the word uh you are shivers shivers thank you I feel chills I feel chills down the the back it's weird we don't want to care about so let's pretend we don't care for the moment okay so we start with the forward pass the forward pass connect the input on the left hand side that goes through these g no linear function until I get a output okay these are written in a mono space font because this is like computation how how these things work on the other side next to the o what do I have in the diagram I have the partial of the loss with respect to this all which has the same number of elements whenever also we work with computers usually they also have the same shape we don't take transpositions in coding in mathematics yes the Jacobian the thing are transposed the gradients are different things in coding they are the same size right we don't get to transpose anything so we start with this partial of the loss with respect to the output and we call that the grad output this grad output and this partial of the output with respect to the input right so this is the variation of the output over the variation of the input which is the same as the Jacobian of this g function okay so this dg over ds is the same as writing the o over ds where o is the output value of the g function and g is the actual function itself right so these two items are the same thing but there are two different ways of writing the same thing over here this means is the Jacobian of a specific function over here is just this output over input sort of notation the cute thing is that if you multiply these two items here you can see that the d o the o cancels out you basically end up with this dl over ds okay that's the cute part of using this different notation and this is called the grad input right so in order to get a grad input you take the grad output you multiply it by the partial of the the Jacobian on the specific module and then we're gonna get this grad input okay this grad input here it's basically the multiplication of these two terms together okay and so we completely ignore the fact that we don't know how to do this for the moment but we just got to compute the multiplication of these two items together and now we are almost one step away from computing the thing we care anyway let me put a box such that we know that everything is actually let's say inside this module right supposedly and then I also gonna be writing these two items on the left hand side just for bookkeeping such that we don't forget what we have computed so far okay so let's clean up the screen and now I replace this multiplication of two items simply with the partial of the loss with respect to the linear sum so I just replace the two okay so we start anew if we would like to perform this additional multiplication what is this so in this case as you can tell we are checking this a we are fine transformation once again we start from an input which in this case is the green guy right which goes through the affine transformation which uses some weights in this case w y and that we are interested in and it produces an output that was the forward pass next to the output what do we have how do we call this item over here how is called this partial of the l with respect to the s we call it we said type type type type no one types anything and how do we call it grad output yes so we start on the other side with a grad output right then how what is this going to be interacting with well we're going to be interacting with this ds over dw why okay why is that such that these two s's cancel out and then in the multiplication I just get this partial of the loss with respect to the parameter which is exactly the grad weight is called and this is actually what we were looking for okay we are done so now we basically know how to compute numerically like procedurally right we will do this actually in the notebook in a few minutes we will know exactly how to compute numerically these items right because everything is quite straightforward I believe by watching this diagrams tell me if it's not straightforward then I just go and I am going to cry in the for the next hour because I spend a lot of time but I hope I hope everything is kind of very clear so far let me also draw one more box okay and also let me put there for bookkeeping these two items okay how about the locks of tarant max part that's gonna be way later in the day okay totally good point I just put it under the carpet for the moment good point other questions I hope everything is otherwise clear I know I just didn't tell you yet how to compute that I believe we are good right because no one is complaining quite clear very good okay happy this is happy okay moving on what next question what happened with the bias here what is this module here we are talking about and into type a fine transformation okay how do we compute the affine transformation so people said w h plus b and I ask what is missing bias is missing where should be missing so what is the diagram missing right now like in drawings right what shapes are missing in this diagram an additional input to a which shape should be what is the shape that is missing a circle okay perfect right so we miss a circle which is going to be the cont containing this additional parameter right which is going to be feeding the a okay question second question what is now instead so that that's a missing part of the diagram which is would make this diagram correct second question so this diagram is lacking uh information second question how about you're on bias and I'm reading the chat right if you're not typing anything I cannot read anything that seems like a reasonable thing to ask about them uh yeah yeah of course bias uh bias are learnable right d s d p t l d s uh yeah what is the sdb uh secret okay one so some can someone tell me what is the grad bias the shan is correct yeah he's going to be how do we call it that uh well in english here well he might have been in coding terms grad output that's correct yeah so that's very good right the grad bias we just concluded right it's equal to the grad output therefore when you're debugging your network and you want to check the health of your training I will go check what is the grad bias why is that like you got it right you got the understanding question right yeah I mean it's the same as the grad output yes but then why what why do I care about the grad output what is the grad output uh necessary for training okay find a proper bias no I don't care about the bias right now um so what is the title of this section of the of the slides back propagation back propagation of what what is the subject okay sure gradient five almost almost use specific names right we have some names on the slide here in english right back propagation of the what is back propagated grad output there you go okay good so the overall back propagation it tells you how to back propagate up to uh earlier layer the grad output okay and so when I want to debug my training uh network training I will check the bias grad value item in order to figure out what is the signal that is coming back from the loss throughout the network without need of creating hooks and other weird tricks in pytorch right so without needing to add any additional you know grad checking crazy code in your network every network you already have can be probed even right now to check what is the output gradient by checking the bias gradient okay this is very important right because if you need to debug a training now you know you should have known that now you know that you can simply check the grad grad bias in order to figure out what is the signal that coming is coming up the net okay when coming down because network goes up right are we all good you understand right this is a very powerful powerful thing yes no good like important right this is because the bias doesn't have interesting gradient is it no no no no uh yes I can explain it again so the the point here we the first part when I say bias is saying that this module here should have another circle with the bias going going inside right that was the first point the second part here grad bias we said that the grad bias is going to be exactly as the grad output because this item over here the s over the bias is equal one and so if you multiply this by one you simply get the grad output right so the grad bias is the grad output since when I want to monitor the training health the health of the training of my model I would like to check what is the grad output throughout my model in order to do that I can simply check the grad bias of all my linear layers in order to figure out what is the signal that is going to be updating while changing and generating my grad weight right so the grad weight basically you can think it think as being computed using the grad bias multiply by the current this thing over here okay which we haven't explained what it is but the point is that the grad weight is function well it depends on the grad bias so checking the grad bias will tell you one step ahead if something got corrupted before that level right so that's why it's really important it's a trick you can use for debugging your your model okay usually people don't know about this what is a bad grad output a zero grad output is a bad grad output because there is no learning going to happen or there is a none grad output or there is plus infinity grad output right the point is that you want to check as early as you get a none because after you get a none all everything is going to be none none none none none none none none none okay I'm crazy moving on okay so just to clarify the reason why we care about the grad output because we want to propagate not because we want because we do propagate back propagation is the back propagation of the grad output yeah back propagation if you want to make the full sentence is back propagation of the grad output okay moving on what next now we want to care about the partial with respect to the WH right so what happened here you can see I want to okay despite the things disappearing there I would like to compute this partial with respect to DH WDH is going to be down here so I need to compute this intermediate value how to do that well I clean up the center I get here this the output over the input right which is the Jacobian which is this thing here the Jacobian of my fine transformation and then I multiply these two and I'm going to get the L over DH because the two S is simplified right and so you can basically get the multiplication of these two it's going to be the partial of the loss with respect to the hidden and then I can just write them down right and then you just keep repeating this all throughout the model and that was my explanation of back propagation later on we are going to be checking the correctness of these things in the notebook but still someone complained about the fact that we haven't talked about the log soft dark max and before that someone was complaining that what on earth is this soft dark max okay and so answering this question we let's let's figure and out what this actual soft max and soft mean and broken nomenclature is all about okay so far we are all good right we are all on board everyone is awake no one fell asleep we are starting a new chapter of the lesson are you ready are you okay do you want to take a sip of water okay just five minutes break no no and there is no break we don't finish otherwise on time there is plenty of material okay let's let's see we speak in italian okay okay let's go on right we took a 30 seconds break with my batman jingle interpretation okay all right actual soft max and soft mean what what what is this stuff right so let's assume i have a vector red bold e of capital n elements e1 e2 e n okay the column vector now i have this the following notation something soft in square brackets the square bracket means it's optional so you may be soft may not be soft and m star star means either max or mean right so here you have four options you have max mean soft max and soft mean okay so four different options soft is optional star star means either max or in beta also optional forget about the fact that i wrote the column that's right now so let's think about the max i give you a n capital n item if i ask you what is the max of n distinct things what you're gonna tell me just one value right the max of a vector right of a list of whatever it's gonna be one value so if you take the max or the mean or the soft max or the soft mean given a vector you should give me one single value right that that seems to make sense to me right i know other people say something else i don't care let's go with this making sensical definition we define now the softmax uh with temperature beta has been one over beta log of the summation of the exponential of each component of this uh vector multiplied by beta okay so you take all the components of the vector you multiply them by beta take the exponential you sum them all take the log divide by beta you're gonna get this single value right you sum bunch of scalars and we call this value softmax the interesting thing is that if you crank up the coldness you make a super cold the softmax which is like a soft ice cream what happens to a soft ice cream if you put it in the freezer it hardens yes correct so the softmax hardens to a max that that's why it's called softmax right the softmax it's like fluffy like it's like warm right if you if you increase the coldness hardens to a max and the max of a vector is just one value so seemingly the softmax of a vector should be also one value on the other case we have the soft mean which is defined as the negative one over beta log of the sum of the exponential of the negative components exponentiated components of the vector which were multiplied by beta this one similarly if you crank up the coldness coefficient you make a super cold the soft mean hardens to a minimum it's interesting to know to to to notice that the soft minimum can be expressed as negative soft maximum of the negative value right that seems pretty logical if i ask you if you only have the max and i ask you if you have a function right you have a bunch of values oh what is the minimum of that vector well you take the the function you flip it right you take the sorry we said yeah you want to find the minimum right that over here so i flip the function i take the max and then i flip it back again you get the mean right so of course the soft mean can be expressed in terms of softmax i have this alternative version where i use angular brackets around the vector and i put a average instead of a sum right so these are the average version of the softmax and soft mean you see the only difference is going to be the the angular brackets and the one over n the interesting thing is that if i make it super hot where does this ice cream melt well the average soft mean and softmax both of them will melt to the mean or average let's call it average because mean and mean like mean and mean are the same sound mean for three chords the soft minimum and soft maximum if you increase the temperature a lot they melt to the average let me see the questions why softmax function is different from we learn in class now this is the softmax you learn in class everything other every other softmax you saw before it's wrong yes we're gonna go softmax next slide i think this is we with me if you talk about softmax or soft mean these are my softmax soft mean i know outside the community is doing wrong things i don't care what other people are doing wrong in class for educational purpose we do things right if you know what is right then you can tell right from wrong if you never thought what's right and what's wrong that you don't know what's right and what's wrong right so in class we stay with right definitions then of course when you read a paper you will know that people are just wrong i will explain this with the next slide and make sure you're you're understanding this okay again the maximum of a set of numbers is just one number so soft maximum is also one number cannot be something else anyway moving on uh so this is how both these functions look okay so i have this vector over here it has five items items they could be like my linear sum the output of the of the thing so the maximum value is going to be something like 1.54 which is shown to you by this dot line and then i have this dotted line here which is showing you the minimum value of this vector which is negative 218 what type of diagram is this how do you describe this diagram by paying attention to the axis what is it everyone is answering the wrong answer so just pay attention more to the diagram and then answer the correct thing so i will say that the horizontal axis is logarithmic this is a very funky type of diagram this is called semi log okay so it is logarithmic from one to one thousand from one to negative one it's linear from negative one to negative one thousand is again logarithmic okay you cannot have negative numbers in logarithms okay so this is different and important if you change the scale you're gonna see that nothing works very well to explain things anyway this is gonna be linear from here to here logarithmic upwards logarithmic downwards right you never have negative numbers you could have 10 to the minus one 10 to the minus two 10 to the minus three i don't have those right so everything here is just squashed down to a linear because otherwise i wouldn't be able to show you negative values with a logarithmic scale that's why it needs to be linear in order to be able to cross the zero okay but if it's not logarithmic afterwards then i cannot show you this asymptotic trend so in this case i show you asymptotic trend for positive value i show you the full range of variations in a linear range and then i show you asymptotic value asymptotic values for well asymptotic trend for negative values okay anyway you have to think about this later on a bit more anyway what we have here is that this is the maximum value the soft max is always under lower bound by this max right and as you can see on the horizontal axis i have the coldness you increase the coldness you get frozen right as you can tell here you decrease the coldness you get hot and hot and warmer until you sweat okay so we said that both this average soft max and this actual soft max both of them hardens to the maximum as i increase the coldness right similarly the soft minimum and this kind of average soft minimum converges to the actual minimum right when i freeze it and make it super cold on the other case when i make it super warm the average soft max and the average soft mean both of them will converge to the average which is this dotted value here which is point negative three okay again i will give you the slides you can pay more attention you can pay more you think about more about these things you can plot it yourself try to figure out how it works this is how these two things work both soft max and soft mean are going to be approximation of the max and the mean respectively if you make it super cold otherwise they either are going up and down or they converge to the average are there any conventional choice of coldness it's a hyper parameter you can you can decide to tune it okay for different reasons i'll tell you more about that future lessons okay so then we talk about we haven't talked about the soft arc max and soft arc mean finally too many people drop the arc and they call these soft max and soft mean they cannot possibly be because it would be stealing the name of the other thing so let's check this thing i have a vector e still n items here i have this possibly soft in quote in square brackets that means is optional accent mean okay again possibly uh with a coldness coefficient this thing goes from our end which is my n dimensional vector to the simplex what is the simplex simplex is going to be just this shape that is going to be um connecting the one zero zero zero one zero and zero zero one okay every point on this simplex is going to be a probability all the summation of all the coordinates on every point in this plane here in this item here that we sum to one okay it's also called probability simplex all right so we define the soft arc max as being the exponential of the vector all elements right divided by the summation of all this the components right so this is the exponentiation of each component well there all the components and then you divide by the sum the interesting thing is that if you increase the coldness you make it super cold this thing converges to an arc max okay what is an arc max it's a vector it's a one hot vector where the one is happening in correspondence to the max of the of the of the of the vector okay so if you take a vector you check what is the arc max the arc max is going to be a one hot vector where the one is like an indicator showing you where it is the interesting thing is that let's say let's say you have a vector now and two value the maximum is shared by two values okay what is going to be doing the arc max if you do arc max of a vector with two maximum values what is going to be telling you maybe the earlier index right what is going to be a soft arc max doing instead it's going to give one half one half exactly right so it's going to be splitting the mass across the equally equally equal values right awesome so it's a better kind of arc max right similarly we can define the soft arc mean right which is going to be equal to the soft arc max where I flip the sign of the vector just while inside there is no flipping of the of the of the sign outside as before okay guess guess guess what if you increase the temperature right of the soft argument you're going to get the argument the output of the soft parameter is still in the same dimension space as the input right well kind of yes so this this thing is the it's written here right this thing here is in the box zero one n dimension right input and output and box zero one box right but then this plane is cutting the box right that's the this is a notation so it's still a box of n n dimension and this is a notation of this chopped version because you chop one slice less right so if if you have a cube right let's say you have a cube you chop the the the cube in one slice right you have a plane so but the simplex will always say like one less dimension than the dimension in which the simplex lives okay notation just forget about the just notation okay we don't care um one more thing is that if you increase drastically the temperature and it's super hot both the soft dark max and soft start mean will converge to the one over n uniform vector which is like a uniform probability distribution across all of the uh items okay i hope now i some how convince you which is a better way of calling these things okay let me show you how it works right uh something interesting is also the fact that the soft arc max is the derivative of the soft max which is also the derivative of the average soft max can you tell me why these two things are the same maybe you didn't catch it maybe it's not important what is the difference between the average soft max and the other soft max remember yeah the one over n where was the one over n inside the log and therefore you can split the product as in a two logarithm the two the summation of the two logarithms and you take the derivative the thing disappears that you're very good yes okay one more thing you also have that the just in case you didn't know but now you know the arc max is also the derivative of the max okay so if the max gives you one item this thing here is going to be the one hot indicator where the max came from let's let's have a look about how this looks so this is the same vector i showed you before and this is going to be the soft arc max right if it's super cold it's super blue dark blue right super super super cold we said this is um hardening to a arc max in fact we can tell we have the one hot right and zero zero zero zero if you go on the other extreme you have super super super super hot by shown to you by this uh sweaty emoji you have that the red thing is going to be one over five right one fifth why one fifth because there are five equally uh five items right so you have one fifth one fifth one fifth one fifth on average like if you go in the middle then you have something nice right you have some none one hot nor average something in the middle okay very good so far so then we have the next chapter and then we have the notebook okay so this is going to be one more step building up all the attention we i'm gonna give you now even something more and i and i really hope you can digest it well understand what i'm trying to tell you because everything at the end we'll just it will like uh telescope in right they will come they will have you hold it like they will like like if you have domino right now i put all the pieces of the domino and at the end i will just do tick and then you see and they will all solve themselves at the end right it's like when you have like a suspended chord right in music and you have that kind of feeling of tension and then you have the dominant final relaxation of the joining that reaching the the the fundamental of your chord right okay okay okay all right moving on so we saw a few uh alternatives uh before we saw the the classical model x going to the predictor we get the hidden value right there we go through the decoder we had the white tiller that was going to the c then we saw the other one which was going through the predictor and then we had some sort of weird decoder where we were getting that o which was already including the logarithm and now one more right so we have three distinct perspective over the same classifier but they will really become very convenient and helpful at the end with that said let's move on cross entropy so this is a reminder of what we have seen so far we start with the pink ball x at the bottom left hand side this goes through the predictor and then the first a fine transformation a this thing is going to give me my s right the linear sum then how were we getting the white tiller where do i have to send s to i have to send s to s s g in the software max perfect so i i i send s through the software max and i want to get this white tiller which is some sort of approximation of my y right the y is like a arg max now we understand is this one hot thing so we had this one hot there and then we will add what a cost term c in order to get the prediction close to the target we said that the c was going to be the negative log of the inner product and then we have this big box f which is our energy and the energy which expressed the level of incompatibility of the even input pair is going to be equal to the cost that i pay for making a prediction that is far from my target we know everything finally we choose our loss functional because he's acting on the function so it's a function of a function therefore it's called a functional anyway my loss functional remember we chose it to be equal to tight tight tight tight tight ah yes the energy energy that that's correct that that's correct right so we made this choice before on the right hand side instead we're going to be having this alternative perspective this negative linear output perspective what is it about so i have my pink ball x at the bottom i send this to my predictor and the affine transformation so i get exactly the same linear sum and then i put the sum inside a box with a capital d this capital d is going to be my negative linear output which is going to be the negative inner product between blue ball y and the vector s the linear sum okay just just just because it's going to be helpful in a bit all right moving on let's take this for granted when we have a big box f what is going to be f f is the level of blah blah blah what is it f is the incompatibility of my inputs and then we set this to be equal what is going to be equal why is d because d is the only term inside the dash box okay so inside the dash box there is only one red term so f is going to be the level incompatibility of the input is going to be equal to this d whatever divergence blah what is going to be the loss well we don't know that's why it's going to be interesting okay we we try to work out this thing together so let's rewind what we have so far we start with the fact that the f is going to be this d y and x okay then we have the d the specific value of my pair right it's going to be this negative y s multiplied right so I extract what does this mean right so this means I extract the specific value of the s right and then I can also write something interesting here this is going to be my full function right so this is the full cost function function of all possible values of y and this is simply going to be the full vector of negative linear output okay I hope you can tell this one is going to be selecting one specific value out of this vector like we were doing before with that d here instead I just pick the whole set of values right the full negative uh linear output okay now remember how we were computing the loss before the loss remember was equal to in the previous uh in the previous view right loss was c and c was the negative negative y times y tilde right how to compute y tilde log soft arc uh soft everything wrong soft oh my god soft arc max okay yeah okay there we go thanks so y tilde was soft arc max of the linear output right now I have a negative sign so if I have a negative sign what should I use instead if I compute the soft arc max of the s but now I have this negative thing I will have to use the soft arc mean okay very good very good very good very good right so I just type now all we said in the chat and I said in the in the air here I will have that my loss functional because it's a function of a function it's going to be this okay let's put the negative one over beta let's forget about that log of the y multiply by the soft arc mean of this f right and this soft arc mean of f is exactly soft arc max of the x right if you just replace inside you automatically get the thing okay where the soft arc mean is the exponential of the negative thing divided by the sum of the exponential right we know about that okay okay okay so how about we compute this logarithm right we compute the logarithm of an exponential what do we get this thing jumps out right but then this negative beta simplify with this thing right and so what do we get now if I compute the negative of one over beta is that you get I get f right so I get f and then what do I get there's a division right I have a log of a division so I'm gonna get the I will have a negative sign a subtraction but there is a negative in front so I will get a addition very good and so I'm gonna get this one right my loss is going to be the energy at the correct x y pair plus one of our beta log sum of the exponential of the negative terms so let's just write this one in the previous slide okay so here the loss on the left hand side was the energy on the right hand side instead what should I be writing here so the loss is going to be the energy at the correct site plus that summation of things okay so I have this f which came out plus this thing over here so now it's very very interesting how do we train a system by minimization of the loss forget about beta for now beta is equal one for now okay minimization of the loss right if I minimize this thing right I want to minimize this one I will try to minimize these right but since there is a minus here I will try to push up here so I pull I push down sorry I push down here but I pull up on these terms this is so interesting oh why tiller runs for all categories see white white sorry white prime it's all possible categories all right so this is just first time you see something that I would tell you in advance is called a contrastive loss because you push down on the correct guy but then you pull up everywhere right you pull up for every other possible thing we don't know yet how it works but we just smell something funny all right let's go back here and I try to yeah we had to go to the notebook is right so what is this right hand side thing here what is it can you tell me what's called the almost almost ah there you go calculator is correct that's the that is the negative soft main okay because you have negative one over beta log sum of exponential of negative beta okay so if I if I just clean up the screen I start here we have that the loss it's equal this scalar value the correct energy minus the minimum value that the energy takes well a soft minimum value the energy takes okay so this is a scalar this is a let's say a vector but then I compute a minimum right of a vector so I have a scalar if it's just a minimum this is actually called the perception loss but we don't care anyway this is going to be my loss is going to be the difference right I have between my specific correct value and the minimum value the energy takes or the soft minimum value the energy takes we don't know what it means but we might understand that later now I'm interested to compute the partial derivative of this equation here with respect to the correct case how much is it can you tell me one over here okay what is this thing here soft mean right that was the negative beta uh yeah you're correct one minus soft argument okay that you're correct so you're gonna have the exponential divided by the sum of exponential right we have the one over because I have log right so I have the log so everything goes underneath and then you multiply by the correct guy so you have the correct guy divided by all of them right so as you as you told before and this is going to be in the soft argument right so I just use this symbol over here to represent the probability in that the model assigns to the correct class given the specific input x parametrized by the weights w okay and also by having this temperature uh coldness that coefficient here okay final final uh question how much was this the soft argument of the of the linear output well soft arg max of the linear output or the soft argument of the negative linear output how do we call it this at the beginning of the the lesson like two hours at one hour and a half ago remember when we do the forward pass on the model right the thing here is the soft arg mean of the negative linear output or is the soft arg max of the linear output soft arg max of the linear output remember g is the function right I actually call g two two distinct things I should change that right no or was the log uh log soft arg max right what was the soft arg max why okay almost we don't have why hot we don't have why hot what do we have here in this class it's called why fix that notation why till the yeah you got it right there we go so this is going to be one minus blue ball y times y till you see in a second your brain is going to be like not yet don't don't don't don't go crazy yet let's define why hot now you see why hot is the contrast example that means all the y's minus the correct case okay this is all y's minus the correct conditional right so what it is this what is the gradient of this expression with respect to the incorrect class we can do it so we can actually I can show you oh okay why hot transpose y till that's correct yes and also why hot t why till the martin is also correct too young is not correct because the first term is going to be a constant with respect to the incorrect class okay this is the correct class because it's blue okay the blue is going to be the correct guy the red is going to be the hot guy the with the hat I haven't told you about the colors yet the colors will come in a bit anyway let's move forward you have zero minus this thing over here and this thing over here simply the extraction basically of the probability in in the so this is going to be one hot right this is my one hot extracting the incorrect probability we care about right so I have one of these guys okay how many items do I have how many correct classes do I have for a point in this classification one okay how many incorrect classes do I have in these five five um classification way thing four okay so how do you call a vector that is one for the correct class and zero for the other classes how do you call a vector yeah okay one hot okay okay okay okay but how do you call exactly arg max is also correct how do we call the correct specific one in this case what symbol should I be using to represent the correct one why right the blue ball why and so if I just put together these two things and I ask for the gradient with respect to the full energy this is going to be the difference between the target and the prediction which is the same gradient you get when you perform the differentiation of the msc you see yes I can repeat okay so when you have the msc one half square uh norm right of y minus y tilde right you take the derivative of that thing with respect to the y tilde you're going to get what the just the difference right between the target and the and the prediction now I show you in a classification case the the grad output with respect to the negative linear output is going to be exactly the difference between my target and the prediction the y tilde I make which is exactly as msc this regression so I'm showing you right now that the grad output the signal that we use to train the model for classification is the same signal we use for regression ah okay all right let me speed up through because now that we have so the this one I don't know if you notice uh what is the loss here how do we call the loss if you just say one word if you if you skip a bit of symbols how can we call this loss by skipping a few symbols just put together the English and skipping the mathematics this is the same as we were I show you like two hours one hour and a half ago right this is the log software max the g function I show you before which I put it under the carpet and someone asked me oh how do I compute the derivative of the software log software max I just show you well it was the log software max of s which is the same as the dog software mean of minus s right because you just flip the uh the sign inside you flip the the thing right you you see who asked the question I don't know who asked the question I forgot but this is the answer to your question I hope you're happy I am anyway moving forward otherwise we don't finish can you explain how you put the two derivatives together into the last segment yeah no I think I really explained that uh right so I said we have only I just go quickly we have one correct blue guy right so this is one for the correct case which is the blue guy and here you're gonna have as many zeros right as these red guys how many red guys you have the full set of possible cut classes categories right minus the correct one right so I have one correct and capital K minus one incorrect and so this is gonna be one hot right one and all the others are zero zero zero zero which one is the one well this one zero zero zero whatever is the correct thing is just the one the blue guy right the blue ball guy is the one hot which has the index in correspondence of the uh correct class right and so if I put together the partial derivative with respect to the function so this is a bit funky right it's a gradient with respect to a function but it's okay because it's a discrete function which is a vector so I'm like it's okay don't don't worry so this is just a partial derivative with respect to the negative linear sum yeah and it's just equal to the y minus y tiller right because I don't have to select anymore I just put the one hot here in front and I subtract the whole thing if it's not yet clear I'll tell you more about once we are done with the class okay I hope you are good good okay all right so now physical intuition then wait sorry I don't understand the symbol this is a subtraction in set theory this capital curly y is the set this this is okay very good all right so we have this one right the partial uh the gradient okay same shit right the gradient with respect to the free energy with the energy is going to be the distance to the target okay that's very easy to say right all right so when I just initialize the network what do I get as an output but I already told you the solution you get basically all zeros why is that the case quick quick quick why an initialized network gives me all zeros how do we initialize these models we don't initialize with zeros we initialize them with near zero random weights correct you kind of everyone put together the answer the point is that the largest singular value is going to be very tiny right so basically if you stack a few layers it will collapse the output to zero so when you train the network the weights are small output is zero more or less so let's draw this my energy at the output of the model the negative linear output is going to be all zeros what is going to be the y tilde the the soft argument right what is going to be soft argument of ff quick quick quick quick ah come on answer average yes the uniform distribution perfect right so y tilde is going to be the uniform distribution but then you have the one minus one hot minus and so you can get this one so if you check for the negative gradient all these guys are going to be pointing up right pulling up every energy level by one over capital K where capital K is number of classes and the correct class gets pointed pulled down by a vector of height one you see I just draw the one hot well the the y tilde minus one hot because I show you the negative gradient right we follow the negative gradient to update the parameters and these so right yes are we good so I step with this gradient what happens well the blue guy will get pulled down right and so you get that step by step you push down the energy of the correct class while pulling up the energy of everything good okay moving on now what happens if I crank up the ac it's super super super super cold the convergence to the y tilde converge to the yeah the minimum of the energy right the maximum of the thing but we'd like to think in energy and therefore minimum right so the y tilde was the okay so soft arg minimum right arg minimum so we the y tilde converges to the arg minimum and the blue ball y is the correct class and so so if we crank up the coldness right the blue guy will get the the arrow pointing down right so it's going to be pushing down the uh the correct case but then what's happening which one is going to be the what's happening tell me which one the other guy which which one which one which one which one yeah tell me which one red one which red guy right I have a super cold uh environment right so you only pick one which one do you pick the lowest red one yes martin the fourth one correct calculator and so here you're gonna get a arrow down for the blue guy and a arrow up for the fourth item such that after stepping in this direction right the negative gradient you're gonna get that the blue now is going to be lower and the one that was the lowest will get a bit less lower this is also called the as I said before the perceptron loss functional yeah just for sake of not knowing things finally what happens in the generic case right this is generic case I have a arbitrary distribution of energy values I will have an arbitrary pulling up forces right why do I call them forces because I take the gradient of an energy right so remember from physics 101 the gradient of the energy the energy is the force right anyway so I pull up oh I pull up yes everyone proportionally to the probability the model associates to these values right and I push down with strength of one the correct case such that when I follow this I will have a lower value for the blue guy right okay I keep pushing down and I keep pulling up when does this stuff stops even maybe this one is going to be easier to think about it right let's even think in the regime of super cold temperatures when is this process stopping yeah yeah I'm asking which one is the convergence when the blue is the lowest that's correct the ocean whenever you pull down the lowest class right the lowest and you push you push down and pull up the same lowest one with the same the correct one like whenever the correct one is going to be the lowest energy one right and it's actually when quite down so it's like or even in the super coldness coefficient so cold and coldness regime so we don't even have to care about other issues you're going to get whenever the correct answer is the lowest one then it will stay at the equilibrium because you have two equal yeah it's equal it's the correct word you have equal strength vectors putting two opposite direction which does no longer move your energy levels right unfortunately the case with this maximum likelihood is that if you don't have a super cold environment what's the issue here tell me you let me know okay in the future think about it right yeah you never stop pushing down right you always pushing because you will always use every contribution anyway we are doing very well because now as I will I was telling you everything will just like the domino simplify in the notebook so we are going to be putting our nose inside this code and figure out all the things we have just computed now with the math and the slides and the things with numerical values such that we get some sort of relief from this very intense I think at least it was for me set of questions okay are we good are we moving forward are we okay are you happy yeah happy okay but when the system is not that call wouldn't be say yeah yeah exactly that was the issue there so we go cd were github book git status what the fuck okay get cd python get status okay get full okay konda activate book jupiter lab all right so let's import the torch and the nn right this notebook is not yet available maybe it's gonna be available maybe not it's like five lines of code so maybe it's not even important I import torch and torch and then right so from the libraries here I fix nomenclature such that I call things with the correct name and I don't get confused then I generate my x which is going to be a random vector of two items is a row with two elements okay and this is going to be just two random values here I just generate one hot vector with a random value right so I have here my one hot vector right which is the tensed with a number three right zero one two three and these are the same equation we saw before hidden h is gonna be the nonlinear function of the affine transformation of the input s is the output of the affine transformation of the hidden which is this thing here all is going to be this log software max of the s and this is why I introduced the log software max before because this is how it works in in code okay f is going to be maybe just the redo why not and then d was the inner product negative inner product so here I have my model okay I have the predictor which is going to be going from two to seven just a number for small for for be able to print things and two is going to be the point in the plane and seven is going to be my hidden representation then I have the nonlinearity we go then I have my affine transformation a which is going to be this and then in here I have the g which is going to be this log software max and d is going to be this negative log log negative log likelihood loss which is the negative inner product between the all and y so I execute this one tell me if there are things that are not clear okay in the chat please I think I so far everything is clear so here I generate things I have h is going to be the output of my predictor which is the first line second line I have that the linear sum is s is the output of applying the affine transformation to the hidden value the hidden layer then I compute the o has been the g the log software max over my linear sum here I tell by torch I would like it to keep the gradients of both the linear output and these log software max output such that when I compute backprop I can inspect those gradients otherwise backprop deletes the gradients of values that don't they are not parameters and they are not yeah they are not parameters right and they are not leads so here I show you the value of the s I show you well we are just executed right I show you what is the linear linear sum right linear sum it I whether it will keep the gradients around yes and whether it does have a gradient no it does not have a gradient I have not yet compute any gradient the o the logs of dark max output is going to be whatever value and you can you can see here who is that generated that thing right this is the function that generated this thing it will keep the gradients around yes it does have a gradient right now no now I compute the loss which is going to be remember the f the energy because we set this is the setting uh equal equality and then I have this one equal the d and the d was the negative inner product unfortunately here I had to use the c the index rather than the one hot because that's how the library defined and then I print the loss the loss is going to be 1.41 first question who I don't know if I have time what is 1.41 can you tell me how to compute this number from any network that has five classes if you use your calculator right now okay take it as an exercise for home uh you should tell me why or whether this is a reasonable number or not how to tell whether this is a reasonable number or not how to compute this number for any network that has five classes you you have to think about this and let me know I I cannot I need to go forward right now but if you don't figure it out let's talk about this tomorrow here I just compute back propagation which is the whole sequence of the things I show you with the diagrams here and just show you the output gradient and the linear output gradient oh what is this thing remember from the slides if I compute the partial of this d expression with respect to the oh what do I get type in the chat because we are running out of time otherwise negative why right so this is the first time you can see the numerical counterpart to the things we just described before here the other thing was what is this thing what is the gradient with respect to the s the linear output remember we said it was the one hot today the y minus y tilde again since here I consider the that that was considering the negative s remember f or n was the negative negative s so here it's since we consider the s grad is going to be flipped inside right and so this is going to be simply the soft arc max of the linear output right minus the one hot I hope you can understand this right the sign is flipped because of the fact that we check the gradient with respect to the linear output and not the negative output so there is a one difference from the from the thing we saw and here you can see it's exactly the same okay so this is what this is the grad output right the gradient with respect to the output of the affine transformation of the hidden layer right what did we say before in the slide where how can we check for the grad output if I have a linear layer grad bias very good and so here I just show you the affine transformation grad bias and if you can see it's exactly the same right you should say oh okay type we can type oh okay okay finally and then we are done oh no actually we have some time right we have five minutes oh we have plenty of time I thought I was running late so this is one thing I struggle a lot with uh every time I see this thing here right this vector matrix vector multiplication I never understand uh what it really means okay it kind of I don't know how to compute this partial derivative it's confusing so what I usually do it's just I expand the meaning right what is this matrix vector multiplication this matrix vector matrix vector multiplication it's simply the linear sum all right of the column of this thing here this table right scale by the coefficients in this vector over here right so that means first column plus multiply by the first coefficient coefficient plus the second column times the second coefficient plus blah blah blah times the uh is it correct the column yeah the this column times the this coefficient right because we have the value plus the bias right so as you can tell the bias well the partial with respect to the with the s right it's going to be the same as the partial with respect to the bias right because you have all the summation so each of these terms share the same uh gradient so that's why we had the answer before below right the we think they are the same the other the other case is going to be what is the gradient with respect to this first vector what is the gradient with respect to this first column vector uh the sorry the partial of the partial dl over partial dw1 okay you have to multiply the grad output right by h1 i believe that's what jack meant so i have to multiply the full vector right the the thing that is the grad bias sorry the well i had to multiply the grad bias basically by one scalar for the first guy one scalar for the second guy one scalar for the last guy right i hope you're still with me because this might be blowing your mind maybe so let's check a bit of sizes right such that we know what we are doing so this is my grad bias right i have that the hidden representation is a row vector of seven seven elements i have that the grad output is going to be a row vector of five elements i have that the weight is going to be a five times seven because it shoots towards five dimension in the output coming from seven from the hidden and the same size will be shared by the gradient right so how do i get a five by seven from a vector that has seven and one vector has five this grad output right has to be scaled by each of these coefficients you can see right this grad output of grad bias right has to be scaled by these coefficients right and so you can just do that by having s grad t well even let's not have s grad t oh okay this is grad output right but we can even type directly this one which is even more cool i think right so we can actually have the grad bias but you can just write the grad output right there we go and so this is the partial i compute by hand and then if i check here the the partial that i torch computed are the exact same so these things are the grad output right or grad bias multiply by the each component of the h and so if i if i check h do i have h somewhere no i never print h let's print h just for sake you can see h is going to be some value zero zero zero because this is the output of the real right so there were negative values here and then i have three other values that survive and so you're gonna get that the grad uh grad bias gets multiplied by six point six fifteen uh six or fifteen zero zero zero zero point eighty point fifty one point twenty two okay and that was the masterpiece of two hours of dense gradients and back propagation and energy and everything how did it go you're still there someone is still there most of you are still there no one is typing anything anymore are you okay we are done with the lesson are you excited are you happy it's a lot i know i don't know if i could repeat but i feel like i learned something very good okay so everything was yeah i okay dense still alive definitely you you should go over i am publishing the slides right now on the on the google doc right it goes through all the steps we went through uh the notebook i don't know if i publish it but i guess you can still watch the the last part otherwise the the the most convenient thing is actually uh to to try to do this yourself right maybe or maybe you can just copy the from the screen or i can put it online right i'll decide anyway so the the whole thing was a new thing right i i i haven't taught it yet so it was also challenged for me i think we i deliver everything i wanted to talk about today in a two hours time tomorrow we're gonna be computing this energy the the shapes i show you at the beginning of the lesson and then finally we'll train a neural network to perform classification which was the beginning that the original goal but then we kind of went through many many many uh different ways to learn about this energy perspective and the our our our soft art means soft art much soft much soft mean and all these things and then got by us we got out to so many things but i hope you like it i enjoy a lot have a nice evening i see tomorrow bye bye questions i mean i hope you liked it i really i really did i really think i i mean i loved it okay no questions right enough i think enough of me for for today just take the time to go through the the the different lines i recommend to do this before tomorrow's session since tomorrow we just try to finish up uh this section uh and train in the model i can address questions if you have questions about today's lesson which would be very nice such that we can conclude the chapter of back propagation classification energy based model uh basics 101 for this part of the class otherwise thank you so much for your attention i'll see you tomorrow bye bye have a good night