 All right so we're going to talk about two or three topics today and the first one is going to be kind of a review of some of the functions that exist in PyTorch and kind of when and how to use them. So the first the first set of topics is about activation functions and there's a whole bunch of them defined in PyTorch and they basically come from you know various papers that people have written where they claim that this or that particular objective function or activation function works better for their problem. So of course everybody knows the value that's a very standard one but there's lots of variations of values these values where the bottom part is not constant and set to zero but it can be allowed to change either only with a positive slope or forced to be to have a negative slope or sometimes being random in the case of the the randomized leaky value. So they have you know nice names like leaky value, prelu, re-value, random value, etc. So leaky value is one where you allow the bottom part to have a slight negative slope and that kind of prevents the issue that sometimes pops up that you know when a value is off it doesn't get any gradient so here you get a chance for that system that function to actually propagate gradient and perhaps do something useful can go all the way to kind of complete full rectification of the signal kind of like an absolute value if you want. Prelu is one where you yeah go ahead. The previous activation is usually using the discriminator in a gun such that we always have gradients going backwards for the generator and also this activation was necessary in order to train the very skinny network I show at the beginning of the class because again having like a very very skinny network it was basically impossible to get gradients flowing back because we were like ending up in one of the quadrants without you know where everything was zero out and then nothing would have been actually trained if you wouldn't have used you know this activation function that allows me to get some kind of gradients even if we are in the regions where we are trying to suppress the output so yeah. Right so Prelu is fairly similar except that now the slope in the negative side can be just about anything and okay what's interesting about all those functions that we just saw is that they are scaling variant in the sense that they you know you can multiply the signal by two and the output will not be changed yeah I mean it would be multiplied by two but otherwise unchanged so they are equivalent to scale there's no sort of intrinsic scale in those in those functions right because there's only one non-linearity and it's a sharp one. So now we're getting into functions where the scale matters so the amplitude of the incoming signal will affect the type of non-linearity that you're going to get and one of those is the soft plus so soft plus is sort of a differentiable version of ReLU if you want it's kind of the soft version of positive part and it's usually parametrized as you can see at the top here one of our beta log 1 plus exponential beta x so it's kind of like the log sum exponential that we've been using a lot for sort of various purpose except here one of the terms in the sum is equal to one which is kind of like exponential zero if you want so that looks like kind of a function that sort of asymptotically is the identity function for large positive values and asymptotically is zero for negative values so the approximate is the ReLU it has a scale parameter though this this beta parameter the larger beta the more the function will look like a ReLU so the the kink will be kind of the corner will be kind of sharper if beta goes to infinity but that that function has a scale now you can parametrize those functions in in various ways and this is sort of a another example of kind of a soft version of ReLU if you want where we're here you use ReLU as a basis and then you add a small constant to it that kind of makes it smooth you know but I can't tell you that any of those has any particular advantage over the others it really depends on the on the problem but they they all have kind of similar properties if you want this also you can make sort of continuously closer to ReLU that's yet a okay so one one difference here in this case is that this guy actually goes negative right so unlike the ReLU that has its minimum at zero this it's horizontal asymptote at zero this guy goes below zero and that may or may not be advantageous depending on the application you have sometimes it's advantageous because it allows the system to basically make the average of the output zero which is advantageous for certain types of for gradient descent convergence the weights that are connected to units like this will see both both positive and negative values which will then converge faster than if they only see positive values so it's a bit the same here and it's just a kind of a differently you know a different parameterization of kind of the same thing if you want with different properties so of course there's tons of variations of this you know with various parameters with different properties and you know some of them that have particular properties that kind of relate them to Gaussian distributions for example this is not the commutative distribution of a Gaussian but okay so those those were things that have one kinks in them and if the kink is sharp there's no scale if the kink has some scale in it there is some scale but it's still sort of a single kink nonlinearity now we're getting into nonlinearity that have two kinks okay so this one is basically a saturating ReLU I'm not sure why it saturates at six you know why not but you know why not parameterize this a little better so here's a smooth function that you're familiar with because it's used in in recurrent nets in gated recurrent nets in an STM in a softmax you know basically this is a two-way softmax you can think of it this way and this is just a function that goes kind of smoothly between 0 and 1 it's sometimes called a Fermi-derived function as well because it derives from some work in physics instead of sequel physics and then there is the hyperbole tangent that we also talked about it's basically identical to the sigmoid except it's centered so it goes between minus one and plus one and it's a little you know it's twice the amplitude and the the gain is a little different but it plays the same role the advantage of hyperbole tangent is that the output is you can expect the output to not have zero mean but be close to having zero mean and again that's advantageous for the weights that follow because they see positive and negative values and they tend to converge faster that's the case so I used to be a big a big fan of those unfortunately if you stack a lot of sigmoids in many layers in a neural net you you can attend to not learn very efficiently you have to be very careful about normalization if you want the system to to converge if you have many layers so in that sense the the single king functions are better for deeper networks so his softsign this is basically a bit like the the sigmoid except that it's it doesn't get to the asymptote as fast so it doesn't get stuck towards the asymptote as quickly so one problem with hyperbole tangent and and the sigmoid is that when you get close to the asymptote the the gradient goes to zero fairly quickly and so if the weights of a of a unit become too large they kind of saturate this unit and the the gradients get very small and then the the unit doesn't learn very very quickly anymore it's a problem that exists both uh sorry both in sigmoids and hyperbole tangent and so uh softsign is a function of those proposed by yosha venju and so his collaborators and it kind of saturates slower so it doesn't have that that same problem I mean it has the problem also but not to the same extent okay and this is kind of the opposite hard tangent high hard 10 h I don't know if it deserves that name but it's basically just a ramp okay and that works surprisingly well particularly if your weights are somehow kept within a kind of small value so the the the units don't saturate too much it's surprising how well it works um and you know people have used this in sort of various contexts but that's sort of you know non-standard so hard threshold is very rarely used because you can't really propagate gradient through it okay and this is really what kept people from inventing backprop in the 60s and 70s which is that they were using binary neurons and so they didn't think of the the whole idea of gradients because of that okay those other functions are rarely used in the context of neural nets uh or at least for kind of activation function in a traditional neural net they use mostly for sometimes for things like sparse coding so one step in sparse coding consists in to compute the value of the latent variable consists in shrinking all the values in the latent variable in the latent vector by some value and you do this with a shrink function a shrinkage function this is kind of a soft version of a shrinkage function the hard version is here I mean it's called soft shrink soft shrink but it actually has corners in it the reason it's called soft shrink is because there is a hard shrink that looks different without showing you in a minute so this basically just changes a variable by a constant towards zero right and and if it goes below zero it kind of it's clamped at zero if it if it's brought too long and so the this is basically just the identity function to which you subtract hyperbole tangent to make it look like a shrink like a shrink so basically this one if we if we try to get the whatever value close to zero and they actually are forced to zero basically right right so small values are forced to zero others are shrunk towards zero but you know since the large enough they're not going to get to zero so again that's used mostly as you know you can think of it as a step of gradient for an L1 criterion okay so if you have a variable you have an L1 cost function on it and you take a step in a negative gradient of the L1 so L1 cost is an absolute value this will cause the the variable to kind of go towards zero by a constant which is the slope of that L1 criterion and to kind of stay at zero you know coming from either side it doesn't kind of overshoot if you want and so that's that's the non-linear function you would use and that's one of the steps in the ISTA algorithm that is used for inference in in sparse coding but again it's rarely used in sort of regular neural nets unless your encoder is kind of used as kind of an estimation of sparse coding this is the hard shrink so hard shrink basically clamps every value smaller than lambda to zero okay so if a value is smaller than lambda or larger than minus lambda it's sort of between minus lambda and lambda where lambda has some constant you just set it to zero again it's used for things like you know certain types of sparse coding but rarely as an activation function in the neural net so luxymorid is mostly used in cost functions not really as an activation function either but it's a useful function to have if you want to plug this into into a loss function and we'll we'll see we'll see that in a minute um so something we've seen this is the same as softmax except you have minus signs so this is sort of more uh so so those are multi-dimensional uh non-linearities right you you have a vector in and you get a vector out which is the same size as the input vector and we know about softmax uh it's you know exponential xi divided by sum over j exponential xj this is softmin where you you put the minus sign in front of the x so you view the x's if you want as energies instead of scores as penalties instead of scores and it's a good way of turning a bunch of numbers to uh something that looks a bit like a probability distribution which means uh numbers between zero and one that's some two one and that's the softmax which we all know uh so luxomax again is not very much used as a non-linearity within the neural net but it's used a lot at the output as kind of one piece of a of a loss function we'll see this in a minute okay so those questions yes we have a question uh so for prilo i'm not sure i understand number one why we want the same value for all channels and number two how learning a would actually be advantages uh you could have a different a for different uh channels so different units can have a different a that could be you could use this this as a as a parameter of every every unit um or not it could be shared i mean that's gonna up to you it could be shared at the level of a feature map in a convolutional net or it could be shared on all feature maps or it could be individual to every unit if you really want to preserve the convolutional nature of a convolutional net you probably want to have the same a for every unit in the feature map but you think you can have different a's for different feature maps okay what was the second question uh why learning actually uh a specific value would be advantages like why are we learning a uh you can learn it or not you can fix it uh the the the reason for fixing it would be uh you know not necessarily to kind of have sort of more powerful uh non-linearity but to kind of ensure that the non-linearity gives you a non-zero gradient even if it's uh even if it's in the negative region um so you know runnable not not runnable um so to make it learnable allows the system to basically turn the non-linearity into either the linear uh mapping which of course is not particularly interesting but why not a value or something like a full rectification okay where a would be minus one uh in the negative part uh which you know can be uh it can be interesting for certain types of applications so for example if you have a convolutional net that has an edge detector an edge detector as a polarity right it's uh it's got plus coefficients on one side minus coefficients on the other side and so it's going to react so if you have an edge in an image that goes from say dark to bright the you know the convolution will react positively to this one but if you have another edge from uh uh from you know in the opposite direction then the the the the the filter will react negatively now if you want your filter to react to an edge regardless of its polarity you rectify it okay so that would be kind of just uh absolute value now you could of course bake this in you don't have to use a prelude you can just use the absolute value uh probably a better idea is to use a square actually so if you take the square of a square nonlinearity it's not implemented as kind of a neural net nonlinearity but you know in the functional form of PyTorch you just write square and that's it i hope i answered the question any other question on this topic uh i have a question uh it seems to me like these uh nonlinearities are trying to basically uh make a linear function nonlinear and the the tweak in the in the lines denote like the change in that function so like can we uh think of this as um if we want to model a curve uh in the line should we have learnable parameters on both like before uh before the zero and after the zero on the x-axis like well so yeah i mean there is diminishing return so the the question is you know how complex do you want your nonlinearity to be so you could imagine of course parametrizing an entire uh nonlinear function you know with spline parameters or busy curves or something like this right or i don't know chibi-chef uh polynomials you know i mean you can you can parametrize any any mapping you want right you can imagine those those parameters could be part of the learning process however you know what is the advantage of doing this versus uh just you know having more units in your in your system and relying on the fact that multiple units will be added in the end uh to approximate the function you want generally uh it really depends on what like if you want to to do regression in a fairly low dimensional space so perhaps you want some parametrized nonlinearities that might help um you might have like you might want to have a collection of different nonlinearities uh with maybe things like like chibi-chef polynomials if you want to do good approximation approximations but for like you know high dimensional tasks like image recognition or things like this uh you just want a nonlinearity and it works better if the nonlinearity is monotonic otherwise it creates all kinds of issues because you can have two points that will produce the same output and so it's a little ambiguous for the system to learn the right function there um so you want it it's it's much better if the function is monotonic and almost all the functions here are monotonic except if you have a negative a here in the in the prioru case um there's big advantage to having monotonic functions um but in principle you could parametrize you know any function you want people are played with this uh you know they're not very popular because mostly they don't seem to be bringing a huge advantage in uh in in um the kind of applications that people use large neural nets for other questions another question is going to be kink versus smooth yeah so the question i've had is can you think of any application with the choice of nonlinearity is made a big impact the only thing i'm aware of is using a single kink function instead of a double kink for deep neural networks helps it train better well so here's the part with double kink double kink has a as a built-in scale in it which means uh if you're it's a weights of the incoming layer are multiplied by two or if the signal amplitude is multiplied by two the result on the output would be completely different right yeah because you will be you know the signal would be more in the nonlinearity so um so you'll get completely different behavior of your layer uh whereas if you have a function with only one kink if you multiply the input by two the output gets also multiplied by two uh in a modular bias but the signal the bias is fine so um but what i mean to ask is i'm thinking of a situation where the choice of activation function made a big difference in the performance of the model except for deep networks using today instead of sigmoid um there is no sort of general answer to this um like if you're going to use uh attention you have to use softmax i mean you have no choice right i mean it's not like you have to use softmax but you want to have something where you get coefficients right to to kind of focus the attention of the system on uh or to kind of spread the attention of the system uh and not allow it to cheat which is to pay attention to multiple things at one time you have to have some sort of normalization of the of the coefficients that come out of the attention system right so uh so normally in most attention systems like in transformers and stuff the the coefficients are are passed through a softmax so you get a bunch of coefficients that are between zero and one and some to one and so that uh causes the system to have to pay attention to uh you know a small number of things right it can only concentrate the coefficients on a small number of of items and it has to spread it right um there are other ways to do normalization you you can do uh and in fact there is something that's wrong with softmax normalization for uh for for transformers uh or for attention which is that if you want a coefficient coming out of a softmax to be close to zero you need the input to be close to minus infinity okay or to be considerably smaller than the largest one right when you go into a softmax one output the largest uh uh input is going to cause the corresponding output to be the to be large but if you want that output to be close to one and all the other ones to be close to zero you basically want this input to be extremely large and all the other ones to be uh large and negative okay now um uh that you know that that can be a problem when the what you are computing at the input are or dot products because the result is that uh you know the easiest way for a system uh to uh to produce a small dot product is to have two vectors that are orthogonal to each other which gives the dot product is zero if you insist that the dot product should be very very small then uh either you have to make the so you have to make the the two vectors basically point in opposite directions and you have to make them very long and that's not so great and so uh using softmax for attention basically limits the the the contrast that you're going to have between the coefficients which is not this is not a good thing so you know something for stm gated uh recurrent nets etc uh you you need sigmoids there because you need coefficients that are between zero and one uh you know that either reset the memory cell or make it a pass through so that it keeps its its previous memory or kind of write the the new input in it so there it's nice to have an output that varies continuously between zero and one um there you have no choice so i i mean i don't think you can say just you know in generic term uh you know this this nonlinearity is better than this other one there are certain cases where it learns better there are certain cases where it relieves you from having to initialize properly uh there are certain cases where it works better if you have lots of layers like you know single kick functions work better if you have lots of layers better than sigmoid like functions there's no kind of there's no simple answer basically and i had a question just regarding the general differences between a nonlinear activation that has kinks versus a smooth nonlinear activation yeah um is there sort of any general reason or rule to why we would prefer to have kinks in the function or not it's a matter of scaling or scale equivariance so uh if the kink is hard again you multiply the input by two the output is multiplied by two but otherwise unchanged okay if you have a smooth transition uh if you multiply the input by let's say a hundred uh the output now will look like you had a hard kink okay because the the smooth part now is is become shrunk by a factor of a hundred if you divide the input by a hundred now the the kink becomes a very very smooth sort of convex functions okay so it changes the behavior by by changing the the scale of the input to change the behavior of the of the unit and that might be a problem sometimes because um when you when you train a multilayer neural net and you have two layers that are one after the other um you don't have a good control for uh like how big the weights of this layer are relative to that other weight so imagine you have a two-layer network where you don't have a nonlinearity in the middle so the system is completely linear right uh if the network has arrived as a solution you can multiply the incoming the the first layer weight matrix by two divide the second weight matrix by two and overall the network will have exactly the same output okay you won't have changed anything and what that means is that uh when you do training there is nothing that uh forces the system to have a particular scale for the weight matrices all right so now if you put a nonlinearity in the middle and you still don't have it because train for the system to kind of uh you know have scales for the first layer weight versus the second layer weight uh that um you know you you'd better have a nonlinearity that doesn't care about scale okay so if you have nonlinearity that does care about scales uh about scale uh then your network doesn't have the choice of what size weight matrix it can use in the first layer because that will completely change the behavior and it may want to have large weights for some other reason which will saturate the nonlinearity and then uh kind of create you know vanishing gradient issues so it's not entirely clear uh you know what why is it that you know deep networks work better with single king functions but it's probably due to that uh scaling variance property or scale equity variance property uh now there would be other ways of fixing this fixing this problem which would be to basically set a hard scale on the weights of every layer so you couldn't like normalize the weights of the layers so that the the the variance that of things that go into a unit you know is always constant in fact that's a little bit what batch normalization does or the various normalization schemes they do that to some extent you know they put the mean at zero and and the variance is constant so so now the variance of the amplitude of the output doesn't depend on the size of the weights because it's normalized uh so you know that is partially why uh things like uh like uh like batch norm and group norm and things like this uh help is because they they kind of fix the the the scale a little bit but then if you fix the scale then uh with something like batch norm you don't the system now doesn't have any way of choosing which part of the non-linearity is going to use in the two king function system okay so things like group normalization or batch normalization are incompatible with kind of sigmoids if you want sure sigmoid you don't want normalization just before it see that provides some really good intuition thank you okay any other questions i have one more question uh i noticed uh in a softmax function some people use the temperature coefficient so in what cases would we want to use the temperature and why would we use it well to so to some extent the temperature is redundant with the incoming weights so if you have weighted sums coming into your softmax uh you know having a beta parameter in your softmax equal to two instead of one is the same as just making your weights twice as big has exactly the same effect okay so that beta parameter is redundant with the size of the weights but again if you were or the size of the weighted sum the variance of the weighted sums if you want but again if you have a batch normalization in there then the temperature parameter matters because now the the input variances are fixed so uh so now the the the temperature matters um temperature basically you controls how hard the the you know the the distribution uh on the output will be so with a very very large beta you basically will have one of the outputs equal to one and all the other ones very close to zero i mean very close to one and very close to zero where beta is small then it's softer in the limit of beta equal to zero it's more like an average actually that you get like softmax behaves a little bit like an average um so you know beta goes to infinity it so behaves a bit like argmax and beta goes to zero it behaves a bit like like an average um so so if you have uh some sort of normalization before the softmax then tuning this parameter allows you to control this kind of hardness and what people do sometimes in certain scenarios is that they start with a relatively low beta so that uh the the numbers that are produced are kind of soft so you get gradients everywhere you know it's it's kind of well behaved in terms of gradient descent and then as learning proceeds if you want kind of harder decisions in your attention mechanism or whatever you increase beta and so that makes the system kind of make harder decisions it doesn't learn as well anymore but it's you know presumably after a few iterations it's kind of in the red boat park so you can sort of sharpen the the decisions uh there by kind of increasing beta it's useful for example in a mixture of mixture of experts and uh you know self-attention systems are kind of you can think of as sort of a form of a weird form of mixture of experts so in mixture of experts you know you have multiple subnetworks and their outputs are kind of linearly combined with coefficients that are the output of a softmax itself uh you know controlled by a another neural net so if you want kind of a soft mixture you have a low beta and and as you increase beta to infinity basically you're going to select one of the experts and ignore all the other ones that might be useful for example if you want to train a mixture of experts or an attention mechanism but in the end you want to save computation by just determining which expert do I need to compute and just not computing the other ones so in that case you want those coefficients to be basically either one or zero and and you can train the system progressively to do this by increasing uh uh increasing beta uh this is called the physicists have a name for this because the use is going to trick so various other things that's called annealing it has the same meaning as so annealing comes from metalwork right you you're making a steel or something and uh you you make a sword or something right and you you heat it up and then you you you you you cool it and depending on whether you cool it quickly or slowly you you'll you change the you know crystalline structure of the of the metal so this idea of annealing of progressively lowering the temperature correspond to this increasing this beta beta is like an inverse temperature it's akin to an inverse temperature any other question I think we are good all right okay so next topic is loss functions so by torch has a whole bunch of loss functions as you might have seen and of course there are things that are simple ones like uh like mean square error so I don't need to explain to you what it is you know compute the square of the error between the desired output y uh and the actual output x and if it's over a mini batch with n samples then you have you know n uh losses one for each of the samples in the batch and you can you can tell this loss function to either keep that vector or to kind of reduce it by computing a mean or or some okay very simple here's a different loss that's the one loss so this is basically the absolute value of the difference between uh between the desired output and the actual output and you want to use this to do what's called um robust regression so if you want small errors to count a lot and large errors to count but you know not as much as if you use the square perhaps because you have noise in your data so you know that you have a bunch of data points you're you're trying to kind of train a neural net or something to kind of you know fit a curve or or or you know do regression but you know that you have a few outliers so you have a few points that are you know very far away from where they should be just because you know the system has noise or something uh or the data was collected with with some noise so you want the system to be robust to that noise you don't want the cost function to increase too quickly as the points uh are are far away from you know the kind of the the general uh curve so at one loss would be more robust now the problem with that one loss is that it's not differentiable at the bottom and so you know you have to kind of be careful when you get to the bottom of how you um how you do the the gradient that's basically done with this soft shrink essentially that's the that's the gradient of the L1 loss now to correct for that people have come up with various ways of kind of making the L1 loss robust for large losses but then still smooth at the bottom kind of behaving like a square error at the bottom so an example of this is it's this particular function smooth at one loss it's basically at one far away and it's sort of L2 nearby and that presents sometimes that's called a Huber loss some people call this also elastic network because an old paper from the 1980s or 1990s that that kind of proposed this this kind of objective function for a different purpose so that's useful um that was uh advertised by Wes Gershik in the FETSAR CNN paper for um and it's used quite a bit in Cupid Division for various purposes again it's for protecting against outliers um also it gives us sharper yeah it gives us also sharper results now when we do like image prediction sharper than using the emcee not particularly i mean it's it's it's just like the emcee for small errors okay so that doesn't make any difference but it it doesn't um or maybe i misunderstood what your point was i'm sorry i was trying to compare the L1 versus the L2 that the L2 gives us like usually blurry uh blurry predictions whenever we try to do a prediction by using like the L2 minimizing the L2 whereas like people are minimizing the L1 in order to have like sharper overall predictions okay so if you take um if you take a bunch of points okay if you take a bunch of y values okay and you ask the question uh what value so you take a bunch of points on on y okay and you ask the question what value of y minimizes the square loss the answer is the is the average of all the y's okay okay so if you so if for a single x you have a whole bunch of y's which means you have noise in your data uh your system will want to produce the average of all the y's that you're observing okay and if the y you're observing is not a single value but is i don't know an image the average of a bunch of images is a blurry image okay that's why you get those blurry effects now with L1 the value of y that minimizes the L1 norm the L1 distance so basically the sum of the absolute values or the differences between the value you're considering and all the points all the y points that's the median okay so it's it's a given point all right uh-huh I see the median of course is not blurry the media is not blurry it's it's just an image although it's kind of difficult to define in multiple dimensions but um so one problem with this loss is that it has a scale right uh so here the transition here is is at point five but why should it be at point five it depends what the scale of your of your errors uh are okay uh negative exactly who lost this is really not the negative exactly who lost i'm not sure why it's called this way by torch but basically here imagine that you have an x vector coming out okay and your your loss function is there is one correct x okay so imagine each x correspond to a score uh for like say multiclass classification right so you have a desired class which is one particular index in that vector okay and what you want is you want to make that score as large as possible okay if those scores are likelihoods then this is uh minimum negative log likelihood if if those scores are log likelihoods then this is maximum likelihood or minimum negative log likelihood okay but there is nothing in this module that actually specifies that the else have to be log likelihoods so this is just you know make my desired component as large as possible that's it um if you put negative signs in front uh so now you you can interpret the x's as energies as opposed to scores okay they're not positive scores they're like they're like penalties if you want um but it's uh it's the same so the formula here says you know just pick the x that happens to be the correct one for one sample in the batch and make that score as large as possible now this particular one allows you to give a different weight to different to different categories choose those w those w's it's a it's a weight vector that gives a weight to each of the each of the categories it's useful in a lot of cases particularly if you have uh widely uh different uh frequencies for the for the categories you might want to increase the weight of samples for which you have a small number of of examples i mean for categories for which you have a small number of samples however i'm actually not a big fan of this of this i think it's a much better idea to just increase the the frequency of the of the samples from the classes from the class that have you know that appears rarely so that you equalize the frequency the frequencies of the classes when you train um uh it's much better because uh it it exploits stochastic gradient in a better way okay so uh so the bottom line of that is if you have uh let me actually draw a picture of this so uh let's say you have a problem where you have tons of samples for category one and then the small number of samples for category two and a tiny number of samples for category three uh you could uh so let's say you know here you have i don't know a thousand samples and here you have 500 samples and here you have i don't know 200 samples right so you could do is uh using this uh this kind of weight function you could give this a weight of of one and this guy a weight of two and this guy a weight of five and then you can equalize the weights if you want it's probably better to make sure that the weights normalized to one there would be probably a better idea but what i recommend is not that what i recommend is um when you pick your your samples uh you basically pick one sample from class one and then one sample from class two and sample from class three and then you know you keep doing this during your training session and when you get to the end of class three you go back to the beginning okay so you keep going here but here you go back to the first sample keep going here go back and now you're on the second sample okay and now you get to the end of of class two go back to the start okay so the next sample is going to be here here and here and then the next one here here and here here here and then this guy wraps around again uh etc right so you basically have equal probability equal frequencies for all the categories by just going through those kind of circular buffers um more often for categories for which you have fewer samples okay one thing you should absolutely never do is uh equalize the frequencies by by just not using all the samples in categories that are frequent i mean that's horrible you should never let any data on the floor there's never any reason to leave data on the floor okay now here's a problem with this the problem with this is that after you've trained your your neural net to do this your neural net does not know about the relative likelihood the relative frequencies of the samples and so let's say this is a system that does medical diagnosis it doesn't know that the common cold is way way more frequent than uh uh you know lung cancer or something right so um what you need to do in the end is do a pass a few passes perhaps where you can fine tune your system so that uh with the actual frequencies of the categories and the effect of this is going to be for the system to adapt the biases at the output layer so that uh the likelihood of you know a diagnosis corresponds to the the the frequency of it right it's going to favor things that are more frequent the reason why you don't want to do this during the entire training is because um if you train a multi-layer net the the the system basically never develops the right features for rare cases and i may have spoke spoken about this already in the class uh in in past weeks um to kind of recycle the example of medical school uh you you don't spend when you go to medical school you don't spend time studying the flu that is proportional to the frequency of the flu with respect to very rare diseases for example right uh you spend basically the same time studying all the diseases in fact you spend more time studying complicated one which usually tend to be rare and that's because you need to develop the features for it okay and then you need to kind of correct for the fact that you know those rare diseases are are are rare so you don't do that you know you don't suspect the the diagnosis for rare diseases very often because you know it's rare okay so that's all for for weights cross-entropy loss so you've been using this a lot of course and cross-entropy loss is a a kind of merging of two things a merging of log softmax function and negative likelihood loss okay so and the reason why you want to have this is is for numerical reasons um so uh the log softmax is you know basically a softmax followed by a log right so uh you first compute the softmax then you do the log if you do softmax and then log and you back propagate through this you might have uh gradients in the middle between the log and the softmax that end up being infinite um so for example if if the the maximum value of one of the softmax is close to one um and some of the other ones are close to zero you take the log you get something that's close to minus infinity you back propagate through the log you get something that's close to infinity okay because the the uh slope of log close to zero is very very close to infinity but now you multiply this by a softmax that is saturated so it's multiplied by something that's very close to zero so in the end you get a reasonable number but because the intermediate numbers are close to infinity or zero you multiply plus something that's close to plus infinity by something that's close to zero you get numerical issues so you don't want to separate log and softmax you want to do log softmax in one in one go it simplifies the formula it makes the whole thing uh much more stable numerically um and for similar reasons you also want to uh merge log softmax and negative likelihood loss so basically if you have log softmax and negative log likelihood loss it says i got a bunch of weighted sums i'm going to pass into a softmax i'm going to take the log of those and then i want to make the the output of the log softmax for the correct class as large as possible okay that's what the negative likelihood loss does it wants to make the score of the correct class as large as possible we saw that just a minute ago right um when you back propagate through the log softmax as a consequence it's going to make the score of all the other classes as small as possible right because of the normalization um and so that you know that's why sometimes the the whole idea of sort of uh building a a network by you know modules sometimes there is an advantage in sort of merging the modules into a single one by hand um right so so the cross entropy loss in fact uh this explains a little bit you know those numerical simplifications so the loss you know takes an x vector and a category a desired category a class okay and computes the negative log of the softmax applied to the vector of scores but the one that's on the numerator numerator here is the the x of the index of the correct class okay so that's that's your loss the negative log of exponential the score of the correct class divided by the sum of the exponentials of all the scores okay you can think of the x's as negative energies okay it's completely equivalent now when you do the the math and you simplify your the log and the exponentials can simplify and so you just get the score of the correct class the negative score of the correct class okay so to make that small you make the score large and then plus the log of the sum of the exponentials of the scores of all the other class to make that small uh you make all the xj's uh small negative as as far as that you know as negative as possible okay so this will make the score of the correct class large make the score of everything else small again like in the nl you can you can have a weight per category also there is a physical interpretation right of the cross entropy right okay so why is it called cross entropy because it is the cross entropy between two distributions it's the kl divergence really between two distributions it it doesn't appear clearly here in this formula but think of the softmax applied to the x vector as a distribution okay so take the x factors the the scores run into a softmax you get a bunch of numbers between zero and one that's on two one and now you have a desired distribution and the desired distribution the target distribution if you want is one in which all the wrong categories have zero and the correct category has one okay now compute the kl divergence between those two distributions okay so it's the sum over indices of the correct probability okay which is zero for except for one term times the ratio between the log of the the the probability that the system produces and the correct probability which is one okay so all of those terms you know reduced to kind of a single term which is just the one for which the correct probability term is one okay so we end up with this with this term it's just a negative log of the softmax output for the correct class okay we can view this as a cross entropy between the distribution produced by the system and the the one hot vector corresponding to the desired distribution if you want okay so now there is there would be another kind of more sophisticated version of this which would be the actual kl divergence between the distribution produced by the system and a distribution that you propose whatever it is a target distribution which now is not binary it's not the one hot vector anymore but it's just a a vector of numbers and that's called the kl divergence loss in fact it's we'll see it in a minute so kl divergence is a kind of you know it's not a distance because it's not symmetric but it's a it's sort of a divergence between between distributions discrete distributions okay so this one is a bit of a kind of an extension if you want of locks of max and it's a version of it that is applicable for very very large categorization so if you have many many many categories what you might want to do is kind of cut some corners you don't want to compute a giant softmax over say a million categories or maybe even more so there you can sort of basically ignore the ones that are small and you know kind of use tricks to kind of you know improve the the speed of the of the computation and this is what this does i'm not going to go into the details exactly what it does because actually i don't know the details but but it's basically an efficient approximation of softmax for a very very large number of categories so this is a special case of cross entropy when you only have two categories and in that case it kind of reduces to something simple so this does not include softmax this is just a cross entropy when you have two categories and as i as i said before the the cross entropy loss is the sum over categories of the probability i mean some of our indices or some of our categories of the probability for the target the target probability for that category times the ratio between the log of the the probability of produced by the system divided by the probability of the target category and if you work it out for two categories necessarily one score is one minus the other one if you have two exclusive categories and and it comes down to this okay now this supposes that x and y is x and y are kind of probabilities they have to be between strictly between zero and one i mean not strictly but well kind of strictly because otherwise the logs can blow up here is the KL divergence loss i was telling you about earlier so here it's the i mean it's it's written here in a funny form but it's basically the here again it sort of assumes this is not the one i was telling you about earlier actually this one is also the simplified one when you have a a one-hot distribution for the target so why is it a it's a category but it has a disadvantage of not being merged with something like softmax or like softmax so it may reach i mean it may have kind of numerical issues again it assumes x and y are you know distributions this is barely used Poisson loss okay so this version of the binary cross entropy here takes scores that haven't gone through a sigmoid so this one does not assume that x the x's are between zero and one it just takes you know values whatever they are and it it it passes them through a sigmoid to make sure they are between zero and one strictly okay and so that is more likely to be numerically stable it's a bit the same idea as kind of merging log softmax and negative log likelihood very yeah sensing here that's what i was talking okay margin losses so this is sort of an important category of losses those losses basically say if i have in this case two inputs um the last function here says i want uh one input to be larger than the other one by at least a margin okay so imagine the two inputs are scores for two categories you want the score for the correct category to be larger than the score for the incorrect category by at least some margin that you pass to the system um and that's the the formula you see down there so it's basically a hinge okay and it takes the difference between the two scores and so y is a binary variable this plus one or minus one and it controls whether you want x to be larger than x one to be larger than x two or whether you want x two to be larger than x one okay we basically give it two scores and you tell it which one you want to be the larger score um and then the the cost function says you know if if this one is larger than that one by at least a margin then the cost is zero if it's if uh if it's smaller than the margin or if it's in the other direction then the cost increases linearly okay so that's called a hinge loss okay so that's very useful for a number of different things we've we've seen an example of this in um so uh yeah for example uh so this is sort of a margin ranking loss so you have two values but um there are sort of there's a simplified version of it I mean there's a simpler version of it which I don't have here for some reason uh we only have an x okay so basically the loss is max of zero uh and uh minus x times a margin and it just wants to make to make x smaller than the margin right and so this is sort of a special case where you have a ranking between two scores of two categories so here is how you would use this for classification you would basically run your classifier you would get uh scores okay and yeah so before you do any non-linearity weighted sums and then you know the correct category so you say I want this correct category to have a high score and then what you do is you take another category that has the most offending uh score so either another category so a category that is incorrect that has a higher score than the correct one or that has a lower score but the lower score is too close okay so you take the the category that whose score is the closest to the uh to the correct one or whose score is um higher than the correct one and you feed the those two scores to a loss function like this uh so basically it's going to push up the score the correct category push down the score the incorrect category until the difference is at least the margin equal to the margin okay um and that's you know a perfectly good way of uh training something in the context of an energy based model for example that's that's one of the things you might want to do you might want to say uh x one or minus x one is the energy I mean x you know minus x one would be the energy of the correct answer and minus x two would be the energy of the incorrect answer like an uh a contrastive uh term an incorrect answer and you want to push down the energy of the correct answer push up the energy of incorrect answer so that the difference is at least some margin okay you can use this kind of loss for that uh the triplet loss is going to be refinement on this so this is used a lot for uh metric learning for the kind of samu's nets that Ishan was uh Ishan Misra was talking about uh last week and and there the idea is let's say I have um a um a distance so let's say I have three samples I have uh one sample and another sample that's very similar to it I run them through two convolutional nets I get two vectors I compute the distance between those two vectors uh d of a i p i for example okay um I want to make this distance as small as possible because that's the correct sample and then I take two samples that I know are symmetrically different okay the image of a cat and one of a table and I want to make the vector as far away from each other so I compute the distance and I want to make this distance large all right now um I can insist that the first distance be zero and I can insist that the second distance be larger than the margin that would be kind of a margin loss type type thing but what I can do is one of those triplet margin loss where I say the only thing I care about is that the distance that I get for the good pair is smaller than the distance that I get for the bad pair I don't care if the distance is small I just want it to be smaller than the distance for the bad pair okay um and that's what those ranking loss do um a bunch of those were uh uh I mean the one of the first I think that was proposed was by uh uh jason weston and sammy benjo back when jason weston was today google and they used this to train uh kind of an image search uh system for google so back then I'm not sure it's true anymore but back then you would type a query on google google would encode that query into a vector uh then we compare this to a whole bunch of vectors describing images that have been previously uh indexed uh and and then would kind of retrieve the images whose vector were close to the one that that you had and the way you train those those uh the networks that compute those vectors in that case back then it was linear linear networks actually is you train them with those triplet loss okay so you said good hits for my search should should have a distance between the vectors that is smaller than any bad hit and I don't care if the distance is small I just want it to be smaller than for bad hits any question that's kind of a graphical explanation of this where p is a positive uh sample so it's uh you know similar to a so a is the sample you considered p is kind of a positive sample and n is a negative sample or a contrastive sample you want to push n away and bring p closer and as soon as p is closer than n by some margin you you stop pushing and pulling um you have soft versions of this and in fact you can think of nce the the kind of loss function that ishain was talking about as kind of a soft version of that where you basically uh you have a bunch of positives and bunch of negatives or you have one positive and bunch of negatives and you run them through a softmax and you say I want the you know e to the minus distance for the correct one to be uh smaller than uh you know e to the minus the the other one so it kind of um you know pushes the positive closer to you and pushes the other ones further to you but now with some sort of softmax the sort of exponential decay as opposed to sort of a hard margin um so in pytorch you have things that allow you to have multi-dables so this allows you to basically give multiple correct outputs so instead of uh you know this is a ranking mass but it's sort of insisting that there is only one correct category and you know you you you want a high score for the correct category and bad score for everything else uh here you can have a number of categories for which you want high scores and then all the other ones will get pushed away all right we'll we'll get their scores will be pushed down um so here it's a it's a hinge loss but you do a sum of those this hinge loss over all categories and uh and for each category if the category is a desired one you push it up if it's a non-desired one you push it down which is what the this is your formula says and of course you have the soft version of this which i'm not going to go into the details of uh and the multi-margin version of it um so this pushing and pulling for metric learning for embedding for san jis nets that i was telling you about um it's actually kind of all implemented if you want in one of those hinge and bending loss so hinge and bending loss is a loss for san jis nets that kind of pushes things that are symmetrically similar to you and push away things that are not okay so the y variable indicates whether the pair you are or whether the score you are giving to the system is one that should be pushed up or one that should be pushed down and it uh it chooses a hinge loss that makes the score uh positive if y is plus one and it makes a score uh negative by some margin delta if if y is minus one um very often when you're doing san jis nets the way you compute the similarity between two vectors is not through a euclidean distance but through a cosine distance so the one minus the cosine between the the of the angle between the two vectors this is basically a normalized euclidean distance if you want you can think of it this way um the advantage of this is that whenever you kind of push the distance uh whenever you have two vectors and you want to make the distance as large as possible there's a very easy way for the system to get away with it by making the the two vectors very large very long you know not pointing in the same direction and make them very very long so now the distance would be large uh but of course that's not what you want you don't want the system to just make the the vector is bigger you wanted to actually rotate the vector in the right direction so you you normalize the vectors and then computer normalize your clean distance and that's basically what this does and what this does is that it for positive cases it tries to make the vectors as aligned with each other as possible and for negative pairs it tries to make the cosine smaller than the particular margin the margin in that case we should probably be something that kind of is is close to zero so you want the the cosine you know in a high-dimensional space there's a lot of space near the equator of the sphere of the high-dimensional sphere okay so all your points now are normalized on the sphere and what you want is uh samples that are symmetrically similar to you should be close to you the samples that are dissimilar should be orthogonal you don't want them to be opposed because there is only one point in the south pole whereas on the equator is a very very high large space the entire sphere minus one dimension basically okay so you can make the margin just you know some small so small positive value and and then you get the entire equator essentially of the sphere which contains almost the entire volume of the sphere and I mentioned ctc loss this is a little more complicated because that's a loss that is basically uses um structure prediction what's called structure prediction so this is I sort of briefly talked about it very quickly a few weeks ago was something very similar to this so this is um it also is applicable when you your output is a sequence of vectors of scores uh where the vectors correspond to scores of categories okay and so you have so your system computes a vector of such scores so imagine for example a speech recognition system speech recognition system every 10 milliseconds gives you a vector of probabilities for what the sound being pronounced right now is and the number of categories usually is quite large on the order of a few thousand okay so it gives you basically a self max vector of a size you know typically three thousand let's say one of those every 10 milliseconds all right and what you'd like you know you have a desired output and the desired output is what word was being pronounced and the word that's being pronounced that corresponds to kind of a particular sequence of sounds if you want that that you might you might know so what you need now is a cost that basically is low if that sequence looks like like that sequence but what you might allow is for the input sequence to repeat some of the sounds if you want right so um so for example you know my uh my cost to the target might be the word seven let's say and it's pronounced really quickly seven so you basically have you know a very small number of samples of each sound in the sequence but then perhaps the the person who is pronouncing the the word now that uses as a training sample pronounced it very slowly like seven right so now the the first the first uh takes you know several see several frames of 10 milliseconds that should all be mapped to the the same instance of of the uh in the in the output and i do that picture before but i'm going to draw it again right so the you have um let's see you have a sequence of scores coming out of softmaxes let's say it's actually better if there are energies but for ctc there need to be and then you have the target sequence and i think of this as some sort of matrix and each uh entry in that matrix basically measures the distance between the the two vectors that are here okay so when entering the matrix indicates how this vector looks like that vector for example the cross entropy is something like that okay or it's quite error it doesn't matter where the loss function is so now if this is the word seven pronounced slowly okay and this has perhaps only one instance of each sound you want all of those you know you would want all of those vectors uh corresponding to the e to be mapped to that e vector here okay so you want to compute that cost of you know confusing that those all of those i mean map matching those e's to that to that e now of course here the system produced the correct answer so you don't have much of a problem but if the target is seven but the word that was pronounced here or the output that was produced by the system does not correspond to seven that's that's that's when you run into into trouble so here what you do is you find the best mapping from the input sequence to the output sequence okay so the s gets mapped to the s the e to the e the v to the v the e's to the e and the n to the to the n so you get this kind of path if you want that think of this as a path in a graph um and the way you determine this is basically by using a dynamic programming algorithm the short-term path algorithm that figures out how do i get from here to here uh in a path that minimizes the sum of the distance distances between uh the the all the vectors all the distances between the vectors of you know all the points i'm going through okay so there's a optimization with respect to a latent variable if you want okay and ctc basically does that for you right so you give it two sequences and it computes the distance between them and you know kind of the best kind of mapping between the two by allowing uh uh uh basically to to map multiple uh input vectors to kind of a single one on the output it cannot expand it it can only kind of reduce if you want um and then that's done in a way that you can back propagate right into it uh we'll come back to this to more things like this at the end if we can oops so this is what this uh the target is assumed to be many to one um the the alignment of the input to the target is assumed to be many to one which means the length of the target sequence such that it must be smaller than the length of the input that's for the reason i just explained okay so it's basically differentiable time warping you could think of it this way or sort of a a module that does dynamic time warping or uh dynamic programming and is still differentiable the idea for this goes back in the early 90s in the leon boutou's phc this actually that's very old is there a a good paper or resource to learn more about that dynamic programming algorithm there that yeah actually that's kind of what i'm going to talk about next um i may not have time to go through it but i'll try to okay but basically the last part of the energy based model tutorial okay so the energy energy based model tutorial the 2006 paper that um we give you a reference a link to uh a tutorial energy based models the the second part is is all about this kind of stuff essentially okay so it's more energy based models but now in kind of more of the kind of more of the supervised context if you want um so uh preliminary so before i get to this i want to come back to the sort of the more general formulation of energy based models and the idea that so if you want to kind of define energy based models in the proper way these are the conditional versions uh you have a a training set a bunch of pairs x y y i for i equals one to p you have a last functional so the last functional l of e and s so it takes the energy function computed by the system okay and the training set and it gives you a scalar value now you can you can think of this as a functional for functional is a function of a function okay but in fact because the energy function itself is parameterized by parameter w you can turn this last functional into a last function which is not just a function of w not a function of the energy function okay um and of course the the set of energy functions is called epsilon here it's parameterized by uh the parameter w which is taken within the set so training consists in of course minimizing the the last functional with respect to w and finding the w that minimizes it and and so one question you might ask yourself you know i i went to a whole bunch of objective function last functions here and the question is if you are in an energy based framework what lost functions are good ones and what lost functions are bad ones how do you characterize a lost function that actually will do something useful for you okay um so here is a general formulation of a lost function it's uh it's an average over training samples uh so here i'm kind of assuming that it's invariance under permutation of the samples so an average is as good as any other aggregation aggregating function so it's the average over training samples of a per sample lost function uh capital l and it takes the desired answer y which could be just a category or it could be a whole uh image or whatever and it takes the energy function where x the x variable xi is is equal to xi the ice training sample uh but the y variable is undetermined okay so e of w uh y and xi is basically the entire shape of the energy function for all values of y over values of y for a given x okay for x equal to xi and you can have a regularizer if you want okay so here this is a lost functional again and again of course uh we have to design this lost function also that it makes the energy of correct answers small and the energy of incorrect answers large in some ways right um okay now we're going to go through a bunch of uh different types of lost functions so one thing we could do is say my lost function is just going to be the energy of the correct answer so i'm going to place myself in the context of an energy based model my system produces scores i interpret those scores as energies so high is bad good is good i mean low is good um as opposed to positive scores um and what i'm just going to do is define my energy functional as a function of the energy function the function of y as simply the energy that my model gives to the correct answer okay so basically i give it an x and i give it the correct answer y and i ask the system what energy do you give to that pair and then i try to make the that energy as small as possible okay so you have this landscape of energies here now we're in the i showed you this slide in the context of unsupervised self-supervised learning here i'm showing to you in the context of supervised learning so imagine that one of the variables is x and the other variable is y okay and the blue beads are training samples and you want to make the energy of the blue beads as small as possible so you're pulling down on the blue beads but you're not doing anything else and so as a result depending on the architecture of your network if your network is not designed properly or if it's designed in a in no particular way it could very well be that the energy function is going to become flat everywhere okay you're just trying to make the energy of the correct answer small and you're not telling the system the energy of everything else should be higher and so the system might just collapse all right so energy loss is not good in that sense but there are certain situations where it's applicable because if the shape of the energy function is such that it cannot make the it can only make the energy of a single answer small all the other ones being being larger they don't need to have a contrastive term okay and we've seen this in the context of self-supervised learning people are completely lost about the loss functional right okay so there's a function l and it's a function of another function e okay so it's called a functional because it's a function of a function right it's not a function of a point it's a function of a function now if that second function is parameterized by parameter w then you can say that the last function is actually a function of that parameter w and it becomes a regular function okay that's what i had in the can you can you write it down it's basically written here okay you can either write the functional as uh if i can find my as l of e and s so that's a functional because it's a function of e which itself is a function okay but e itself uh is a function of w and so if i write the last function directly as a function of w now it's just a regular function yeah i mean i asked the question that was asked in the chat i yeah i know i'm just kind of i know um before you know for okay we've seen the negative lack likelihood loss uh uh before i talked about this so this is a a loss function that uh tries to make the energy of the correct answer so look at the the rectangle in red tries to make the energy of the correct answer as low as possible and then you have the second term whatever beta log sum over all y's of e to the minus beta e of w y x um and this one is trying to make the energy of all y's for this given x as large as possible okay because the best way to make this term uh small is to make those energies large because they enter uh in there as a negative of uh negative exponential okay so this has this kind of pushing down on the correct answer pushing up on incorrect answer behavior and we've seen before we just talked about margin loss and and other types of losses here is something that's called a perceptron loss because it's basically very similar to i mean it's exactly the same as the loss that was used for the the perceptron 60 years ago over 60 years ago so this one says i want to make the energy of the correct answer small and at the same time i want to make the energy of the smallest the smallest energy for all answers as large as possible okay so pick the y that has the smallest energy in your system make that as large as you can at the same time picks the correct energy make that as small as you can now there is a point at which the answer with the correct energy is going to be equal to the correct answer and so that difference can never be negative okay because the first term is necessarily one term in that minimum and so the difference is at best zero and for every other case is positive it's only zero when the system gives you the correct answer okay but this objective function does not prevent the system from giving the very same energy to every answer okay so in that sense it's a bad energy it's a bad loss function it's a bad loss function because it it says i want the energy of the correct answer to be small i want the energy of all the other answers to be large but i don't insist that there is any difference between them so the system can choose to make every answer the same energy and that's a collapse okay so perceptual loss is not good it's actually only good for linear systems but it's not good for as as an objective function for nonlinear systems so here's a way to design an objective function that was always be good and you you take the energy of the correct answer and you take the energy of the most of any incorrect answer which means the value of y that is incorrect but at the same time that is the lowest energy of all the incorrect answers okay and your system will work if that difference is negative in other words if the energy of the correct answer is smaller than the energy of the most of any incorrect answer but at least some quantity some margin okay so as long as your objective function when you design it ensures that the energy of the correct answer is smaller than the energy of the most of any incorrect answer by at least a margin non-zero margin then you're fine your loss function is good okay so things like hinge loss are good the hinge loss basically says and we talked about this just just before i want the energy of the correct answer to be smaller than the energy of the most of any incorrect answer which is denoted y i bar here by at least m okay this is what this loss function does it's a hinge loss and it wants to push down the energy of this guy below the energy of that guy by at least this margin so this has a margin m and this will you know if you train a system with this loss it will and it can learn the task it will learn the task and probably produce the good answers the hinge loss the soft hinge loss which is in the context of an issue as models is expressed this way basically instead of feeding the difference between the energies of the correct answer and the most of any incorrect one into a hinge it fits it's it feeds it to a soft hinge okay which we talked about um just 30 minutes ago and they are this also has a margin the margin would be how how to pick m say a question would be how to pick m it's arbitrary you can set m to one you can set m to one tenth i mean it's kind of arbitrary because it will just determine the size of the weights of your last layer that's all it does okay so it's basically up to you yeah so the soft hinge loss has an infinite margin it wants the difference between those two energies to be infinite but the the stroke sort of decreases exponentially so it's it's never going to get there because you know the gradients get very small as the difference increases here's another example of a margin loss the square loss the square the square the square square loss okay so this is a loss that tries to make the energy of the correct answer squared as small as possible and then it has a square hinge to push away to push up the energy of the most of any incorrect answer okay and again that works um and this is very similar to the kind of loss that people use inside these nets and stuff like that that you've you've heard about there's a whole menagerie of such losses which i'm not going to go through um there's actually a whole table here which is also in this paper the tutorial energy based models and what's indicated on the on the right side is whether they have a margin or not so the energy loss does not have a margin it doesn't push up anything so no margin it doesn't it's not it doesn't work always you have to design the machine so that this loss may work for that that system the perceptual loss uh does not work in general it only works it works if you have a linear parameterization of your energy as a function of the parameters but that's a special case and that's the case for the perceptron um and then uh some of them have a finite margin like the hinge loss and some of them have an infinite margin like the the log um the the sort of soft hinge if you want um and there's a whole bunch of a whole bunch of those losses some of those were used were invented in the context of discriminative learning for speech recognition systems but um not not they were invented before people in machine learning actually got interested in this question would be like how how you find the y bar so if you have like a discrete code we can find simply like you know uh the minimum value but otherwise are we running gradient descent um right so if y is continuous then uh there is no kind of clear definition for what is the most offending incorrect answer okay you would have to define some sort of distance around the correct answer uh above which you consider an answer to be incorrect okay so for example you are in a continuous energy landscape there's one one training sample here you want to make that the energy of that training sample small easy enough compute the energy through your neural net push it down back propagate update the weights so that the energy goes down easy enough now the incorrect answer if you if you take an answer that's just kind of epsilon outside of that uh and you push up you know your energy surface might be a little stiff because it's computed by a parameterized neural net so that may be possible so you probably want to have a incorrect answer that's you know quite a bit outside that you're gonna you're gonna push up um and so that's how you define you know the the whole question is how you define a contrasted sample that you're gonna push up and and the a lot of those objective functions here those last functions use a single uh uh you know y bar uh negative sample but uh there is no simple single correct way of peaking this y bar you can imagine um you know particularly in the kind of in the sort of continuous case or in the case where y is either very very large or or continuous and high-dimensional there's no simple way to pick to pick y bar you know a lot of discussions we've had about contrastive methods that Ishan talked about for some of his nets and that we talked about before were basically how do you pick a y bar in the uh uh self supervised case so self supervised you'll not have an x right um and you know there's many ways you can you can pick it up um it's only obvious how to pick it up in kind of small cases I just want to point out uh the formula here at the bottom so this is a kind of you can think of this as sort of a general form of of uh sort of hinge type uh uh contrastive losses where uh you have an h function here think of it as a as a hinge for some type okay and inside of that hinge you have uh the energy of the correct answer so that's the energy of uh w y i x i so this is your training sample that's the energy your system gives to the training sample the second term is uh the energy of some other answer y okay uh for the same x uh training sample and then there is a margin but that margin c is actually a function of y i and y and you might imagine the margin is actually also a function of x and x i um so basically you determine a margin as a function of the distance between the between the y's okay and you feed that to let's say a hinge now the thing is this loss function is summed over all y's here is a discrete sum because y is discrete but you could imagine an integral okay so this kind of loss says um you know I I have an energy for my correct answer I have energies for every other answer in my space and I want to push up the energy of all other answers uh but the amount by which I want to make them higher the margin depends on the distance between uh between y and y uh y bar or in this case between y i which is this and y which is the other uh the otherwise okay so you can imagine that this margin will you know we become smaller and smaller as the two y is going to get closer to each other in this case you don't push up too much for things that are too close and you push up in proportion to the distance of of the y you know whatever distance you you think is appropriate this is of course a more difficult uh uh loss function to to optimize um and out of time so I might talk about the structural prediction issue that I said I was going to talk about at the later time uh any more question um the contrastive method used for the SSL supervised learning papers that I usually um take the random um take random images as the negative examples do you have any idea if they use these functions anyone's tried experimented with these they use what kind of function um these loss functions that you explained to us now so most most of them use the basically the negative low likelihood loss here which in this panel is called NLL MMI okay so NCE that you you heard about from Yishan that's what they use right they're trying to make the distance between the samples as much as possible and then the contrastive term uh is you know it's basically a log softmax of the distances so when you compute the log softmax you think of distance of a distance as an energy and then you compute the log softmax of those energies uh you get this formula here uh in the second last line called NLL MMI okay so the integral is approximated by taking random images a random what um dating random images is a negative example so that is used to approximate the integral well so basically you can't compute this integral over all y so uh or this a sum if y is discrete and so you basically approximate the sum by uh you know a few terms that you pick randomly right yeah that's called Monte Carlo I mean basically if you want to do this properly you have to pick those samples uh according to the rule of Monte Carlo sampling but it doesn't matter I mean that's that's why hard negative mining is hard okay that's why what makes the difference between mo co perl seem clear etc is how you pick those negative samples that's why I said there is no kind of in cases where uh the white space is is high dimensional there's there's no you know predefined way of picking negative samples essentially it's only in classification that is easy but has have um other people experimented with the other losses uh yeah I mean a lot of people are using the square square or the the the sort of hinge you know with the difference of energies so some of the systems that were used by at least at some point the the system that deep face which is the the face recognition system that used by facebook to to tag people it used a convolutional net trained in supervised mode with a certain number of categories basically images from I don't know a million people or something uh but then there was a fine tuning phase that used metric learning uh basically sign these nets where you show two uh photos of the same person and you say those are the same person and then two photos of different people and you push them apart and that used uh they they tried different objective functions but I think they were using the square square loss uh at some point maybe the square is financial and not entirely sure what they're using now but you know it's one of those uh professor what topics will you cover in the next lecture um okay so we're going to have uh two guest lectures so next week is Michael Lewis Michael Lewis is a research scientist at uh physical care research in Seattle and he is a specialist of natural language processing and translation so he's going to you know tell you all the interesting tidbits about sequence to sequence about transformers about nlp and about translation okay uh and you know he knows a lot you know much better the details about this than I do so he's the right person to talk about this uh we're going to have another guest lecture it's going to be Xavier Bresson he he's uh you know one of the world specialists of graph neural nets and um um so this is the the whole idea of you know how do you apply neural nets uh you know you can think of an image as a function on a regular grid okay every pixel is a location on a regular grid you can think of an image as a function on that grid okay so grid is is a graph of a particular type and the image is just a function on the graph uh you can think of I don't know a video as you know a regular 3d grid where you have space and time and you know most natural signals you can think of them as functions on kind of regular graphs okay what about the case where the function you're interested in is not on a is not on a Euclidean graph if you want so let's imagine for example you take a photo with a panoramic camera a 360 camera right so it's a camera that basically takes a spherical image okay so now your your pixels live on the sphere how do you compute a convolution on a sphere okay so you want to run your convolutional net on this image that now lives on the sphere you can't use the standard ways of computing convolutions so you have to figure out how to compute convolutions on the sphere right so that's an example now here's something a little more complicated imagine now that you have a 3d scanner and you're capturing I don't know a dancer you know someone kind of in front of a 3d scanner and that person has a particular pose let's say like this okay and and then you take another 3d picture 3d data from another person and that other person is you know in another pose that person has a different body shape she's in a different body pose and now what you want is you want to be able to map one onto the other you want to be able to say like you know what is the hand in the first person what is where is the hand in the second person so what you have to do now is basically have a neural net that takes into account a 3d mesh that represents the geometry of a hand and train it to tell you it's a hand so that when you apply it to the hand it tells you it's a hand when you apply it to the other parts of the body it tells you it's something else but the data you have is not an image it's a 3d mesh okay the mesh may have different resolutions the triangles may occur at different places so how you define your convolutions on the domain like this that is independent of the resolution of the mesh and only kind of depends on the shape so that you can classify a hand regardless of the orientation the size the conformation and the body shape of the person you know things like that right so here's another example that's perhaps more interesting you you want to do you want to train something like a like a samis net but you want to train the samis net to tell you whether when a molecule is going to stick to another molecule right so you give two molecules to your neural net and your neural net produces two vectors if those two molecules stick together it gives you two vectors whose distance is small okay and if they don't stick together then the distance is large okay so you can think of the distance as kind of the negative free energy of the binding the binding energy of the two you know the two the two molecules right or you know the the free energy you know minus the constant if you want so so you would train this as a samis net but then the problem is how you represent a molecule to a network knowing that it's the same network you're going to apply to this molecule and that molecule and the two molecules don't have the same shape they don't have the same length they don't have the same number of atoms they don't have the same like how do you represent a molecule the best way to represent a molecule is that's a graph it's basically a graph whose structure changes with the molecule and this graph is annotated by the identity of the atoms at each site maybe by the location in 3d space or the relative location maybe by the angle of the bonds between two successive atoms or the binding energy of that particular bond or things like this so so the best way to represent a molecule is by by representing as a graph basically and there's here's another example perhaps more relevant to something like facebook let's say let's say i want to kind of infer or let's say amazon or something i want to infer what type of uh let's let's say i'm amazon right and i have a customer and that customer has bought a whole bunch of different things and that customer has commented a whole bunch of different things i could think of kind of encoding this as a vector but it would be a vector of of variable size because you know people buy different numbers of things and stuff like that so i would need to sort of find a way to aggregate that data so that everybody can be represented by the same fixed size vector but what if instead i represent the the person and all the all the things that that person has bought and all the uh you know reviews that person that made etc as a graph essentially and then i represent what i feed to the neural net is the graph with values on the nodes and perhaps the arcs if i have a way of representing a graph so that i can connect a neural net independently of the shape of the graph then i can do this kind of application uh and so this is what graph neural nets are about it's a very very hot topic at the moment it's extremely promising for a lot of applications particularly in uh biomedicine you know in chemistry in material science but also in uh social science uh for social network analysis and uh you know all kinds of all kinds of applications computer graphics you know all kinds of stuff so it's um it's really cool and Xavier is really one of the one of the experts on this topic so i'm really happy that he accepted to give us the talk it's not gonna be easy for them for him because he's in Singapore so yeah he's gonna be fine the morning for him that's right well i'm giving a lecture in a couple days uh in uh in hong kong so the same thing for me i think so actually he's from uh nanyang technology technological universities and university and NTU right NTU NTU yeah yeah i uh i was confused correct all right so that was it oh sorry there was one more question uh yeah that was really interesting professor uh i had one more question i was reading this term called normalizing flows and i don't understand what they are could you just give some intuition into why why people are excited about it right uh so normalizing flows um so it's not a technique i have a lot of experience with but you know i've read the papers um it's uh so it was proposed by daniella resendee and um uh check here uh i'm at deep mind a while ago and a while back maybe five years ago or so uh and it's a basically a sort of a density estimation method so it's a little bit like uh like gans it has a bit of the same spirit as gans the um and it gets it gets inspiration from ica independent component analysis uh although it's not kind of explicit in the original paper but here's a basic idea this idea is um you want to train a neural net to transform a known distribution from which you can sample into a distribution that happens to be the distribution of your data okay so let's imagine that you have a written variable z that you sample from a gas and distribution and you run it through a function or uniform distribution over a domain okay you run it through a function implemented by neural net and you want to train this neural net so that the distribution you get that the output is the one you want that corresponds to your data okay uh and um so the the uh let me give you a very simple example so let's say uh let's say I have a variable z and have a observed variable y and I sample my variable z with a uniform distribution okay between say zero and one okay and what I want on the output is I don't know um say a Gaussian it's kind of stupid to want a Gaussian but it's say I want a Gaussian because I could sample from a Gaussian easily uh so what I need to do is kind of transform this uh uniform distribution into a Gaussian by a mapping and a mapping is going to be a function like uh it's going to be a function okay zero is here kind of like this if you want okay and and this is the inverse of the integral of the Gaussian distribution okay so if I take the derivative of this function okay so now now let me kind of draw this uh it's a little difficult to to see but if I map okay the the derivative of this function here will indicate how much I stretch a little piece here into a piece here right so the larger the derivative the more I stretch right if the slope here is one then this piece of the distribution here is not going to be stretched is going to be kind of passed unchanged okay uh and the larger the the the slope the more I stretch the distribution I stretch a little piece here and therefore I kind of distribute the all the samples that fall into this little location here I stretch them over a large region right and so what I need to do is design this function in such a way that it stretches my input distribution so that that distribution get transformed into the output distribution I want all right so there's a formula that says so in multi-dimension it's a little more complicated than this but it says that the the distribution you're going to get uh on y is going to be equal to the distribution that you started with in z multiplied by the inverse of the determinant of the Jacobian of this f function so this is f uh minus one so it's actually the original formula is this one but those two things are equal okay so if you take so this is for a multi-dimensional vector function right so it has a Jacobian to map z to y and so if you take the determinant of the inverse Jacobian uh of that function which is a scalar value indicates by how much the distribution gets stretched uh or compressed in that case at q so in that case here is the is the compression ratio it's the inverse of the derivative so it's the compression right and so the more you compress here uh the more the the probability will be high more p of y will be large the density p of y for this y will be large for a given uh q so this is uh for y equals f of z okay so so the big question of uh normalizing flow methods is um is how you do how you do this right given a number of samples of p of y and given that you sample your distribution uh q um you you you have your distribution q you sample from it um you know how do you kind of minimize an objective function that uh knowing that that you know the p you get that the output is equal to the q you put at the input multiplied by this sort of inverse determinant of the Jacobian of the f function what you have to find is the f function so basically you have to differentiate so basically compute the distance between those you know divergence k l divergence for example between p of y and the thing on the right side of the of the of the equal sign and you have to differentiate this with respect to the parameters of f so you have to basically propagate through the inverse gradient of the Jacobian of f right it's not easy very often what people do is that they write f as a succession of very simple f's that only modify the the distribution just a little bit um so f very often is you know something like uh the identity plus some deviation a bit like resonant if you want and then you stack uh lots and lots of layers of that and the problem becomes simpler because when the uh when those those functions do a little bit of uh of modification then the a lot of those kind of issues kind of becomes become simple the the determinant here kind of simplifies okay that's a very sort of abstract high level description of normalizing flow uh there's uh um yeah interesting papers about this in recent recent years on even recent months on sort of using this for like particle physics and stuff like that card Cranmer at NYU is actually a kind of a specialist of that thank you so much professor all right any other question okay that was it great thank you very much everyone yeah see you tomorrow guys all right bye bye take care