 All right. Good morning. Good afternoon, everyone. This is Drew LaHart from IBM Research. Today we have Anirban Laha from IBM Research. Anirban is going to be the next speaker in our A.I. Horizons seminar series. Anirban has a Norse paper that he'll be presenting to us today on controllable sparse alternatives to softmax. That's quite a mouthful there. So looking forward to hearing the seminar. We can ask questions as we go. If possible, I'd prefer to hold them to the end to make sure that we have enough time for Anirban to get through his deck. I will also just remind you that this call is being recorded and we made available on the IBM Research YouTube channel once it's completed. With that, let me turn it over to Anirban and we'll get started. Thank you, Drew. Hello, everyone. So today I'm going to present on controllable sparse alternatives to softmax. So I think as we go through the presentation, all of these terms will get clearer. So this is a joint work with my collaborators from IBM Research India as well as professors from IBM Address. So I'll get started. So this talk is about probability mapping functions. So what are probability mapping functions? So basically, these are functions which can take in any real vector and can try to convert them to probability distribution. So as in usual probability distributions, the elements of the vector, the resultant vector P will conform to the properties of probabilities, like all of the values will be greater than equal to 0 and they will sum to 1. So in this discussion, we are going to discuss about such functions which can take in any real score vector and produce a probability distribution out of it. So essentially what's happening here is given any point in the real space. So this kind of function which we represent as rho will transform such a point z to a resultant point in the probability sampling. So here in this illustrative diagram on the right, so we take a vector z and transform it to probability simplex on the 3D space which is exactly a triangle. So we are going to discuss all such kinds of functions rho which can perform this kind of operation. So let's see if there are any such known probability mapping functions. There's one such very familiar function known as softmax which we use in almost every other setting in machine learning. There are also other not so well known probability mapping functions like sum normalization and spherical softmax. Now what are the applications they are used in? So typically softmax is ubiquitous in almost all kinds of machine learning applications like probabilistic classification, be it multiclass or multilevel classification. It's also used in neural attention models typically where soft attention mechanism is used and also in memory networks, reinforcement learning, knowledge distillation and so on. Many other such applications. So softmax is the most prevalent and the others that I have mentioned are not so well used because softmax has very nice properties which satisfies almost every other need. Now let's look at one application which is of interest that I'll be discussing today. So let's look at the image on the right. It is the image of a cockpit on an airplane. There are like a plethora of instruments and sensors which are actually visible. So a pilot who is handling the controls in the cockpit should not be overwhelmed by so many sensors and instruments. He needs to focus on only the ones which are important for a particular situation. Maybe only a few like two or three sensors might be relevant during takeoff or landing. So similarly, neural attention models try to attend to only the parts of the input which are relevant for certain situations. And so usually the way they do it is they usually apply a softmax function over the parts of the input and get probabilities over the parts of the input and use those probability weights to have a weighted combination of the inputs which is then passed to the next stage of the neural model. So typically soft attention mechanisms use the softmax probability mapping function. Now there are the known probability functions that I have spoken about have certain limitations. Like for example, the softmax function that we see here. So it has the property that it always produces nonzero probabilities. That is because the numerator cannot produce a zero value because it is an exponential of number Zi. So we cannot get any sparse outputs because all of the values no matter how low they are, they will always be greater than zero. Similarly, there are other issues with the other functions like sum normalization. So the sum normalization is the function Zi by summation Zj. So here the issue is that it cannot take negative values of Z. So if it does that then the resultant output violates the constraint of probability distribution because the values end up being negative, right? Similarly with theoretical softmax because of the square term there. So there is this issue of non-monotonicity. Now so basically this talk will be about search of the right probability mapping functions with all the nice properties. So now why do we need sparsity? So last slide I spoke about that softmax cannot produce sparse outputs, right? Now if we look at multi-level classification setting. So given an image of this car, right? There are only a handful of labels which would be turned on. Like maybe in this case just five levels are true and the label space could be thousands or even millions, right? So only a few labels are true. So that's why we probably need a sparse probability mapping function to predict only the labels for which non-zero is produced and the zeros are the irrelevant levels. Similarly for attention models I have shown a textual element task for representation. So basically when trying to see when if a pair of sentences like premise and hypothesis if they are entailing with each other or contradicting or so on. So given this hypothesis only a few words in the premise might be relevant in predicting the output, right? So most of the words are not needed. Typically if you apply a softmax function what happens is almost all the words are predicted non-zero. So the darker here shows the higher value of probabilities while lighter shows lower value. So typically it would be needed that for the lighter shade it should be exactly white which means zero probability because they are irrelevant but still because of limitations of softmax they are also cropping up as low shade here. So if we have zero for the irrelevant words we might actually be able to make the computation faster. So whenever we use the probabilities as weights for linear combination of this words in the input so the number of computations decrease if we have zeros for the irrelevant ones. So that's why sparsity is also desired in this setting, right? Now where do we need sparsity? So there's a paradigm of sparsity of model parameters, right? Let's take a look at a simple model where the input is x and the output prediction is y. So a simple linear model could be like wx plus b maybe followed by some non-linearity say rho could produce an output y, right? So typical sparsity literature for model parameters try to deal with sparsity in the w. But in our setting here we are looking at sparsity in the output which is the probability vector that is produced, right? So this is different from the other sparsity literature that we have seen so far in our studies. So one easy one intuitive way to do it is probably if we can have a L1 norm regularizer on the output y in the loss function. Can we achieve sparsity in the output? The answer is no because so if you are dealing it output which are probabilities as the maybe the softmax function or any probability mapping function would produce. So since it's probability the L1 norm of probability distribution is always one, right? So we cannot have an optimization on top of that. So probably we need to look at a better mechanism of obtaining sparsity in the output, right? So there's already one prior work called sparsmax which came in ICML 2016. So it has tried to address this concern where to obtain a probability distribution p from a real vector z it would try to make the probability distribution p sparse enough. The way it does is it tries to take the input vector z and tries to find the nearest point on the probability simplex which is denoted as p and it returns that as output. Now it can be mathematically proven that the nearest point on the probability simplex corresponding to the point z is actually a sparse probability distribution. So many of the dimensions of p could be 0 unlike softmax. So there are very interesting properties about this transformation sparse max in a way that it can be easily computed as a closed form solution. Typically the way it could be done is there could be a threshold that needs to be find out here denoted as tau. And we should fix the threshold tau at such a point that we can compute this sparse max for each dimension by this. And the sum of this sparse max for each dimension turns out to be 1 which is the property of the probability distribution. So this is the prior work in ICML 2016 paper. Now let's visualize how this sparse max function works. So let's consider a two dimensional input vector z1 z2 and as per our requirement we want to transform z1 z2 to a probability vector p1 p2 with the desired properties of probabilities. So let's take the real space, two dimensional real space to represent z1 z2. So here z1 is denoted by the x axis, z2 is denoted by the y axis. So here the red regions are actually the z1 z2's or the points in this real space which can if pass to the sparse max transformation will produce a sparse vector. So what do I mean by sparse vector? So in two dimensional space sparse vector will be either 1 0 or 0 1. So all of these points in the red regions could be either 1 0 or 0 1. Whereas the regions in between which are denoted by the contours that we can see these are the areas where both p1 and p2 are non-zero. So here the value of the contours show the value of the first dimension of the probability p1. So this is one way to visualize this kind of transformation. So let's look at how the other known probability mapping functions that we have seen how to visualize them in this way. So sparse max as we have seen this is how it looks like. The sparse regions are denoted by the red and in between it's the non-sparse region. Whereas in softmax once we have a visualization of the sparsity versus non-sparsity region. So as we have stated before that softmax can't produce non-zero, can't produce zero values. So indeed it doesn't have any sparse region throughout the real space. So this contour line spread throughout the real space and there's no red region. Whereas for some normalization also since it's valid only for the first quadrant it cannot work for negative values. Still even for the first quadrant there's no red region which means that it cannot produce sparse output even in the first quadrant where it's working. So and the other point here is so this transformation functions do not have control over the sparsity. So this is something that will get clearer in the next few slides. Now so how we actually incorporate control into a sparse probability mapping mechanism. So we propose a framework called sparse gen which is a unified framework which is like it can contain a fact. So it's actually a family of sparse probability mapping functions. The way it's defined is so taking an input z and using a predefined transformation function g and with some parameter lambda. So the way it's defined is as follows. So given an input z, so it's passed through a transformation function which takes in a k-dimensional vector and produces a k-dimensional vector as output. That output vector of gz is projected onto the simplex like this sparse max function. The first part of the objective looks similar. Then once that is done, so this is followed by a negative L2 norm regularizer which helps in also controlling the sparse ct. So this equation or the formulation can be also represented equivalently in a simpler form. So sparse gen of z for a predefined function g and parameter lambda could also be denoted by sparse max of gz by 1 minus lambda. Now if we have a close form representation of gz, then we can easily have a close form representation of sparse gen as well. Because sparse max also has a close form solution as we saw before. Now as I mentioned about controls of sparse ct, so there are two ways to control sparse ct using this sparse gen framework. So one is through the gz functions. It's typically what kind of function g we can choose. The another way is the parameter lambda. So here in this we will talk about two derivatives of this kind of controls. So we propose two such formulations. So one by changing only the lambda parameter which is we call it sparse gen hyphen lin. And the other which we change by controlling only the gz function which we call as sparse hourglass. So these are just two variants that we propose. But since this is a family of probability mapping functions, many other variants could be produced out of it. And in fact, so we have shown that this framework is a generalization of all the other known probability mapping functions that we have seen till now. Now so as I mentioned there are two ways to control sparse ct. So one way of controlling sparse ct is we call this formulations sparse gen lin. So lin because gz function is exactly same as z. It's just that the negative L2 norm regularizer here is added to whatever we have seen before. So the benefit of using this negative L2 norm regularizer is seen from the diagrams below. So when lambda is equal to 0, the formulation turns out to be exactly same as sparse max that we have seen before with the exactly same sparse regions and the contours. Whereas if we increase lambda to 0.5 on the right, so we see that the non-sparse region strings which is the contours region has shrunk. Whereas the sparse region has increased. So if we decrease lambda to minus 2 or even below, so we see the sparse region has shrunk whereas the contours region has increased. So this shows this lambda can actually control the width of the non-sparse region. Similarly we propose another variant called sparse hourglass. So the name typically comes from the shape that we see in the middle figure below which is in the shape of an hourglass. So that shape is controlled by a parameter called q which is found in this little complicated formulation. I won't go into the details of it, the derivation and everything can be found in our paper. So only we should note that there is this parameter q here at the denominator which helps in controlling the shape of the non-sparse region. So here we can see that in the middle figure below. So we have the shape of the non-sparse region in the form of an hourglass. So that happens for a value of q equal to 0.5. Now if we decrease q to 0, so this hourglass shape changes in a way that the contour covers the whole of the first and the third quadrant or rather the last quadrant. Whereas if we increase q to infinity, we get exactly same as sparse max. So this is one q is another way to control the shape of the non-sparse region and hence control the sparsity. Now I will discuss two important properties of probability mapping functions that are desired. So one is called translation invariance. So this property states that starting with a real vector z, if we get a probability distribution p as output. So if we add a same constant c to all the dimensions of z, we should still get back the same probability distribution that we've seen before. So if such a property is holding then it's called translation invariant. So essentially adding the same constant to all the dimensions of z doesn't change the output probability distribution. That's the idea of translation invariance. So typically this translation invariance is seen in some of the known probability mapping functions like sparse max, softmax and also the sparse strain length that we have defined just now. Similarly there's another property called scale invariance which is observed by couple of other known functions like some normalization and spherical softmax. So the idea here is instead of adding constant to every dimension of z, if we multiply a constant to every dimension of z and if we end up getting the same output probability distribution then it's called the scale invariance property which is satisfied in this case. So these are two different kinds of properties which are shown by different probability mapping functions that we have seen till now. Now sparse hourglass that we have defined in couple of slides before has an interesting connection between these two properties. So the parameter q that I was talking about in this slide can be used to control the shape of the hourglass that I have shown before. But this q can also, so when q is equal to 0, so it can be shown that the formulation actually satisfies the scale invariance property which is on the right. Whereas if q is tending to infinity in this case exactly equal to sparse max if it's equal to infinity then the translation invariance property is satisfied as in the case of sparse max. So this q parameter can essentially bridge the gap between these two properties scale and translation invariance which is an interesting property. So I have discussed couple of formulations which can be used as a probability mapping function. Now as you might recall during the slide for the cockpit of an airplane, so there I pointed out that softmax is used for computing probability distribution over the instruments in the cockpit. So that's how the neural attention mechanism works. So instead of softmax if you replace by any other probability mapping function, so then also we could have an attention mechanism. So here if we look at this figure as encoder decoder kind of setup for every state of the decoder we have a sparse attention or an attention mechanism defined over the inputs which is the states of the encoder. So typically instead of using a softmax to compute probability over the states of the encoder we could use any other probability mapping function as we have defined. So if we just replace softmax there with sparse max or sparse genlin or sparse rglass we could have a sparse attention mechanism. So we have tried this out like for a couple of settings in sequence to sequence kind of models. One is neural machine translation and other is abstracted summarization. So the results that we have seen here are promising. So if we look at the translation setting, so we have reported the blues course whereas for the summarization setting we have reported the roots course. So typically root 1, root 2 and root l and so the main highlight of this table is so one of our defined formulations sparse genlin does pretty well across multiple tasks through most of this matrix whereas the sparse hourglass is actually comparable. So but the main key finding here is that even though this is a highly non-convex setting being encoder-decoder setup with many layers and attention mechanism defined, still controlling the lambda parameter for sparse genlin we are able to get sparse attention vectors. So if you look at the figure at the bottom, so we show in the x-axis increasing lambda starting from minus 1000 to 0.75 and the y-axis shows the non-zeros, the number of non-zeros in the attention heat map averaged over the test set. So we see that increasing the lambda indeed decreases the number of non-zeros or makes the attention vector sparser which is not surprising but it's actually good to see that it's holding true in a non-convex setting. And even with the non-lambda control we are also able to get like better blue scores and rouge scores across certain settings. Now the next problem that I am going to discuss is the multi-level classification. So typically what is multi-level classification? So in case of classification we usually see for a particular instance vector there is a particular class being predicted as true, right? Whereas in the case of multi-level classification for a particular instance there could be more than one levels which are true. For instance let's look at this image of a landscape. So many of the levels which I have marked on the right are actually true for this image, right? Trees, hills, grass and so on, right? So this is a typical multi-level classification setting whereas the... So there's another point to be noted here is that the level space for multi-level classification could be anything. It could be thousands or even millions but more than one of the levels of them are true but few of them are true, right? So usually typical approaches of solving multi-level classification is for every level in the level space there's a classifier that is being trained. For example there's a tree classifier, there's a greenery classifier, there's a grass classifier. So if there are millions of labels in the level space for each of them there's a classifier that needs to be trained. And then whichever classifier fires, so those outputs are actually considered and that corresponding labels are considered as predicted true. So this is one way to approach multi-level classification and as we see there are multiple shortcomings in this. So the number of classifiers are exactly proportional to the number of labels in the level space and also each classifier is independent of the other level classifier. So we are going to consider another approach that is also known in literature. It was defined in the ICML 2016 paper that I mentioned earlier. So here what happens is given the image there's a single classifier which tries to score each labels in the level space. And then once we have the score vector over the level space we could apply a probability mapping function. Typically if we apply a sparse probability mapping function like sparse max or sparse gen-len or sparse hourglass as we saw. We would get outputs like for the labels which need to be predicted true, the output probabilities would be non-zero and for the others the output would be zero. So it's a much simpler way to actually consider multi-level approach where the probability mapping function only selects the ones which are need to be predicted true and the others are zeroes. So here again there's a single classifier that is used and so the parameters are shared across the level space. Now even though we are going to follow that approach there's a difficulty in training that kind of a model. So for training a model we'll have the input and probably the label which is the label vector over the level space. So typically the model will learn the score vector z followed by the probability mapping function to produce the probability distribution as we saw the sparse probability distribution. And the sparse probability distribution need to match with the label vector consisting of only the true levels marked as non-zero and the other levels marked as zero. But training this kind of loss function it can be theoretically shown that it's difficult to produce. It's not possible to produce a convex model objective or define any convex loss function if we have this kind of a setup. Now there's a workaround to actually train in this kind of setup. So the way we do it is instead of applying so during training time instead of applying probability mapping function on the score vector. So there's a way to define the loss function which essentially captures the essence of the mapping function that we would have used otherwise. So the loss function itself takes the score vector and it encodes some properties of the probability mapping function that we desire. So it takes the real vector and tries to compute loss with the label vector considering the probability mapping function that we desire. So this is the way to train that we adopt in this paper whereas during so it can be shown that using this formulation it's still possible to object or it's still possible to obtain a convex model objective. But so that's something that's done during the training time whereas during the inference time since there's no convex objective needed. So we can directly apply the probability mapping function and we don't need to apply any loss function. The probability mapping function itself produces the sparse probability distribution which could be used for model prediction. So typically for the two formulations that we have defined which is sparse gen lene and sparse hourglass. We were able to formulate and propose convex loss functions which are based on the hinge loss idea. So we were able to create convex loss functions which when fitted into this kind of framework made it possible to form a convex. Model objective. So I won't go into the details of this loss function the way they are derived. So the details can be found in our paper but one thing to note here again the loss function takes input only the real vector z and the label vector eta. And this loss function some way encodes some properties of the probability mapping function in this cases the sparse gen lene or the sparse hourglass. Such that during inference time we could directly apply the probability mapping function. So we have applied this kind of losses for a synthetic data for multi-level classification. So we have tried to compare the setup of like sparse gen lene, sparse hourglass and the existing sparse max and softmax. So we tried to compare them how they fare against this multi-level experimental setup on this synthetic data set. So one thing to note here is as I mentioned multi-level data sets can have more than one level predicted true for a particular instance. So the way we build this synthetic data set is the number of labels which are true for a particular instance. So we tried to vary the mean number of labels over the whole data set. Just to compare and see just to observe the behavior of all of this competing models. Similarly we tried to vary the range of the number of labels over the whole data set and also we tried to vary the document length. So typically this synthetic data set is made out of combination of words. So we tried to vary the number of words which is the document length. So these are the different kinds of analysis we tried to do like varying the mean number of labels over the whole data set, varying the variance of the labels and the document length and we tried to observe how the competing models fare. And now one thing to note about how this competing models are represented. So as I mentioned in the earlier slide that there is a loss function applied directly to the score vector whereas during the inference there is no loss function there is a probability mapping function. So that's why I have denoted the competing models as a pair which is the loss function which is after the plus sign and the probability mapping function which is before the plus sign. So it's a pair of model that way. So typically what would be done is that during training time you would train with the corresponding loss as I have mentioned like hinge or Hoover or log loss whereas at the inference time so the corresponding probability mapping function would be used to get these parts output. Now let's look at how the results look like in multi-level classification setting. So as I mentioned there are multiple variants of the data set we have used where we tried to vary the mean number of labels or the vary the range of the number of labels or document length. So these are the corresponding plots at the top of the slide. So each plot shows the F score compared with the mean for the number of labels in the first figure and similarly for the range in the second figure and so on. Let's look at the first figure. So here we are showing the competing setup like softmax plus log or sparse hourglass plus hinge and so on. So we see the black and the violet plots which are consistently doing better across varying number of labels whereas for the other two baselines they are catching up only when the number of labels are higher. So this shows that if we have very sparse labels which is denoted by lower number of mean labels so still we are able to get sparse outputs and hence accurate outputs which is denoted by higher F score. Similarly if we look at the second figure at the top we see that varying the range even then we see consistently higher performance of F1 scores and similarly we see it for varying document length. So one conclusion I want to draw from here is so even though there is a high variability in the number of labels in the data set or even if there is a very less number of labels which are true so the mechanisms that we proposed are able to adapt to those kind of variability in the data set and are able to get sparser outputs when required which also ends up boosting the accuracy. So that is seen by the figure at the bottom right. So the top figures they actually show accurate predictions in terms of the F scores whereas the lower figure shows that so the black and the violet lines are actually lower which shows that there is lower number of non-zeros which means sparser outputs are getting produced whenever it is needed. So these are the two conclusions that we can draw from here. So coming to us the end of my presentation so I would like to summarize the key contributions of this work. So firstly we have proposed a unified framework for sparse probability mapping functions which we have called sparse gen. So we have then derived a couple of formulations sparse gen length and sparse hourglass. Those are derived from the sparse gen formulation which was the unified framework. So these two formulations sparse gen length and sparse hourglass provides control over sparsity through different parameters and also in the multi-level classification setting we were able to propose convex hinge-based loss functions for these two formulations. Through experiments in the multi-level classification we have shown sparser and more accurate predictions over synthetic datasets. We have also tried it on some real datasets which were available. We have seen comparable results and also in the attention setting like for neural machine translation and abstractive summarization we have shown the effectiveness of sparsity control over attention heat maps. So these are the main summary of this presentation. And so these are the references mainly the ICML 2016 paper which proposed pass max and then there was an ICLA 2016 paper which spoke about the necessity of trying to debate whether scale invariance is necessary or translation invariance. So we have provided a mechanism in this paper through sparse hourglass which tries to bridge the gap between scale and translation invariance. And there was also another paper in NEPS 2017 which also tried to produce a regularized framework for sparse and neural attention. So as I come towards the end of my talk I want to notify all the listeners about and even that thing next month. So we are organizing a natural language generation tutorial at ACL 2019 next month. So if any of you guys are attending ACL and interested in natural language generation please do drop by at this tutorial. So with that I'll be taking questions. Here in our bond for anyone interested in asking questions just a reminder that we are recording the session will be available on YouTube afterwards. All of the speakers or all of the participants at this point are on mute. So if you'd like to ask a question please come off mute and ask your question now. Okay I'm not seeing any and I don't see any in the chat either. One more query any questions from any of the participants. Okay. Thank you so much. Just a reminder. Our next AI horizon seminar is on June 11th at 3pm Eastern time. It is a large scale corpus for conversation disentanglement and it'll be Jonathan Comerfield from University of Michigan presenting. Thank you for your time everyone. Thank you for listening. Thank you for hosting me. Thanks a lot. Bye bye.