 Hello, my name is Brent Halpern, and I'm the Scientific Director of the Anarchorizing Stem Will Cutten. This is our weekly seminar series. Today we have Siddhartha Brahma from IBM Research in the Almadin Research Center talking about improved language models by decoding the past. Siddhartha is a staff member here in Almadin in AI, his PhD is from EPFL, his master's is from Princeton. And without further ado, Siddhartha, take it away. Thank you very much, Brent. All right, so today I'll be talking about, so the title of my talk is improved language modeling by decoding the past. This will be presented as a long paper in ACL in a couple of weeks time. All right, so the paper is about a simple idea that tries to improve a language modeling using neural networks. Okay, so first, to define language modeling, essentially by language modeling we mean an algorithm to assign a probability to a piece of text. So for example, something like Megan likes playing soccer, we want to assign a probability to this piece of text. And the idea is that a model that can in some way understand language well should assign much higher probability to something like Megan likes playing soccer to Megan something like Megan likes playing sushi, which doesn't make sense. So that's the key task of language modeling. So in practical terms, so given a corpus of text, and this text could be anything, free text from Wikipedia or news or whatever, it really depends on what you want to train your language model on. We want to train a model using this free text so that it generalizes well to unseen text. So when the language model is then evaluated on a new piece of text, then it should not, in some sense, it should not be very surprised by this new piece of text, right? Because if it could, if it has understood the language well, then the element of surprise with a new piece of text should be less. So that's the whole idea. It's a very fundamental problem in NLP or NLU. And most importantly, language models have been shown to be the key ingredient in most modern general purpose models that work well across natural language tasks, right? So the, for example, GPT-1 or ELMO or BERT, these are the core ingredients of very general models that have been shown to work very well or in performing different kinds of natural language parsing tasks. So the way, the type of language modeling may differ a bit, but the key problem is still the same. All right, so, abstractly speaking, given a sequence of tokens, W1 to WN, we want the model to compute or assign a probability to the sequence of tokens, W1 to WN, right? And by the chain rule of probability, this can be broken down as a product of N terms, each of which represents a conditional probability. So WP of W1 multiplied by P of W2 given W1, followed by W3 given W1 from W2 and so on, right? So essentially, most modern language models which use neural networks will try to model this chain rule. So essentially, it will try to construct a representation of W1 to WI, right? And then, using this representation, compute a probability of the next token, which is WI plus 1, using a linear layer followed by softmax. And that's how the model is set up. And then it's trained on whatever piece of text you want to train it on. People have shown that essentially simple LSTMs or transformers with different modes of regularization can achieve pretty impressive performances in language model. And just as a side note, the goodness of a language model is measured by this term, perfection, perplexity, which is essentially a normalized probability number. Or it can also be interpreted as an entropy of a new string in an exponentiated form. All right, so next to the model. So I'll spend some time on this figure because this has the cracks of what a typical language model will do. So in the center is a LSTM block or it could be a transformer block, whatever. At a particular moment of time, when it sees the IH token here represented by WI, it also has a representation of the token that it has seen already, that's W1 to WI plus 1. So essentially we're trying to model one term in the chain rule equation. So when it sees WI, it also has a representation of the past tokens. And then we try to compute the probability distribution over the next token. The WI would typically be represented as a one-hot vector over a space of dimension V, where V is the size of the vocabulary. This WI is then multiplied by an embedding matrix, which is a set of trainable parameters. And this produces a vector of dimension D. And the LSTM hidden dimension is also chosen as D because of one reason. That's because after the LSTM module is operated on WIE and HI minus 1, it outputs a new state, HI. And then people have shown that by multiplying HI by the transpose of E, which would be of dimension D cross V, and then taking a softmax over it, we can compute a distribution over the next token. So in general, this second matrix that's multiplied with HI could be any matrix, but it has been shown that by essentially keeping it identical with the embedding matrix, we get better performance than using an entirely new matrix. So that's called wait time. So this is how basically a LSTM language model would work. At every step, we get a new token, WI. It produces a probability distribution over the next token, which is represented by WI plus 1 tilde. And then at training time, we know the actual WI plus 1. And so we can compute a loss function, loss term, which computes a cross entropy between WI plus 1 and WI plus 1 tilde. So there is one thing to notice here, which leads to the main intuition of this work. So this block, which represents one step, one term of the chain rule, is highly symmetric. So if we ignore HI minus 1 for a moment, what comes in is a vector, which is a trivial probability distribution, but still it adds up to 1. This one-hot vector representing WI, and what comes out is another probability vector also over the same dimension. So the vector is still of size V, where V is the size of the vocabulary. Furthermore, so there is an input-output symmetry at each step of language modeling. Furthermore, what goes into the LSTM and what comes out of the LSTM are also of the same dimension. And this is by choice. We make sure that the hidden dimension of LSTM is the same as the dimension of the embedding matrix. So our idea is to, so one of the ideas is to exploit the symmetry in order to bias the language model. So to exploit the symmetry, we ask the following question. So is it possible to decode the past, the last token that the language, the LSTM sees from the next token probability distribution? So the intuition can be understood from the following two equations. So suppose the language model knew what the next token was. So for example, if in a sentence the next token, the I plus 1 token was soccer, then there is a higher chance of the previous token being plain rather than it being eating, right? There is a higher chance of soccer being preceded by plain rather than soccer being preceded by eating. At the same time, there's a higher chance of sushi being preceded by eating rather than sushi being preceded by playing. So the intuition is to somehow incorporate this knowledge. So essentially, this is nothing but biogram statistics of the language that we are looking at, right? So the biograms of a language will obviously not be distributed in an independent fashion. So conditioned on the next token, we can pretty much see with certainty that the distribution of the previous token has is highly skewed towards certain tokens, right? So we are going to exploit this intuition. So the thing is at a particular point of the language model, that is when it's modeling a particular term of the chain rule, it doesn't have access to the next token because that's where that's, so the last term is being computed against that one, that term. So instead, so we really can't use that. But what we can do is use the next token probability, that's w i plus 1 tilde, which is produced by the model itself as a proxy for the next token, right? So if the language model is very good, then it should, the w i plus 1 tilde will be highly skewed, right? But since we don't know w i plus 1, we can use w i plus 1 tilde as a proxy for it, right? All right, so what we do is then exploit the symmetry of the model, right? As we had, as we have mentioned, the forward part of going from w i to w i plus 1 tilde has this symmetry, right? So in goes a vector of dimension v and out comes a vector of dimension v. So we retrace the part in the reverse direction. So we can think of, we can think of in terms of w i plus 1 tilde being a proxy for the actual token, right? In that sense, if you multiply it with the embedding matrix E, it gives us sort of a weighted distribution over the vocabulary, all right? This is followed by a nonlinear transformation, which I'll talk about shortly. Out comes h i prime, which is the counter part of h i in the forward direction. And then we sort of mirror the softmax computation by multiplying it with E transpose, right? So the path decoding operation can be a mirror image of the forward flow of the LSTM. The only difference being the nonlinear transformation in the middle. So the LSTM will have information also coming from the context of h i h i minus 1. But in the backward direction, we don't have context, neither do we want to have context because we are really interested in decoding just the last token from the next token probability, all right? So this nonlinear transformation is essentially state this in contrast to the forward direction where it's stateful, of course. So the equation shows the exact form of the probability of the last token, right? So we have w i plus 1 tilde multiplied by E followed by a nonlinear transformation, then again multiplied by essentially a linear transformation, E transpose plus a bias term. And so that gives a vector followed by a softmax that gives a vector over the vocabulary, which is then, and so for a particular word, we can compute the probability of the last token, right? Then the thing is straightforward. So we incorporate this past decoding operation as essentially a loss term and further an added term in the loss function, right? So the usual loss function is the purple ellipse on the top. So the next token probability, which comes out after applying the softmax function, so we compute a cross entropy with that and w i plus 1, which is the actual next token. And for the past decoding operation, we have, we know the actual past token, right? Because that's the input of this stage of language modeling. And we have an estimate of the last token by the past decoding operations, right? We can then compute a normal cross entropy term, which we term as L PDR, PDR stands for past decoding regularization. And this term can be used as a regularization term in the normal loss function, which is the LLM, right? In other words, so this is what the final thing looks like. So the past decoding loss function is defined as this lower ellipse, the cross entropy between the current token and the past decoded probability. The normal loss function is between the next token probability and the actual next token. And the combined loss term is then the normal language modeling loss term plus the past decoding regularization loss term, of course, with the weight lambda, which is determined through experimentation. And for the nonlinear part of the past decoder, we want to keep it very simple because we don't want to bias the model too much by introducing too many parameters. So the only requirement is that it should map dimension D vector to dimension D vector. So we just use a normal linear layer with a tan hyperbolic nonlinearity, all right? All right. So this is what essentially the main contribution of this work. One thing is important to note is that this past decoding operation is only used during training. So the only extra parameters we introduce are the ones in the linear layer in the nonlinear box. And there is an extra bias term, which is not very important, but it does seem to help a little bit. But that's the only set of parameters that we introduce. And really, they contribute very few extra parameters to the overall language model. And this is, again, just used for biasing the model. But during test time or inference time, this past decoding operation is, of course, not required. Okay. Now to results. Okay. So we tested out our method on four different data sets, both for word-level language modeling and character-level language modeling. So for word-level language modeling, we tested out on two standard data sets, PenTreeBank and Wikitext2 for character-level language modeling on PenTreeBank character and NVic-8. So four data sets. And in terms of models, so since this is an extra regularization term, we pretty much follow the modeling parameters of this well-known work called AWD-LSTM from Merity et al. 2018. Essentially, we don't make any changes to the layer dimensions and so on. And okay, so there is one thing to note. So now there are like two different types of language models which are important to note. So one is using a single softmax at the output of the language model. But as Young et al showed in a paper in 2017, that mixture of softmax often helps to improve the perplexity of language models. So we try our regularization on both kinds of models and we'll show results for both of them. And for training, we pretty much use what is done by Merity et al. We do very light hyperparameter tuning in the vicinity of the best hyperparameter that was reported by Merity et al. One important thing is the lambda term. So this lambda term was fixed by conducting a few experiments. So essentially, we saw that around a lambda of around 0.001 works pretty well for our models. Okay, so the first thing that we did was to check whether indeed PDR can act as a good regularizer. So what we did was we, so this is just for the word level language model, we took AWLSTM, which is three-layer LSTM everything, but removed all kinds of regularizations because the paper uses like seven different types of regularization. And the computer trained it on the PTB and Wikitext to training set and then computed its performance on the validation set. And then we just turned on the pass decode regularization. So this is just like a sanity check whether indeed a PDR can act as a regularizer and indeed it does help. So the valid validation perplexity drops from by around 2.5 for PTB and around by about five points for Wikitext. So that's one good thing that indeed PDR can act as a regularizer sort of confirming our hypothesis. But then of course, for the whole full-credit model, we use the other drop other regularizations that are used in AWLSTM and I just list them here, but essentially there are like almost seven types of regularization, which essentially target each part of the LSTM model. Essentially there are dropouts on the input, there are dropouts at the output, there are dropouts in the recordant matrices of the LSTM. Then there are temporal regularizations applied to the LSTM states and so on. But the author showed that all of them do actually complement each other and help in reducing perplexity. So now the results. So we tried it out on the full AWD LSTM model with all the regularizations turned on and what we see is that PDR leads to a drop of around 1.7 perplexity points on the test set. And when AWD LSTM is then used in conjunction with dynamic evaluation. So this is this method by Krause et al where at test, where at inference time, some of the model parameters are changed slightly and that really helps in driving down perplexity even more. In even this case, PDR helps by again around 1.8 perplexity points. So note that the two models have at least at test time that the number of parameters is exactly the same because the path decoding operation is not required at test time. So the same number of parameters, but using PDR we get gains in the perplexity. And with the use of mixture of softmax, that's multiple softmaxes of young et al, the reductions are smaller and there could be many reasons for this but one, so we really did not do much of hyperparameter tuning for this thing. So the gains might be slightly better if they are tuned better. So this is for Pen Tree Bank. So at some point of time, these were the best results for these data sets but of course they are being superseded by now. For WikiText 2, the drops are similar. For single softmax models, we get an improvement of around 2.3 perplexity points and with dynamic evaluation about 1.7 perplexity points and for the mixture of softmax models, we have a slightly lower gains. And essentially what we see is that we do have gains of around for the single softmax models at least around 1.7 to 2.3 around that range. For character level models here, things are much more difficult to improve I guess. But again here, we do see marginal improvements by using practical trivialization but around level of 1 or 2 percent, that's not very much. So we did some analysis of how exactly PDR is working. So in this plot, we essentially plot the negative log likelihood of the pass decoding operation. So this is essentially once the last token of the context is decoded, that's the pass decoded vector, we can compute the negative log likelihood corresponding to the actual last token. And if we compute that and we plot a histogram for the validation sets of PTB and the PTX2, essentially what we see is indeed the pass decoding operation can recover quite a lot of information about the last token. It's essentially the, how do you compute it, is it unbiased or is it, I don't think questions are allowed. Oh, okay. Okay. All right. So no, this is very simple. So we decode the last, the pass decoding operation gives us probability distribution of the last token, right, and then we simply take, look at the entry of the actual last token and compute its NLL. And then we plot the histogram for the validation sets. So what this shows is that if the model is able to decode the last token nice, well, with a fairly good degree of accuracy, then the negative log likelihood should all be in a skew towards the left, right? And indeed, it is skew towards the left. So this operation doesn't make sense and about how exactly PDR biases the language models. So here is another representation of, I'm trying to understand what it's exactly doing. So what we do here is we look at the next token probabilities, okay, and compute the entropy and then take a histogram for the BTB valid dataset. What we see here is essentially PDR is shifting the entropies of the next token probability distributions slightly to the left. So the green one is without PDR and the red ones are with PDR. So it's sort of shifting probability mass more towards lower vectors of lower probability. So it's sort of betting more aggressively towards skewed probability distributions, which should help if the language model is in general better or in general good. So if it is accurate, then giving more higher probability to the correct token is always good for reducing your entropy or your perplexity, but of course it can backfire if you are wrong. So in general what's happening is it's pushing the model towards being more, producing more skewed distributions and it's doing it in a way so that even overall the perplexity of the models will improve. So that's essentially what's happening internally. All right, so we did some ablation studies, so perhaps too many numbers, but essentially what we show is that by removing PDR of course it helps. Not to the extent that the other dropouts that are used in the model help, but that's sort of expected because the other regularization terms sort of act directly on the LSTM and it's been put in outputs. And this is more of the PDR term is more geared towards biasing the language model. I mean from a different perspective it's not directly affecting the parameters of the LSTM. That's pretty much it. So essentially what we propose is a new regularization term for language modeling which tries to decode the last token in the context from the next token probability distribution, which introduces very few extra parameters. And yet we show through extensive experiments that across the board it does help in biasing the language model in directions which lead to lower perplexity terms. As part of future work, so we just used off the shelf LSTMs and of course I mean it would be interesting to see what happens if you use more transformers or whatever. The other more direct question would be whether, so in this we just use, we just decode the last token in the context, but of course one could argue that we should, we can decode even tokens even further. I am skeptical whether it will help too much, but of course that's something that needs to be checked out by if it helps decode more than one token in the context. And of course now the world has sort of moved on on using even larger models to do language modeling like GPT2 or things like that. It would be interesting to see whether for such large in models, regularization like PDR can help in any substantial manner. Yeah, that's all. Okay, thank you very much, so that's appreciated. Anybody has questions, please unmute yourself, it's the little red microphone icon at the bottom of your screen. Any IBMers if this is an open talk and will be recorded on YouTube, so please don't ask anything that involves something confidential. So do we have any questions? Yeah, I was wondering, does, do you have any of this myself better in constrained language like domain specific situations, I could imagine that somebody having finance questions or making airline reservations has a more limited vocabulary and the ability to predict backwards and forwards might be easier. I'm just wondering if this would help in some situations more than just general language. That's probable, I mean regularization terms typically, I mean if it's a more constrained language enough, I guess it makes more sense not to use any two large models because it would simply overfit too much. In that case, regularization terms typically would help better rather than any free text from any domain. In general, the answer should be yes, but it should help in such situations. Okay, any other questions, please unmute yourself, they'll hit an answer. Obviously, this one question is where you are, does anybody there have a question? So my question, maybe I just don't know the field well now, but the complexity measure, you need to use the likelihood function to measure the complexity, right? But likelihood function, when you estimate it with regularization, it seems to be a bias likelihood. No, no, no. No, but the complexity is measured on the basis of the action, right? It's only that, well, you just compute it using the output of the softmax, right? So that's like one term of the chain rule. So you just simply multiply them. But the parameters are estimated through the bias process. No, but there are two terms, right? So there is one last term that comes from the normal language modeling thing, and there is an additional last term, so these two are added. I understand, but you estimate the parameters in the likelihood function through regularized maximum likelihood. Yes. So then the first likelihood that you are computing is not the maximum likelihood, it's the regularized maximum likelihood. Yes. So it's okay to compute the complexity using the softmax? No, but the complexity is only computed in the actual action, not the regularized one. Okay, then we should take the final test. Okay. Yeah, yeah, yeah. Sounds like you need a chalkboard. Sounds like you need a chalkboard for that one. All right, any other questions? Okay, it's not, Siddhartha, thank you very much for the presentation. Thank you very much. I want to remind everybody that we have a seminar again next week. It's called Dialogue-Based Interactive Image Retrieval. We play Woo of IBM Research from the NREPS 2018 paper. That should be the last seminar for July, and we'll probably take a break through August. So we'll see you all the next week. Thank you very much. Thank you very much. There's a fire alarm.