 Most people using recurrent neural nets don't use simple recurrent neural nets. They often use ones with some sort of gating, gated recurrent units or LSTMs, long short-term memory networks. These allow for somewhat better memory of things that happened much earlier in the sequence. I will look in some detail at how these work. There's a bunch of different variations, but they all have the same idea. Now, the fundamental problem of recurrent neural nets, simple ones, is that they forget stuff. If you have an input x to 0, it's now passed through a whole bunch of weights, and then that gets then put through a hidden node that's passed through a whole bunch of multiplications and layers again and again and again and again, and by the time we try and find something over as an output that should depend on it, it's been multiplied by a whole bunch of matrices all the way along the way. Now, every time you multiply something by a matrix, it effectively tends to forget. People have seen hidden Markov models will remember that the Markov transition matrix, each time you multiply by that, what you knew before shrinks by a factor of the largest eigenvalue other than the one. Basically, networks tend to forget exponentially quickly. Each multiplication makes it remember less about what it knew before. Multiply, multiply, multiply. If it's 0.9, 0.9, 0.9, 0.9, after 10 steps, it's 0.9 to the 10th and you've forgotten it. So how can we fix that problem without having something that actually is a real memory unit where we store things like a neural Turing machine, store it somewhere where it stays forever and retrieve it. We're not going to build it in this course SQL queries where we store things in a database and retrieve them, but we want something that's differentiable, but has that style of storing things. So let's think of a simple task to illustrate where this comes into play. Imagine you have a sequence of length L and you've got a bunch of stuff that's irrelevant of length T, and then the answer. And what you want to do, now what you want the neural net to automatically do, is to figure out that all of that junk in the irrelevant sequence is irrelevant. Remember the stuff at the basic L part, take that hit in the state, save it away, and then much later retrieve it. Right, you want to store stuff, pull it back again, all within something that's 100% solvable by gradient descent. Okay, so let's take that again. We want the neural net to say, hey, we've got some hidden states. Take these hidden states, store them away. We need it to learn when to store them, when to retrieve them, when to forget them and replace them with something else. Note that this feels like discrete logic. Store things, retrieve things. We can approximate yes-no logic with like a hyperbolic tangent, push things to minus one plus one. We can estimate the whole thing with gradient descent. It's a little bit tricky for the descent because you're solving something that in some weird sense could be NP-hard if it's discrete logic. But it does allow you to store things and remember them in a differentiable way so you can use gradient descent. And that's really magic. Okay, so let's look through all of the gory details of the components in a LSTM network. Think of one copy of my unrolled network here, right? There's three copies network A. Each of them take in a current X and the hidden state from the previous one. The simplest version of a standard recurrent net takes in X, multiplies it by matrix, takes in H of t minus one, multiplies it by matrix, adds it together plus a bias term, passes it through a hyperbolic tangent. We got a new hidden state to pass on to the next time step. Cool, that's what we've seen before. But now we want to put a whole bunch of other stuff in the middle, which I'm going to walk you through in a number of slides. And inside here will be a whole bunch of little functions, some of which are point-wise operations. You take two vectors and point-wise multiply it. You multiply each of the elements of it. Some of them are just vector copies, transfers. Some of them split things and send them to places, like the hidden output goes both here and out to the next thing. And some things are concatenated, like we concatenate X and the hidden output from the preceding one. So we'll see each of these coming together. So let's step through it. Imagine we have seen something and we want to store something of one. If the last image was a puppy, but the last word was, quote, puppy or something, otherwise we store a zero. So we want to know, should we store something? So we're going to take in our inputs, our current X and our hidden state from last time, and we're going to pass it through a function where we concatenate the previous hidden state H and the X, multiply it by a weight matrix, add a bias term, pass it through a sigmoidal unit, and we get something we're going to call a forgetting term, right? Should we forget this? Previous version, should we remember it, okay? So should I forget the value in the cell? It's some function of the input, X and preceding state. Well, along with should we forget the last thing, it should be remember something new. So we're going to have an I for input. We have the forgetting mark, we have the input. This I is going to look exactly the same. It's also a sigmoidal function with a different weight multiplied again by the concatenation of the current X and the preceding hidden state and a new bias term. So that's should I remember this new value? And then the question is, what should I remember? Well, what should I remember? It's going to be something that, oh, this looks familiar, is again the concatenation of the current X and the preceding hidden state multiplied by a weight plus a bias term through a hyperbolic tangent. Okay, could have been sigmoidal hyperbolic tangent. People like these, there are lots of variations. But this is all sort of binarizing things. We're moving towards something store, not store, forget not to forget store a one, something toward a one, or something away from a one. And so now what are we going to do? We're going to say, well, our new value we want is going to be the forgetting the output from that sigmoidal model times point wise times that's for each of the elements of the state times the preceding C we had, right? Hidden value style from before plus I, the input, which is also there with the neural net, times the new C tilde, which we just computed based on the inputs. So we've got something that says our new value of C is the old value of C times the forgetting plus a new value times the input. Again, note that F, I and C tilde are all computed as a function of X and H of T minus one. And note that the star here is point wise multiplication for each of the elements in this thing, which is sort of like a hidden state, but it's not. We have the hidden state, which we kept, and we have the C, which is our memory, right? So old stuff in memory, new stuff to be added to memory, new memory going out to be sent to the next time step. And of course, we're gonna unroll this in time, so this same network will show over and over again. So what do we have then? We're now going to say, okay, we need to take this piece here, our hidden state for this time step H will be some O, which is O, once again a sigma times sigmoidal function of the same input X and H multiplied concatenated by a new weight, yet another weight matrix plus a bias times the hyperbolic tangent of C, which is our memory. So what do we do? Our hidden state is a product of something we've computed based on the input, point wise multiplied, right? For each of the elements that hit the state times something which is our memory. So we've computed for each of these things how much to keep. Now we have a hidden state we've computed, which we can pass to some output to predict maybe a label, pass to the next time step, and we have a C which we've updated. So we put the whole thing together, what is there, one Mongo set of equations, all of which look pretty much the same, but with different weights in them. We have a forgetting factor based on the inputs, how much should I forget the memory? We have a remembering factor based on how much I remember, and we have this output, how much should I output. Oh, and just to be fancy, we've added in one more term, we've added in the memory. So how much to forget depends not just on the input, but also on the remembered part, and how much new stuff to store depends not just on the previous input, but also on the part that we've remembered from before, and how much to output depends upon. So we've got ourselves something cool that uses both the inputs and the memory, and it's a super flexible, equational form which we solve all with gradient descent. Now it works really well. Does one need all of this fancy forgetting and remembering and storing, or could one do some other arbitrary similar sort of thing that has a bunch of sigboidal functions and hyperbolic tangents in it? And the answer is, yep, it's deep learning. You can do lots of different things. So one simple version says instead of a forgetting and a storage, you can do a forgetting and a one minus forgetting as the storage. So now we don't actually have a separate piece here. It allows deleting and writing, but it doesn't quite have the same flexibility. Lots of people, instead of using LSTMs, use gated recurrent units, GRUs, which look a lot like the LSTMs, but have a couple less pieces of moving parts. It looks more like this other piece instead of a forgetting and remembering. I've got a one minus Z and a Z, and it works directly on the hidden state rather on the separate memory C. So now what are we doing? We're taking a hidden state which has a certain amount of remembering the hidden state and a certain amount of forgetting and updating. It all looks similar and yet a little bit different. Personally, I'm not too attached to any one of the architectures. The LSTM is as good as any other one, but people optimize these and try different forms. There are also variations we'll talk about later about by LSTMs, but for now, give it a try, run the LSTM, see how it works. In general, they work much better than the simpler recurrent neural nets.