 Applications of machine learning have gotten a lot of traction in the last few years. There's a couple of big categories that have had wins. One is identifying pictures, the equivalent of finding cats on the internet, and any problem that can be made to look like that. And the other is sequence-to-sequence translation. This can be speech to text or one language to another. Most of the former are done with convolutional neural networks. Most of the latter are done with recurrent neural networks, particularly long short-term memory. To give an example of how long short-term memory works, we will consider the question of what's for dinner. Let's say for a minute that you are a very lucky apartment dweller, and you have a flatmate who loves to cook dinner. Every night he cooks one of three things. Sushi, waffles, or pizza. And you would like to be able to predict what you're going to have on a given night. So you can plan the rest of your days eating accordingly. In order to predict what you're going to have for dinner, you set up a neural network. The inputs to this neural network are a bunch of items like the day of the week, the month of the year, whether or not your flatmate was in a late meeting, variables that might reasonably affect what you're going to have for dinner. Now if you're new to neural networks, I highly recommend you take a minute and stop to watch the How Neural Networks Work tutorial. There's a link down in the comments section. If you'd rather not do that right now and you're still not familiar with neural networks, you can think of them as a voting process. And so in the neural network that you set up, there's a complicated voting process and all of the inputs like day of the week and month of the year go into it and then you train it on your history of what you've had for dinner and you learn how to predict what's going to be for dinner tonight. The trouble is that your network doesn't work very well. Despite carefully choosing your inputs and training it thoroughly, you still can't get much better than chance predictions on dinner. As is often the case with complicated machine learning problems, it's useful to take a step back and just look at the data. And when you do that, you notice a pattern. Your flatmate makes pizza, then sushi, then waffles, then pizza again in a cycle. It doesn't depend on the day of the week or anything else. It's in a regular cycle. So knowing this, we can make a new neural network. In our new one, the only inputs that matter are what we had for dinner yesterday. So if we know if we had pizza for dinner yesterday, it'll be sushi tonight, sushi yesterday, waffles tonight, waffles yesterday, pizza tonight. It becomes a very simple voting process and it's right all the time because your flatmate is incredibly consistent. Now, if you happen to be gone on a given night, let's say yesterday you were out, you don't know what was for dinner yesterday. You can still predict what's going to be for dinner tonight by thinking back two days ago, think what was for dinner then, so what would be predicted for last night, and then you can use that prediction in turn to make a prediction for tonight. So we make use of not only our actual information from yesterday, but also what our prediction was yesterday. So at this point it's helpful to take a little detour and talk about vectors. A vector is just a fancy word for a list of numbers. If I wanted to describe the weather to you for a given day, I could say the high is 76 degrees Fahrenheit, the low is 43, the wind is 13 miles an hour, there's going to be a quarter inch of rain and the relative humidity is 83%. That's all a vector is. The reason that it's useful is vectors, lists of numbers, are computers native language. If you want to get something into a format that it's natural for a computer to compute, to do operations on, to do statistical machine learning, lists of numbers are the way to go. Everything gets reduced to a list of numbers before it goes through an algorithm. We can also have a vector for statements like it's Tuesday. In order to encode this kind of information, what we do is we make a list of all the possible values it could have, in this case all the days of the week, and we assign a number to each. And then we go through and set them all equal to zero, except for the one that is true right now. This format is called one-hot encoding. And it's very common to see a long vector of zeros with just one element being one. It seems inefficient, but for a computer, this is a lot easier way to ingest that information. So we can make a one-hot vector for our prediction for dinner tonight. We set everything equal to zero except for the dinner item that we predict. So in this case we'll be predicting sushi. Now we can group together our inputs and outputs into vectors, separate lists of numbers, and it becomes a useful shorthand for describing this neural network. So we can have our dinner yesterday vector, our predictions for yesterday vector, and our prediction for today vector. And the neural network is just connections between every element in each of those input vectors to every element in the output vector. And to complete our picture, we can show how the prediction for today will get recycled. The dotted line there means hold on to it for a day and then reuse it tomorrow. And it becomes our yesterday's predictions tomorrow. Now we can see how, if we were lacking some information, let's say we were out of town for two weeks, we can still make a good guess about what's going to be for dinner tonight. We just ignore the new information part and we can unwrap or unwind this vector in time until we do have some information to base it on and then just play it forward. And when it's unwrapped, it looks like this. And we can go back as far as we need to and see what was for dinner and trace it forward and play out our menu over the last two weeks until we find out what's for dinner tonight. So this was a nice simple example that showed recurrent neural networks. Now to show how they don't meet all of our needs, we're going to write a children's book. It'll have sentences of the format, Doug saw Jane, period. Jane saw Spot, period. Spot saw Doug, period. And so on. So our dictionary is small. Just the words Doug, Jane, Spot, saw and a period. And the task of the neural network is to put these together in the right order to make a good children's book. So to do this, we replace our food vectors with our dictionary vectors. Here again, it's just a list of numbers representing each of the words. So for instance, if Doug was the most recent word that I saw, my new information vector would be all zeros except for a one in the Doug position. And we similarly can represent our predictions and our predictions from yesterday. Now after training this neural network and teaching it what to do, we would expect to see certain patterns. For instance, anytime a name comes up, Jane, Doug or Spot, we would expect that to vote heavily for the word saw or for a period because those are the two words in our dictionary that can follow a name. Similarly, if we had predicted a name on the previous time step, we would expect those to vote also for the word saw or for a period. And then by a similar method, anytime we come across the word saw or a period, we know that a name has to come after that. So it will learn to vote very strongly for a name, Jane, Doug or Spot. So in this form, in this formulation, we have a recurrent neural network. For simplicity, I'll take the vectors and the weights and collapse them down to that little symbol with the dots and the arrows, the dots and the lines connecting them. And there's one more symbol we haven't talked about yet. This is a squashing function and it just helps the network to behave. How it works is you take all of your votes coming out and you subject them to this squashing function. For instance, if something received a total vote of 0.5, you draw a vertical line up where it crosses the function, you draw a horizontal line over to the y-axis and there's your squashed version out. For small numbers, the squashed version is pretty close to the original version, but as your number gets larger, the number that comes out is closer and closer to 1. And similarly, if you put in a big negative number, then what you'll get out will be very close to minus 1. No matter what you put in, what comes out is between minus 1 and 1. So this is really helpful when you have a loop like this, where the same values get processed again and again, day after day. It is possible. You can imagine if in the course of that processing, say something got voted for twice, it got multiplied by 2. In that case, it would get twice as big every time and very soon blow up to be astronomical. By ensuring that it's always less than 1, but more than minus 1, you can multiply it as many times as you want. You can go through that loop and it won't explode. So in that exact loop, this is an example of negative feedback or attenuating feedback. So you may have noticed our neural network in its current state is subject to some mistakes. We could get a sentence, for instance, of the form, Doug saw Doug, period, because Doug strongly votes for the word saw, which in turn strongly votes for unnamed, any name, which could be Doug. Similarly, we could get something like, Doug saw Jane, saw Spot, saw Doug. Because each of our predictions only looks back one time step, it has very short-term memory, then it doesn't use the information from further back, and it's subject to these types of mistakes. In order to overcome this, we take our recurrent neural network and we expand it, and we add some more pieces to it. The critical part that we add to the middle here is memory. We want to be able to remember what happened many time steps ago. So in order to explain how this works, I'll have to describe a few new symbols that we've introduced here. One is another squashing function, this one with a flat bottom. One is an X in a circle, and one is a cross in a circle. So the cross in a circle is element by element addition. The way it works is you start with two vectors of equal size, and you go down each one, you add the first element of one vector to the first element of another vector, and then the total goes into the first element of the output vector. So 3 plus 6 equals 9. Then you go to the next element, 4 plus 7 equals 11. And so your output vector is the same size of each of your input vectors. Just a list of numbers, same length, but it's the sum element by element of the two. And very closely related to this, you've probably guessed, the X in the circle is element by element multiplication. It's just like addition, except instead of adding, you multiply, for instance, 3 times 6 gives you a first element of 18. 4 times 7 gets you 28. Again, the output vector is the same size of each of the input vectors. Now, element-wise multiplication lets you do something pretty cool. Imagine that you have a signal, and it's like a bunch of pipes, and they have a certain amount of water trying to flow down them. In this case, we'll just assign the number to that, a 0.8. It's like a signal. Now, on each of those pipes, we have a faucet, and we can open it all the way, or close it all the way, or keep it somewhere in the middle to either let that signal come through or block it. So in this case, an open gate, an open faucet would be a 1, and a closed faucet would be a 0. And the way this works, with element-wise multiplication, we get 0.8 times 1 equals 0.8. That signal passed right through into the output vector, passed element 0.8 times 0 equals 0. That signal, the original signal, was effectively blocked. And then with the gating value of 0.5, the signal was passed through, but it's smaller. It's attenuated. So gating lets us control what passes through and what gets blocked, which is really useful. Now, in order to do gating, it's nice to have a value that you know is always between 0 and 1. So we introduce another squashing function. This will represent with a circle with a flat bottom, and it's called the logistic function. It's very similar to the other squashing function, the hyperbolic tangent, except that it just goes between 0 and 1 instead of minus 1 and 1. Now, when we introduce all of these together, what we get, we still have the combination of our previous predictions and our new information. Those vectors get passed and we make predictions based on them. Those predictions get passed through, but the other thing that happens is a copy of those predictions is held onto for the next time step, the next pass through the network. And some of them, here's a gate right here, some of them are forgotten, some of them are remembered, the ones that are remembered are added back into the prediction. So now we have not just prediction, but predictions plus the memories that we've accumulated and that we haven't chosen to forget yet. Now, there's an entirely separate neural network here that learns when to forget what. Based on what we're seeing right now, what do we want to remember, to forget? So, you can see this is powerful. This will let us hold on to things for as long as we want. Now, you've probably noticed though when we are combining our predictions with our memories, we may not necessarily want to release all of those memories out as new predictions each time. So we want a little filter to keep our memories inside and let our predictions get out. That's, we add another gate for that to do selection. It has its own neural network so its own voting process so that our new information and our previous predictions can be used to vote on what all the gates should be, what should be kept internal, and what should be released as a prediction. We've also introduced another squashing function here. Since we do an addition here, it's possible that things could become greater and minus one. So we just squash it to be careful to make sure it never gets out of control. And now, when we bring in new predictions we make a lot of possibilities and then we collect those with memory over time and of all of those possible predictions at each time step we select just a few to release as the prediction for that moment. Each of these things, what we get and when to let things out of our memory are learned by their own neural networks. And the only other piece we need to add to complete our picture here is yet another set of gates. This lets us actually ignore possible predictions, possibilities as they come in. This is an attention mechanism. It lets things that aren't immediately relevant be set aside so we don't cloud the predictions and memory going forward. It has its own neural network and its own logistic squashing function and its own gating activity right here. Now, long short term memory has a lot of pieces a lot of bits that work together and it's a little much to wrap your head around it all at once. So what we'll do is take a very simple example and step through it just to illustrate how a couple of these pieces work. It's admittedly an overly simplistic example and feel free to poke holes at it later. When you get to that point then you know you're ready to move on to the next level of material. So, we are now in the process of writing our children's book. And for the purposes of demonstration we'll assume that this LSTM has been trained on our children's books examples that we want to mimic. All of the appropriate votes and weights in those neural networks have been learned. Now, we'll show it in action. So, so far our story so far is Jane saw spot period. Doug so Doug is the most recent word that's occurred in our story and also not surprisingly for this time step the names Doug, Jane and spot were all predicted as viable options. This makes sense. We just wrapped up a sentence with a period the new sentence can start with any name so these are all great predictions. So, we have our new information which is the word Doug. We have our recent prediction which is Doug, Jane and spot and we pass these two vectors together to all four of our neural networks which are learning to make predictions to do it ignoring to do forgetting and to do selection. So, the first one of these makes some predictions given that the word Doug just occurred this is learned that the word saw is a great guess to make for a next word but it's also learned that having seen the word Doug that it should not see the word Doug again very soon. So, it makes a positive prediction for saw and a negative prediction for Doug it says I do not expect to see Doug in the near future so that's why Doug is in black. So, this example is so simple we don't need to focus on attention or ignoring so we'll skip over it for now and this prediction of saw, not Doug is passed forward and again for the purposes of simplicity let's say there's no memory at the moment so saw and Doug get passed forward and then the selection mechanism here has learned that when the most recent word was a name then what comes next is either going to be the word saw or a period so it blocks any other names from coming out. So, the fact that there's a vote for not Doug gets blocked here and the word saw is out as the prediction for the next time step. So, we take a step forward in time now the word saw is our most recent word and our most recent prediction they get passed forward to all of these neural networks and we get a new set of predictions because the word saw just occurred we now predict that the words Doug, Jane or Spot might come next we'll pass over ignoring and attention in this example and we'll take those predictions forward now the other thing that happened is our previous set of possibilities the word saw and not Doug that we were maintaining internally get passed to a forgetting gate now the forgetting gate says hey my last word that came that occurred was the word saw my past experience then can forget about I know that it occurred but I want to keep any predictions having to do with names so it forgets saw holds on to the vote for not Doug and now at this element by element addition we have a positive vote for Doug a negative vote for Doug and so they cancel each other out so now we just have votes for Jane those get passed forward our selection gate it knows that the word saw just occurred and based on experience a name will happen next and so it passes through these predictions for names and for the next time step then we get predictions of only Jane and Spot not Doug this avoids the Doug saw Doug period type of error and the other errors that we saw what this shows is that long short term memory can look back to three many time steps and use that information to make good predictions about what's going to happen next now to be fair to vanilla recurrent neural networks they can actually look back several time steps as well but not very many LSTM can look back many time steps and has shown that successfully this is really useful in some surprisingly practical applications if I have text in one language and I want to translate it to text to another language LSTMs work very well even though translation is not a word to word process it's a phrase to phrase or even in some cases a sentence to sentence process LSTMs are able to represent those grammar structures that are specific to each language and what it looks like is that they find the higher level idea and translate it from one mode of expression to another just using the bits and pieces that we just walked through another thing that they do well is translating speech to text speech is just some signals that vary in time it takes them and uses that then to predict what text what word is being spoken and it can use the history the recent history of words to make a better guess for what's going to come next LSTMs are a great fit for any information that's embedded in time audio video application of all of course is robotics robotics is nothing more than an agent taking in information from a set of sensors and then based on that information making a decision and carrying out an action it's inherently sequential and actions taken now can influence what is sensed and what should be done many time steps down the line what LSTMs look like in math this is it this is lifted straight from the Wikipedia page I won't step through it but it's encouraging that something that looks so complex expressed mathematically actually makes a fairly straightforward picture and story and if you'd like to dig into it more I encourage you to go to the Wikipedia page also there are a collection of really good tutorials and discussions other ways of explaining LSTMs that you may find helpful as well I'd also strongly encourage you to visit Andre Karpathy's blog post showing examples of what LSTMs can do in text and if you haven't seen it yet take a look at the video on how neural networks work to get some more details on exactly how you go about implementing something like this in code thanks for tuning in I wish you a lot of luck on your next project, Building with Recurrent Neural Networks