 Now that we've seen the basic architectures of recurrent neural nets, let's look in a little more depth at the classic one, the language model. Remember that the language model takes in a sequence of words, often a start character then I went to, and for each word, it takes the word embeds it as some vector x using love or fast text or word to beck. It takes that embedding, the preceding hidden state feeds that to a neural net, predicts the next hidden state, and then takes that and uses it to predict the next word, which is then used to score it. So let's look at that for one little piece of this ongoing network to make sure we're clear on the pieces. We're taking in the preceding hidden state plus the embedding of the word. So if we're looking at I went to, went is embedded to say a 300 dimensional vector, the previous hidden state and the embedding go into a neural net. Some function f, often a deep neural net, produce the next hidden state. The next hidden state is then taken to another neural net G, which then produces a soft max, which is to say a probability distribution over all of the possible next words. We then observe the next word. The next word is to, I went to, and then we get a error, a loss based on the log of the probability of two we estimated. We can update the weights using back propagation, and then repeat the process. We take the hidden state h1, the embedding of the next word to put it into the same neural net f, and we get a new hidden state and repeat the process over and over and over. This is done on lots of words, typically hundreds of millions or even billions of words. And at the end, we have a really nice recurrent neural net that has a good language model.