 Let's start by reviewing the basic R and N architecture, one that many of you have seen before, before we get fancier. So, the core idea is we will take at each time step an input x, for example, a word embedding. We will then feed it and the hidden state output from the preceding time, h of t minus one, over time zero, this will have to be just all zeros, into a neural net. Now, here I've shown the dumbest possible neural net. We take one weight times the current observation x, one matrix, and one weight matrix w, h, h times the preceding hidden state. Then you could add a bias term which I haven't shown, pass that through some transformation function. People often use hyperbolic tangents instead of values here, because the gradients seem to be a little more stable, but you couldn't use either one. That then gives you the new hidden state. That new hidden state is both passed on to the future, the next copy of the neural network, we'll take that in and the next x of t plus one. That same hidden state h is then processed to make some prediction in the simplest model, where you could take it and just say take the softmax of it, or run it through another layer of a neural net to get some output. The output might, for example, be the x of t plus one. So what do we have? Simple feed forward neural network inside here, taking in the current observation x, the preceding hidden state, outputting a new hidden state, and through one more function, an output. And again where I've shown the hyperbolic tangent and the softmax, you could put as many neural nets as you want. So what does this look like though in practice? In practice, when we want to unroll this, and those of you who know hidden Markov models, which are most of you, this will look incredibly familiar in the sense of unrolling something with a hidden state, except the hidden state are real vectors here, rather than one hot as they are hidden Markov models, or real values like a common filter. So what happens? We take in the initial hidden state zero, plus the current observation x of t minus one, pass it through a neural net, compute a new hidden state and output, keep repeating, repeating, repeating exactly the same weights in each copy of the neural net. Note that when we say, oh the output at ot plus one was this y sub t, we get an error term. That error term affects the weights in the neural net here through back propagation, and that affects how much this h should have been differently, which affects the same weights again here, which then affects again what the hidden input should have been here, which affects the weights again over here. So what happens is in this unrolling through time to compute one output at t plus one, it's a function indirectly of x of t, x of t minus one, x of t minus two, and therefore the same weights, because each weight is identical in the neural nets, get changed multiple times by the gradient descent. The whole thing is a mess, and I strongly discourage you from writing this gradient descent algorithm and these sort of chain rule, but it's just the chain rule, and happily people in PyTorch have already implemented it, trying to do this at home. I knew some guys at Amazon who spent six months trying to do this right, and then finally they just downloaded TensorFlow and oh, it worked beautifully. So it's a mess to get the thing exactly right, but it's doable. Cool. And as I mentioned before, there's no reason that the hidden state has to be a simple hyperbolic tangent of weights on x and h. It can be any function f, right? A multi-layer deep neural network. So you can put any neural net for the f, and where I have the soft max of h, you can put any neural net there. Lots of architectural choices, and they're all solved, as always, by gradient descent. In future sections, we'll see lots of variations on this simple architecture.