 gradient descent in RNNs looks like it does, it all deep learning, but it can be just a little bit messy. And the problem really comes because if you get some feedback based on some output, say over way down in some sequence, then the changes from that need to propagate back through the original network, the preceding copy of the network, the preceding copy of the network, the preceding copy of the network all the way back to somewhere which is maybe which came from. So maybe this label here is in some way dependent upon these inputs way back in time. And RNNs first of all tend to forget what happened earlier. And there's lots of multiplications every time you go through a layer in a layer and a layer on each of these every time it gets multiplied again and again. So gradient descent can be a little bit messy. There are a couple of obvious sort of points of failure. One is relatively easy to fix. Sometimes the gradients become really big called exploding gradients. It helps to use a hyperbolic tangent rather than a relu so that it actually is bounded in the pieces there. And it's very common to use gradient clipping. If any term in the gradient gets above some number, 100,000, whatever, don't let it get any bigger or below negative 100,000 of course. So gradient clipping is built into PyTorch allows you to avoid taking big steps when maybe you shouldn't. The other direction vanishing gradients where you multiply something by something by something by something it's smaller and smaller. Harder to deal with LSTMs and GRUs tend to be ways that attempt to do this, but it can mean still that training can be rather unpleasantly slow. Now while I'm talking about training, it's worth noting that each of these different architectures I showed you, language model, seek to seek, each one has a different unrolling, where you have sets of Xs and sets of weights through hidden nodes and loss functions on everything with lots of Ys and it gets messy, messy, messy. And I'm not going to walk through any of that. All I want to remind you is the math looks just like it looked from before. There's some loss function, which is for example the average over the little t goes to 1 to big t of the losses of each of the outputs of the Ys. If you're doing this to a labeling task, it'll be different for each task. Each of the hidden nodes and the outputs have some sort of a function. You take a gradient, you use the chain rule. The problem is as you start doing more and more weights and more and more layers and more and more unrolling, you get chain rules or chain rules of chain rules and it gets messy, but it's all the same piece. If you want to see it, go look at the dive into deep learning textbook. It's got nice examples of working through all the math. Firstly, I find the math ugly, but someone has done it and it's built into PyTorch, which is awesome.