 This week, we're going to cover recurrent neural nets in great detail, but if you actually want to see lots of the math worked out, then I would encourage you to follow the links to the dive into deep learning, d2l.ai, or other links in the notebook to actually see all the math worked out. I don't actually like talking through lots of equations and pieces here. So in the pieces here, I'm going to try and convey the intuition. You'll get the coding in PyTorch, and hopefully it will all make sense, but there is lots of textbook-y stuff if you want it. Cool. So what's the inspiration? We're going to have a sequence of observations, x1 through xm, think of words. Now is the time for so forth. We want to take each of those, that sequence so far, feed it into a network and have some sort of an output, like a label. Maybe the label is happy or sad, or like the movie, or didn't like the movie, or maybe eventually we'll do something which is a whole response, like, could you please clarify or some other output, or English to Chinese, but for the moment, think of labeling things. The problem is, first of all, that this sequence of words is of arbitrary length, could be longer or shorter, and secondly, that this can get sort of big if we don't make some sort of assumptions similar to those made with convolutional neural nets. So the sort of assumption we're going to make is that rather than having all of the words in a sentence going into a neural network, we're going to have the first word go into some network, give a hidden state, the second network structure will then take in the hidden state, and the second word, and produce a new hidden state, and so on and so on until eventually the last word, and the last minus one hidden state come in and produce the label. So we want some sort of a structure that factors things, that breaks things up. But this is not going to work, because we can't have a separate neural net for each of those. So instead what we're going to have is some simpler architecture, which is going to tie these things together. What we believe and will assume about the world is that the transition functions that describe the world don't change from moment to moment. The world is stationary. This is the same assumption pretty much as you can have a translationally invariant filter apply feature detectors each place. When you do that, you get a nice architecture, which is that you take in the first word, run it through a neural net, produce the hidden state, take in a second word, and the hidden state to a neural net, put a third new hidden state, take each hidden state and word recursively, note the first one is a little bit weird, so I have to have sort of a trivial h0, a hidden state which has nothing. So every step of time looks exactly the same as everyone. Every time step takes in a hidden state from the preceding time, a new observation and produces a new hidden state and at the very end we take the final or penultimate hidden state, final observation and produce a new hidden state which is then transformed usually with a softmax to make a prediction. So this is the basic recurrent neural net architecture and we'll see in detail how to make this happen. One trick that we're gonna see over and over is that we are going to want to train neural nets on lots of unlabeled data and then use those networks trained on unlabeled data to label actual, trained on actually labeled data, where we have less labeled data. And time series are wonderful because for any sequence of words x1 through xm, you can feed those in through our recurrent neural network structure, right? And then predict what will the next word in the sequence be or for any sequence of images, give an image, image, image, image, predict what will the next image be or for next sequence of inventory levels, day 1, day 2, up to day m inventory level, what will the inventory be on day m plus 1? So note that we have with time series always lots of data that we can use to train up models and particularly for language it's wonderful because we have billions and billions of sequences of English or of Chinese which we can use then to train up recurrent neural nets, ones which then predict the next word. Once we've trained up a neural net to predict the next word, what it does is it has now stored up some hidden state as the output, right? Remember that it took out the m minus 1th hidden state, the mth observation, goes to neural net, predicts the mth hidden state. Now we can take that hidden state for any sequence we put in, similar to the training ones, and use that to do some future label. That was maybe a little fast but don't worry we'll cover all this in detail this week.