 Recurrent neural networks are very versatile, as we can use them in a host of applications. In the last video, we saw a few architectures like vector-to-sequence models, sequence-to-vector models, and sequence-to-sequence models. You could go back and watch that video, but even if you don't, that's okay. Let's just make sure that we're on the same page. So what is a neural network? We could define a neural network mathematically as a differentiable function that maps one kind of variable to another kind of variable, like how in classification problems we convert vectors to vectors, and in regression problems we can convert vectors to scalars. Recurrent nets just throw sequences into the mix, and so we end up with numerous architectures that can be used in various applications. One architecture is a vector-to-sequence model. We take a vector and generate a sequence of desired length. Trending research that uses this is image captioning. The input can be a vector representation of an image, and the output is a sequence of words that describes that image. A second architecture that we discussed is sequence-to-vector models. The input is a sequence of words, and the output is a fixed-length vector. A typical use case would be in sentiment analysis. The input could be the words of a movie or product review, and the output could be a two-dimensional vector indicating whether the review was positive or negative. The third architecture we looked at is sequence-to-sequence models, where both the input and the outputs are sequences. In the last video we even coded a model of this type using only numpy. The input was a set of words of a sentence, and the output at each step was to predict the next word in a sequence. With sufficient training, this word-level language model can generate its own sentences. But here's the thing. This sequence-to-sequence model had equal-size inputs and outputs. Most applications out there don't have equal-size inputs and outputs, though. Like in the case of language translation, a 10-word sentence in English may not have a 10-word German translation. Even in the case of text summarization, the input is a set of sentences, but the output by definition is a reduced set of sentences. Clearly, to deal with this new set of problems, we need another type of architecture that takes in an input sequence, but outputs a sequence of different length from the input. It exists, and it's called the encoder-decoder architecture. I'm pretty sure you can predict the two parts of this architecture. The first is the encoder. It converts the sequence to a vector, and the second part is the decoder that converts the vector to a sequence. Take the example of English-to-German translation. Each sequence is a sentence, or a set of sentences. Each neuron XT is a word in the input English sentence, and each neuron YT is the word in the output German sentence. The encoder takes a sequence and converts it into a vector, so it takes an English sentence and converts it to some internal representation. This representation, which is a vector, holds the meaning of the English representation. But it isn't human-interpretable. The decoder takes this meaning vector and converts it to a sequence, which is a German sentence. So here's a question. How long can these sequences really be? Theoretically, they can be infinite, but we run into a problem. Let's take a simple example. Consider a simple recurrent net with no hidden units, but with a recurrence on some scalar, some X0. After n time units, its value would be Xn, and we write it in this way because we consider a discrete dynamical system. Since we have ourselves a network, we need to learn the scalar weight w by the back propagation through time algorithm. But what happens to the value of Xn for a very large n? Well, if w is slightly greater than 1, then w to the nx0 explodes, and if w is slightly less than 1, then w to the nx0 would tend to 0, or would vanish. Because the forward-propagated values explode or vanish, the same will happen to its gradients. We could generalize this to matrices as well. Xt could be a vector, and w could be a matrix transformation. In this case, for value entries of w greater than 1, the corresponding eigenvectors of w to the n will explode. This means that the values of the input in the direction of the eigenvectors will explode to infinity, and we'll lose input information. The opposite is observed with values of w less than 1. The corresponding eigenvectors become near zero values, and the components of the input in the direction of these eigenvectors just vanish, which again leads to loss of input information. Now you're probably thinking, but AJ, don't we observe something similar in deep neural networks? Isn't vanishing and exploding gradients the reason we can't go deeper into those architectures as well? What's so special about recurrent nets? And my answer to that is the effect of vanishing and exploding gradients is much worse in RNNs than it is for traditional deep neural networks. This is because DNNs have different weighted matrices between layers. So if the weights between the first two layers are greater than one, then the next layer can have matrix weights which are less than one, and so their effects would cancel each other out. But in the case of RNNs, the same weight parameter recurs between different recurrent units. So it's more of a problem because we cannot cancel it out. Interestingly, this problem in deep neural networks for long sequences was investigated way back in 1991 by Horschreiter, Horschreiter, Horschreiter. It's an interesting read, and I'll link the paper in the description. We have some ways of dealing with this problem of vanishing and exploding gradients. The first thing we can do is skip connections. We can add additional edges called skip connections to connect states, some DNNs in front of it. So the current state is influenced by the previous state and a state that occurred D time steps ago. Gradients will now explode or vanish as a function of tau over D instead of just a function of tau. This concept is exactly how the popular ResNet architecture works in the convolutional network space. The second thing we can do is actively remove connections of length 1 and replace them with longer connections. This forces the network to learn along this modified path. Now the third thing we can do is, well, let's consider our vanilla over current neural network. But this time append a constant alpha over every edge joining the adjacent hidden units. This alpha can regulate the amount of information the network remembers over time. If alpha is closer to 1, more memory is retained. If it is closer to 0, the memory of the previous states vanishes, or it forgets. A modification of the leaky hidden units is the gated recurrent networks. Instead of manually assigning a constant value alpha to determine what to retain, we introduce a set of parameters, one for every time step. So we leave it up to the network to decide what to remember and what to forget by introducing new parameters that act as gates. One of the most commonly used gated recurrent neural network architectures is LSTMs. Which stands for long short term memory. Consider our vanilla recurrent neural network. Now replace every hidden unit with something called an LSTM cell. And add another connection from every cell called the cell state. And that's it. This here is now our LSTM RNN. LSTMs were designed to mitigate the vanishing and exploding gradient problem. Apart from the hidden state vector, each LSTM cell maintains a cell state vector. And at each time step, the next LSTM can choose to read from it, write to it, or reset the cell using an explicit gating mechanism. Each unit has three gates of the same shape. Think of each of these as binary gates. The input gate controls whether the memory cell is updated. The forget gate controls if the memory cell is reset to zero. And the output gate controls whether the information of the current cell state is made visible. They all have a sigmoid activation. But why sigmoid? It's so that they constitute smooth curves in the range zero to one. And the model remains differentiable. Apart from these gates, we have another vector, C bar, that modifies the cell state. It has the TANCH activation. Now why TANCH here? With a zero centered range, a long sum operation will distribute the gradients pretty well. This allows the cell state information to flow longer without vanishing or exploding. Now you have an intuition of why LSTMs are constructed in this way, and how it helps to mitigate the problem of vanishing exploding gradients. The equation shouldn't be too difficult to understand now. Each of the gates takes the hidden state and the current input X as inputs. It concatenates the vectors and applies a sigmoid. C bar represents the new candidate values that can be applied to the cell state. Now we can apply the gates. Like I said before, the input gate controls whether the memory cell is updated, so it's applied to C bar, which is the only vector that can modify the cell state. The forget gate controls how much of the old state should be forgotten. This state is applied to the output gate to get the hidden vector. Here we have three gates per LSTM, so we have a slew of parameters to model. But do we really need this complex structure? We could get away with just two gates, an update gate and a reset gate, and this is the basis for GRUs, gated recurrent units. On a wide variety of tasks, LSTMs and GRUs have similar performances. I'll be making a specific video on GRUs pretty soon. We have a pretty good hold on theory now. Let's now build an LSTM text generator, more specifically a character level language model. Here the RNN will take a look at a set of sequences, one character at a time, and it will learn to generate the next character in a sequence. This is similar to the text generator that we built in the last video using only NumPy. But this time we're using Keras, a deep learning library built on top of TensorFlow. The dataset we'll be looking at is a collection of state union speeches that dates way back to the 1940s. You can get this from Python's NLTK corpus. NLTK is the natural language processing toolkit in Python. You can use it for text manipulation, but here we're going to just use it to get the data. NumPy is Python's math library for matrix operations. OS is used to read data from my local file system. Random here will be used for initializing our tensors. We'll import different modules in Keras. Callbacks are invoked after some event happens, like the end of an epoch during training. Model is used to construct our RNN. Layers is used to add LSTM and normal components to our network. Optimizers allow us to define the optimizer for training. This first chunk of code here reads all the files in the directory of our corpus into a single string. Next, we construct a mapping from every character to a number, and also its inverse mapping. The number of characters per sequence will be the number of LSTM cells in the unrolled network. The train set will be a set of these sequences. The labeled output will be the next character. I define a callback print callback and define a method on epoch end, which, as the name suggests, is executed after every epoch. An epoch is complete when all the sequences have been read once by the model. Here I just want to generate sample data. We use temperature sampling to generate the next character at various temperatures, and just spit it out all on screen. I also save the weights of a model in an HDF5 file, so you can continue training where I leave off. Now we build our model, adding the LSTM cell with a 128 dimensional hidden vector. And the output is simply a softmax vector of characters. The actual result, however, is a one-hot encoded vector. I chose Keras for the implementation to hammer home the fact that there is only one LSTM cell for which the hidden vector updates. The unrolled version is good for understanding mathematically and theoretically the intuition behind recurrent neural networks. But the looped version is how we construct the network programmatically. We use RMS Prop to learn the weight parameters as our optimizer. The default parameters tend to work well for recurrent nets, but you can always play around with it. And then we begin our training. I trained this network for 30 epochs, and it took about 6 hours-ish on my little macaboy. If you take a look at the sample sentences generated after every epoch, you can see the generated sentences get more coherent. Pretty slick, right? If you want to train this further for better results yourself, just construct the model, load the weights, and you're good to go. I'll leave a link to this code in the description down below. A few things to take away from this video. Recurrent neural nets are basically feedforward neural network layers just copy and pasted. They learn parameters through the truncated backpropagation through time algorithm, which is basically backprop but applied at every time step. Longer sequences in traditional RNNs are a problem as they lead to vanishing and exploding gradients. LSTMs and GRUs are gated recurrent networks that can deal with such long sequences. With the power of these recurrent nets, you can dive into stock prediction, language translation, speech recognition, and so much more. Hope I left you knowing a little more about sequence modeling. If so, hit that like button, smash that subscribe, ring that bell, share the video, add it to your playlist, add my other playlist to your list of playlists, and I look forward to your support.