 Before I get started with this video, I want to thank you for your support. The channel officially reached 5,000 subscribers, 300,000 views after publishing my 50th video all on the same day. I couldn't have done it without you. And I finally started a Patreon page. By becoming a patron, you can voice your opinions on the type of educational videos you want to see. You can have access to my community and so much more. So if you like my content and support what I do on this channel, become a patron today so I can keep making the content that you deserve. The link to my Patreon page is down in the description below. So check it out. Alright, enough of that, enough pandering. Back to the video. Since there's a ton to cover in RNNs, I'm making this a two-part series. In this video, we're going to start with intuition behind RNNs and use that to code a text generator to generate sentences using only NumPy. When you hear recurrent neural networks, you think of time series, you think of the stock market prediction in cryptocurrency, you may be interested in recurrent nets for their applications in natural language processing, like language translation and sentiment analysis. How are RNNs so versatile? How can one type of network be used in so many applications? We're going to take a look at exactly that. And the best part is that understanding them isn't actually too difficult. RNNs are just your typical feed-forward layers, just copied and pasted. So all you need to know is backprop and you're good to go to understand recurrent networks. After this video, you'll understand exactly how RNNs work on the inside. Whether you've never heard of them before or dabbled with them a little bit using some deep learning libraries, you'll get a new perspective of recurrent neural network theory. This is code emporium, so let's get started. Let's first start with the big question. What is deep learning? Now I'm not talking about the standard wiki definition and its association with machine learning, but more big picture from a mathematical perspective. Essentially, it is a way of representing differentiable functions that map one kind of variable to another kind of variable. So for example, vector to floating point number in regression or a vector to a vector in classification problems, where the output vector could be probability of belonging to multiple classes. So what are vectors? In math, it is an n cross one column matrix, but more concretely, it is an abstraction of raw data. While dealing with images, we convert them into a vector of pixel values to pass into a feed-forward net. Vector represents the meaning of the image, but it is in the form that is less understood by humans. While dealing with audio, raw wave information is transformed into a vector of malfrequency substantial coefficients. MFCCs carry the meaning of the audio clip that is better understood by a computer. We can see recurrent neural networks as throwing sequences into the mix. So what are sequences? In English, a sequence could be a sequence of words in a sentence. In stock prediction, a sequence could be a sequence of prices over time. In general, a sequence is a sequence of data that has some defined order to it. Recurrent neural networks can transform sequences to vectors. They can transform vectors into sequences or even sequences to other sequences. Because of such general transformations, RNNs find themselves useful in a host of applications, and we'll talk about these in a bit. But before we look at the basic architecture, let's first derive some intuition behind recurrent nets by introducing the concept of dynamical systems. Discrete dynamical systems answer the question, I know the state of a system now at time t, so what will the state of the system be at time t plus n? This is also the question we want our recurrent net applications, like stock prediction to do. You know the current market and some past info, so what will be the state of the market in the future? Let st be the state of a system at time t. This state could be represented by some vector. Mathematically, we have a function f that determines the system in the next time step. So with such a function f, we can determine the state of a system at any future discrete time step. But in machine learning, we usually have some form of input to the system. And for this input, we want to make a future prediction. Let's consider a dynamical system with an external input. To avoid confusion, I'll replace the state s notation with h to indicate that it is hidden. h of t plus one represents information seen before it. But since h of t plus one is a fixed length vector, and the number of inputs before it can be any number, h of t plus one doesn't remember everything it has seen before. So it is important to determine what to remember and what to forget. For example, if we have a network to predict the next word in a sequence of words, all information from the beginning may not be important. That said, what is retained and what is forgotten can be determined based on a function f. Representing this in neural network language, it can be written as applying transformations and activation. If you don't see how this is, that's okay. Let's start with writing out the equations. This here is a typical recurrent neural network that maps a sequence x to another sequence o. Xt is an input at time t. This could be a word in a sequence. h of t is a hidden unit. This is a vector that represents the current input and the past input seen before. The amount of information retained is determined by the weights w from the previous time step. It also has a nonlinear activation for modeling complex data. Let's consider the TANGE activation. It's important to note that the h's here are all actually the same vector. Its value is just tweaked over different time steps. But each x is different from each other, and together they represent a sequence. o of t is the output at time t. In supervised learning, we have a sequence of labels, y. At every time step, we can compare the prediction that is o of t to the actual label, which is y of t, and hence determine the loss. The total loss is the sum of losses at every time step. The primary idea behind RNNs is that the system's past state influences its next state. We convert the black box form to this nice graphical form, which is basically copy and paste of the same network. And now we can estimate the parameters of the network, that is v, w, and u, the matrices here, with standard forward propagation and back propagation. To be more specific, the technique uses back propagation through time, bptt, as the h's aren't actually different systems, but the same system at different time steps, like I mentioned before. But there's a problem with back propagation through time. At every time step, we need to perform back propagation, so deep layers have significantly larger computation costs. They take longer, and activations need to be stored at every time step. So the amount of storage you need also increases. So a simple solution is to avoid training with back propagation through time. And we can avoid this using an algorithm called teacher forcing. So during training time, instead of feeding the hidden layer of the previous state to the hidden layer of the next state, we feed the output y, the actual label y, from the previous state to the hidden layer of the next state. Now if we perform back propagation LFT, you go back to HFT, and then y of t minus one. From there, there's no backward edge, so back propagation stops. At every time step, we only back prop two edges instead of the entire sequence history. During testing time, at every single time step, the predicted output OFT would be fed to the next layer. So it's pretty slick, right? But there's a problem here. It is trained on perfect data, because we're feeding in the output factual labels y, but it is tested on real world data by feeding in the output predicted labels O. And since factual data may not be a good representative of actual data that's generated in the real world, the predictions could be off. Another architecture is the sequence to vector model. The input x is a sequence with a single output vector. An example is a network that reads all the words of a movie review, and it determines if the review was positive or negative. A typical application of sentiment analysis. Another architecture is the vector to sequence model. We can convert a single vector x into a sequence O of any length. Usually a vector is an internal representation that's understood by computers, not so much by humans. By converting these into sequences, we can make sense of the abstract vector representation. To give an example, x could be some parameters of a generative model that generates sentences in English. The output sequence could be the result of sampling from this generative model at different time steps. And so from the output, we actually get a meaningful sequence or meaningful sentences. Another interesting model is a bi-directional recurrent neural network. Now the recurrent networks that we've seen so far use information in the past to make decisions in the present. But in some cases, it's also possible to make use of information in the future to make decisions in the present. That sounds weird, right? So how can you actually make use of information from the future that hasn't happened yet? Well, you can't, and that's not what I mean here. Say you want to model language translation. You have to translate some information from English to German. Just looking at the last word or words makes it possible to make a word-to-word translation. But this wouldn't make sense. Anyone who's bilingual would know that translating every word isn't how language translation works. There are context cues and rules of grammar that need to be followed. So the output O of T wouldn't simply be the translation of X of T, but some other German word depending on the context. This context is derived from words that appear both before and after X of T in the sequence. In a similar light, another use of bidirectional RNNs would be in text summarization. You need to summarize a paragraph or a document into a few sentences. The RNN already has access to the entire document, so it makes sense to read through a few words into the future to understand the context. However, bidirectional RNNs wouldn't be useful in real-time applications where the future input isn't known, like in the case of self-driving cars, or any type of streaming data. Technically, the text summarization and language translation applications don't use RNNs in this specific architecture. Here, the length of the input sequence needs to be the same as that of the output sequence. This isn't the case, though, for language translation, as a 10-word sentence in English can have a six-word translation in German. So instead, we would use a bidirectional architecture with unequal input and output sequences. This looks like a good checkpoint for theory. Let's now take a look at these recurrent nets in action with some code. We are now going to build a vanilla recurrent neural network that generates sentences using only NumPy. First, we make library imports. Iteratools is used for performing operations on pythonic data structures. NumPy is our math library, typically for matrix operations. NLTK is our toolkit for natural language processing. We use it for word and sentence tokenization, and also to get our data, which I'll explain in a sec. OS is used to access our local file system. Sys allows access to Python interpreter variables to do things like clearing output screens or accessing command-line arguments. First, download NLTK. We'll be using the State Union corpora, which has speech transcripts dating back to the 1940s. We read all of these files and extract all sentences in a list. We add sentence delimiters to every single sentence. It's required to let our RNN know what is the start and end of a sentence. Next, we break down every sentence in terms of words, a sequence of words. This dataset has over 18,000 words. But since that's going to take forever to train, I regulated it to 8,000 of the most frequent words. And for every word that didn't make the cut, I replace it with a token called unknown. Now create our training data. Input is a word, the output is the predicted next word. Pythonically, it's just zipping a list with a version of itself that's one step ahead. Next, let's get into our RNN class. The constructor makes initializations of our matrix parameters, u, v, n, w. The forward method computes and stores the state and output activations of every time step. They are stored for the back prop step. The prediction function will output the word with the highest probability of occurrence. SGD step performs the gradient update through the back propagation through time algorithm. Gradient checking will determine if our hand implemented optimizer is actually doing what it should be doing. It's good to include it for manually coded optimizers. Once we see our optimizer is working as expected, we can begin training our model with stochastic gradient descent, decreasing our learning rate if the loss starts increasing. This was trained over 100 epics, and each time having the network look at 1,000 sentences. This took like 30 minutes on my little MacBook. Now for the fun part, generating sentences. We tell our model to generate sentences of at least 10 words. Each time, we first feed the sentence the start word token, and then let the model take over from there. It performs the forward step and generates the output probabilities at each time step. This is basically a 800 dimensional vector containing values proportional to the probability of sampling that word. We sample the words until we sample an end of sentence token, and boom, we have our generated sentence. Let's now take a look at some glorious examples. I into other billion when you world of by most a n of supply. I with economic taking his peace loving the year child the enduring was switch value for enemies American of also. Now you're looking at this and you're like why does it suck so much? It sucks because we need to train it for longer. I save the model parameters after training so you can just reload it and continue to training until you're satisfied. I'll put this code on GitHub and the links in the description, and that'll do it for now. Once again, thank you guys so much for the support. Don't forget, it'd be awesome if you guys be my first patrons to so I can continue dishing out some awesome content links in description. In the next video, we're going to talk about more important architectures. We'll also cover why for longer sequence inputs, we need better architectures gated recurrent networks like LSTMs and GRUs. So stay tuned and stay safe.