 In this section we will go through the forward and backward paths for recurrent neural networks. We will start with reviewing the equations for a vanilla neural network, with a new notation which allows us to introduce later the equation for the forward propagation of recurrent neural networks. Finally we will see how back propagation can be applied on an unrolled version of the network. This technique gets the name of back propagation through time. Recurrent neural network can be thought as fancy neural network where the current output y hat depends on the given input x and the state of the system. Let's start reviewing what are the basic fundamental equations of a vanilla neural network. In this case we have a three layer neural network. I could draw here one more circle representing the first input. So we have three layers representing vectorial quantities, not as before where each circle was representing a scalar. And then we go from x to h through the genome linearity and the w h matrix. And then we go from h to y hat through the genome linearity with w y matrix. So let's draw the equations with this new notation and let's see how they connect to what we have seen previously in the previous lessons. We can write here that the vector h is equal to our nonlinear function that we just simply call g of the affine transformation of x. So we have w h which multiplies my vector x and then we have the bias vector for the h activation. Then we have that our prediction y hat is going to be again a nonlinear function of the product of w y with our h plus again the bias term vector for my y term. As a reminder in the previous lessons we saw that this one was simply the sigmoid and this part over here was my theta coming from the first layer which was multiplying my x to which I have added the one for the bias. But basically we are doing the same where I simply took out the bias from the x. And moreover we have that the again the representation here on the left hand side represent vectors and not anymore scalars. And moreover we have that this first guy here it was called a2 vector. On the same way this guy was called sigma. This content here was theta for the second layer multiply a hat2. And this guy was called our final hypothesis on the input. So here we have slightly changed the notation which is going to make our life easier when we switch to the current neural networks. So we have that our x it's still a n dimensional vector. Our y is a I know this symbol represents something else but let's say I can write this symbol in the opposite direction. This is our rk and let's say our dimension here of our hidden state is simply let's call it the dimensional. So we are going to have that our whd times n because then it's multiplied by n dimensional vector. And then we have of course our bias h it's going to be the dimensional. And on the same way we had that w y is going to be shooting towards k by starting from t and then the bias term for the y. It's going of course to be a k dimensional vector G. In this case can be a sigmoid can be a tanh or any kind of other non-linearity is not that relevant. So far all quantities are being matrices or vectors. So we can think that our neural network transfer function y hat it's actually a function of our input x. So we have that y hat goes from our n which is our input to rk which means shoot vector x into vector y. So it tries to shoot our vector x into a mapping to our y labels basically or at least we are trying to. So we have a hat here but it's doing this going through the hidden state. So actually we could write that y goes yes from our n but it goes through our d before going to rk. Where we have that d the hidden state usually is quite larger than both the input and the output dimensionality. This is done in order to generate a hidden space or internal representation which disentangle the combination of input features and therefore allows to discern meaningful characteristics of the input data. The matrices w h and w y contain our linear kernel as row vectors. Each of these row vectors is scalarly multiplied by the corresponding input. So if we are in the h layer we are going to be using the projection of these horizontal kernels with these row kernels with the input. And then if we are in the last layer we perform the projection of the hidden state with these output kernels. And then we apply the nonlinearity in order to boost the prediction and also to use separation and break linearity of the system. So we can think w to be something like w1, w2 and so on. And every time we perform a multiplication we perform simply the projection of the input here with respect to the hidden every kernel. And then we apply a translation with the bias term which is then followed by the application of a nonlinear function in order to be able to stack more than one layer since again the linear combination of multiple linear system is still a linear system and this would translate as in a total collapse of any hierarchy of layers. And finally we come to the recurrent neural network. So all it changes here is that we have here a loop with a WHH matrix. And then we have introduced the square bracket with a time index double case t. So in the input we don't have anymore just the imp x vector but we can have like as well a sequence of input. And then we have a hidden state which is function of the time index t. And then at the output we can have the sequence WHH of t or simply consider the last element of the sequence WHH T. Which can be thought as the final output of the network given that we provided all the input and then the processing has been completed. Let's write the equations for this network and let's see how they compare to the equation we just wrote for the classical standard neural network. So we have now that our H it's a sequence so we have time index t. We have square brackets because the time is discrete. We would have parentheses instead if we would have been working with time continuous domains. So here we are in discrete domain. This is going to be equal our nonlinear function g which is fed with WHX which multiplies the input at the time index t plus the weight matrix which is weighting the previous state H vector time index t minus 1. Plus the associated bias term for the H. And then below we have the deprediction WHAT vector of time index t. It's actually exactly as it was for the previous slide. So we have our nonlinear function which is fed with the matrix WH which is multiplying our hidden state. Of the same time index t plus the offset B of Y. Again here each and every of these elements belongs to the RK dimensionality. The state and previous state belongs to the D dimensionality. And our input is a sequence of n dimensional vectors. Moreover we had to specify that the initial state, so the hidden state at the time interval zero, is by definition set the zero vector so that we start with a reset system. So regarding the dimensionality of the matrices we have that the WHX is the same as for the WH of before which was shooting towards D and going coming from N. And then we had that now the new WHH instead goes to D coming from D. And the last one which is the same we had that WH instead shoots towards the K output dimensionality coming from the D dimension of the hidden state. The term BH is of course a D dimensional vector and BY again of course is going to be a K dimensional vector. What is the main takeaway of this equation here? The recurrent nature of this equation tells us that the current output vector Y of index t not only depends on the current input X time index t but also to the internal state of the system which is H of index t minus one. This means that applying multiple times the same very input X may lead to different output Y. We see here the current state H of index t depends on the previous state H of t minus one via the matrix WHH which is modeling the relationship with which the current state depends from the previous state. The hidden state equation can be rewritten in a different and more compact way which is going to be even more similar with the previous equation we just saw in the previous slide. So we have now that H vector on time t index t can be also written and preferably written as our G nonlinear function where I just used one matrix WH likewise I was doing for our vanilla neural network which is multiplying something a bit particular. It's the concatenation of my input X at time index t and the system state Y of time index t minus one and both of them are still vectors. And then we simply sum our bias vector for the H layer where WH is the horizontal concatenation of WHX and WHH which therefore belongs to D coming from N plus D. This should not be surprising at all because instead of doing the scalar product of the first row and the column vector plus the first row the second matrix and the column vector of the state we simply perform the whole first row which is a concatenation of the two first rows. The concatenation of the two columns so the computations are exactly the same but this notation allows us to express in a more compact way and actually we perform just one matrix vector multiplication so it's even more efficient in terms of computations when performed on a computer. Furthermore we can see that our current neural network can be easily converted back to a normal neural network when the matrix WH is set to the null matrix. So basically if the input is a sequence which does not contain any kind of information across the time index the network can converge to a system where the memory matrix is set to zero and therefore the system will be simply a combinatorial system. So how do we train the system? The cool part is that we simply apply forward propagation, back propagation and then we apply gradient descent. So nothing changes from the way we have been training our networks so far. So let's start and see how we can perform the first forward propagation. So we are going to insert our first input. Then we suppose we do not have any previous states so we actually input here a zero vector and therefore we simply kill the recursion from the first element. If this is the first element of our sequence then we can therefore compute our internal representation given simply the first input. And then via the matrix WH we perform and compute the first prediction. Finally we are going to have here our loss block which is fed with the network output and accounts for the first error. In this case it's called t-1 here. Then we insert our second input and then we can compute the internal hidden state given the previous hidden state and the WHH matrix. Once the new state is computed we can compute the output given the WY matrix. Then the new prediction is sent forward and we compute again the error with whatever is expected. Finally we input let's say the last element of our sequence. So there are no more elements. We can produce the new state given the previous state with the WHH matrix and the WHX matrix. And then we produce the output given the WY matrix. And we feed the last prediction into our loss block. What we had to notice is that each and every circle here share the same parameters and also the gradient with respect to the parameters. So although this one may seem a simple neural network actually is a lighter version because the parameters and the grad parameters are shared across each and every instance. What is not sure is the internal temporal buffers meaning if I send my first input then my first input temporal buffers inside the network will be different from whatever happens for the second instance and the third instance. Otherwise the network would be overwriting its own activations. In the same way also the three nodes here share same parameters and grad parameters. So we started with a sequence of input which we can call X1, X2 and X3. Then we have set of parameters which is again similarly to what was happening before. And although there are again multiple branches we just have one set because they are shared. So parameters are shared. And then at the end we end up having a sequence of vectorial output and corresponding errors. Then we can simply compute the corresponding gradients and back propagate them down the network. So here we have our last gradient, second last gradient, third gradient. They are converted into the delta 2 in deltas. Then of course here we have to say we don't want any gradient coming from the future. So the gradient that comes from here is going to be a zero gradient because again we don't have the gift of precognition or prescience. So again we have to input a zero from the future. And then we are going to have that this delta coming down. We split in two so one goes down here and of course we don't do anything to the input so we don't care. But we could have more layers stacked so we can represent here the delta coming down. So these are relative to the layer 2. And then we have the gradient here coming down this way. The same way we go one down here and the other one coming down. Another one down here and stop because we don't go in the past before the initial time. Then we can update all the grad parameters, the E in the theta, given gradient input coming from the higher levels in the hierarchy and from the layers in the future. There are some assumptions we have done so far. So we have for training we fix capital T, which is the number of repetitions of the network. It limits also the length of influence which we can catch from the data. So if data have relationships that are spread at a time interval, time period greater than capital T, then we won't be able to capture those kind of relationships if we train with a capital T that is lower than such a period. For training purpose we fix T training interval. So that was number one. Then we have number two. Zero is the vector that comes from the past. So we reset zero initial state. So we start with a reset state unless we would like to keep learning a specific sequence. So we actually can input let's say in this case three elements. Then for the new element we can send to the first state. The last state of the previous training shot. In this way we have that the state keeps updating based on the several inputs. And then we can have a system that remembers what has happened in longer time period. So zero initial state for the first element of the sequence. And then the other assumption is that zero future knowledge. Which means we don't have the ability of precognition. Means that at the end here we will have a zero gradient input or zero delta coming from future symbols we haven't input into the system while training. These recurrent neural networks are powerful tools that can extract information from sequences, can condense temporary information into vectors, can expand vectorial information into sequences, can compress and re-expand back and even they can perform vector to vector mapping which is going to be simply killing the WHH metrics. So they are very powerful and versatile models which can be trained with the same tools we have seen so far but requires a little bit perhaps of tricky implementation for applying the back propagation through time which is simply the again the classical back propagation algorithm which computes the partial derivatives of the error function with respect to our parameters on an unrolled version of the network which shares the parameters and also the gradient of the error with respect to the parameters. In the next episode we will see how we can implement this kind of tricky combination of connections of forward and backward signals in a network in Torch.