 In this segment, we will see how to train a recurrent neural network with an easy example. We will start by introducing our study case, which means the problem we will try to solve with a recurrent neural network training. Then we will review how to use OPTIM for training a neural network, and in this case a recurrent neural network. Then we will run the code and see the output. We will try also to change some hyperparameters and see their corresponding results. Finally, we will delve deep into the code and we will see how everything works together. So let's introduce our learning problem. We are going to have a sequence-to-sequence case. So we are going to start from a sequence here, which is sent to the hidden layer. We are going to just use one hidden layer. And then we expect to output a prediction based on the current input and from the previous input. So what is our task? So our X is going to be a sequence of characters. For example, A, B, A, B, B, A, A. And let's say we would like to be able to recognize the sequence of letters that make the word ABBA, like the famous music band. So more specifically, we would like that when we see this last character, the last A of the word ABBA, we would like our model to output two instead for all the other cases. We just output A1. So let's say this one is our sequence of label. If we get it right, like here I expect a two and I actually get my natural prediction two, then I'm going to highlight these four letters in one way. If instead I predict the last one to be the correct sample, I will have to say that these letters here are a false positive. Instead, we can have that if I have to predict a number two here and I don't predict, so I predict a one, I will highlight these four letters as a false negative. So in our visualization, we will have to take care of true positives, false positives, true negative, which are actually the only one that we won't be highlighting. And then we have false negatives. And we will see how we can do so in the UNIX environment. So actually our X has to be one hot encoding. So this sequence that we have just written here on the left-hand side. So let's write it here. It was ABBABA. We had to correspond to a 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, for example, and so on. Our labels are going to be... So for the first one we have 1, for the second one we have 1, third one is going to be 1, fourth one is going to be 1, fifth one. Then we have the sixth, we have a two, and then again a one, and so on. So our target labels are going to be one for all the sequences that don't correspond to the ABBA sequence. And then it's going to be turning two when the sequence has been found. So we call it this, find ABBA sequence. Problem. So how do we plan to train such a system? We have seen for convolutional neural network that we can use very efficiently the OPTIM package. We will do here the same, and therefore we will follow a very similar workflow. Let's summarize it just for convenience. So at the beginning, in the training part, we are going to have get dataset. This is a train. Then we are going to craft our model, our current neural network, together with its time replica. We have to reset our state, initial state, because we start from a clean network. So we haven't seen any sequence before. Then we should get the parameter in a vectorial form and not anymore as a collection of objects and the gradient parameters. So the partial derivative of the error with respect to the global vectorial parameter. Then we will define our function eval to which we send theta as argument and which returns error e and the partial derivative of the error with respect to the vectorial form of the parameters. And finally, we can run our optimizer for testing purposes. We just need to get the dataset. In this case is the validation and then reset the state. Since the state would have been polluted by the training at this time of the code. We are going to feed each example of x one at a time to the RNN and we don't need anymore the time replica format. As we have said in the previous video, we can find the code at the elab website under the torch seven demo repository and in the folder rnn-train-sample. Let's start by having a look at how this program runs. So let's call th main. We have the training progress and here we have the output notation. We can observe in green the true positives in white simply the true negative in white with red background the false positives, which means the network tells that the sequence is positive but it is not and then underline it's the false negative. Basically it's when the network doesn't recognize the sequence. So in green again is when the network recognize the sequence in white is when the network do not identify wrongly any sequence. Red background is when a sequence has been identified as positive but it's not and the underline it's when a sequence has not been recognized. As we can see here from this the sequence BAAPBB and then we have twice ABBA that is has been recognized so it is in green. Then there is one more ABBA which has not been recognized therefore is underline. Then we have a B and then three B's followed by IA which the network thinks it's ABBA it's not and therefore it is highlighted in red background. Then it correctly identify true negatives so we have all a white sequence. Then correctly identify the ABBA sequence and we go on like this up to the end of the sequence. Let's see how this training is implemented. This is the main .lua file. We require NNGraph and Optime so we can use them both in this script. We set a torch manual seed of six in order to have repeatability of this script. If you run this script you will observe the same result that I'm showing you right now. Then we set the torch default tensor type to float tensor. This is a common practice in machine learning. We do not require very high precision during computation. Moreover there are set of hyperparameters and is the dimensionality of the input that is two. D is the dimensionality of the hidden representation. Then we have number of hidden layers so we have just one hidden layer. And then K is the output dimensionality that is again two. N is two because it can be A or B so the notation since it's one hotend coding will be one zero or zero one. In the same way we will have for K a dimensionality of two because we will have a probability distribution across the classes. T is the maximum length of the sequence which is equal four since ABBA has four characters. The training size is set to ten thousand and the testing size is set to two hundred and fifty. Let's see what happens if we change the hidden dimensionality going from two to three. We run again the script and we have all correct predictions. So we have seen now that one singular hidden layer with three neurons is able to solve this simple problem. Let's go back to the previous screen. We set a learning rate of 0.02 and a smoothing factor of 0.95 which are then sent to the optimum state. We define some coloring shortcuts that are showing you the different colors of the background and the underlying features. Then we do data equal require data and then we call data dot get data with the training size and the maximum length of the sequence. And we expect to receive x and y so let's have a look to data and here we have data. So the function get data defines a torch tensor of dimensionality size of the training which is filled with ones and twos randomly. We define a Y which is a tensor of ones and then if we do find from the fourth position to the last element of the training size. The sequence a b b a or one two two one then we set the corresponding label of Y to two. Then we define our X. X is going to have as many rows as the training size and as many columns as the features. Since it's a one hotend coding with two symbols we will have just two columns. So for the first to the last feature if the symbol is a then we are going to set a one so basically it's going to be a one zero. Otherwise if the feature is a B so it's a two we are going to set zero one basically and then we return X and Y. And then we have RNN equal require RNN and then we have that we call the get model from the RNN package we have brought with the dimensionality of the input, the dimensionality of the hidden layer, the number of hidden layers, the dimensionality of the output K and the max length of the sequence T. And then the output is going to be a model which is the prototype clone over the capital T time steps and the prototype which is one of these long models. Then we define a criterion which is a negative log log equal class criterion. So it's our classical cross entropy. If you don't know what is a cross entropy criterion just check CNN loss video lecture. Then we get our one dimensional vectors containing the weights of the model and the grad parameters which are the derivatives of the loss function with respect to the parameters. And then we print on screen the number of elements that are in W. So the number of parameters. We then define H zero and H shables and we fill them with answers of zeros. So these are the initial state and the current state of the network. But since we haven't started using the network, we are going to fill them with zeros. Then we forward our first four samples in the network that is clone over time and in the prototype so that we can plot the two graphs with full annotations. Let's have a look at these two graphs. This is the prototype. We can see that after the first reverse maps and splits, we have the X of T in orange on the left hand side. And then on the right hand side we have in red the first and actually only hidden layer at the time step T minus one. These two inputs are fed inside a joint table which is then sent to a linear which is then sent into a nonlinearity which computes the first hidden layer output at time T. If we would have had more layers then it would have stacked one after each other. The output of the first and only hidden layer then it's sent to another linear and then to the final log softmax which allows us to use the cross entropy criterion. Here we have the full model clone over time. In blue we are going to see the RNN prototypes. So we have RNN one, RNN time step two, RNN time step three and RNN time step four. To the first RNN time step one we are going to feed the previous value of the hidden state and then we send also the first element of our sequence X of one. Then the output of the RNN, the red circle which contains the hidden layer at the time step one. It's fed together with the second element of our X tensor into the second clone RNN two. The output of RNN two which is the H at time two. It's fed into the RNN three together with the X at time three. Finally in the last H of time three together with the X at time four. It's fed into the RNN four which produce the last output H. Moreover each of these RNN blue blocks produce the Y prime softmax output. So we are going to have here Y hat one, Y hat two, Y hat three and Y hat four which we are going to be using in order to train and to write the prediction at the testing time. Here we define two helper functions. One allows me to go from tables to tensors and the other one allows me to go from tensor to table. It also adds some zeros that are going to be sent into the state when we perform the propagation. We will see this in a few line of code. Here we initialize our training error to zero. Then we go from the first element to the last element of our training set with a time step of capital T. We are going to narrow the whole dataset X along the first dimensionality. We start from the item iteration and we take just capital T element. We do the same for the Y tensor. We define then our F evaluation. We set our model to training mode. Then we have our current state which is going to be the collection of the output Ys followed by the H's states. It's going to be equal our model to which we forward our X sequence and we unpack the tables of H. In this case we just have one hidden layer so we just send there one tensor. But in the case you would have chosen multiple hidden layers the unpack would have sent separately each and every tensor. Then we have the prediction. We convert the states table which contains the Y prediction and the H hidden states into just prediction. So we basically extract just the prediction from the states. Then we compute the error via the criterion to which we forward the prediction and the Y sequence. Perform the backward pass. We compute the gradient of the error with respect to the output of the system by forwarding again the prediction and the labels to the criterion using the backward method. Then we convert the output tensor from the criterion into a table which we are going to use in order to perform back propagation to the model. But before sending it into the model we have zero degrade parameters in order to perform stochastic gradient descent. Otherwise people would have been accumulating degrade parameters over the previous values. Finally we take care of storing all the hidden layers state into the H table. In this way when we go back above and we unpack H we are going to unpack the current state of the network and we do not have a zero state every time we switch to the next sequence. And that's it we simply call OPTIM with the RMS PROP optimization function to which we provide the function evaluation, the parameters W and the OPTIM state which was containing the learning rate and the learning rate decay. Now we go to testing. We start our testing by setting the model and the prototype to evaluation so we turn off any kind of training related features. We again initialize our hidden state to the H0 state which is a table of tensors set to zero. Moreover we require a new dataset of size test size which was 150 so in order to be able to test our system on data that is not the training data. Then we define some pointers for the queue we are going to use later in order to visualize nicely the result. And here we have the test function. It gets the first element of the X validation set and it sends to the prototype which is one of the RNN replicas. It sends the first symbol of the sequence together with the state of the network which we start from a zero state. And then we obtain the output which contains the next state and we cache it and then we extract the prediction by checking the value after the states. Then we check whether the first symbol was a one zero or zero one. We convert it to the character A or B. Then if we are just processing the first three elements of the whole validation set we insert them into our sequential buffer and we do not do anything. Otherwise if we are at the fourth symbol we can start evaluating whether the sequence is correct sequence or it's a bad sequence. Index is the max index of the prediction so prediction is going to be a log likelihood and we take the max. So if the index is equal to the label then we are going to draw it in green. Otherwise if we got it wrong there are two cases. One case is that the label is two so it's basically we have a sequence but we haven't identified the sequence so we have a false negative. Or otherwise if it's not a two we have identified a sequence then we have a false positive. And then we update the pointers. We get the next value from the queue and we write it on the screen. At the end we simply print the legend or the visualization and we iterate the test function over the test size dimension. And let's run it once more. Since we have restored the dimensionality of the hidden state to two we are going to see again different mistakes. So we train here. Number of parameters are 16. It's a very small network. And then we can see again we have a true negative, true positive. Then we have a false negative. We didn't see the ABBA coming. Then we have again a true negative and then we got a false positive. The output of the network was saying that it was a positive but it was not there. And that's it. Thank you for listening.