 All right, a very warm welcome and good morning, all of you. My name is Anmol Krishnadeva, and the topic for today is understanding and implementing recurrent neural networks. So before starting, let's first see an example. If we are reading a book, and it has, say, five chapters, we are on chapter one. We have read that. We are going to chapter two. We are starting reading it. But suddenly, we forgot everything that was there in chapter one. So how will we be able to understand chapter two? It means short-term memory plays a very crucial role. On that motion, I would like to start with the recurrent neural networks. So Martin has already introduced me. I'll be skipping this. Perquisites, you should be aware of Python language. You should have a decent knowledge of artificial neural networks and elementary linear algebra. So recurrent neural networks can be thought of as the neural networks which persist information. And we can think it off as sequential processes, basically influencing decisions. So in contrast to the traditional neural networks which don't persist information, recurrent neural networks are having this advantage over those. So we can think recurrent neural networks as the networks having loops to themselves. I'll be explaining the architecture now. And these are one of the most complex, supervised deep learning algorithms we have today. So consider a normal neural network. And it has many layers. X0, X1, X2 is basically the input that we are providing to the layer. A is the hidden layer. So the input from X0 goes to A. And the input gets transferred from one hidden layer to another hidden layer. Simultaneously, it gets transferred from the other hidden layer to the next hidden layer. And we can think it off, squashing this thing that you see on your right-hand side into the thing that is known as the vanilla RNN. That means it is having loop to itself. RNNs can be of many architectures. The first one is one-to-many architecture in which you provide one input and you can get many outputs. So you can think of, say, you provide with an image and it provides with certain caption. So an image is mapped into caption of, say, 506 words. So it is one image getting mapped into 506 words. That is one-to-many transformation of RNN. Likewise, we have many-to-one. So many-to-one can be thought of as, say, we are having video frames and we are generating text out of the video. So it can be mapped to many-to-many or many-to-one. So we can get from a video a sentence and similarly we can have many-to-many transformation also. So recurrent neural networks have the following applications that is image captioning, subtitle generation, time series classification, language modeling, natural language processing. Even chatbot development is also based on RNNs nowadays. So there's a major problem in the vanilla RNN. That is, if the RNN is having so many layers, so adjusting the weights at each hidden layer is a problem. That means once the information from X naught is transferred to A, it is multiplied by some weight matrix, then transferred to the next layer, then transferred to next, then transferred to next. And simultaneously the output gets generated on the top. Now we have something called as loss function. That is the difference between the actual output and the predicted output. So if the loss functions for say, ht plus one generates some error. So there's some error of say 0.15% and it's need to be backpropagated throughout the network and backpropagating through a network that is so large is difficult because each time you can think of say, W is getting multiplied at each layer and multiplying W that is between zero and one, multiplying anything by zero and one, between zero and one, say 0.2. If we multiply something by 0.2 multiple times, then it will be tending to very small value. So it will be like, the gradient will be propagating through the whole chain, but it will be taking much time to train. So it is not feasible solution. So Vanilla RNN suffers from the vanishing gradient problem. Likewise, we have the exploding gradient problem. So exploding gradient problem is when the value of W is greater than one. So if you multiply simultaneously a number by say a factor of more than one, then it will tend to always go and multiply to many huge values. Now the vanishing gradient problem can be thought of as W is less than one and exploding gradient can be thought of as W is greater than one. To solve these vanishing gradient problem and the exploding gradient problems, we use certain techniques. So for exploding gradient problem, we have truncated back propagation. So in truncated back propagation, we divide the whole set into certain batches and in those certain batches, we just back propagate within those certain batches sequentially. In system of rewards and penalties, it is like reinforcement learning. So we provide rewards for say, if the back propagation is doing well, else we provide penalties. And we have gradient clipping. If the gradient goes beyond some range, then we just clip that gradient and we don't propagate it through the network. For vanishing gradient problem, we have smart weight initialization that is like a guesswork. Then we have equestrate networks and we have LSTM. So I'll be talking today of LSTM that is long short-term memory. And it is one of the most used variants of RNN. So LSTM is one of the most used variants of RNN and the approach for LSTM is making the weight W equal to one. So you can think of, you are not having W less than one, you are not having W greater than one, so what else we can do? We can just make W equal to one. Now, coming to the architecture of the LSTM, C is the new introduction in here. HT is the state of the current input, the hidden layer and CT can be thought of as a cell state. So in here, if we provide the input XT and the hidden layer and we transfer this thing to four states, F, I, G, O. F is the, you can say, the final gate of forget gate, I is the input gate, G is another gate, there's no name for it and O is the output gate. We take, we apply the function like the element wise multiplication and the addition using this formula. So CT is basically F times that is forget gate times the CT minus one that is the previous state, the previous state of the cell that was there. It is basically forgetting some part of the memory. So it is like you were reading something and you actually forgot some part of it, but you retain certain part of it. So this is actually forgetting major part of it and retaining some part of it and that some part is used for training the rest of the network. And this is actually applied through the TANH function and the output gate is multiplied and we get the output of the hidden layer. So comparing the RNN and the LSTM, it is, we can say the LSTM has a sophisticated architecture in terms that it has a cell state and using this cell state, it is like a super highway through which W is not getting changed. So we are actually changing the element wise fraction that is F and we are not changing the weight matrix. So multiplying anything by element or scalar multiplication is much simpler than the matrix multiplication, which is involved in the weight matrix multiplication that is used in the vanilla RNN. So that is why the vanishing gradient problem goes off in LSTM. Now we can start building the LSTM and the code I'll be providing for that. So you can just join in this. You can just implement this as it is now. I'll be just shifting to the implementation thing. So the major task of implementing the RNN involves the data preprocessing, data preprocessing, then building the recurrent neural model and making the predictions and visualizations. So before implementing, we need certain libraries that is Keras library, then we use scikit-learn, use TensorFlow and yeah. So first task is data preprocessing. So we import the NumPy library, we import the matplot library, we import the pandas library. The NumPy library is for visualizing the results, is for manipulation, matplot is for visualizing the results and pandas is for managing the data sets because the Keras library, yeah, zoom. Okay, can you see now? Okay, so since the Keras library doesn't support the data frames of pandas, we need to use the NumPy arrays for Keras library. Now, before starting, let me introduce you to the problem. So today we'll be predicting the stock price for the Google for first month of 2017. So these are the records that you can see. These are the records that you can see. So these are 20 records, that is each month has 20 financial days, that means weekdays. So this is the record that we have to predict on and the training data set is this. So we'll be predicting the open stock prices of Google for first month of 2017 and we'll be training on the year 2012 to 2016. So five years of data. Now, actually the prediction should look like this. So if I take these two columns and just insert the plot for it. So the result that we should predict should be like this. So today our task is to mimic this behavior of the stock price in January, 2017. Now let's get started. So the training data set is, I'm calling it Google stock price train.csv. I'm importing it as a data set. The pandas data frame. So I'm using pd.read.csv. Then since this data frame contains many columns, but we only require the open stock prices. The open stock prices that is column one. So I use the iLoc function for selecting, the first parameter in iLoc is the colon. That means it will be selecting all the rows and the column and the second parameter is the column parameter. That is one to two, one points to the open and two basically points to the next column. But since the second value is omitted, so it is just considering the first column that is the open column. And dot values is for converting it to the numpy arrays. So we are just converting the pandas data frame to the numpy array. Now we need to scale this data and we'll be using the scikit-learn pre-processing library. And from that we'll be using the minmax scalar class. So sc is the object of minmax scalar class and the feature range parameter here tells that you need to convert the values of the column between the range zero to one. So everything will be scaled between zero to one. And after that we just fit and transform. So fit and transform basically means normalizing the data. So it actually takes the minimum value from that column and the maximum value from that column and the transform puts it into the normalization function. The normalization function can be the standardized function or the normalization function. So in standardized function we take the mean value and we subtract it from the actual value and then we divide it by mean value. But in normalization function we take the min value, we subtract it from the mean value and we divide it by the difference between the max value and the min value. So it is a basic normalization thing that we are doing. So fit and transform does this thing. After that we need to create the training data, the training list and the predicted ground truth list. So x-train is a list, an empty list and y-train is an empty list. That is for the training, you can say set and the ground truth set. So we have one to five weight records here in the stock prediction data. So we have one to five weight records. These are one to five weight records. And we are taking the time step at 60. So what is time step? Basically time step is a very important concept here in the recurrent neural networks. It is basically how many observations you will be taking into consideration for training the next value. Or you can say if you are focusing on just predicting the ith value, then what set of values you will be taking into consideration before that ith value. So it is basically, I'm starting from the 60th record. It means it will be taking the data of three months training on that and predicting the 61st value. So it is just an assumption you can take 20 days, 18 days, anything, but it is like during my testing I took 60 and it produced great results. So I'm using 60 as the time step value. And one to five weight is the number of records. So I append to the empty list the training scaled values from the I minus 60th value to the current value. So that is basically appending. It is making an array in which one record contains 60 previous values from the current value. So it is like a six, you can say N cross 60 metrics that is getting created. Now the Y train is basically the ground truth and the ground truth will be just having the current value. So it is from I to I plus one and zero is basically the column that is the first column because we have already cleaned our data and we are just having one column for open stock prices. So zero points that thing. Now X train comma Y train equals NP dot array. So it is converting to NumPy array. So we are generating X train NumPy array and Y train NumPy array. Then we need to reshape the X train array. So reshaping means that to work seamlessly with the you can say NumPy arrays and the RN and stuff, we need to add an extra dimension to the this existing NumPy array. So basically this is this basically X train dot shape is the number of rows X train dot shape. One is number of columns and one is basically the indicator column that is the open stock column. So we are just reshaping to build it a 3D array from a 2D array. Now the first step of data pre-processing has been done and we are on to the next step of building the RNN. Does anyone have any doubt in this? Yeah. Yeah. So basically we need to train the data. We need to say we are not having, we are having values say 700, 800, 1,000, to say 20, 22, five in the stock prices. And it is not ideal. It will take a lot of time and it will not generate good results if the data is not scaled between some value say 0 or 1. So I have provided the range 0 to 1 to make it more cleaner. So it will be approximated to the nearest value. So it is just cleaning the data and smoothing the data. So scaling has that use. Yeah. Yeah. Okay. Yeah. You can't see the result. Yeah. You don't know the result. All right. So basically the question that you are asking is a very good question. And we use something called grid search for this. That is basically a part of hyperparameter tuning. So these are basically pointing towards the hyperparameter stuff. So 60 is like hyperparameter only, which is making or which is amending how the predictions are being made. So what we can do is after this is done, like this is the first clue for you all people, but I tried it with certain parameters. And in the hyperparameter tuning, the grid search library provides you with, you can say you can provide arrays of values. Say I have something called say time step. So I provide time step equal to 30, 60, 300 suppose. And I have another value say optimizer. And I take three optimizers, suppose, which are used for this type of regression. This is a regression problem because we are dealing with the continuous values. So we take say RMS prop optimizer that I'll be telling you more about. And we take Adam optimizer and we say and Adam optimizer. So it will be doing a cross product. So it will be take this parameter 30 time step, cross it with the first RMS prop and generate result. Then 30 Adam generate result, 30 and Adam generate result. Then it will be doing same thing with 60 RMS prop, 60 Adam, 60 and Adam. So it will be generating nine results. And from those nine results, you can actually get the matrix formed, which will be showing which one is generating good accuracy and less loss. So using that grid search thing, you can actually judge which hyperparameter to tune to what value. So that is part of grid search. Thanks. No, I have not because it is like, I don't have that much time for showing, yeah. So now building the RNN stuff. So we are importing four classes, sequential, dense, LSTM and dropout. Sequential class is for basically inputting the first, you can say, set of inputs to the neural network. Dense class is for the last layer of the neural network, that's the output layer. LSTM class is for making the hidden layers of the neural network. And dropout is for something called as dropout regularization, which I'll be telling you more about further. So we are making a regressor since it's not a classification problem, it's we are dealing with the continuous data sets. So it's a regression problem. So we make an object of sequential class called regressor and we add the first layer to it. So it's regressor.add. So regressor.add will add layers to the neural network. So first layer is the LSTM layer, the hidden layer that we are adding. And units equal to 50 means the number of neurons you can see that we are introducing to the network. So this layer will have 50 neurons. And you can take it any value. So it was, again, the hyperparameter stuff. So I took 50. Then return sequence is equal to true means that the output of this layer will be getting forwarded to the next layer. And input shape equals extrane dot shape one that points to the columns of the data set. And one is for reshaping it to the third dimension. So we are just using the column and the third dimensional value for it. We don't need to use the first row dimension value for it. So this will be adding the first layer. And adding the first layer may introduce overfitting, say. So to deal with the overfitting problem, that is to, say, avoid the noise that is being created by the overfitting stuff, we use the dropout regularization stuff. So dropout actually means dropping out certain number of neurons from the layer. So out of 50, I have chosen 0.2. That's 20% to be dropped out. So it means it will be just considering 40 neurons for the next layer. And out of those 50 neurons, 10 neurons at random, or, say, 20% of those 50 neurons will be dropped out at random. So this actually avoids the overfitting problem. Yeah. So basically, it's a stacked LSTM. So I am using four hidden layers in this. And for the four layers, I am adding four dropouts. So yeah. Yeah. No, no. It's like during the epochs. Epochs is basically how many time the data, the full data, will be propagated forward and backward to lessen the loss generated by, you can say, the gradient. So it's actually, for the number of epochs, the dropouts will be done at random. So dropping out means just not considering 20% of those neurons randomly in one epoch. And in the next epoch, it will again be choosing random 20% and then be not considering. So it's like a random stuff going on. I have chosen 100 epochs. So it's fully propagating the data forward and backward. And it's training is done 100 times on this data. So it's like, for each epoch, it is generating some random value. It's choosing some random neuron, and it's dropping out 10 random neurons. So it actually avoids the overfitting problem in that way. And then similarly, I am adding three more layers. So this is regressor.addLSTM. Now you have added the first layer initially. So you don't need to provide the input chip because LSTM class already knows that what is the input. So from the second layer, you don't need to provide this parameter input chip. It will automatically take from the previous thing. And this is the second layer. Then this is the third layer and the fourth layer. In fourth layer, we don't actually need the output return sequence because we need to pass it through the dense layer and not another LSTM layer. So we'll be omitting the return sequence equal to true parameter. We can state here return sequence equal to false, but the default value of return sequence is false. So we are not putting here the parameter. And dense is the final output layer. So basically, the unit is equal to 1 means it will be having just one neuron. That's the resultant neuron. So dense is actually corresponding to the result of the neural network. Now, the building of RNN has been done, but the compilation phase is still left. Compilation actually means compiling with certain loss function and choosing the right optimizer. So as I told you earlier, RMS prop and Adam are the two optimizers that are used with RNN. And Adam is the one that usually works apart from, say, RNN also. It works with CNN also. So it is a much wider optimizer that you can use. And it gives good results. Again, a hyperparameter tuning thing. So I have chosen the optimizer as Adam. And since it is the regression stuff, so we'll be using mean squared error loss function. If it would have been, say, the classification stuff, then we would have used the cross-entropy or binary cross-entropy loss function. Then we have compiled this and then we are ready to fit it. So fitting the RNN for the training set, so we pass the fit function X train, Y train, epochs equal to 100, batch size is 32. So it will be taking batch of 32. It will be dividing the data set into 32, say, element batches, and it will be working on those batches. It's again hyperparameter tuning stuff. So what I mean by hyperparameter tuning is that tuning the parameters of the classes in such a way that it provides good results or good predictions. Now the task three is the prediction task. So we have done the RNN building stuff. We have done the data preprocessing stuff. Now it's the task to predict the results. So actually the data set for predicting was separate than the data set for testing. So we just take the test data set and we merge that test data set with the actual prediction data set to make it a total data set, which will contain every value. So here we are just converting the read CSV. We are creating a data frame of the test data set and then we are again converting it to the Numpy array as we did it with the price, the prediction price stuff. Then we concatenate the open column of the prediction data set with the test column. So it will be one, two, five, eight plus 20 records. That's one, two, seven, eight records in total because during the time of testing we need to have, we need to consider the previous 60 records and for considering those previous 60 records we should have a data set incomplete. So I'm just using this complete data set and access zero is for vertical join. Input equals data set merged. So basically it is taking, it is creating another, you can say Numpy array, which is based on the merge data set and it will be taking the values, 60 values from the current state. So counting from one to five, record number one to five eight it will be taking into account the previous 60 values to predict the results of the record one to five eight. Now we are reshaping it again and since we have already fitted the data earlier we'll just be transforming it. So we are just taking the input Numpy array and we are transforming the input Numpy array for making the predictions and we use another list called X test and we append to this list the values from the last 20 rows that we need to test and then we form the Numpy array of this and we just shape it again, reshape it again. Now the prediction can be done using the predict function of the regressor class the class that we formed. So regressor.predict basically makes prediction on the X test list that we just formed, the Numpy array and inverse transform is the function that is used to inverse the normalization stuff that we did. So now we are just, you can say, we are inversing the scaling that we did. We made the scaling from zero to one. We normalize the results from zero to one. We are just inversing the transform to actually convert them into the real value. So it will be something like zero point, something is converted into 770 that is say a stock price for one record. And then we plot these results. So since the time is short, I have already run this thing in Pacham. So it was say 100 epochs and you can see that I'll just show you, say, you see that with increasing number of epochs the loss function is decreasing. So the value of loss is decreasing. So the loss must have started with say some value point three or something point zero zero three four and it's now at the end of hundred epochs it's point zero zero one three one five. So you can see as I'm going up the loss function is increasing. So the value of loss function is increasing. It means that the training has been done correctly and we have achieved certain level of accuracy so that the loss function has minimized its value from say it's in thirties or forties point let's say point zero zero three zero two point one three and we can see that the plot that we have is following the trend of ups and downs of the market but it's not actually predicting the correct values though it is following the trend that's like if the value of the stock price is going up it is following that trend. So blue is the predicted one and red is the actual one. So using these parameters I was able to generate this result doing hyperparameter tuning with grid search will definitely improvise these results and it will actually match with some error with the actual red line. So it is like just following the trend with the red thing so we can see that it is actually following the trend so we can see on x axis at point two point five it is just going down then it's following the upward trend and then it's being stable at the end so it was this prediction that using these hyperparameters I was able to make. So the source code for this I'll be updating my slides but it's on Github it's at this address I'll be just uploading my slides and it will get uploaded on the session page and after that so these are my acknowledgements Professor Martin Christian then my supervisors Christopher Oula for creating good blogs and yeah, if you have any questions then you can ask, yeah. Thanks, thank you very much we have time for a couple of questions. Thank you, I wanted to ask how did you tune your architecture? Why four layers? Okay, so when we deal with RNNs we cannot go beyond say we generally choose two, three or four layers because it actually doesn't make any sense going beyond four layers as the results will not improve over choosing four layers so for just showing the implementation stuff I used four layers and it was just making some good assumptions when compared with using three layers or two layers but generally in RNNs we just don't go beyond four layers it can be the case that if we are using RNNs with some other neural network say CNN for say image captioning task then it will be another scenario we'll be getting the results of the vector generated from the CNN and we'll be transferring those CNN results into the first hidden layer that's the H0 thing and those results will be just making further you can say vectors, forming vectors which will then be propagated throughout the chain so it will be having more impact on the RNN training and in that case we can just lower down from four layers to two layers because it is already having lots of information from the CNN stuff so it is basically based on that thing so if we consider a basic example that's the perfect roommate example that is very famous in field of RNN if you have a roommate who cooks food for you and he cooks say apple pie, chicken and say anything XYZ, anything so on sunny day he cooks say apple pie and if it's another sunny day then he just goes off the duty and doesn't cook anything if it's rainy day he's at home and he cooks next dish so it cooks chicken for you so it is like the first input is weather that's sunny or rainy the second input is the dish that he made previously so these input provided to a state will predict the next output it's like a vector product plus the addition of some of the other things that goes in complex neural networks you can just make sense out of it yeah, yeah, yeah I'll be making it public, yeah okay, thank you very much again no, basically I applied the inverse transform function so the... yeah, yeah, yeah yeah, it will be inside this range but the results will get inverted so the current value so it will be inverse transform function so that will be adding something divided by something that is lesser so it will be just exploding the value from 0, 1 to scale of say, 100s yeah, yeah, yeah that is why the prediction is not predicting the actual result it is having some variation with the results because you are using some smoothened value and using that smoothened value you are going to predict actual results so these will have variations between them