 ಬೋನ ಗಾ ಎದೆರಾ � fiber馈 ಬೋಸ oggiಕೆ ಣustom� some ಸೈಕ ಎದೆರ್ ಬವೆದಳಲಿಲಾ ಣ್​​ರವರ್></ನೂಣಥದಲಲ, ನ ವೈಕಲимерಛಿ ಊ್ಟಸ್,  ಸರರ್ಸ 쓰고. ಸಟಲಾ �ülಿಸಹಿಸು ತ್ಶಲ ಮಾ ಬೇಲೆಂ ಮಾಸಾ ಮಾಟ.. Today I will describe how to forecast time series with machine learning and deep learning algorithms and we use that in our production system for alerting. So first what is time series and why it is so important to know them to forecast them. So time series actually recorded a system behavior that changes over time and that records abnormal behavior as well. And we use this time series recording structure for any series based application like financial trading systems, medical devices, monitoring software and hardware systems, cryptocurrencies and to the autonomous car as well. And most important part of time series is like for let's say financial trading system we are recording stock market data as well. So it has recorded all the abnormalities of stock up and down. So in red hat we use the use cases of monitoring software and hardware systems. And the image is the ideal time series of sine wave. Unfortunately in real world that does not exist that much. Okay so why time series forecasting is required. Okay so why time series forecasting is required for metric analysis. So first of all health and reliability. So time series are recorded over time so every time it is recorded for one minute intervals. So fortunately it can record all the abnormal behaviors of system. You have to just find it and you have to create some sharp alerting system for future. And reducing false alarm. So in case of system admins they get emails like system is down or maybe some abnormality is going on or in case of share stock markets fund managers get alerts like this funds is going to be downwards. So there can be some false alarming as well. So they come under the false positive category of classification problem. And we don't want that. We don't want any fund managers to take a selling decision in terms of false alarm. So it will be lost. And in case of our system domain we don't want to disturb our system admin where the system is perfectly working normal but our alerting system is not adapted to the current trend of the data. So that is the reducing false alarms and glance of the future. So at least by good forecasting algorithm we will know the direction of the data. So suppose upcoming one week the system will be highly used by some operation. So maybe some higher values of metric is not an anomaly. Let's say for the stock in upcoming one week the stock market for this stock is uptrend. So or downtrend. So normal downtrend is not an anomaly there. So that is the glance of future. And challenges between time series forecasting is stationary and non-stationary. So stationary is the green one. So that is kind of sine wave type of signal where you can find uniform mean and standard deviation throughout the series. But as I was telling this ideal thing is not that normal in real world. So we get the red one where the time series has an increment trend or decrement trend over time. And in these cases you cannot depend on the mean and standard deviation which you are getting from the data because it changes over time. So this is the system abstract what we are planning to build. Some part of it has been completed by our interns as well. So we get the data from safe storage. The Prometheus matrix data previously it was in influx DB and we created data connectors. So after crossing the data connectors it should give us some structured version of the data like PySpark frame or Pandas data frame. So before the connector it is storage specific. After the connector it is same. So after that there are some transformation steps goes on on which we will run the forecasting algorithm. So we will see shortly what are the transformation steps are required. Then the machine learning module comes into the picture. So we used various kind of models. One was profit from Facebook. Second one was TensorFlow and Keras. And Natasha is also working on some different models. So after that, after model building is completed we extrapolate the future. So that is the future prediction part. And then we store back the data to some secondary data storage that can be safe or any other thing. Then we pull the data from the secondary data storage and spin up some alert. Data science part is still the future prediction. After that it is application layer mainly. So I will be describing the data science part in more detail. So what are the components of time series? So mainly in a normal time series it is not normal actually. It has lots of abnormality. So there are mainly four parts. Trains, seasonality, cyclicity and irregularity. Trains is like visible by Bayer Eye. So you can see the time series pattern is increasing over the time. And it has not decreased in that time window. It is always increasing. Seasonality is like increasing the values and then decreasing again after some finite time unit. And it may not be visible by Bayer Eye but you can detect that by some statistical methods. Cyclicity is same like seasonality but it is irregular. Seasonality normally happens in regular interval but cyclicity does not. And irregularity is absolutely irregular behavior. You can see that there is a unique spike and that does not occur anywhere in the future or past. Now by any forecasting algorithm you can predict the trend and seasonality. May be some part of cyclicity but never the irregularity. So irregularity is automatically an abnormal behavior. So let's see what are the forecasting algorithms are. So first one is linear regression. Every data scientist starts from it but unfortunately it is not that much useful for this kind of series where there is a trend on the data. If it is a normal regression problem may be linear regression you can fit with a quadratic model but not in the time series. Exponential smoothing as I have shown in case of real time series data there is some increasing trend in the data. So mean and standard deviation changes over the time. What exponential smoothing does it gives you some weights to the current data versus the past data. So the next value will have the more weight to the recent past than the distant past. So your distant data points will be fade out and the recent data points will have the most effect on the future values. But exponential smoothing also have some problem because it can maximum forecast one step or two step at the future. So we have to calculate the exponential smoothing every time and it is also not that useful when the data size is very big. Hold Winters model it adds the trend and seasonal part with the exponential smoothing but the forecast it produces will not use that in a real time because it actually lags by some lag unit like two lags or three lags behind the real data. And Arima was the most sophisticated model before like five years before now it is not. So Arima has some three parameters to tune in normal Arima there are seasonal Arima as well but all of these models don't cope up with the big data situation. So they fail to leverage the big data facility. So what are the more recent approaches with machine learning world and let's say deep learning with the time series. So first one is profit model. So it is kind of for quick prototyping it is a kind of revolutionary model that was created by Facebook AI team and it is open source totally. So there are lots of contribution chances as well. You guys can find that. And profit is actually additive regression it models and predicts at the same time. And second one is recurrent neural net so it is a part of neural net family and it learns the temporal feature of the data because it is recurrent in nature so whatever you are feeding into the model it keeps the track what to forget and what to remember to predict the next model. Next output. So you will see more details in profit what it gives us. So profit is like it can cope up the trend and similarities on the data and when it fits the data and it extrapolates at the same time and it gives us like three or four types of curves or maybe you can say plots. So left one is fitting and extrapolating at the same time the black dots here are the actual data point in our time series so blue continuous line is the predicted value and light sky blue is the confidence boundary so you can see that where the anomaly has been found the confidence interval also jumped a little bit so it keeps the track of these anomalous data points now on the right side you can see the three plots so these are mainly the trend plots trend and seasonality plots so as we can see on the real data it has an increasing trend maybe at the end of the data there is some decreasing on the trend part the graph shows the same it is increasing the first one on the right it is increasing throughout the time the second one is the weekly seasonality so as you can see the wedness days are much higher valued than the Sunday and Saturday so when I did this graph the data I had it had higher values on wedness day that means the system metric that we are capturing by the time series it got higher values on the wedness day because the system maybe was in rest on the weekends and last graph is the daily seasonalities we can see that it is forming like something like sine wave but don't think the data actually is like that it is the seasonality component of daily trend now what is happening every four hours the values are shifting so if it is higher valued in 4 a.m. at the morning or maybe I don't know if it is a.m. or p.m. that I can't remember but it is shifting every four hours so these all are very useful point for time series because when we are seeing the real data the left one we don't see these trends in the data we cannot see that by bear eyes so these three plots helps us to identify that and for system admin we can say that suppose in wedness day we are getting some higher values that may not be the error situation or any animal situation because usually you should get higher values on wedness day but if you get any higher values on Saturday or Sunday maybe that is something to look at so you will see the prediction part of the profit okay there are a lot more data so blue spikes are the real data green line if you can see the green line that is the actual predicted value that is y hat red one is the lower confidence boundary and purple one is the upper confidence boundary problem is profit don't profit can't predict that much good when the prediction is very small amount of time like one minute of interval so these data is taken from one two minutes of interval interval lags and you can see it has almost missed all the spikes so that is the problem with profit and that's why it is still experimental model we can try with the different kind of parameters of profit but it is very useful to find the trends and seasonality on the data by profit so we will see the residual analysis so residual plots are created from actual minus predicted value so main hypothesis the residual analysis for time series is it should be in Gaussian distribution after modeling if you can't find Gaussian distribution from your residual data then you are doing something wrong but if you can find the Gaussian pattern the bell curve pattern from the data then you might doing something right like you have your model have found the seasonality maybe the trend but there are some problem you need to work on but the assumption are right so the left one is the actual and I zero centered it by subtracting mean and dividing by standard deviation so it is showing the bell curve so that means profit model we can use for time series at least for the seasonality protein shortcomings are so there are some special spike list you can mention on the profit like holiday list suppose you are predicting something like hotel reservation so on the holiday dates and maybe the special event dates there will be surely some high spike or low spike on the reservation series you have to give that data manually to the profit so there are some like 10 days in the event series and you have to adjust the model by that so that is manual process and we can't do that in production system we should not do that it is still experimental and may not be very accurate for minute by minute prediction so what is RNN now so RNN gives us a good promise against all those previous models that it learns the temporal context better than any model and it can work with the big data so prequisites so I have used python 3.6 plus and denser flow 1.9 actually keras 2.2 nampi and panda so why keras why not direct denser flow the guys or whoever is starting with deep learning you should know the tensor flow is the widely used machine learning framework because of the community support and the examples as well but it has some problem or maybe it is some complexity with the data dimension so when you have a multi dimension data you have to mention correctly on the denser flow API otherwise the model will not work keras does all those stuff automatically by by its own internal layer and gives you a very simple architecture to implement any model and you can stack layer by layer very easily if there is time I will show that how to do that so RNN architecture so there are mainly two types of RNN GRUs and LSTM GRUs Gated recurrent unit and LSTM is long short term memories so LSTM is better for time series because it can forget I mean its little contradictory what I was saying but truly it can forget what it needs to forget and it has some very cool way to do that it learns over time what to forget what to remember data in supervised learning format so what we saw in the previous slide so it is a continuous series maybe with flowed values we need to convert it into a supervised learning format so lets say we have a series of x1, x2, x3, x4 like that to xn we are taking x1, x2 as past and predicting x3 as future second x2, x3 as past and x4 as the future so what it does it takes consecutive two items in the past and predicts the third item so I am not saying I have used two values from the past there are like 50 or 60 values but it is the technique how to convert a time series to a supervised learning format train validation and test split whatever forecasting you are doing whatever model you are applying you always divide the data like 80, 10, 10 format or 95, 5 format if you have big data you can do 95 and 5 division stationary king so at the first slide I showed that real time time series are non stationary so there is no identical mean or standard division over the time so you have to convert that somehow to stationary time series so we do that by taking one lag difference x2 minus x1, x3 minus x2 like that so these are all the transformation steps from our data pipeline so the images of recurrent neural nets you can see it is an n is to 1 n is to 1 recurrent so it takes n inputs and gives us one value just like what we want for our time series so x1, x2 those are the time steps maybe you can say the features and those are the hidden layers by which it is going and then it is predicting y so now the LSTM part so LSTM is little sophisticated than normal RNN on the left picture you can see the right images of plain RNN and LSTM is little bit complex complexity comes from input gate forget gate and output gate I told right so it learns how to forget so the previous time steps previous time steps will be forgotten by its optimized parameter and these mathematical equations mainly FT is the forget equation so WXF means that is the weight to go from X to F so weights in neural network works like that the suffix says if it is says XF that means it will go from X layer to F layer and connection is the WXF that is the weight matrix so when a algorithm learns it optimizes the W so forget gate is responsible for removing information from the sales state what you are seeing and write that is a single sale of LSTM so multiple sales are connected so multiple neurons are connected so one neuron consists of this sale it depends on the previous time steps that is HD-1 that means a single sale is updated from the previous time step the current input and the weight that has been optimized this is mainly these 5 equations are mainly for the forward propagation part of the algorithm I did not mention the backward propagation but that is how it optimizes this equation by partial derivatives so you will see some hyper parameters to optimize before using an LSTM so if you just put an LSTM block before time series it may not work well because there are lots of regulators to be tuned among them I have mentioned 6 major things so you always have to do some cross validation with these 6 so sequence length is like how many past time steps are required to predict the future I have used 50 time steps so that means 50 minutes trained will be used to predict the next time step next is mini back size this learning algorithm does the learning such a way so if you don't use mini back size and you have 1 billion records so before doing any optimization of the weights it will see all those 1 billion records then it does the first epochs and optimizes the weights so it slows down the learning and weights don't learn that much to give us the optimized value always we should use the mini back size it has multiple points like we can use many big data many big data size than our RAM have as well as the learning algorithm optimizes very fast and we always should use 2 to the power back size because our computer architecture is built like that it always works with 2 to the power optimizers that is the optimization function for learning part so mainly we have rmsprof, adam and adagrad so main basic is gradient descent and rmsprof, adam all are built on top of that with some modifications so most of the time I found from my experience adam does the job very good than compared to vanilla SGD but rmsprof is also very good for regression problem I did the cross validation adam owned that time so activation functions what is activation function so normally linear any neural network does only linear transformation that means multiplying some numbers with other numbers that don't learn anything so linear transformations without the activation functions whatever layer you build how much hidden layer you give it will always be a linear transformation when you put some activation then it will learn something with some complex pattern because this is the non linearity we have mainly 5 activation functions tanh relu, licky relu, sigmoid and linear function so I will only describe relu and licky relu because sigmoid you all know maybe that is the activation function for binary classification problem and linear is nothing so whatever the value of x it will give the output as x only relu and licky relu is little interesting because you can see that licky relu will not give you any negative value so after transformation with the weights if your values are coming as negative it will dump that so it will shrinkage the parameter licky relu gives us very small number when the values are negative licky relu in mind study so it does not dump the input if you use licky relu in case of regression problem it just reduces our feature size in case of image processing you always you should always use the relu because in case of image we need to dump many data because except the object there are lots of noises also in the image but in regression we don't have regularization I will drop out I will just quickly show what is the drop out so left one is the standard neural net all layers are connected with each other right one is the drop out layer drop out means the framework will dump some neurons with some given probability if you give drop out as 0.2 so it will dump 20% of neurons from every layer so it gives us better regularization now we will see the texture of LSTM what is this visible so it is starting from LSTM first unit so this is the input unit and as we have shown there is a drop out layer between every two connected layers LSTM 1, LSTM 2 and LSTM 3 so what you are seeing is the forward pass of the algorithm so it goes from LSTM 1 then drop out then LSTM 2 then LIKI RELU that is the activation functions then again drop out layer then LSTM 3rd unit so green ones are the hidden units and at the end there is a dense layer that means fully connected layer from where the regression will come out and loss functions are calculated at the end of the process of forward propagation then the backward propagation starts the yellow lines are from the backward propagation so that is the learning part of the algorithm so this is the model training part first one is the training loss as you can see that is ideal world it is always decreases with number of epochs are increasing below plot is the validation data loss function now if you see the training loss will always look this decent but validation loss will not so there are some local minimums on the data so as you can see both graphs are having a lot so when the training loss at the minimum may not be the validation loss is also behaving same that is why we always need to use multiple iterations of the algorithm as well as the regularization part these plots are made with tensor board so now comes to the prediction part as we saw on the prophet it was not predicting the spikes well for every minute gap now as we can see here LSTM could find the spikes in one minute interval and also it did not find some higher towers so those are our alert situations so we can club them for let's say five minutes interval and club those towers and compare with the predicted values to generate alert and right one analysis as we saw in the prophet as well it is giving us the normal distribution of the residual plot that means actual minus predicted and the lag and the ribbon below the plot is the data density as we can see 99% data should be like mean minus two standard deviation so it is giving that much density to the residual plot so our algorithm implementation at least we can say it is finding the correct tune to forecast the future so what we can do from here so as Stephen was saying we have come on AI library where you just put your data to get some result out you don't have to make the model from the scratch just to get some ideas so we can use this model for that AIC way common libraries then we can do the modeling as well from sequence to sequence prediction what I did from n is to one prediction there are n inputs and it is predicting just one value in place of that we can predict sequence to sequence that requires auto encoder architecture more deeper layer we can use so I have used only two hidden layers it took a lot of time like three hours to run the total model and predict the output so more deeper layer we require more complex hardware and increasing the feature space so I have used only the single time series and created the past data from that so we can use the correlated data points as well from other time series so some operations have some impact to other time series so we can take that and make some better predictions from here so if we have some time left I can show the implementation part do we have time left okay so this whole code is hosted on github so you can get that from the presentation I want to quickly show something here yes please okay this is the model architecture as I told keras does it very efficiently so you can stack layer over layer and it is as simple as that yes you obviously need some idea about that but implementation is very easy now you can see that after every layer there is a drop out layer and it has 0.2 that means it will dump 20% of neuron from every layer to reduce the over fitting problem but in times model fitting only you have to use the drop out when you are predicting the future you should not use the drop out because which neurons it will drop you have no idea it is just random so I have kept a parameter predict as false so it is training time in the training time it will add the drop out layer in the prediction time it will remove all those drop out layer in the prediction time only the elasteme units will be there next part is callbacks so callbacks are the special kind of methods on keras it helps us to store the best result between the model some model metadata as well as for the logs so I have used mainly 3 or 4 callbacks here first one is the reset state so as I was saying we are doing mini batch gradient descent for learning so what what happens between a single epoch it divides the data by the batch size so for the first batch it initializes the neuron with the random weights now these random weights are updated with the learning algorithm for the next batch it does not initialize with the random weights it takes the initialization from the first batch and initializes for the second batch so after completion of the epochs it has optimized the weights number of times of batch sizes but after completion of epochs we should remove all those weights from the model state because otherwise it will be over fitting model reset state does that I have created a custom class which will be called after every epoch next one is the model checkpoint checkpoint is like the heart of the deep learning so when you are doing the model training it is evident that for every epochs may not be the validation loss is the lowest one so this checkpoint will make sure that the best model from let's say 100 iterations we are running it will make sure that we will store the best weights for the future use so maybe it came around 50 epochs so it will store the best result from 50 epochs and will discard worse from the result tensor board is the for log visualization so the architecture I showed that is created from the tensor board and reduce error so when you are doing the learning maybe for every iterations the weight will not learn weight matrix will not be updated so if it is continuing for let's say 5 iterations the validation loss is not reducing then we should reduce the learning rate that is part of hyper parameter tuning as well so the parameters here is the monitor that is validation loss factor is 0.9 so that means if validation loss is not improving that means reducing after 5 epochs it will reduce the learning rate by 10% patience is the 5 iterations and minimum learning minimum learning rate means it should not go beyond that so this is the model architecture this is the model architecture in the time of training as we can see 256 is the batch size and 50 is the input sequence size now when you are choosing the model architecture you should make sure that the feature space should be doubled in time of hidden layer and then it should reduce at the end of the layer so from where we are getting the outputs but before that all are hidden layer LSTM2, LSTM3 we cannot get the state of the model directly so now we will see the prediction net so this is the prediction net it has absolutely the same architecture with the training model except the dropout layers as I told you cannot use the dropout layers in terms of prediction net and also we need the prediction net because in the training time we used 256 batch size so if you use the same model to predict the future it will always require a perfect divisible number of the test case of 256 if you give a 257 size of testing data it will throw you an error it will require 256 perfect divisible number so that's why I created the new net with the same architecture and batch size so at the 48 block you can see the batch size is 1 and here you can give one sequence at a time so that's more likely it I had to show any questions in between so this code is available I have given the link with the presentation so you can access the code and run it so that's it Kishabujit at the side don't put it in the middle like maybe here this works and then you have it from behind