 Wow, it is household. So, thank you everyone. So, welcome to the talk real time anomaly detection on telemetry data using neural networks. I am Keshav Peswani and I work as a senior developer at Xpeda group. My areas of work include building deep learning models and distributed systems. So, before I dwell into this complex topic of real time anomaly detection, if there is one thing that you guys can take away from this presentation, what is this? How do you increase the revenue of your firm's business by reducing the mean time to know and we will see this how is how this is how we can achieve this as in when we proceed in the presentation. Mean time to know is actually the time difference between and when an incident occurs and the time when we get to know the root cost for an incident and more money saved for the firm is more money that we get in our pockets. So, that is the key agenda that I had when I was building the system. So, let us start with the agenda of the talk. We will first go over with what is observability and Haystack and how it led to our problem statement. We will then see what is the system architecture of the anomaly detection system and how the algorithm leverages out. We will then see what all data sets we have and how are the results that that we got while running the algorithm on those data sets and then finally we will wrap up with the future prospects of the algorithm. So, let us start with Haystack. What is Haystack? Haystack in a single line is an Expedas open source distributed tracing solution that helps in detection of problems in a microservice architecture. Sounds complex? Well, it is not. Let us take an example. Let us say someone tries to search for a hotel at Expeda.com. To serve that one single customer request there are plethora of microservices running behind the scene such as content service, pricing service, inventory service. What Haystack does is it helps visualizing the entire customer request in a waterfall view. You can see this in the top left diagram. So, the entire customer request is called trace and the interacting services within that request as called spans. Haystack also aggregates this data to form vital service health trends. These health trends can be the failure counts for a service or the latency matrix. This helps in actually depicting the behavior for the service. Now, what happens if one of the crucial microservice such as pricing service starts failing out? The customer who is trying to do that huddle search at Expeda will either get an incorrect result or there might be complete failure. This would actually be a bad customer experience and ultimately loss in business for us. The problem aggregates more if there are thousands and thousands of customers trying to do the same and they are getting the same experience. Now, it is simple to see these failures when there are only couple or handful of microservices running in the org. But at the scale of having 100 billion spans a day that means 100 billion interactions happening in a week that means millions of time series it is almost impossible for our developers and our operations team to look at these failures in a manual fashion. So, we need an intelligent and smart alerting and alerting mechanism that detects these deviations and trends and alerts the required teams. This paved the way for our problem statement. So, problem statement is very simple. We need an automated anomaly detection system for univariate time series that can do the anomaly detection in near real time. These univariate time series are nothing but the failure counts and the latency matrix being produced by the haystack system itself. But you see there are many challenges when we try to build an industrial level anomaly detection system. We know that the algorithm needs to span for millions of time series now. So, labeling the data for those anomalous points is almost impossible. Hence, the algorithm needs to work in an unsupervised or a semi-supervised manner. The hyper-parameter tuning for these algorithms or the models should happen in an automated fashion. There should be no human intervention involved at all. Onwarding a time series or a metric should be as simple as a click of a button. These algorithm and the system should be efficient not just in terms of precision and recall, but also how fast it can do the anomaly detection. It should be cost effective as well that training and the inferencing should be cost effective. And finally, the system should have a provision for users to give feedback if necessary and tune up its model so as to improve the efficiency of the system. So, with all these challenges in mind we built an anomaly detection system. Let's see how the architecture looks like. So, this is a high level overview of the entire architecture of the anomaly detection system. I'll be taking the same example of failing pricing service and we'll see how anomaly detection works out for it. So, architecture is divided into two parts. One is the deep learning automated pipeline and the other is the online compute piece. In the deep learning automated pipeline, we have a cron job that fetches the failing pricing service trends from a data store and pushes it into S3. As soon as the data is pushed into S3, a lambda gets triggered out and it checks whether the training should happen or not. If an answer is yes, a training job is triggered and it does all the hyperparameter tuning and the training and the final model that is produced is again deployed back to the sage maker behind an endpoint. This can be actually leveraged by the online compute piece. So, the online compute piece is actually a Kafka streams app that runs or that listens to a Kafka topic for the failing pricing service trends. This app actually leverages the model that is deployed in the sage maker to compute the anomalies and if the points are anomalous, it is sent to the alert management system for the teams to be notified out. We also have a UI that helps us in visualizing these anomalies and there's a provision in the UI to provide feedbacks so that the model gets improved on a timely basis. So, this is how the entire architecture of the system looks like. You would be wondering that it is very simple, but there are certain hidden jewels into this architecture and I would like to highlight few of them. So, normally when we deploy a model and sage maker as a single container, it is very, very costly, but our challenge was to make the system cost efficient. So, what we did was we actually collaged all the models together and ran them as a single container. This made the system cost efficient, but there was a trick involved in it. So, most of you who would be using keras or tensor flow, there is the problem with keras or tensor flow in when we do an inferencing is that you cannot load multiple models in a single python process. So, what we did was we actually you can tweak the graph of the tensor flow or keras and have multiple sessions so that you can load multiple models in the single python process. This makes the system more cost efficient. Now, we use Kafka streams to actually manage the state of the model in a state store. This helps us in bringing the application back to the same state where it was in case of any restarts or in case of any horizontal scaling or a deployment of the new app, newer version of the app. Kafka streams also have an advantage, additional advantage. So, they run as an embedded application library. So, that means we do not need a cluster for doing the streaming part like spark or a flink cluster. That means the overhead of maintaining that cluster is now gone. That means the system or the solution becomes cost efficient. We did a blue-green deployment of the newer model with the older model. This actually ensures that there is always a model running behind that end point in SageMaker so that online compute piece can leverage it out. This actually helps that the anomaly detection would always happen in a near real-time banner. So, these are the major hidden jewels in the architecture. You now must be wondering as to that the architecture looks good. How does the algorithm leverage it, leverages out? So, let's see that now. So, anomaly detection methodology comprises of five steps. It starts from training an adaptive LSTM model using that trained model to forecast the next point in time series, detecting whether the actual value is anomalous or not. Then we have a provision for humans to provide feedback and finally all the four steps are repeated in a cyclic manner. Let's see these in a more bit detail. So, we use recurrent neural networks particularly LSTMs to forecast the next point in time series and LSTMs are very well suited for this task. And I would be assuming that most of us over here know LSTMs or they would have listened Catherine's talk about what LSTMs are. So, I won't go into the details of what LSTMs are, but what I will go over is how did we train an adaptive LSTM that could span for millions of time series. So, we started with removing the anomalous data from the training data. This could be done by using the human feedback that was given to us as well as using the statistical method such as elbow or some normal distribution. This is done so that LSTM doesn't learn the anomalous data now. We use cellu or self-exponential learning activation function so that the problem of exploding gradients and the vanishing gradients is mitigated. This is one of the common problems when we try to train an LSTM and most of us face that and that is the reason our LSTMs don't converge and we say they don't work, but actually the reason is the exploding and the vanishing gradients. We use continuous coin based betting or cuckoo as an optimizer. This is a dynamic programming based optimizer without a learning rate. This actually helps the model converge more faster and since it doesn't have a learning rate that means there is one less hyper parameter to tune now. We do the hyper parameter tuning for rest of the parameters such as number of hidden layers, number of neurons per hidden layer dropouts and various other hyper parameters using Bayesian optimization. Bayesian optimization actually creates the prior distributions for these parameters and uses these distributions to leverage out the objective or the loss function. What we get in the end is a best set of parameters in the minimal number of iterations. This is more appealing than the brute force way of doing hyper parameter tuning like grid search, CV or others scale and library. Finally, we check where the model that is deployed is a good fit or not against a validation data set. If it is not a tree training is automatically triggered. So, now that we have a model trained and deployed in SageMaker, we use it to forecast the next point in the time series. We compare this forecasted value with an actual value to compute something called an anomaly score. As you can see in the slide deck an anomaly score is actually a difference between the absolute difference between the actual value and the forecasted value. This is a very important concept for us since the anomaly score helps us in determining whether the actual value is anomalous or not. Let us see this. So, at any given instant we have a pool of anomaly scores with us. We use this pool of anomaly scores to calculate a distribution, a normal distribution. Now, at any time t we check whether an anomaly score resides in that normal distribution or not. If the answer is yes that means the anomaly score resides with that within that distribution that means the actual value is non anomalous. That means this can be pushed back to the pool of anomaly scores. The green signifies a non anomalous data set. However, if that anomaly scores resides outside that normal distribution that means the actual value is anomalous and hence this can be passed on to the alert management systems so that the teams can be notified out. So, this is how the anomaly detection algorithm works in action. Now, you must be wondering that the architecture looks good, the algorithm looks good but how about the results because anything in data science world works and talks about results everyone talks about results. So, let us see what all data sets we tried the algorithm on and what or how are the results. So, our first data set was actually a KPI data set that was released by AIOps data competition. This data set has time series from various various companies such as Alibaba, eBay and others. This data set or we tried this algorithm on this data set and we got the results. The comparison is shown our model is RNN plus statistical model and we got an F1 score of 0.7 which is pretty good considering that the time series was not just one companies but various companies and the results are pretty good over here. The next data set is our Expedia's Haystack time series data set. This was this is nothing but the actual failure counts and the latency matrix from the Haystack system itself. I will divide the results in two parts one is the business level matrix and another is the model level matrix. So, from business point of view the most important metric that we can look at is the time to know. So, before I tell the results for time to know let me tell you one thing that even an improvement in time to know by just few minutes actually bumps up the revenue by thousands of dollars. So, anyone in the over here is who is going to willing to actually guess what would be the improvement in time to know 20 percent. So, I will talk in absolute terms so our average time to know has improved by 26 minutes that means we could catch or we could predict the incidence 26 minutes prior to the previous baseline solutions. That means that we have now increased the revenue by thousands of dollars but at what cost which brings me to the very next important metric from business point of view which is cost because everyone over here looks what is the cost of running the model. So, again someone who is willing to guess what would be the cost of running this model on a daily or on a monthly or a weekly basis yeah pretty close. So, we have less than 3 dollars a month that is the cost of running this model on a monthly basis and this is less than a cup of a coffee nowadays. So, you can see how efficient the model is. So, I will just pause here for a moment for you to guys to digest these numbers these are very fascinating numbers that even I was. So, yeah exactly the cost of development. So, when we say our revenue has increased and the model is now churning itself the cost of development goes away because everything is automated out. So, we have around 1000 models running time series running that I cannot tell you certain things which is yeah. So, the model level metrics we have 0.7 as the F1 score and the latency to detect whether a point is a anomalous or not is around 205 milliseconds that means the model is running in near real time and most of that latency is the network latency between our compute piece and the sage maker which is around 187 milliseconds. So, model works in almost near real time I will show you certain screenshots as to how the model works in action. So, these are certain screenshots with model running in action you can see that model actually detects the deviations pretty easily and the red region is the anomalous region. So, I will take this answer this question after the talk. So, the future prospects so, we have an LSTM and a statistical model that does the anomaly detection for us and as I told you that we ran it on the KPI dataset that gave us a result of 0.7. What we plan to do is we actually plan to remove the statistical piece and replace it with the rewards based networks. We and the reason we want to do is we feel that the normal distribution they do not give they do not do justice for all the time series and not all the time series would fall in the normal distribution part. So, we ran an experiment with the same KPI dataset and we got certain improvements and that is why I am speaking about the future prospects in for this algorithm and the improvement in with this combination the LSTM and the reward based networks is that that we got an F1 score of 0.84. This result is very very promising, but we still need to try it out at our scale and at our dataset to ensure that this becomes the benchmark for the industry. So, now I hope that you guys would be and you guys would have an understanding that how the anomaly detection system works in action and how we can leverage it out to reduce that mean time to know and increase the firm's business by increasing its revenue. With this I will conclude my talk. Thank you. So, I will first complete that question and yeah. So, you were asking about how do we cater the false alarms piece? So, there are two things to this one any model in data science world you train it is not always 100 percent accurate. So, that is the reason we have a human intervention loop that helps us in taking the human feedback. Let us say if there is a false positive coming in we take a human feedback so that we adjust the model in real time it is not happening in like a batch thing. The second thing is with the Haystack system I it was a shortage of time. So, I do not explain that with the Haystack system when we get anomalies let us say if I am there are thousands of metrics running and we got to get around 20 anomalies. It is not that we alert the 20 teams go and alert the 20 teams with the Haystack system we have a already have a graph. So, we know exactly what who is the root cause of all those 20 systems that are failing out. So, that means only one service although the system became noisy because of one root cause but we now know what is the root cause and we only alert that team because once he that team fixes his or her service it is gone it is going going to go away for all the teams. So, that graph is already stitched out as I showed you in the first slide that how this Haystack has a visualization of what all interacts with each other. On an R basis or in a month is something it depends on various factors it is not like you can if there is a deployment or a release happening if there is a service that is going rogue because of some GC issues or something. So, there are there are plethora of factors I cannot count on saying that today you got 10 alerts tomorrow you will get 20 alerts you can you cannot comment on that because it is it is a thousands of microservices everyone is doing their own work nobody has that understanding of how this how these things stitch out together and how we can find the root cause. So, it is it is almost impossible to for us to say that today we will get 10 alerts and tomorrow we would get because it is all dynamic anomaly scores. So, we use those anomaly scores the pool that we have and we just create a normal distribution out of it no no no we have it with ourselves we get an actual value we get an forecasted value and then we can compute something called anomaly scores. Now, we use those scores to formulate a probability distribution it is just like you have let us say 5000 array of scores and then you create a distribution out of it it is a mu plus 3 sigma mu plus sigma kind of a thing yes exactly.