 Hello everyone. Welcome to Observability Day Europe 2023. The topic we have today is scaling Observability Stack for AIOps. I am Ravi Hari. I work at Intuit as a Principal Software Engineer. About Intuit briefly, Intuit was founded in 1983 and it went IPO in 93. We have about 19 locations and I work from Bangalore office. We have about 12.7 billion revenue last year and we have more than 100 million customers and it contributed into multiple open source projects. Coming to the topic, briefly Prometheus metrics are used for various purposes and Prometheus being a defect of standard in Kubernetes. It can be leveraged for various use cases. For example, monitoring time series data, alerting, auto scaling and anomaly detection are common use cases for Prometheus metrics. The problem that we had to solve was we wanted to generate anomaly scores using AIOps tools, leveraging Prometheus metrics data. And the thing is that Prometheus metrics that the Prometheus instances collect, it is good for a few other but we cannot store the data in Prometheus instances for days to later. And Prometheus also did a lot of other things like scraping the remote writing and also allows for pairing because of which we cannot retrieve that data with long-term storage directly from Prometheus. We had to choose other components. So, we looked at all the solutions we and we felt Thanos is a great fit with Prometheus and started integrating Prometheus with Thanos to scale horizontally. Thanos has a number of components like Thanos primary store, sidebar, ruler, etc. And one thing that actually we like was Thanos sidebar with the help of it running along with Prometheus, we can write the data of Prometheus DSDB over to S3 buckets. So, this is a high-level picture of Thanos where the Thanos sidebar runs along with the Prometheus instances and the separate component of the Thanos sidebar writes the DSDB block-central object storage, S3 buckets in our case. And Thanos query can be used to query the data using Thanos store as a gateway. And Thanos store, it is the data from S3 buckets and passes it on to query which can be leveraged by the clients. Additional components like ruler and contact would be helpful. Contact will be useful to contact the data in object storage. Ruler can be used for aggregating the rules if you have multiple Prometheus sidebar so that we get to consolidate the data. So, we thought we'll use S3 buckets to store the data and which was about 8 years of data would be a good historical data query and run from an object on top of that matrix for a given object. And retrieving this data from Thanos store, we pass it on to Thanos query as we have seen. So, this architecture essentially looks like this for which we have initially noted. So, for querying the live matrix, we went directly to Prometheus. We didn't have too much traffic on this side. But we were anticipating more traffic coming towards the Thanos long-term storage matrix because we wanted to run this and all this process on a given cluster. And this is the part where it queries the Thanos query S3 service. And that calls this Thanos query three parts because they are getting retrieved from S3. We are just knowing that for three years. And this is Thanos query S3 parts between the data from Thanos store. Thanos store is a staple set and that can also be horizontally scaled. And they call the S3 buckets and they retrieve the data from S3. So, that is a current model we have implemented. And we have seen a number of challenges. This is the first challenge that we have seen. We thought everything would work with the free itself picture. But this was working only for the add-on namespaces or namespaces where we want to have Kubernetes controllers. But it is not working for us in the application. The reason for that is our application namespaces have high security constraints. And because of this, we have network policies that we put in place so that no two application services can contact each other. So, we cannot query a service from an application namespace into either an add-on namespace or any other namespace assets. Then how can we query this data to run it from the AAPS pipeline in the namespaces? Then we thought of a solution called internal AAPS, which we can put in place in front of this Thanos query S3 service. Then the application namespace, which is running this AAPS pipeline, can query the internal AAPS to retrieve the data that eventually can call the Thanos query S3 and store the data from S3 buckets. So, this pipeline can work is what we figured out. And we thought this will solve all our problems. Then we started running into other issues. The second issue that we ran was we started load testing it. We found that Thanos query varies with errors coming up quite frequently, and we are also seeing errors on the claims. And the error that we saw in the Thanos query is the chunk pool being exhausted. Without initially this is a Thanos query problem, but later we realized this is essentially a problem coming from Thanos store. And when we looked into what are the possible options for us to configure with Thanos store, we came across two critical options in Thanos store. One is the chunk pool size configuration, and the other one is the next batch of size configuration. So, the chunk pool size configuration started playing around by giving different sizes for it. And even when we gave around 30 GB for the chunk pool size, it was not sufficient. We were running out of it when we had a high volume load test. Then we thought we need to keep this minimal because otherwise the Thanos store parts are going to have a very huge amount of memory. So, we optimized it in a little bit of time to that. The other thing that we also found beneficial is the next batch of size. So, this keeps the index data that gets stored in the DSV blocks, which has information of where the chunks are stored in the C++. So, by enabling this, it will optimize the speed to fetch the chunk data that is stored in the C++. And it doesn't have to retrieve this index file at any time because it is stored in that unit. So, these two options help us alleviate the chunk pool exhaustion problem that we were seeing in the Thanos. Now, after that, we found another issue that the ALPs are getting 504 gateway timeouts. And we were running the load test, initially we didn't see this, but after a certain point of time, after we fixed the chunk pool size configurations and stuff like that, we saw that the ALPs are giving us 504 gateway timeout data. And we looked into that, so far we figured out that the ALPs are getting timeout. So, there is this configuration and it will be called idle timeout. By default, it is set to 60 seconds. But because we are finding eight days of data, the amount of time taken to retrieve the data from SC, but they can pass it on back to ALP is taking more than 60 seconds on the load. So, which is why we are not able to get the metric data back. And that connection is getting terminated after 60 seconds and we are getting 504. So, the fix for this is we have to increase the idle timeout on ALP and then we are not seeing the 504 gateway timeout data. Then we thought we are set and we had to run as a problem. And this problem is 502 back, which is on ALP. And the reason for this is we thought it has to do with some timeout settings and other things, but later we found that as we started increasing the load, the timeout store memory is getting exhausted quite fast. And at the same time the memory from the node is also getting exhausted and reaching 100% quite fast. And then this timeout store as it is now becoming accessible. And timeout query at 3 when it is trying to query timeout store, it is getting a connection with this data. So, with timeout query getting connection with this data, the ALP is giving us a 502 back here. Then we thought, hey, whatever we are doing is not right because we are querying too much of data. And that is overwhelming timeout store. And we need to somehow reduce the amount of data that we can query at once from timeout store. So, we can limit the memory on the timeout spreading system. So, as we discussed, we saw the 100% usage on the nodes and CPU on timeout store, CPU on memory and timeout storage getting spiked immediately. We looked at a couple of alternatives. And the solution for this problem is we found how can we reduce the size of the data that we can retrieve. We thought of implementing our own solution that we later found that there is already a well thought out solution for this called timeout storage rendering. And this provides us an option to split the query range with the split interval flag. And there, we tried it on different combinations during the Lotus and we tried to us as a decent query split interval with which we can limit the one timeout data timeout store consumption timeout storage because otherwise, if we don't have this and we query timeout spreading, even if timeout store sometimes performs well, timeout spreading is running out of rendering. So, this one actually protects both timeout spreading and timeout store to retrieve the information, right? So, this is our finance solution. So, the pipeline, the transfer and application interfaces queries the Internet ALB, which queries the timeout spreading which can horizontally split, that Internet queries the timeout spreading service, which can also horizontally scale, which queries timeout store, which actually is the data from a certificate and that can also horizontally split. Now, with all this, we ran the load test and we didn't see any error after that. And we saw that 99 percent timeout response time is around 60 seconds and an average is around 30 seconds. And these are the results on the back end, where we saw that the timeout spread F3 CPU and memory in control, as well as the timeout store query and memory and CPU are also in control. And we also saw that the HP is getting kicked in the right controls and this components can also scale. And the node method in search and review are also related to this. The next thing, once we have this thing running, right, the next big thing is how much does this pass for the state to have this solution. And when you look into components involved in this, there are three components. One is ALB, the second one is nodes, the third one is S3 bucket. So first, looking at the ALB first, the ALBs are getting charged based on the LCOs, which is I think what load balancer capacity units. And in our load test, based on all the information that we have, when we look into them of the product case start, we require it based on a load profile at the peak, we require it over 31 GB per hour. And we have about 30,000 requests per second and stuff like that, about 30,000 for to create against the ALB per second. And based on all this, we saw that ALBs, we can charge about $204 per month, which is not bad. And then we looked at the node usage, right? And how many nodes should be based on the instance type or the number of nodes that are changing. We started with the outside web site, but as we saw, the memory was one close to the other instance of the series, because they are already optimized. But later, as we were able to control the quality range to do a sort of speaking role, then we realized that it's no more a memory-bound problem, it's essentially a computer problem. So we chose, we went with C5 instances and C5 instances, when we looked into the on-demand instances, it was personally a little higher. But when we chose the resulting instances, the cost essentially descended even to its color of instances. So overall cost at C5 for instance, around $1460. And then last thing is AWS S3 bucket cost. AWS S3 buckets are charged based on the amount of data that we store in them, and the amount of data that we retrieve out of it and the scans, it does store data, right? And by feeding in all the information used on our predictions and give us less points, you can only arrive at about $1400 per month approximately. So overall cost based on the prediction, it comes to about $3,100. This is a good insight for us. Now we can look into is this what the purpose to kind of build our own solution or look at any alternatives that provide us out of the work solutions. But this seems to be much more optimal than going for any paid tools and stuff like that. That is all our livings in scaling Thanos to retrieve long-term data for metrics that actually did. Thank you.