 Good afternoon everyone. It's been a great day here today at Cubeday India, the first major CNC event in India. So it's a pleasure speaking here and there have been some amazing talks all day long, really great energetic conversations and we'd like to continue this momentum a little further. The zero interest rate phenomena is truly over. Companies worldwide are focusing on managing their costs in tooling, also like parallely to their performance optimizations. We're going to be talking about how a few years ago we made a few mistakes that almost cost us upwards of $100,000 and these are mistakes in our long-term metric from your storage and yeah, that's what we're going to talk about. We're all humans. We're all going to make thousands of mistakes in our lives. It just so happened that these mistakes synchronized so beautifully together that each mistake exponentially increased the cost. So let's dive right into it. So who are we? Well, I'm Shubham and this is Ankur, we're the founding team at ZenDuty, an incident response platform. We have accumulated years of experience making mistakes, learning from them and advocating for best practices in production in SRE teams worldwide. So what is this talk really going to be about? Firstly, we're going to talk about why we needed a solution like Thanos in the first place. What other solutions were present in the market? So this was Victoria Metrics, Cortex and we're going to see how they compare against Thanos. Then our production setup and our expectations from that. And then the good, the bad and the ugly, not the smooth movie but what went wrong, how we fixed it, the actual cost and the potential cost for our missing configurations. So just to give you some context into what we're building and the scale that we could run into, this is what ZenDuty looks like, the incident response platform. So we connect with the entirety of our customers' infra support and comms channels and deliver them alerts across all platforms. With the context straight from the monitoring stack, we have to automate everything from their first alerts. So naturally, when our business started picking up, our production scale also needed to ramp up that fast. With that, we took the cue that our observability stack needs to be just as reliable and we need to step that up a little. So why would we need Thanos? Well, Prometheus has a simple and reliable operational model. It's capable of tracking millions of measurements in real time and with powerful and pretty capabilities of visualization and querying via Grafana. But after a certain scale, it has some shortcomings. And these shortcomings, they fail to answer questions that naturally fast-skating teams come across. Questions like how can I store petabytes of historical data? How can I store that without sacrificing response query times? Can we access all the metrics from a single API query? How can I merge replicated data collected via Prometheus HA setups? And turns out, Thanos was the solution to all these questions. So, yeah, a few major reasons why we chose Thanos. Firstly, the global query view. So Prometheus encourages a functional sharding approach. Even a single Prometheus server provides enough capability to liberate users from the complexities of horizontal scaling. And regardless, you would still want to access all this data through a single UI or API, which is the global view we're talking about. So for example, you can generate multiple queries for a single Grafana graph, but they're all going to be against a single Prometheus server. With Thanos, however, you can create, like, you can query multiple Prometheus servers at the same time because they're all available from the single endpoint. Next, we have reliable historical data storage. Naturally, Prometheus sidecar watches, so Thanos sidecar watches Prometheus for new blocks of persistent data, and it pushes this to the object storage, which in our case was AWS S3. Then we have down sampling. So once you start querying historical data, you'll realize that there are some fundamental big complexities that make your queries slower and slower as you retrieve weeks, months, and eventually years' worth of data. And the solution to this problem is good old down sampling, which is reducing the sampling rate of the signal. So with down sampling, you can zoom out to a larger time frame and still maintain the same number of samples and thus keeping your queries responsive. And lastly, we have high availability. So Prometheus has an HA model that essentially collects data twice, which is as simple as it could be. However, the merged and redeplicated view of both streams that Thanos provides is a huge usability improvement. So we considered a few different solutions at that time, but there were a few problems in every one of them. So a few of them weren't community-driven. They were either too young. They either did not support object storage, or just their general integration with the rest of our observability stack was not as efficient as we expected it to be. So Thanos is where all the great engineering teams were at that time. That's where we decided to be. But if you were having this conversation today, it could go very differently because you have some good competition from Victoria Matrix and Cortex depending on what you're looking for. I'll give a quick shout out to even last line who is trying to build something in this space to check them out as well. And so let's take a look at how we implemented Thanos, a beautiful Titanic before it hit the iceberg. So Thanos runs as the sidecar to the Prometheus instance. And it pushes blocks of object data storage to your S3. We had implemented Prometheus functional sharding, which means that we were scraping the same data twice from multiple ACs. This setup requires Thanos' ruler to set up the promql equations for alerting. And it should be in turn connected with the Prometheus alert manager over there for delivering the alerts to the final destinations. In our case, the NUTN slack. The Thanos squareier uses the storage gateway to retrieve all data from object storage. And it's used by teams for any analysis or to create internal or external dashboards for their PMs or any other team members. And in order to get faster query results, the Thanos comparator that we talked about is used to downsample the data and store it back to the object storage. Now let's see what we had for log aggregation. So we implemented log aggregation via Fluentbit or the other word app that not a lot of us talk about. Well, Fluentbit is a well-known log aggregate tool which is a worthy successor to Fluentd. And we deployed the daemon set and started exporting our logs to OpenSearch. Fluentbit uses a plugin to add extra labels to the log line for each log stream. And for getting these extra labels, it queries the Kubernetes API for labels like pod name, namespace, annotations, and any other relevant information. So what if the Kubernetes API does not respond? What if for some reason it fails? And if that happens, it retries after some jitter. It retries and retries and eventually pushes the data to the destination. That will, this small caveat somehow managed to form a little problem in the system. And what were the expectations after this? Well, we spent a lot of time on this. It wasn't easy to configure, but we got it configured. And our expectations was effortless querying, cheap long-term metric storage, leisurely downsampling data whenever and however you wanted. And in essence, just a much cleaner metrics and logging system. And yeah, this should work straight out of the box, right? It didn't. Not really. In an IAVT, we saw a bunch of anomalies and errors popping up that we didn't make sense of initially. But sooner or later, it all came together. So the first in this series of unfortunate events, we had an issue with our network topology. So while we were building a log and metric handling infrastructure, we missed to keep in mind some basics that cost us daily. Next, I'm just playing with high cardinality. We were scraping too much data and metrics for our skill. And finally, our friendly downsampler that I've been talking about all this long, there was a mismatch in our implementation and our expectations with the Thanos downsampler, causing an increase in network and storage costs. So this sounds intense, well, it was. And I'll pass the stage now on to Ankur, who was off the deep end when this happened. Let's see what he found. Thanks, Shivam. So to drill down, this is a general network topology which looks like where you have multiple availability zone and multiple VPC. We were using that only when we were stuck with that issue. But the problem with this topology is that we were pushing, like he's mentioned that our AWS storage resources, AWS S3, so we were pushing a lot of logs data through S3. And they were actually traversing through NAT gateway, which means egress coast, and which got to a certain stage which impacted a lot to us, although this was not even a 10% or 20% of it. We'll come down to the majority part later. So how did we fix it? We actually added a VPC gateway endpoint which identifies and separates our public and private... public and private traffic to S3. So whatever traffic which needs to be go to S3 from our private servers, it will follow through VPC gateway endpoint while others still go through the NAT gateway. This is how we actually fix the problem. The other issue was high cardinality. I'll explain to you what high cardinality is. Prometheus is very much capable of actually tracking millions of metrics, but what happens is that we sometimes miss out on the cardinality issue when we are setting up these metrics alerts. For example, if you want to put a metrics just on a pod name, which is not that dynamic, it's not something which changes every second or every millisecond, then it's fine. But if you're including something which is changing every millisecond or every second, for example, you would not want to actually club your login session with timestamp, or you would not want your request to be clubbed with timestamp, or you would not want to club your request ID with your sessions also because that would lead to a very high cardinality because the values are changing very frequently. Which means Prometheus has to store multiple data for each unique metric it's getting and save it there, which adds to a lot of value. So when we were actually debugging our issue, which was not related to that, we actually identified this could be another problem, which was causing an issue and leading it to a lot of logs aggregation. So the third one was the down sampling. This was a major issue which we faced which was the main cause of the whole scenario where we are here. So if you see, if anyone has used down sampling compactor, they would know that Thanos provides a very basic down sampling compactor where it provides you retention of very specific days and down sampling up to only two levels like five minutes and one hour, which we were using it earlier. The other problem there was whenever we have to down sample any data, the main object store needs to be available. So for example in this case, if a data needs to be down sampled after 40 days to a five minute, the main data needs to be stored for 40 days for sure. And after that, if we want to actually have a down sample to one hour, it would still need the original data to down sample. So you are storing the same data thrice. Also this down sampling over a problem of NAT gateway and high cardinality leads to much bigger issue. So how did we fix it? We actually, we went back and actually wrote our own Thanos compactor which we are planning to open source it in some time. It currently works nicely with the Prometheus but we're also testing it with Victoria Matrix so it should be live soon. So what we did here is we gave freedom to developers on the down sampling fields. For example, now they can down sampling based on their own requirement, not just limited to five minutes of one hour, they can actually choose down sampling to 10 minutes, 30 minutes, even one hour. Also they don't need to keep the original object store. They can actually remove it once the first down sampling is done. For example in our case what we did was we did a 10 minute down sample which needs to be retained for 180 days. Once that is done you can actually get rid of the main object store which we're only keeping it for three days. That also means your down sampling needs to start before three days and you can choose that option in this feature that we can delete the down sampling after three days and also start the down sampling whenever we want. In this case it should start before three days and after that when we have to down sample it for 30 minutes, we can just depend on the previous down sampling data which is the 10 minutes in our case and not based on the original down sample data. This is what we did. The other problem which Shubham mentioned it was not a bigger of an issue but we detected it after we fixed the Thanos down sampling data was fluent with it. Since it's part of the same problem we thought we'll actually discuss it here only. Like most of you must be knowing that fluent with but it does it actually scrapes your application logs and actually push it to an object store. In our case we're pushing it to the cloud watch. But the problem with fluent with in our case was that once we were collecting fluent with data we were using QBAPI to append few more data. In our case it was POD and also the tracing ID which we were appending it to that. We missed the basic part that fluent with does is if QBAPI fails it keeps on trying. It keeps on trying. And as the number of requests comes in initially it will be like 10 requests which will pile up to 100 requests, 1000. So those requests are keep trying to get that data. We did not realize it because the data which we were suffering in Thanos compactor was in TVs this data was GB. That's why this came into light when the TV issue was done this was the second biggest problem for us. And the problem in our case was very simple, it was very naive also that in one of the newer services which we released recently we did not change the sub path amount sub path. So it was working fine on the local but it was not working because that sub path was not even available through QBAPI it was trying to get. So it was costing money. So with proper monitoring we were able to detect this anomalies that I would actually give credit to my team also for detecting it but still it was not as soon as we would have wanted we have done better than now this by actually putting more better alarms and more better monitoring systems but we all know that we don't actually give too much importance to logs how much would that cost we give a lot of importance to our database data that is costing us a lot but logs we always think because they don't actually we don't retain it for so long also and we don't actually cater to that the traffic the network traffic which we are using it right. So we did a calculation based on what the build we also got that what could have it cost it right. So this was a process data which was assuming from the NAT gateway 50 TV and if you actually calculate with hourly cost it will come down to 32 dollars not much but in our case it was processing 51,000 GB per month which will come to $2,304 and if you calculate with both the data it's $2,336 and this we were doing it over 6 NAT gateway since we are in a multi AZ and a multi multiple region environment. Still not a lot but the the logging cost we estimated it comes down to $1,700 based on the Cloudhose because it was in GB but assume all of that in 6,7 clusters so we were running 6,7 cluster of these which accounted it to 100K gladly we did not actually went this far and actually affected but still it cost us a bit to actually learn that lesson and that's why we are here to share it with you guys a few final words for you know we leave you the smallest mistakes hardest to fix and in turn they're just much more expensive. There's this quote from Anthony Hopkins that I really like the evolution of sentient life on this planet was created by one single tool which is a mistake so don't count out single mistakes at all. You don't know what other mistakes this is going to pile on to become second don't rush into things because it might be easy to drop in so we see a lot of chatter in the last year or two about shifting vendors adopting open source solutions and I'm all for that but make sure that you're doing your due diligence you're not using something just because it's easy to drop in and because you know there's a lot of hype around it or they claim to be a silver bullet nothing's ever going to be a silver bullet and lastly if you're a growing build fast break fast team it's often word spending on extensive monitoring especially if that's going to save you much more on saving you from errors like these and that's pretty much it one small shout out to everyone at Kubernetes, Thanos and Flo and Bit who've been tirelessly maintaining so I think there are a few Kubernetes maintainers here today thank you for everything you do and yeah that's all the time thank you everyone for listening towards the end of the day so it's nice to see a full auditorium as well we have a few socials over here if you want to get in touch with us we also have a pretty active Slack community and an SRE meet up community if you want to see more of these talks more of these investigations into site reliability and yeah that's us if you're looking for a cost effective facelift in your insert management systems check out the NUD we're still hiring and yeah we're ready for any questions which you want to share, any mistakes which were made I'm sorry Mimid from Gafana so Mimid wasn't really so this was, this is not a latest story, this was done like a couple of years back at that time it was not that mature enough as I mentioned Mimid was I'm not even sure it was launched at that time if it was it was very very nascent so we did not want to venture so our environment is pretty stable and we generally do thorough testing when we are okay apart from the mistake which we made we do rely on things which are stable we look for the most stable version but at that particular time Thanos seems to be better and still we are continuing to use that it's working really smoothly for us yeah so high cardinality problem like I said right you don't rely on very dynamic metrics so one of the examples which I gave you was not to rely on session IDs if you have to track your user behavior do it on email ID or any user ID which you have other example would stay away from timestamps that is the high volume cardinality problem which we tend to ports are fine, services are fine if you have to do it yeah so once we did the down sampling right we anyways got rid of the main object store so we didn't had to so yeah maybe in future we have to we'll see you had some questions I didn't get him I'm sorry okay so like I said with the tool which we built it's an incident management solution so for us we help companies stay reliable that's what we focus on and the reliability of our tool needs to be better than our customers tools so for that our DR and everything needs to be multi az and multi regional now we have moved to multi cloud also for that particular region we were moving that much data to different region so that if you have to switch the data is always available for monitoring so S3 is anyways a global store right we were moving data from multiple region to S3 we are on AWS yeah so we just moved our some of the workload in S3 that out and see if that is useful but in our experience I'm not trying to advocate AWS here we have been using AWS from last 10 years but we have been able to find the cost and the usage more transparent than azure azure works fine but the support which AWS providers has been more smoother that's why we have taken to AWS for now okay yeah that's a good note keep that in mind pass that feedback to my team yeah in the Thanos compacter which we have provided now you can I'm sorry I didn't get that function of metrics not the days function of metrics not the days so you want to down sample a particular metric and not critical load so in our case what we were doing so you can pick what data you need to down sample and keep it for longer version of the other in our case also we are also discarding some of the data which we don't need for 365 days or even eternity this sample which we have showed you this is only data which needs to be retained yeah and what about you have chosen solution resampling you cannot go back we have tried that in our case it has been able to work because the down sampling which we were doing is based on maximum and average so we were able to reproduce it to some level we were doing it to build some intelligence on top of it if I tell you frankly but real data is real data thank you it was a nice presentation anyone want to share a mistake with them or you can hit us later on also it's love to talk to you about if any of you have similar stories or experiences you'd like to share that we have a pretty active podcast it's called the incidentally reliable podcast we talk about talk to SREs and you know their long careers in history talk about the mistakes the volume stories so you know do check that out and yeah if any of you would like to be on that do reach out we'd love to hear from you and we are continuously hiring for back-end and infragas do hit us up on the career page and love to actually talk to you about that it's on youtube spotify I think it's there on the link as well on the Q&A you can find that over there you want to show it yeah I could share that as well currently we have a few episodes up the last one was with Manoj Sebastian who served at Flipkart and you know that was a treasure trove of stories on how Flipkart like companies like Flipkart at Lash and Yahoo manage their scale so definitely if you are into site reliability stories you know war room nightmares definitely check this out we'd love to hear from you guys as well if any of you would like to be on this do reach out to me any other questions we have a few minutes thank you guys and thank you for giving us the opportunity thank you so much