 We've personally had a great time so far at KubeCon Europe. KubeCon Europe's observed today to be precise. And Paris is everything they said to be and more. Like, every piece of bread I've broken so far has been better than the last one. Can't sing the same praise about the subway system. I hope you guys can relate with me. I don't know why, like, why do they give such bad directions? Like, I thought I agree with bad directions, but yeah. All the talks today have been really fresh, really intuitive. And the energy on this day zero has been more useful than any other KubeCon I've been to so far. And what we're going to talk about today is related to the zero interest rate phenomena. How many of you guys have heard about this before? The zero interest rate phenomena, ZIRP, yeah? A lot of chat about this on pretty much everything that's happening today is because of this last two years. And this phenomenon is truly over. It flew by like an innocent night of drinking. And ever since, there has been an increased focus on cost optimization going hand in hand with performance optimization. As companies consulted their businesses, offerings, and financials, every penny spent needs to be accounted for more and more now. And this talk is about how RazorPay saved upwards of $500,000 on their observability and infra tooling spent by switching open source vendors and implementing best practices which aren't as intuitive as they seem to be. I hope that a lot of fast-growing teams today here can identify similar patterns to find money where this seemingly is none. So yeah, nice to meet you all again once. I'm Shubham and I lead developer relations at Zenduri. This is Sandeep. Hey, hi, everyone. Nice meeting you here. I'm lead DevOps at RazorPay. RazorPay is India's biggest print recognition. You can call it Stripe for India. It kind of take care of close to 80 million of merchants' transaction on day-to-day basis. Yeah, that's all I mean. Yeah. And once again, I lead developer relations at Zenduri and advance incident response and alerting platform. Together, we've helped teams set up their cloud automation and incident response processes across the world. So what is this talk really going to be about? So firstly, we'll start with discussing the scale at which observatory costs really start to pinch hard. It's a place where RazorPay personally has been for a while. And we'll talk a little about how even experienced and proficient teams can find themselves stuck in the spiral and let infrared observatory cause spiral away. And then we'll just take some time and cover each pillar of change that RazorPay undertook, talking about their experiments with tools like Fluent Bit, Hybrid Trace, Victoria Metrics, EVPF, and a lot of some other fun stuff, all in the spirit of getting some more bang out of the money. So a few of you might think that, hey, any proficient team with cloud-cost alerting and good architecture in place would not be in a situation like this. However, reducing costs in areas like observatory and associate tooling is often something you would not get 100% right while you're shaping these pillars up. So I'll walk you through a similar journey that a lot of these fast-growing teams come through. Firstly, the team rapidly expands as the business booms. Infra-orchestration kicks in into high gear. And the team is currently at the stage where they have a lot of problems. But these are good problems to have. They're happy about being there. Next, as the consumption climbs and climbs and eventually peaks, scaling issues can and do arrive even for teams that do plan for it a lot beforehand. And at this fateful juncture, every high-performing team would prioritize meeting the evolving requirements of the customers and the internal developers. And this involves ensuring that everything that a team can do to boost developer productivity, making sure that you're reducing the time to market. And this involves over-provisioning, then choosing tools and any other processes not requiring a lot of deliberation, but on the focus of what's an easy way to get this reliability seeped into my operations right now. Yeah, so I'll just give you a quick overview of the scale on ReservPay Operates, basically. So we kind of process 100 million of money every year for our merchants. And then we do close to 1,000 TPS in a second. And just to give you a brief, PayPal, maybe till last year they used to do across India somewhere around 190 or 193 TPS. And Visa is capable of doing somewhere around 1,700 TPS, basically. And so we produce somewhere around 100 TVs of logs every day and 32 trillion of data points per hour. And then we have somewhere around close to a million power, half a million power, basically, spreading across different, different environments, basically. So we'll start with the observability highest with the metrics, basically. And then we'll go back, go to logs, and then traces. So Victoria Metrics is the holy solution or her majesty solution for us, which kind of saved a lot of money for us in this. So when ReservPay was very small, we used to have Prometheus, which could take care of our metrics. But when we started scaling, it was becoming the single point of failure for us. It was not high availability. And the only solution which we could have with the Prometheus is actually we need to scale it vertically, which kind of becomes more dependent on DevOps engineers. We cannot have the high availability and the auto scaling with that, basically. And then we tried Thanos. But the problem with Thanos is that actually it saves the metrics in S3. And then fetches. And then it does the calculation in the memory. And it takes a lot of back and forth, and the network costs a test to it. So we rejected or kind of parted ways with Thanos. And then we came up with the Victoria Metrics. And the flexibility which Victoria Metrics gives us is actually this. So Victoria Metrics, we kind of keep our data in the tired format, which is kind of insert, select, and then the storage format, basically. And then we can even further tear it down in the two more format, hot and cold, basically. So whichever is in use, we can just keep it in the hot format. And then whichever is not used, or maybe whatever we want to archive, we can just put it in the cold tier. And Victoria Metrics enables the Grafana for the previous or the old data as well, which we can just fetch from the Prometheus as well. So yeah, that's all about the Victoria Metrics. So just to give you a brief, I mean, how much cost is associated with it, actually, we can have a few VM instances for an environment, typical Kubernetes cluster, which will be somewhere around five machines. And then we can just put it on a spot. So that will just cost us somewhere around $13. And then two or three more nodes on basically for the storage and the select. And then that will be somewhere around $3. And the total cost for a cluster, which will come somewhere around $18, basically. And the overall cost saving with the Victoria Metrics was somewhere around $40 to $45, basically. And this cost is clubbed with the spot instances plus Victoria Metrics, which used to cost us somewhere around $60 a month. And then it is down to $15 a month now. The other factor in the observability is the logs, basically. So we used to have a different logging platform, which used to ingest the whole data, whichever we were producing. But the main sipper for the logs were the Fluent D at that point of time. And it used to have a lot of memory hogging. And then if we visualize it with the 10,000 to 12,000 nodes at a time, it becomes somewhere around 200 to 300 machines only for the log shippers, basically. So we shifted from Fluent D to Fluent Bit. And then we made it a log shipper and then Fluent D as our aggregator. And then we shifted our logging platform as well, which does not need the whole data to be ingested at the same point of time. So we moved to the different vendor, and then it saves the data into the S3 in parquet format. So whatever we need at that time of troubleshooting or visualization, it fetches from the S3. And then the cost is associated with only the ingested data or the process data. So once upon a time, we kind of played with hypertrace as well. Hypertrace is kind of a cloud native solution for tracing. It saves the data in Pinout and Kafka. But we did 100% sampling. And the cost kind of skyrocketed for us somewhere around $3 million a month a year, basically. So we have to tear it down and then move to some other solution. And that's when the hypertrace became really hyper for us. The newest addition in the observability, the hot topic is kind of EVPF. We started playing with it and we have implemented an EVPF based network observability solution, which kind of tracks the kernel prof for us using K-Props and gives us the exact visualization which some of the tools in the market currently giving. And it gives us the whichever request or service is trying to connect outside well, because result pay being fintech organization, it has to be compliant with the central bank regulation all the time. So whichever service we need to move it to internal network or external network, it can be taken care in a few minutes. And the outside or inside communication is kind of associated with the net cost as well. Because if you're connecting outside, it goes via net. So significant amount of money has been saved with the EVPF tool. We call it the egress eye, basically. So observability is only half the battle when things eventually break. All the shiny, iron tight monitoring and observability systems that you have built would be for not if your teams aren't able to remediate and prevent business impacting disruptions. So if you're asking how unoptimized incident response processes could cost your team's money, you're not the first one. Let me show you how. So raise up a leader's notice that on an average, every engineer is spending more than three hours per month attending incident disruption calls, which is around 10% of their monthly bandwidth. So in almost 75% of these cases, it was realized that the final root cause is something that was beyond the control. And all they needed to do was run a few SQL queries to reach to that point. And this is a very repetitive, repeatable journey that could have been undertaken without getting the engineers involved so frequently. The incident management team at RazorPay diligently monitors approximately 11,000 alerts per month, out of which only around 250 are P0 alerts, out of which only 110 end up being business impacting incidents. So you can kind of get an idea of the noise to signal ratio when it comes to improper alerting practices. And just your on-call engineers being involved in so many of these calls, diving into something and realizing that this has already been fixed or is not something that you should be losing your sleep over, costs you more than developer productivity than you can probably measure. So the solution to this predicament is as simple as your first thought, which is just minimize human intervention wherever possible. So RazorPay identified the top five alerts, which are frequently occurring, and built runbooks outlining the steps to address these alerts. These were meticulously reviewed by senior engineers who have been here for a while and know that automation can't just be thrown into a system really nearly. You need to have some level of manual intervention there. And they did this by using the tool of choice, which was Shoreline. And to minimize the noisy alerts, RazorPay implemented a service-based incident response architecture. And they had set up very complex alert rules around their alert payloads, which they received from their monitoring tools. And this allowed their incident alerting tool of choice, which was Zendudi, to deliver only the right alerts to the right people. So right now, they've already shed their alert load to around 3,000. And we're on the journey to bring it down to 1,000 alerts per month. And the outcome for this, well, a significant 20% reduction in the time to isolate, which is the time that they spent typically on finding out what the root cause per incident is. And the automation of runbooks empowered these teams to swiftly pinpoint the time of isolation and in a remarkable span of less than five minutes. Like, automations need to be relied upon as we move forward. It is the future. A lot of more established teams try to avoid this fact. But the only way that you can maintain the same business velocity that you want to do in this current competitive market is relying on automations. And a few final words before we leave you all. So don't rush into things because it might be easy to drop in. Hypertrace was a good example for you guys. There's a lot of chatter about shifting vendors, exploring open source solutions. I'm saying this at a CNC conference. But yeah, do your due diligence properly. And don't drop something because you feel it's going to solve all your problems. There's nothing called like a silver bullet in today's world. Open source is not free. So make sure that you're not correlating those two facts. And it's an extremely competitive time in the market, as I said previously. And companies just can't afford any downtime. So while you're scaling up fast and fighting goliaths, if you're one of those rapidly built fast break fast teams, it's worth spending on extensive monitoring and alerting regardless of your skill or size. You would rather be at the scale razor pays where they've gotten over the hump and now they're fixing their tiny mistakes rather than never gotten over the hump. And that is pretty much all the time. Thanks for listening, all of you guys. We can take a few questions if you have. And yeah, I hope that the next few talks will do you guys justice for your attention. Thank you so much.