 My name is Ashwant, working as a staff security engineer at Razor Pit, currently focusing on cloud infrastructure security, worked with Microsoft and back synopsis security teams in the past. I have presentive conferences such as FSICI, NullCon, CoCon, 50B, AWS, Summit, etc. Welcome to the talk. Hey, everyone. I'm Amit and I'm a principal engineer with the platform team at Razor Pit. My primary focus area is on our microservices journey, more specifically on the API gateway and its related services. The two of us will be talking to you about how we do DLAS mitigation at a real-time scale. Razor Pit enables frictionless payments, banking, and lending experiences for different classes of merchants. These merchants vary in scale. We serve moment pop shops, and we also serve large enterprises. Today, we process millions of transactions for tons of merchants across the country. Razor Pit has been at the forefront of innovation over the last few years in terms of transforming the financial ecosystem of India. Basically, power is payments and bankings for a large chunk of India-stocked companies. We are proud to be a part of their growth journey. This would mean that our systems and infrastructure needed to scale to support our partner scale. Talking about our growth, to support our growth journey over the past couple of years, our linear team strength had to grow 10x over the past four years. For dedicated focus, we have four BUs and 60 pods and 600 employees in tech. To better handle scaring, we embarked on the microservices journey a couple of years ago. Today, we have 100-plus microservices. We've done about five acquisitions in the past four years. We have a mixed bag when it comes to the tech stack with over 2,000-plus deployments a month. The takeaway here is that Razor Pit has a large number of APIs that are exposed, and the number of APIs and the complexity of these APIs will increase over time. So why DDoS? Availability was a key focus area for Razor Pay last year, and outage would not only hurt us, but also hurt our customers who are dependent on us for payments and banking. This adds a great responsibility on our shoulders. While going through the possible areas of outages, DDoS stood out. In recent years, DDoS has been one of the largest contributors to outages across industries. There has been a constant increase in DDoS attacks, especially post-COVID, with companies going digital. Some metrics to look at. Ransom-based DDoS attacks are where an attacker is trying to extract a ransom because he or she has detected that a company is vulnerable to DDoS attacks. The ransom-based DDoS attacks have increased 29% year-on-year, according to a cloud-based report. Cloudflare reported a max of 17.2 million requests per second, plus is that number for a second, right? 17.2 million. So let's get into the depth of what is a DDoS attack and why is it so bad? Why do people conduct DDoS attacks? Competitors at times, geopolitical situations, and sometimes attackers are just looking for an answer from companies. They notice that a company is susceptible to a DDoS attack, then the attacker sends a ransom note saying, if you don't pay XBitcoin, we will bring down your site or API. This thread being easy to pull off a DDoS attacks. There are services that orchestrate these attacks, and it is as simple as paying using a credit card and giving me one. If you look at the picture on the right, the DDoS services use a botnet in the back end to achieve these tasks. So what is a botnet? The attackers look for zero-day vulnerabilities, especially on network devices like routers, and make these devices a part of their network. These devices are called the bots or the dummy hosts. These bots have no idea that they're a part of the botnet. They're spread across the world and just keep waiting for instructions from their master, or what is called the control server or the command and control server, in the second row of the picture. Let's look at the working of a DDoS attack with an example. Let us imagine a normal traffic signal where four streets join in. Let us say the signal is designed with a traffic flow of 10 vehicles per minute. The traffic flow is smooth with this traffic volume. The school in the corner leaves at 3 p.m. 50 buses hit the road exactly at 3 p.m. I'm sure you must have seen this in different places. This will choke up the traffic signal, causing a huge traffic jam. This, by definition, is a denial of service. This will cause the traffic signal to choke and the recovery will be messy. Now let me say there are 30 schools across four streets with 50 buses each, all of them leaving at 3 p.m. sharp. It would crowd up the whole signal, the surrounding streets, causing a deadlock that would take hours to resolve, if not days. This is distributed denial of service, focusing the word distributed, because they're all coming from different sources and they all hit at the same time. Coming back to DDoS, how exactly will all the machines hit a server at the same time? Distributed and timing pieces achieved using botnets, where the control server pushes a command to the box. The command will say, go hit URLX at this hour and send Y request per second volume. Summarizing so far, botnets form the backbone of DDoS attacks. These attacks are easy to pull off. When these attacks happen, they overwhelm the system and recovery is difficult, potentially causing an outage. Approach. Now that we had to prepare for DDoS attacks, we had to be very methodical and focus on safeguarding the most important assets first. We split up the response into three phases. Phase one, automated inventory. We have an automated inventory system based on cartography by lift, which captures the list of APIs and websites exposed. We lined them by exposure and risk of an outage. At the end of this step, we had a list of assets described. So phase two, bucket these assets into groups such as APIs, websites, starting and programming. Subgroup these assets into unauthenticated, authenticated and who use them. Clear the solution for these groups and we get into the details in the upcoming slides. Phase three was to battle test our solution. Our chaos team is awesome. We came up with a solution where we could simulate the attack traffic so that we could test our own solution under different loads, origins, et cetera. In the next slide, we'll focus mostly on step two, our solution. So this is where we spent most of our time and effort and this will be like the biggest part of the solution. So here is the Chris problem statement. Ensure availability without customer impact. This means that the solution should not block legitimate traffic. One such edge case to think about is how do we differentiate, let's say a Diwali sale versus a DDoS attack. Key points from the solution to protect against the attack. We divide the response up into three phases. Detect, in this layer, we find out if there is a DDoS attack going on and write alarm bells, fingerprint and get more information about the attack. Find out which routes are getting hit, which merchant is getting hit, what is the volume, et cetera. Phase two, prevent. This layer focuses on proactive blocking. We collect all the information from the previous step that is fingerprinting and create automation to block at multiple layers. Mitigate, despite all the automation, sometimes the traffic spikes still sneak in and this would go around the system. So this is where we would need some kind of a manual intervention based on all of our domain knowledge. So a multi-layer defense strategy is required to effectively allow legitimate traffic and block back traffic. Going back to the traffic example of real human traffic, we needed to block the traffic at school gates, intersection of the side road and the main road, the traffic signal, and all of this blocking had to be coordinated and there had to be an exchange of information across all of the blocking areas. So this is the thought behind the multi-layered defense. So DDoS protection by groups, right? So we were looking at the problem from a build versus a buy lens. There were quite a few out-of-the-box solutions to handle DDoS attacks and they do a great job at creating static websites. You could also achieve a fair degree of protection for dynamic websites. For example, VATJS or AngularJS websites which make API requests to fetch data. By tuning the rules and experimentation, you can achieve like a fair degree of coverage. How for laser pay, we had a few complexities. Large number of APIs and routes given the complexity of our business, more so because we are in the B2B category. We enabled payments for our customers so we have very little information about the end user who's using our system. Let's say Amit places an order on Swiggy for Biryani. He pays a piece to 100 using a print card. Don't mean to make you hungry. Swiggy will know Amit and all of his details, right? How will the laser pay will have little to no context about Amit? Laser pay will only know that a customer has placed an order on Swiggy and we have processed a payment for the piece to 100. So the takeaway here is an out-of-the-box solution would not work for a laser pay. So we had to go back to the drawing model. Over the next three slides, I'll talk about the architecture at a high level and options available at individual layers to protect against data syntax. When someone makes a request to api.laserpay.com, a request goes to the application load balancer on AWS followed by a self-hosted API database.com. Then finally, the actual microservice serving the request. We've configured the AWS VARF of the web application firewall to tag on to the PLB on the application load balancer. You're also subscribed to AWS Shield Advanced Protection. We will focus on the AWS VARF and the Shield Advanced Investment. In simple words, VARF can do two operations very well. One, blocking, give it a pattern with a parameter could be a header, IP, body, country, et cetera. It will block the request matching pattern. Very effective tool, but this tool needs to be used under extreme caution because it could potentially block legitimate traffic. Two, rate limiting. If you expect an end IP address to send 100 requests in a five minute time window, if this IP sends more than 100 requests, it will get blocked. Rate limiting is a very powerful tool and comes in really handy during DDoS attacks. Shield Advanced will do these operations. One, detect an attack. An alarm is triggered when whenever health checks fail. These health checks can be configured and an alarm can be triggered when the health checks fail. Two, fingerprinting. This is basically to gain visibility into an attack. Shield will give you the top five contributors in terms of traffic, IPs, countries, URL, referrals, et cetera. And the third piece around prevention is automated blocking. So this is where Shield can automatically block what it thinks are the top contributors to that attack. Use this option with extreme care. Amit will talk about a lesson we learned through this operation. The final piece is Shield Response Team and the support from them. Experienced staff from the AWS Shield team will dial in during the attack. They will provide suggestions and support over call. Finally, as a part of the multi-layer defense strategy, our API gateway plays a major role in the detect phase. It identifies the routes under attack, customer under attack quickly and triggers alarms. Edge also helps prevent in this case. It does blocking under multiple circumstances which Amit will cover in detail. Summing up so far, DDoS attacks are easy to pull off for attackers. It's increasingly one of the main reasons for outages in the industry. Methodical approach of detect, prevent, and mitigate is required as a part of DDoS response. Remember those three things, detect, prevent, and mitigate. Multi-layer defense across components is required for effective response. I also covered the defense approach at two layers, above and behind. Amit will spend time talking about our end-to-end solution and how all the pieces will come together. Thanks, Ashwath, and thanks everyone for tuning into our talk so far. I'll now dive deeper into the solutions we've been working on and expand on the things Ashwath mentioned earlier. An API gateway is a critical piece of infrastructure for microservice architectures. At RazorPay, we use Kong open source as our API gateway, which we internally call Edge. We chose Kong because it's backed by the proven performance of Enzenix and has a powerful plugin architecture for extendability, among other things. So a typical request to RazorPay API opts for multiple systems in our infra layers before being processed by a background microservice. The API gateway is the first system on this ingress path that understands RazorPay domain context that prior layers cannot. This is possible because it's where we've defined our services and flows and we execute plugins like authentication that resolve our user identity. Here's an example of some of the domain context resolved at the API gateway. The payload here is a JSON blob that is generated at the API gateway and is then passed around between the different microservice over HTTP headers. The payload here is self-sufficient, meaning it contains all metadata required by upstream microservice to function. The context you see here is also useful input within the API gateway itself. For some of its other plugins such as rate limiter and insights, as we'll see in the upcoming sections. Having good visibility into traffic patterns and insights are pretty essential to a well-downed DDoS protection strategy. So DDoS attacks do not generally have straightforward patterns, right? Attackers tend to change course quickly by hitting different end points, switching their request payloads, API keys, et cetera. Further, due to the distributor nature of these attacks, the client context that we can resolve, such as IP address and user agent are pretty vast. Having quick, easy access to these patterns is crucial to detecting and protecting against these attacks. So at RazerPlay, we believe that you can never really have too much data. And we collect a lot of information from various sources such as ALB, WAF, Shield, and the API gateway. It's important to do so across different layers so that you're able to get better coverage and avoid relying on a single source of information, especially because some of these sources may actually get impacted at high readerscape. An area we specifically focused on was attack fingerprinting, which broadly attempts to narrow down the data into a few specific set of patterns. The diagram you see here is an overview of how we collect and store our context data. AWS Shield Advanced provides a useful alarm on any detected event, which we process via Lambda jobs. I'll talk more about the shortly. On the API gateway, we've implemented a custom insights plugin that acts as a source for all our intelligence. Here's a more detailed look into the insights pipeline I was talking about. The insights plugin on the API gateway collects a lot of rich derived data that includes info on HTTP requests, customer, resource, and more. We are able to derive this and more than the data available at Infra layers since we resolve authentication and authorization by this point. And we can also access contextual data such as previous access patterns for that user or IP address. All of this data is collected and pushed asynchronously to an internal service we call Lumberjack, which is a wrapper over Kafka. Via existing data like ingestion pipelines, this data is then ingested to two data stores. One, Druid, which we use for its high performance ingestion and powerful low latency aggregation queries. So we query Druid for things and use cases like count of requests from an IP address in the last minute, et cetera. And two, we use Pino and Trino, which we power large window time series base queries for use cases like dashboard visualization. To visualize and query all of our massive data set, we use Lookup, which is a BI slash big data analytics tool. These queries are run on an intermediate SQL engine called Trino, which is an ANSI SQL compliant engine and allows us to run SQL window functions for time series base queries. Here's an example of a few of the dashboards we've built on Lookup. The graph you see here is a time series of the top five IPs making requests over time. Here's a similar graph that displays requests from a single customer. These dashboards allow us to quickly identify spikes and anomalies and drill down for more accurate debugging. The primary consumers for these are on-call engineers and SREs who need to quickly understand the source of a request rate anomaly. This, along with other dashboards, alarms, and insights, provides us a well-rounded visibility into all areas of traffic patterns on our systems. We'll now talk about what's arguably the most important piece of any leader strategy, prevention. Of course, we can't prevent an attack from taking place. What we're rather referring to here is to prevent an attack from entering deeper into our systems, preferably at the earliest layers of Ingress. Doing this allows us to minimize the impact on our systems by reducing the blast radius of an attack, which is important as you'll go see. One of the tools we rely heavily on for the purpose is rate limiting, which is a technique that allows us to restrict the frequency of resource access to predefined constraints and limits. The result of this is that we're able to limit users, applications, and bots that are potentially abusing our resources. Let's dive deep into the rate limiting solution itself. At RezaPay, we rate limited multiple layers. This is because, this is important because it allows us to leverage the strengths and weaknesses of every layer. So for example, AWS Waf is our first layer of rate limiting. With AWS Waf, we don't need to worry about scale. And at the same time, we're able to configure a wide range of rate limits based on APR request metadata, such as request path, headers, et cetera. However, Waf has a few limitations. One, its rate limits work on a granularity of five minutes, which is too large for our use case. Waf also takes 30 seconds to kick in. So the first time it evaluates the rule is in 30 seconds, which again adds a delay. Second, domain context is typically unavailable on Waf. So we are unable to set different rate limits for enterprise customers versus normal customers, for example. To overcome these limitations, we run our second layer of API rate limiting on the API gateway. We can perform fingerprinting at the API gateway layer to find out which route customer and domain is under attack. Second, setting fine-grained policies on rate limits is possible. We've built a performant and highly configurable rate limiter plugin on our API gateway. Implemented using the token bucket algorithm, we're able to enforce limits on one second window granularities. Moreover, since the API gateway results routing, authentication, and other metadata internally, we're able to write much more focused rules for our use cases. So while this layer offers better custom control, it cannot replace Waf entirely, since that will expose it to system availability risks. The way Waf and API gateway complement each other allows for better overall protection. Ashwath will now talk to us more in detail about how we've implemented rate limit systems. Thank you, Amit. So the most important piece here is to achieve like a good degree of protection. We need to have a good rate limit, specifically on like a scope down statement, right? You identify which merchant is under attack, which route is under attack, and you want a specific rate limit for that. So let's talk about that particular piece here. First, we set defaults on the Waf and API gateways, so these are static numbers. These rules are derived from our internal understanding of capacity, user behavior, and business requirements. The second piece, getting the right rate limits for fine-grained combinations of our access patterns is hard because of our evolving customer behavior. We were manually creating rate limits based on segregation of customers into small, medium, and large buckets. This exercise is manual, laborious, and I don't know. So anytime customer sale would happen or customer would grow, we would have to go change the rate limits. So we needed a solution which would one, a one with change in data to provide real-time rate limits for a given date time. For example, weekday versus weekend, 10 a.m. on a weekday versus 2 a.m. on Saturday morning, right? So the third piece is where the solution and the models can be easily written. So we chose machine learning to solve this problem. It's one of the very good use case for machine learning at race day. We used a statistical system on what to predict and the combinations, which is a rate limit that can be used by different tools such as API gateway or the WARF, and we needed it at different grammar values, for example, requests per second and ruling requests for five minute time, which is mostly for the WARF. We needed bounds on target variables, one spike on a given day, should not skew the data considerably. We needed to train a model that could predict rate limits for unseen combinations. For example, there's a new route that a particular customer is using or they're adding a new business and so on. As I previously mentioned, we connect a lot of data at race day. We use this data to run on machine learning models on a regular cadence using data bricks to predict rate limits for multiple combinations such as route level, route plus, customer level, et cetera. Data bricks is a third party ML platform that we use. It provides the ability to run custom ML algorithms on a regular cadence. So this is mostly to predict for a given day and time. Additionally, there's another Lambda script that we'll talk about in the next couple of slides, which will pull these rate limits from the S3 bucket and I'll guide them in the WARF and the Edge. So let's go back to the scenario I spoke about earlier. How do we differentiate a Diwali sale graphic versus a D-RAS? The ML model determines traffic patterns based on the given workflow in the context of a specific customer. It does this by looking at anomalies, anomaly parameters such as time, customer details, country, HTTP request details, et cetera. We also collect manually input fed into our system by the sales and the support teams. So this also serves as a parameter to our model. Apart from rate limits, we've added few more preventative measures thanks to Shield Advanced. We delegate all of the L3 and L4 protection to AWS. Internally, our insights and alerts are still set up to detect different types of attacks too. We also built context-aware capture challenges for UI-based resources. This is again, powered by a custom plugin on the API gateway that understands user behavior from the insights that we collect and transparently returns a capture page to suspicious users. Our stance at RazorPay is that capture-based solutions are unfriendly to users' experience. So we keep this attempt back to Ahmed. So having good prevention solutions and strategy is great, but we need to always assume that these attacks can breach them. And therefore, regardless of those solutions, any D-DAR strategy should always build tech for reactive mitigation. And that's what I'm gonna talk to you about next. There are three key pieces required for mitigation. One, runbooks, good insights and dashboards to understand the patterns of an attack. And three, a workflow that allows us to target that specific set of patterns to block an ongoing attack. Our workflow at RazorPay consists of both manual and automated solutions. The manual one looks somewhat like this. First, engineers receive alerts for D-DARs and DARs events. They review patterns across a few dashboards, AWS Shield and other internal monitoring tools like Graphana and Lookup. If the attack is already being handled by our prevention pooling, they're still reviewed to see if more protection is needed. And finally, if required, they end up setting rules on WAF and the API gateway. We rolled out an automated solution that aims to imitate this workflow. It's driven by a Lambda function that gets invoked when AWS Shield raises a D-DARs event alarm. We've implemented the first version of this based on IP blocking at the WAF. However, we quickly found that this solution had the potential to block legitimate traffic, especially because our end users could be behind NAT IPs and in their workplace. We then moved to use WAF rate limits instead. Here, the Lambda function pulls attack context such as route host, customer under attack, et cetera, from Druid, the existing rate limits for this particular context from S3 and sets a new rate limit rule at the WAF. Due to this coped down rate limit rule, any burst traffic matching the attack on context that is greater than the set rate limit will get blocked. We've tested and tuned this over time to minimize the number of false negatives that get processed. To quickly summarize until this point, we've discussed the API gateway stack, how we collect and process insights, and specific prevention and mitigation strategies that leverage rate limiting and machine learning. While the pieces I mentioned are individual solutions, on our platform, they run together and complement each other as one cohesive solution. We've discovered the tech that we've built, but through our journey over the last six to eight months, we found that focus on process and people are equally, if not more important. At RazorPay, we've improved our incident management processes specifically for Deras. This involved creating specific runbooks, improved monitoring by setting up dashboards and alerts, and training our on-call engineers. One of our biggest wins on this front was creating a chaos testing program specifically for Deras. So what we did was created a cross-functional team of engineers who are responsible for defining and scripting a variety of different Deras scenarios. These simulated attacks are then run directly on our production systems on a schedule. The results and recommendations from this team are then published to all the relevant teams at RazorPay for action. With all these solutions in place, we were able to successfully create and implement a Deras prediction strategy at RazorPay. The Deras chaos program helps us test our solutions and iteratively improve and battle test different flows and systems. Thanks to these changes, one key result was that we were able to run chaos tests on production where we were able to handle close to 40 times the earlier baseline for our abusive product. That brings us to the end of this talk. Thanks, everyone. We hope this was an informative session and we'd love to hear your questions and answer them. Thanks, Amit and Ashwath, for a great explanation on how new folks at RazorPay are probably both tackling the Deras and mitigating them whenever they arise. So with that, we have one more, we have one Mehul with us in the panel, who is an engineering manager at Devops at Kohler. He has 14 years of experience starting from tinkering with the Linux box on its own PC to becoming a sysad man and then reaching out to different roles and in his own changing experience. Mehul, I'll probably turn off, I'll give the mic to you and what are your thoughts on this particular topic and the talk that just brought up? So yeah, this is an interesting talk to listen about how the whole setup has been thoroughly analyzed, different kind of like, recognizing what kind of different risk patterns do you have, how do you classify the threats? How do you protect what is more like, things like APIs are a lot more vulnerable versus a static website, which is much easier to protect. So it was a pretty interesting insight into how do you think about classifying your risk as well, not just blindly putting a Deras protection mechanism in place, but to understand your model, like what you are building and what are the threat models for each of them and putting the right measures in place for them. Also, how do you think about when, like if anybody is looking to build a more secure mechanism, how do you think about different layers and how do you think about certain things like places where you want to do early recognition, but then there are limitations of what it can do versus say the WAF level versus how do you go deeper down into your API gateway, get more insights and figure out what is more meaningful versus what is a completely wrong traffic, which probably a WAF will not be able to easily understand. So you are not just blindly blocking any kind of a high traffic that may actually be legit traffic. So there is a good balance mechanism that they have thought about and set up and that enriches the experience, like good experience and reduces the bad actors from actually being able to do a more sustained attack. So that's a valid thought process and how do you design your systems to be more useful? Thanks, thanks for your thoughts, Manojit. You rightly said it's been probably thought through on what all edge cases can come into the scenario and how they'll probably fix it and mitigate it. Not just trusting on one source of information but going through multiple data sources and getting the data and taking the decision nicely. It makes sense. Thanks for your thoughts, Manojit. Amit, before we move ahead and for the questions on this livestream that I have a couple of questions for you folks, I'm more interested in how, so once your attack is over, how you probably remove those bad rules out of your pictures so that the IPs, as you rightly said, they can be enacted in a country like India where, so how do you remove those rules so that no other customers can probably be impacted by the rules that we have? Amit, do you want me to take that one? Sure, sure, go ahead. Yeah, so I think there are two pieces to it, right? So the first piece is we need to kind of scope it down to a particular context, like the attack fingerprint which is getting attacked. And once we scope it down, we have the rate limit. So that will allow the legitimate traffic through, but whatever crosses that legitimate traffic will get blocked. So that is basically like part A to your answer, part B to your answer in the cleanup phase. So we have good alarms around the detection and the fingerprinting, like figuring out when there is a spike in traffic and what does the attack fingerprint look like or anything that does not match it for a given period we do it in the middle. Sir, thanks for the answer for that, Ashut. I have one more question on that aspect, is chaos engineering is the best word in the industry, right? And I think our audience will be more than happy to listen to your experience on how you implemented chaos testing on production and without breaking the actual production systems and what your experience was like and how do you manage to execute it properly? Yeah, great question. And in general, I think chaos engineering is a very iterative process, right? You want to approach it in steps and essentially you want to cover your basis probably manually at the start, not even worry about touching systems and do a more theoretical exercise of how you approach a chaos scenario. It was iterative for us as well. We started off with a categorization of the different routes or different, I would say properties that we want to expose and test, right? Things that impact our service the most. Once we have that in place, it was an iterative exercise of getting the right context involved with the team that was performing this. So for example, if I had to, for example, look at the core of database business which is their payment processing, how do I size that between the different routes that I need to handle that request? Can I then emulate attacks first theoretically and then via tooling to determine what sort of potential outcomes are going to come out? Most of the times we see that we don't even need tooling to get to an answer. The answer is right in front of you when we do a theoretical discussion on it. But that's it. Over time, we ended up building a pretty decent suite of tools that help us launch and basically track these attacks and how we prevent these tools from handling impact is not really a part of the KIA solution itself. That's handled by a separate set of availability mitigation tooling, which we have in place. And essentially, we rely on the measures we put in place in production to prevent BDOS from taking place to catch that. And of course, I'm not saying that's brilliant. We still have people on call, actively monitoring and looking at metrics during the time for those specific set of requests, spots, properties, et cetera, to make sure that the tool itself is not bringing us down. So yeah, we do have those big measures in place. So yeah, just two more points to add on to that. So the first piece is around the visibility. So the existing tools will not scale up to whenever there's a high volume. For example, if you use something like a sumo or a Splunk, they might not scale up to the whenever there's a spike in traffic. So that is something that needs to be looked at. And there has to be a thorough effort on that. So that is the first piece. And the second piece is whenever on the chaos team also, it's very important to have the kill switch. And another thing that we are trying to build now is to look at the health of the system and to actively kill the test if things are going back or the CPU is spiking or the memory is spiking. And we try to reduce the load or we kill the test if it crosses the potential. So that's also something that we look at. Yeah, it makes sense actually. So thanks for these answers, Ashwath and Arith. Like actually that makes sense on how chaos engineering, like if you take the system as a whole, it will not be possible for us to implement chaos engineering overall, but it's always good to go iterative in nature and select small number of APIs and go ahead and see how the system behaves and then go on and take the overall system health into picture. Makes sense, thanks for those answers. We have a couple of more questions on the livestream that we have, I'll get the first one. How do you detect and mitigate for unintentional dose that happens on the system? So might be possible someone was, some of your merchant is testing something and they accidentally did a dose about unknowingly and unintentionally. How do you guys focus on that? So yeah, this was a part of the presentation that we gave. So because we are able to resolve a lot of context that our API gave in, we're able to build a pretty intelligent data set, right? That essentially collates what sort of traffic patterns we're getting down to the level that we're able to slice it by a particular customer by particular set of scope on that customer. So for example, we're able to say, if a customer like Zomando is accessing a particular payment method of what's the anomaly on anomalous traffic on that set of traffic patterns. All of this data set is then consumed by our ML model to determine whether that's an intentional or an intentional dose, basic anomaly detection. So that's a gist of it, but again, I'm doing the service to summarizing all of that data modeling to one. At least the ML modeling to one black box in this way. There's a lot more to it that we've optimized over time. Ashut, do you want to add on with respect to what we are doing on the ML side? Sure. So on the ML side, there are two sides of the story. So the first side of the story is getting the right rate limits for a given combination, right? So for the attachment, for example, if it's this particular route and this particular customer, that combination, what should the rate limit look like? For example, how do we differentiate a Kirana shop versus an IRCTC or Zomando? So that piece is key. And to do that, ML comes into play. And the second side of the ML story that we're also looking at is around anomaly detection. So this is to say, okay, this merchant can usually, this Kirana shop send so much traffic and they're doing some testing right now. So I'll give it like two or three X of the normal traffic. Whereas if it exceeds that, then we start throttling it and we say, hey, what's going on kind of thing. Yeah, thanks for your answer. I think what I get the trucks from this particular thing is that it's not only one factor or one thing that you're relying on depending and blocking the traffic on. It's like a mix and match of multiple parameters and then blocking based on that. So that actually reduces the scope, even if someone is doing an industrial loss that reduces the scope of the block to a given crowd or based on the offing of printing mechanisms. So yeah, that probably makes sense. Moving on to the next question. We have is, if AWS Shield is taking care of distributed denial of service, what is Razor paid on their side to make sure that they are calling zero trust model? So I'll start with the answer, I'm at feel free to add on. So AWS Shield has three main facilities. So the first facility is they'll get on call and whenever there is an incident, they'll get on call and they'll provide support. And the second one is whenever there's a DDoS attack, they'll send you an alarm. It's called the Shield alarm. And the third piece, which is interesting is they'll give you some visibility to say, okay, here are the top five contributors in terms of reference, IPs, countries, blah, blah, blah. And they also offer the option of automated blocking. Automated blocking is super tricky because you can burn your hands because it's only looking at like the top five contributors. So for us specifically, India is always going to be a top five contributor. So accidentally if we block India, then a large part of our traffic is gone. So that's why we have to be super careful with part three. And that's where we did all of this drama and specifically us being like a B2B and API driven company. We had to do all of this drama to kind of ensure that legitimate traffic does not get blocked. So and on the zero trust, I think from day one, the way we have architected the system is around zero trust where the basic idea is okay, we have components and the idea is no component should trust each other. So that's the way I think we've architected from day one. Amit. Yeah, quickly adding on to zero trust. So it's not really linked into our data surface. In the sense that it's more of a parallel concern, but at least the zero trust is enforced both at the perimeter with respect basically at the API gateway where we enforce authorization checks and also within our parameters. So all of our services when they communicate to each other, we make sure that none of those requests are also trusted by default and that we have an additional layer of enforcement that happens to all of these requests. That's the way we enforce zero trust within and outside on it.