 So a quick brief about Uma. Uma is currently the head of Chaos Engineering at Harness. Earlier, Uma co-founded Chaos Native and Maya data companies, where he co-created open source projects, Litmus Chaos, and OpenEBS. He's an active maintainer of the popular CNC of Chaos Engineering project, Litmus Chaos. He's a regular speaker of the subject of Chaos Engineering and Cloud Native DevOps. And also, he's a regular speaker in meetups, conferences, which are related to SRE, reliability, and Chaos Engineering. Uma holds a master's degree in telecommunications and software engineering from Illinois Institute of Technology, Keshav Kago, and also a bachelor's in communication from SVE University. Thanks a lot, Uma, for being here, the stage's audience. Everyone, good morning. We had to see these events coming back to the in-person form, right? So I've been missing them for a few years, right? Also got fixed to the seat, delivering talks into the laptop now. It's good to see some real questions coming up. Thanks, Sandil, not only for inviting the organizers. It's a big job to do events like this. They have to go run around permissions, find speakers, organizing events is not easy. So congratulations on running such a big event, first of all. And thank you for taking time and coming here to learn a little bit more. Today, I just want to, I thought of spending some time technically on deep into Chaos Engineering, but then I saw one more talk on late mass chaos by Atul later this evening. So I'll probably stick to some of the good practices and why you need to do Chaos Engineering, right? So with that, let's start. So reliability is not a new topic. It's been there for quite some time, right? So people talked about finance, reliability, and all. But it's become more important for various reasons. In the recent times, there's too much digitalization. Everybody is on phones doing business. So we get frustrated if something doesn't work, right? So the customer expectation is really, really high nowadays. Even people who don't know anything to do with computers, they won't pay to happen just like that, right? So you have a smart phone, and I want to transfer money or receive money, pay money. It should just happen. Otherwise, it starts scolding somebody who's providing the whole technology suite, right? So that's the kind of expectation of reliability. So who provides that reliability is a big question, right? And how to do it right, right? So just a little bit about me. I work at Harness, head of Chaos Engineering there. I co-founded three companies. The last one was Chaos Native, where I was doing Chaos Engineering using Litmus, which I wrote, co-founded about six years ago now. It's a CNCF-hosted project now. It's an incubating, probably going to graduation sometime next year. There's a lot of users with Litmus, probably in thousands. We are pulling about a million downloads. Docker pulls a month, so that's great to see. It all happened because Kubernetes has seen a great adoption curve, right? So there's no question of why Kubernetes now, right? So things are just moving around with the flow. So let me talk about the reliability a little bit more, and then who should do the reliability and DevOps. There's a misconception that reliability should be given by SRIs. And reliability should be given by the managed reliability. They are the ones that take care of if something goes down. And they are the ones who get fired if something doesn't work. All that is a spine. But it's not just SRIs that are responsible for reliability. It's really the code and developers, right? So you need to work in tandem. Reliability has to be done in DevOps itself, right? So today's challenges in DevOps, right? Everybody wants to move to Kubernetes, the sales pitch for Kubernetes is, why are you waiting for six months to deliver a change? Put hundreds of millions of dollars, increase customer satisfaction, you think of a change, you deliver next week or next month, right? So that's the change that's happening. The pitch for moving to Kubernetes is accelerating the change delivery or deliver the change to the end user as quick as possible, right? No more once in six months upgrades and all, right? Speed is a challenge, and then there are a lot of good practices that have come, right? And that's also the driving factor for all this innovation. If you see any container pitch, a white cloud native pitch, the real guys, the CAOs, CTOs of big banks or retail industries, why are they spending hundreds of millions of dollars per year project going into cloud native is really that customer expectations have changed. We cannot deliver our change slowly. We had to deliver it fast before my competitor takes away my user, right? So speed is definitely a challenge, and we're doing it, right? Most of people are doing it. And while you do the change super fast, quality is a problem. Everybody is saying that, yeah, moving to containers, you don't need to worry about other person. Just write that process into a container, give nice APIs, deliver it, go home. It runs anywhere. That's fine. But does it run properly, right? Who takes care of that QA testing? So it's a headache for QA guys, right? So the software spec will come in one way. That's just what container does. So managing the quality has been an issue in microservices in a paradigm, right? So that's definitely a challenge. And because that is a challenge, developers are getting pulled into fixing issues more often, right? So you deliver fast. It goes to the next pipeline. And then it doesn't move faster again, right? So developers, again, there are a lot of tools that are coming to measure the developer productivity. Where are you spending time? Writing code or fixing bugs, right? What bugs? Bugs found in pipelines immediately or bugs found by SREs or customers, et cetera, et cetera. So developer productivity is a big challenge today. Or I made a decision to invest a lot of money. But where is all this money being going? It's into developers, right? So developer cost is a trillion dollar number today, all over the world. So are we spending real money into that? Cloud cost, right? So you see anybody who is on the cloud. How much percentage of it is for actually running the service versus how much is it for development and other things? It's 30% production, 70% testing, and other things. What are these developers doing? Not only the salary, but they keep doing this testing 100 times, you are increasing the cost. So developers' productivity is super important to make a reasonable ROI into all of this, right? So you're trying to push things fast. That is somebody facing the heat for quality. Somebody is trying to say that you need to be very productive, try some magic, and then still push it out faster, right? I want you to be really productive at the same time fast, making it quality. Either that guy is a magician, or he leaves the company for another hike in another company. And then do the same thing again, right? So the main problem in all this big push is nobody is talking about reliability, right? Reliability is a problem because you put in some wrong code or wrong configuration. And wrong configuration is, again, SRE problem may not be. You are not provided the right way to put the configuration. So it could be a developer problem, design issue, right? If you have nice design, SREs can use those parameters to automate something or do something about configuring them, right? So developers need to think about the end service goals, not just how my service talks to next service and how to be interoperable. How can it be reliable, right? So there is not enough focus on reliability, right? So is reliability a goal, right? So that's what I'm trying to push the point here. Developers definitely should think of how this eventually will run as a service and how will it be reliable. Because if you want to be really productive, you don't want to be debugging it beyond the calls with SREs lots of time. It should just run, right? And even if they find some issue, it should be fixed by SREs because there is good enough configuration parameters that are available. I don't need to be involved, right? Me as a developer, right? So that if you're being on calls all the time, or 50% of the time, you're being a critical engineer, not everybody gets called on to the calls, right? Only if you're very good, you know everything, you will be called on to the calls because there is an SLE. You have to solve it very quick. Otherwise, the pressure is going to be mounted even further, right? So if you are such a good engineer, you should be writing code, not spend time with SREs, right? So you should be designing it a little bit. Better find issues in the design and keep improving it, right? Yes, it is a goal. Reliability is a goal for good engineers, right? So spend time in designing it in that aspect, right? So it's like you're really leaking this code issues, design issues, architectural issues. You're leaking them onto the production. Sometime it has to be found, right? If not now, maybe you just moved on. Some other person comes six months later, and his head will be on the anvil, right? Because the design bug that you put is not even known, right? So somebody has to put a retaining wall for this leaks in DevOps, right? And that's a big trend that's coming. And the new trend that we are seeing is continuous resilience, right? Continuous integration, how to put the code, continuous delivery, how to deliver the code. Man, all that's good. Now, the headache is we are on cloud native. Customers are there using it, right? My money is now going through microservices. Earlier it is my own friends, who the hell cares, right? So it's just that it's not going through now, right? So the pressure is more, and then we have hundreds of millions of people doing these transactions, right, on a daily basis. And they have serious high expectation. So it has to be not debugged. It has to be fixed from leaking on to the next stage, right? And I've seen changes taking six months to three months. Now, in our own organization, we do multiple builds in a day, whether we do multiple deployments that sometimes they have sometimes no, right? But things are happening very, very fast, right? So when you do that fast, deployments, changes, you have to do that resilience thing also. So continuous resilience is a topic that's the innovation trigger now. I'll talk about it in a bit, right? So how to retain this reliability is a bigger question that we need to ask. Reliability of what, right? It's not about code, right? If you think about the code, my code has to be reliable. Then there's code coverage, and there is memory leaks happening or not. All those nice tools to make sure that your code is nice, clean, intended, everything. But we have to think differently as developers, right? Reliability of what? Against what? Yeah, it is reliable. It's running. I just tested. But in after six months, there is an outage, right? Against what? And how to measure it, right? Measure it. People generally call it as a finance, six nines, so many hours of downtime only. I'm a developer. What are these numbers, right? You are saying developers need to take care of reliability, but tell me how to actually give me a number, how do I measure? I'm saying my code is reliable. Tell me I'm wrong, right? So these are the problems that we have to see, right? So you need to have right angles to measure. Reliability of business services or deployed services. Reliability against some failure, right? Outage is something that is seen by your end user, right? And outage happens, you know, let's say, in a finance, somebody talked about, I think, Krishna, only a few hours in a year. That's the number that you've seen. But alerts are happening all the time, right? That means something failed, but still you're not resulted in outage, right? So when this falls happen, sometimes it results in an outage, right? So you need to be reliable against this falls, right? And why are we talking about this reliability of Kubernetes cloud native services? That's because Kubernetes has an architecture where it induces falls all the time on its own. And people say it's a very good design, yeah? Pot delete, that is the main fault that happens. You know, if you go back to one more virtualization era, it's a VM delete. How often does a VM delete happens, right? And if you go back even further, right, how often does a node is removed or rebooted? People used to celebrate, yeah, you know, there was one server, it was not rebooted for one year. Continuously serving. Come to Kubernetes, yeah, it just moved from here to there all the time because of some configuration. We work on the concept of reconciliation, right? Kubernetes, there is an expectation. It has to come to that state, even if somebody gets deleted. In fact, what I do is I go deleted, because I think it should run on the other node, right? So it's all given as an architecture advantage. It's actually a fault that your architecture isn't using. So the service has to be reliable, and Kubernetes doesn't guarantee that. The developer has to make sure, right? It has to be reliable. So it's really a burden on the developers to make sure that the reliability angle is put into their head, right? So if I have to summarize, how do you measure the reliability is things are happening, and some faults are also happening, and my study state checks, when I keep measuring the five or 10 or hundreds of monitors, they're all serving as expected, right? And the key thing here is you should expect faults to happen, right? What are the types of faults? For delete is one thing, but then you go to AWS. Yeah, you know, I give SLAs, right? For what? For downtimes. Do you give SLAs for latencies? Yeah, I give, you know, generally I give the latencies too, but sometimes they go a little bit on the spike latency. And latency here, latency there. Everything gets piled up, you know, four, five. Then because of that latency issue, transactions get loaded up, and that adds load onto some other node, port gets deleted, moved. Finally, some SRE gets called, outages happen, right? So this is what's happening. So study state checks need to be really, really measured. Even if it's latency increase, it is a fault. Not only a fault delete. Latency is increased by a service. I have not done anything. Some wire is cut in a ocean somewhere, and then some small 10% latency is increased for one hour. By the time the switch is rebooted, or switches, you know, moved onto somewhere else. So latency is a problem. You should consider it as a fault, right? So how do you do all this fault injection? Study state measurement, right? People are doing ops. SREs are doing all they're looking at is monitors. Continuously, how can I measure things better, right? That's a job they're doing very well. But if you look at the reliability, you know, the journey till here, last four or five slides, what we are saying is you have to measure when faults are happening, right? So you don't need to break. One misconception about chaos engineering is break down one entire data center and then measure your things that are working or not. No, that's not what is. That's one aspect of chaos engineering. Chaos engineering is really about introduced, controlled faults that are happening all the time but are being ignored all the time also, right? And then try to measure something that helps the developers to go and fix, you know, either their design or code. A 5% increase in latency should not eventually increase the end service latency should not be increased. Something should be taken care of, maybe more provisioning or provisioning of certain resources should be happening, et cetera, et cetera. Developers, you know, each service is unique. They can take care of it. So this is chaos engineering, right? So measure the steady state while faults are happening is called chaos engineering, right? And because of the set reasons, it is a natural choice. Chaos engineering has been there for a long time but now because of Kubernetes, microservices architecture and everybody is trying to push things much faster, you don't have time for, you know, doing a proper testing, right? Earlier there was an API change but now, you know, each API is coming through as a microservice, right? And before I know it, a lot of containers got upgraded, right, to the next level. So how do I go and test this change, right? Not only in production, in pre-prod. So you use chaos engineering as a way to, it doesn't matter what changed. My test case is same. I go and introduce hundreds of faults that are possible and still my service is reliable. What change happened? Who cares, right? Any fault happens, it should be reliable, right? So that's what chaos engineering is, right? So you're taking an innovative approach. How fast do you change? What changes you do doesn't matter. I go based on my service angle, right? The service has 100 dependencies and 200 resources. The 200 resources can inject finite faults. I will inject them and see whether things are working or not. Yesterday it was working. All the steady states were good. Now a new build has come. I'll do the same checks and see, right? So this is another total different approach to making sure that you're not leaking bugs to the right side. And nothing changes in chaos engineering. The original definition remains same, right? Introduce a fault and then verify it. It's just that the people who are doing how frequently you have to do and why you need to do, there are total different reasons, right? And where is chaos engineering being used today, right? So mostly, I've seen people taking up my litmus chaos code very aggressively by developers, right? So we are seeing million docker pools a month. But now I'm more towards, you know, I'm seeing a lot of enterprise deployments also. Financial services are using because they are moving to Kubernetes, right? And they're doing good business, right? So they are a very good industry. Congrats, right? So because, you know, everybody is on smartphone doing something about their money nowadays, not just in India everywhere, right? So a lot of services are there. So there is a lot of load and customer expectations have increased. So almost all the banks either are now doing chaos engineering or will do chaos engineering in the next two, three years, right? Who have been doing chaos engineering are very popular banks, right? So they're also, again, changing their strategy towards chaos engineering. So they are definitely doing. And then DR scenarios, right? Even if you go to RBA or any other federal regulations, they are trying to control how banks operate, right? So are you doing DR testing? That's a regulation. If you say no, then we'll revoke your license, OK? And they have to prove that they have done DR. Come on, DR testing I do once in six months. Why? You do once in six months release, right? Your DR is working for this software version. Now I brought down a node or brought down a data center, brought down as easy. It works. That's what you tell RBA. You tell that. And by the time you tell that, one more software change has happened. And then you cannot say that DR was working on that version, but now I upgrade it to this version, right? So DR is a big pain. Proving that you have tested these DR scenarios is a big pain. They don't expect every, you know, with every build, but at least every three months, every two months, these are the kind of regulations. So use chaos engineering to automate these DR scenarios, right, in pre-prod and prod, right? And highly-scaled environments, NPCI, UPI payments, we do, you know, billions of transactions. So you can't imagine the kind of tech there is. They're one of the users of chaos engineering, right? So I keep spending a lot of time with them, because, you know, it's very difficult to measure what goes wrong where. So you try to bring down those, you know, little bit of those resources to make sure that you are reliable, right? And obviously Kubernetes environments, because pod delays happen all the time, right? So Kubernetes chaos engineering, you should just do chaos engineering. In fact, I used to recommend whenever you do a test case, right, you spin up a pod, you bring down a replica, the developer should do that test automatically, right? So if you start doing that, you know, when it gets scaled, all your test cases will become very natural in terms of protecting the reliability. There are a lot of other implementation challenges as well, right? You know, you have to sell first of all to your management, why chaos engineering, right? Yeah, I have a lot of problems, reliability. Basically, you're saying that you break more things, right? And that's what it is, but now we are saying that do continuous resilience. Yeah, break things in pre-prod, QA, right? So what's wrong with it? If you are afraid to break things in pre-prod, then definitely you're hiding something, right? And it's not easy if you think about it, right? So maturity model, it takes years, which I'll talk about some of the myths and buzz. And I need to invest a lot into chaos engineering because somebody needs to write these test cases, fix them, again, continuously improve them, et cetera, et cetera, a new microservice is introduced, how to make sure that chaos is protected, chaos scenarios are protected for that new service, right? There are a lot of implementation challenges we are at the beginning of this real chaos engineering cycle. So with that introduction of why, what is chaos engineering we know, and why chaos engineering also, we know, right? Primarily it's Kubernetes, Cloud Native, developers, and too many moving parts, that's why chaos engineering. Let's talk about some myths and facts, right? Most of the people say, yeah, I'm doing chaos engineering because I pull the plug, right? No, that's not chaos engineering. It's much, much more than that. It is about not only pulling plug, but observing almost everything and then doing it as a process, right? And it is about not only bringing something big thing down, but can you introduce API change? What if API does not respond, right? What if network does not respond? What if the error code comes in a little bit differently while the latency is up? So you need to think of it as like a new design, right? So there should be, what I tell my teams is, you have a design, you have a test strategy, you should have a chaos strategy also, right? At the design time, because you're the best person to tell what can go wrong and do you have protection against it in your design, right? And chaos engineering is definitely for SRS, but it is not only for SRS. SRS should do chaos engineering. Chaos engineering should be invested by SRE, VP of SRE, not VP of development, right? So the budgets are usually allocated there and it's changing, right? So it is for the set reasons you have to put the retaining wall, right? Once it is leaked, it's too costly to fix it. So you have to budget it for QA and developers also. At least I've seen now a lot of QA teams are using chaos, right? Because chaos engineering is started at SRS level and then they don't get permission, right? Okay, let's do it in pre-proud on QA, right? So that's how it's happening, but ideally it should be left to right, not right to left, right? So developers also, it will come in about a couple of years if we're doing, we're trying to make chaos engineering easy for everyone. Developers, they want everything to be very easy through one simple API, everything should happen, right? So write some declarative language, chaos test should happen, right? If you make things easy, then they will write test cases along with integration test cases, unit test cases, can I write chaos test? Yeah, why not? Make it easy, I will use it, right? So it's evolving, it'll come in about a year or two. You'll see there are examples of people doing it, right? And chaos engineering is break something and go away, that's another problem, right? It's about 30% breaking, 70% observing, right? If you simply say that my traffic is still working when I bring down one pod, so I'm resilient, no. It's a false positive, right? So you try doing that under different conditions, under load conditions, under times you do the same pod delete, one time definitely outage will happen. So how to get the real value out of chaos engineering is when you put a pod down or delete a pod or a network segment, you have to go and observe most of the other parameters, what changed by 5%, 10%, why, right? So that means you have a weakness in a particular service to act slow, right? That is not an outage right now, but for a developer it is an outage, right? What if your SLA is that? SLA is not like a complete outage, right? So that's where this SLIs, SLOs come, error budgets, all this SRIs are measuring, give me an error budget. I can let you go down in the slowness, five times in a month, right? Not more than five times. That means if you become slow, more than a certain number in a given service, you are actually breaching my SLO or breaching my SLA in terms of business. So that has to be treated as an error, right? Error budget, I will give you something, but not more than that, right? So if you burn. So it's about steady state observation, right? And people think it's, yeah, chaos engineering, I have a lot of money for my development teams, QA teams, chaos engineering, people generally say that I don't have budget. If you push, they'll put, they'll ask the same guys to do chaos engineering. Just add some chaos tests, no? Because somebody's pushing me to do chaos tests and just add it, right? That never works, right? It is an engineering. You have to have budget, focus, an actual strategy because you are spending all this to avoid an outage, improve your end customer experience in a big way, right? So it has to be budgeted. This is probably for the leaders in this room or go tell your managers that you need not buy an enterprise tool, but let me properly investigate an open source tool and then do it. At least my time should be budgeted, right? So things like that. And chaos engineering is quick. No, as long as the code is there, as long as you're changing it, chaos engineering has to continue, right? It is an extended burden to the SDLC, right? Because you have to be reliable all the time. That means you have to continue to change your test cases, modify your test cases, add your test cases again and again. There's no limit to what kind of faults can happen. There are 1,000 resources and there are 5,000 test cases that you can do, right? So it's a continuous process and that's where we are calling it as continuous resilience, right? So resilience has to be a continuous effort. Just like CICD, now CR also is coming, right? So it will come definitely, right? Not only it's me who is talking about it recently. Gartner also looked at it as an innovation trigger. And Mohan doesn't believe in Gartner, but they talk to people, hundreds of people and then see what's happening. They talk to SRS, yeah, you know, this is not working. We are pushing the QA teams to do the more reliability testing. That's the feedback that they give. They find a pattern. Sometimes they influence also, but it's happening, right? So even if it is influenced, somebody will take that feedback and then, you know, put it into it. So where should you do chaos engineering? What are we doing on time? Five minutes? Infrastructure is vast, right? And then you have to go to memory hogs, CPU hogs, and then you do a lot of APIs, you know, you call some other API of some other cloud to go get authentication done. So what if they respond slow? How to make sure that they're not responding, you know, in a very bad way. Even if they do, you should still be reliable. And then application chaos, which is still a kind of a myth, right? So it's not very widely adopted. How can you go and trigger certain circuit breakers inside your code, right? Like hash of depth are there. So can you actually unlock that hash of depth and then test it, right? So put, ask the developer to go put that test case inside the code, but it's not built in. But you can trigger it, you know, at that time. So it's failure path testing. So you can go to that level. And then operation chaos is more about when something goes wrong and that person is in vacation who is supposed to recover. Is it well documented? Is it automated? So operational chaos is one of the more, you know, commonly seen whenever you introduce a new service. Things go down, but you don't know how to recover it because only few people know about it, right? Or the keys are not there. Keys are very secure. So, and that person left also. Okay, gone, right? Now after one month I came to know that I need to retrieve that key, right? So a lot of chaos can happen at that time. So start with infrastructure. Because if the infrastructure goes down and outage happens because of infrastructure, the impact is very high. And then, you know, keep learning about it. And then, you know, keep moving towards that. Operation chaos, for example, is don't tell a person a very, you know, senior SRE just ask them to go on leave intentionally for a day and then introduce a major chaos, right? Major fault. And then see how the VP of SRE responds, right? So that's the kind of chaos test that you need to do, right? So now the topic is continuous resilience, right? So everything, as developers say, if it's not automated, then it's not done, right? Even chaos testing also has to be automated. I wouldn't recommend automation of chaos in production, right? Until, unless you're super clear what you're doing. But I strongly recommend automating chaos in QA and pre-prod environments, right? And then it's continuous resilience is verifying the resilience of your services against faults continuously. It's as simple as that, right? And then you have to do in dev, QA, pre-prod and prod, right? So prod is, there is a blast radius concept, how much you can take down, right? So you can do 5% increase in CPU on one node in prod. That's chaos. Just see, and I ran a test case with very little blast radius at a lower traffic time, peak time, right? So that's okay to do and try to automate it, right? Try to randomize it. Try to randomize multiple things in that low blast radius things. But you have to know what you're doing because customer experience can be very bad if you are injecting a fault willfully and the blast radius is very high. And especially in production, you have to have auto remediation built in because you are introducing fault. If something doesn't recover, the test case should require recovery also, right? The tool can do, right, depending on the tool, depending on who is writing the test case. So you just have to be careful, but try to be not so careful in pre-prod. Let somebody suffer, right? So let's see, you know, who is at fault? A developer design issue comes in, operational issue comes in, but being too reckless will bring down your entire productivity of QA teams. That's also very dangerous. So just try to be a little bit aggressive, right? So metrics I talked about, the common metrics is mean time to fail, mean time to repair or recover, mean time to inject a failure, right? How fast you can reproduce an outage, right? These are the common SRA practices or metrics. The new ones that I've been recommending is resilience score and resilience coverage. This is mostly targeted towards developers and QA. How much of coverage have you done? Yeah, all code is done. White box testing, black box testing. No, I'm talking about resilience, chaos test, right? Have you covered all faults in all areas, right? So that's resilience coverage. My services are resilient against 10 test cases. That's good, you know, another thousand are there, but at least you know these 10 test cases, you're resilient. That is a metric, right? And that's a way to ask for budget also. Let's say another incident happened, outage happened, you can say that only this 10 I've covered and that incident falls into some other area, so give me a budget to go and implement that test case, right? So resilience score is how well are you doing against a given fault or how well a service is doing against multiple faults happening into it, right? So these are just, it's all common sense, nothing new, right? So think about this metrics. So again, if you are deep into chaos engineering, if you ask somebody what are you doing, we are running game days. That's ad hoc chaos engineering. Trying to do continuously educating people why, what, how, right, and doing it with a purpose, with a strategy, with some budget is called continuous resilience, right? So you do it like engineering, then reliability will improve, right? So how fast it takes, right? If you do it super fast with high budget, it takes three years to get to expert level, okay? If you just casually start doing it and you are responding, you know, your business is doing well, then automatically some support will come, but you always start with a test and then slightly try to automate whatever you've done and then try to encourage other team members to do that same automation. Then expert level is, you know, 80% of your services are under chaos testing in non-prod and 20% is in prod automated testing and in non-prod things are randomized. That's like, you know, you're very, very comfortable. Look back three, four years ago, you had no test case now. You know, there's a lot of design changes that have been forced because of this chaos testing. So you are doing it a much, the ROI will be very, very high and you help your business to scale also faster. Now, what's the point of, you know, come and bring another 1,000 users onto the same thing? No problem, because I've tested it, right? It can take load without this kind of data people over provision services, right? I don't know, better tell now it's not my problem. My boss let him put another $100,000 into another big set of racks and other things. Yeah, you know, I don't want to run my service more than 50% of the load, right? You don't need to do that, right? If you're confident of, if something goes down, it can still go to 80% and then worked well. That kind of a testing, if you're done, you're very confident so you can optimize your cost. So chaos engineering is a must, right? If you're a good developer, good SRE, good leader, you have to look at it proactively and definitely it gives good returns, you know, once you start implementing it. That's the way to take care of reliability. That's the summary of it. And then open source, you have Litmus and other tools. It's a good project. I can watch for it. I wrote it, right? You also have enterprise edition through Harness. So that's also now pretty stable, already deployed. So if you're a business or a developer, you have choices to take control on reliability. Yeah, thank you. Thanks a lot, Omar, for the, yeah. So any questions? Maybe you can feel free to ask. Sorry, it took more time. It's 11.30, so. Chaos monkey is just one of the tools, right? There are many other tools. Chaos monkey was started by Netflix. They did good marketing. Almost everybody thinks chaos, engineering means chaos monkey. Thanks to them, they actually, you know, marketed the word through that. But Chaos monkey, I don't want to say bad tool or good tool, it's a tool. And I started with Chaos monkey and I wanted to, you know, being a developer, do something very cloud native, right? With containers, so that's where Litmus was written. Chaos mesh is another one. Chaos blade, there are three chaos projects within CNCF. But if you want to do a modern chaos engineering, you know, pick up Kubernetes related tools and they will help. And if you are an enterprise, at some point you want to see if there are enterprise options also, right? So then, you know, it's better to pick something like Litmus because, you know, you really want to invest a lot and get a good ROI. There are enterprise tools support available, right? So, yes, now you can. There are a lot of chaos engineers out there. But it's not that difficult. Chaos engineers, I mean with tools like Litmus and four years ago chaos engineering, even if I want to do, I have to hire somebody, you know, specific to that field because, you know, it's not that common knowledge. But now the tools have evolved, things are easy, we provide SaaS service, so you can just, you know, plug in and then, you know, start doing your chaos just within five minutes. So, and also it has to be done by developers eventually. So it's a skill. You can write, you can ask somebody in your team to write 10 test cases, but you want thousands of test cases, really, right? So who will write them? It's like whoever is writing the regular test cases, they have to write, but definitely you need a champion. So chaos engineering, you can pick a QL lead, ask them to read about chaos engineering, right? You know, just motivate them or give some benefits, right? Goals, then they can pick it up. Yeah, but there are new roles called chaos engineers. I've seen in job portals, we are looking for a chaos engineer. Okay, then I tweet that. Look, chaos engineering is important, so. Thanks a lot, thanks for the question as well. Thank you. Yeah. Config chaos, data chaos, yes. I would not recommend you to do data chaos unless you know what you're doing. Config chaos is also resource chaos, right? So config chaos is the first thing that you do. Change the number of replicas, scale down, right? How do you do the, either you can change the config to bring down the replicas or you can just delete one replica and see if it happens or not. So the tools provide various options. Mostly they are at an infrastructure level or a config level, right? Data chaos is also there. For example, our enterprise version supports file system chaos. So go and while the data is being written, put some corrupted data. See whether the database has redundancy in that or not, right? You can do that, but then eventually, if you're screwing up some complaints, problems, and not only you, your CAO also will be in trouble, right? So you have to be very careful. Don't do it in production, but you can definitely do it in on-prod, all right? So try to corrupt the database in on-prod. See what happens, right? So do you have operational readiness to recover it or not, right? So who knows on that kind of challenges you can recover? But definitely that's expert level. Hi, sir. Thanks for the presentation. So is there any books do you recommend on chaos engineering? Docs.litmuscaos.io. That's the latest book. Don't read any other books here. I've been asked all the time, right, right, should I write a book or should I write something new in the code and docs, then they can read. We are in open-source world, right? So just go and read. You come to chaos community, you know, we help you read. It's all so easy. We'll help you learn, right? Associate with the community. This is all happening because of community, right? So that's the fastest way. Don't read books. Let's get something done with code, right? Thanks. Hi, sir. Here. So my question is here. Sorry. This is Karthik from Presidio. I have one question. Is chaos engineering, it's foreign, saying the code level. I'm just looking about the infrastructure. Is in chaos and disaster recovery is it dependable or so? Yes, as I said, chaos engineering has to be done at the first infrastructure level only, right? So then you can go to the code level in the end. But you have to start with infrastructure chaos only first. Yeah? Is that your question? Sorry. So it's a DR, it's dependable for chaos. Sorry? Disaster recovery is dependable for chaos engineering. Yes, that's one of the use cases. So disaster recovery or DR scenario can be kick-started in various ways. You can go and bring down some infrastructure to kick-start DR scenario. So DR is one of the definite use cases for chaos engineering to automate DR testing. Thank you, sir. Read a lot. I strongly recommend not to ignore this, right? If at all it's not possible to implement, it's a good topic for you to learn and you know. Is it working? Hi, sir. There are a couple of questions. One, like in the automotive world, they got a global end cap after you've done quality checks, right? Likewise, for the chaos engineering, when you sell a product on the marketing, how do you differentiate that? Is there any kind of tagging attached with the software? Generally, I mean, we are not gone to that level. Have your, has your product gone through chaos testing? You should ask them before buying, right? Is there any kind of rating for the chaos engineering such that your product got a higher preference to your competitor? See, it's all tied to reliability, right? So how resilient you are, people always market. We are the highest resilient, reliable product, but chaos engineering is a way to achieve that, right? So I hope I could give a better answer, but we are not there yet. I wish that is true, right? So by this product, because it is chaos engineering certified, if it is that, then, you know, buy it, right? So something like that. And then the followers, chaos engineering is something which will always come at the end of the software development or in this permit, or it should be part of the design principle itself. It's the later. We have something called chaos-first principle that I've been advocating. Chaos should not be done at the last. Chaos should be done at the first. As soon as the developer writes code, they have to write a test case, functional test case, value test case, performance test case, and also chaos test case, right? Chaos testing should be added as a strategy for your design document done by QA engineers, right? Because if you take control of chaos at that level, right, you will unearth the problems early enough and your leaders will see the value. They encourage you to do it. They encourage you to automate it. Definitely, our recommendation is do it at the beginning. That's where continuous resilience is the topic is coming. People are saying, yes, you know, I think it makes sense to do it, right? So, yeah, it should be at the beginning. It's left to right. Any key or maybe a good advice to convince management because currently I'm in telecom world. It's always tough to get a pre-prod environment equivalent to that of the prod environment because a lot of cost, infrastructure, everything is involved in it. That is another myth. You don't need an equivalent environment to do the same, to reproduce the same outage. Whenever an outage has happened, there is an RCA, right? What happened because of what? Because some network slowness happened, something else changed in the code and that did not respond or some configuration was a problem and hence the outages happened. Take the RCA and try to reproduce that symptoms at lower environment only by using various different test cases. Load need not be by putting thousands of users. When you put thousands of users, the CPU goes high, the database utilization goes high. You can simulate those conditions using chaos test and then when that goes high, somebody did not do the job well. So, now you can write the test case like that. So, with chaos, with lower environment, you can still reproduce production faults. That's a fact. That's possible with modern chaos tools, right? So, try to sell that. There's a lot of new knowledge available and then a lot of research reports are saying chaos engineering is the need, right? So, you can try to sell that. All right, thank you very much, guys. I would like to invite money on the stage. Thank you. Thank you, Uma, for that. Okay, all right, thank you all. We have a couple of announcements. So, Zoho had set up a stall outside at the reception lobby. So, you can collect your raffle coupon there and you'll be getting the results of the raffle at the end of the day. So, you can go ahead for that. We'll be breaking now for 15 minutes quickly and we'll be back by 11.55. Thank you.