 Welcome everyone. We have Vilas with us. Vilas has a very interesting topic. He's going to talk about minimum viable resiliency and production readiness. How to invest iteratively in reliability and observability. I mean it has a tag of chaos engineering which is very interesting. So without further delay, over to you Vilas. Thank you so much, Vanjunath. And really good to see everyone here. So the topic as Vanjunath mentioned is minimum viable resiliency. Let me talk a little bit about myself first. I'll introduce myself and then we'll go forward. What do I do? In the last decade or so, my work has been focused on improving developers' life by contributing to developer productivity, making sure that we invest in continuous integration delivery and improvements of quality. I've been lucky enough to be a participant very early in the Netflix chaos engineering program and performance testing and analysis is also near and dear to my heart. Right now I work for build.com and I'm a senior director there focusing on developer experience. So let's talk about what is the elephant in the room. All of us work for companies where there is investment in the public cloud. That investment is growing year over year. Right and over time people are investing in something we now call a hybrid cloud where there is some footprint on the private cloud and some in the public. And the amount of money that is being spent, the annual public cloud spend over years is increasing. This data is from last year. So now that said, are we making sure that our applications and our services are actually working in the public cloud? Are we thinking about what are cloud deployable criteria? In my experience, there are five criteria that the applications and services need to follow. Resilient, they have to be resilient to degradation, failure in the cloud or latencies, issues, things like that. Resilient means your customer doesn't see pages which are like 404s or bad experience for the customer. Second thing is superior performance. Making sure that you can scale the very existence of cloud enables the ability to scale capacity rapidly to ensure that customers can actually get what they need and for your business to grow very quickly. That brings us to the third one, which is scalable design. Is the design scalable enough that you can handle a growing set of customers without having to redesign or rearchitect something from scratch? It has to be portable. Portable means you can change public cloud providers at a given time. There are more than one public cloud provider that's available today. And in the variety of options that are available, can you move your service from one to another? And finally, compliance and security is needed for all of our applications and the security world, we know there is a challenge with workloads on the cloud. So there has to be especially extra precaution there. But tonight or today, I'm only going to talk about resiliency. Because I believe this is one of the most important things for your service or application to have. Why is that? I'm going to share a few snippets from the last few years. These are snippets of major headlines that were caused because there was a cloud outage, either on Azure, Amazon, Google, or Cloudflare, as examples. These are examples where the business suffered an immense loss because of challenges in the public cloud that they were using. And so if the workload was not resilient, was not able to perform during that time, customers obviously did not have a good experience, and that led to a lot of challenges in this public articles about it. So there are some truths we learn out of this. One, the customers don't really care what solution you're using. If they are logging on to a page or they're using an app and they can't do what they wanted to, they are not going to blame the cloud provider. They're going to blame you. So the downtime penalties are high. They're high for the business. And this trust that is lost, if a customer is trying to use your application at a certain given time when there's degradation in the cloud, that trust loss can be for the session or it can be for the lifetime. The customer may not ever come back. Not just that, the customer may actually spread the word that, hey, this is really a crappy software or a crappy application and no one else should use it. If that sticks, you get extended loss. And finally, the many companies think about, hey, let's just add redundancy. Let's just scale up. Redundancy is not necessarily the complete answer. So we'll talk more about what is the answer. So I'm going to talk about five important aspects of this. Reliability, availability, usability. So reliability, how reliable is the service? Is it always available for the customer to use it? That's availability. How easy it is to use it, usability. Predictability, which means can I at any given time say that, hey, I can definitely do this function using this product and finally scalability. If a million people are logging to the site, will it actually stand up or will it die? What we call it is these are the customer perceived illities of your product. Remember this line because we will refer to it again and again. So customer perception is what we're going to talk about and that's how resilience is measured. So the customer doesn't just perceive these illities. They also look at the consistency, which means if you're reliable today and not tomorrow, that is an inconsistency. If you're available today, but not on the weekends or available on the weekends and not on the work days, that is inconsistency. This lack of consistency can lead to a loss of trust and that's exactly what happens when there is an outage. When there is a cloud issue or if there is an issue that is happening that you can't control, you lose your ability to be consistent to your customer and that results in loss of trust. So what we've established essentially is a loss of trust means a lack of consistency and any lack of consistency means we lose that customer's business either for the session, for the month, for the year or possibly for their lifetime and not just theirs. It's possible that we lose their French business, their relatives business, whoever else, whoever they can influence. So even well-designed and well-tested products face major loss of customer trust and it takes a long and sort of a dedicated effort to bring back that trust. There are examples where companies have done it. I've linked one here, but this loss of trust to get it back, it's an extreme amount of effort and it's an investment that companies can ill afford. So the question that I want to answer today is can a product or application or service maintain these customer perceivabilities during turbulent conditions in production? So all of this time we've been talking about inconsistency, I've been hinting at okay, something could go wrong. What are those things that go wrong? There could be issues. We have seen this as possible. Every one of us surely remembers at least one incident where a cloud provider went down and it impacted the company. At least one news article we remember reading. So what is the answer to this question? Let's answer it together. My end goal with this presentation is to share a vision for how to maintain a resilient application ecosystem in the cloud where failures in infrastructure dependencies cause minimal damage to the end user experience or no damage to the end user experience if you are very lucky. So let's talk about the requirements. We'll deal with it just like we would do with any other sort of work item. What are the requirements? There are three requirements here. So in order to enable the end goal we want to make sure that we invest in resiliency testing which means that we constantly check if our product can survive during degradation, during failures, either internal failures which is code level failures or critical dependency failures or the infrastructure dependency failures. The second is deployment model. How is this product deployed? Are we using best practices for redundancy? Are we making sure that we can fail over from one end to another? Things like that. Are we choosing the right sized VMs? And finally, obviously KPIs, what are we monitoring? What are we using to make sure that we are monitoring both the actual product in play as well as what is the overall monitor for how we succeed in this entire end over. So there are three requirements. Let's dig into each one. Resiliency testing. I'm going to focus on the testing part. Now all of us are very familiar with what CICD is. It's continuous integration and continuous delivery. So what are the stages here? A team starts planning. They work with a product owner or they work with other stakeholders. They figure out, okay, this is what we want to build. They start coding in. Once the code is in, you go through the merge request or the pull request goes through some code reviews. It then says, okay, let's verify this. So and then verify some small verification happens. If it eventually gets merged into the trunk, then there is a verify pipeline where you verify all of the unit testing, integration testing, all of that end-to-end verification. And then there's a profiling stage where you do performance testing, resiliency, whatever else. And then you deploy. And then finally it goes to production. In production, there is monitoring. You measure. You say, hey, this is what's happening. Customer issues has come up. And you go back to planning. And this cycle continues. So this is our continuous integration and delivery. Why am I talking about this? Because resiliency testing has a place in this. Resiliency testing has a place in the later half of this. So once you have written your code, from the first point of verification, there is a need to write tests that will check failure paths. So failure injection tests. It is important to do experiments on what happens when something fails, when the critical path is not, when the happy path is not executed. And as you come closer to deployment, it's important to make sure that Iran failover exercises earlier in pre-production, we do fallback testing. So we say, hey, what is the output that is seen by a customer? What does the UI say when the UI can't fetch data from the API call that it's making? That's fallback testing. And then finally, we have to tune the configuration. So if there is configuration that says, let's say you do everything right, and you create a circuit breaker where it fails when 20% of the traffic is seeing failure, is it really working? We need to test it out. That test has to happen before production. Or sometimes people do it in production, not everyone can succeed in that. And then finally, in the monitoring phase, there is on-call training that's needed. There is a playbook that needs to be defined. So we'll talk a little bit more about the playbook in the next few slides. But this is essentially where resiliency testing has a way to play. But all of this costs money. So as we go from the left to the right, as we go closer to production, each test is expensive. The infrastructure needs are also piling up. And so is the amount of time that the developer needs to fix something. For example, in the profile phase, if you have a performance test and there is a failure, the developer now needs to spend a significant amount of time improving performance. In fact, they need to go all the way back into design and maybe rearchitect a solution. If the performance characteristics are not being met, performance requirements are not being met. So it's an expensive solution. We have to keep this in mind. And this will become useful when we talk about this later. So that's the first part of it. What is the second requirement? Deployment model. As I said earlier, redundancy is an answer to this. Before we go to redundancy, let's talk about the key variables in the deployment model. So every cloud provider has an underlying network that they use to connect one VM to another, one rack to another or one data center to another. The network can go down. There could be issues there. There could be challenges there. How does the cloud provider provide relief against that? They make sure that they say, hey, we'll provide multiple regions for you and things like that. VM infra, what size of VM is being used? Is this the right size VM? Is this the right number of CPUs, memory, et cetera, for what you need to run, the workload you need to run? Is the load balancer configured correctly? Is it a round robin method? Is it the lowest sort of traffic node first? All of those, right? Or is it just a blank, like the only node that it'll only go to one and no one else, right? Is it a mis-tuned load balancer? Failover capacity. You may want to failover. Anyway, you may choose to failover or not, but you probably don't have the capacity or you don't have that readiness. Or even if you have capacity, maybe the cold start is preventing you from allowing that traffic go to that region, right? All of these are variables. And how to solve some of these, what we can introduce is redundancy, right? In many cases, not in all. We can introduce a networking redundancy, which means if there is no route that is available from one data center to another, there might be another route. There might be express routes for fast lanes for important critical traffic. There could be quick sort of back-end solutions that could exist, that could improve networking reliability as well, right? That's one part of it. Hardware redundancy. Within the VM structure, there could be multiple racks. If you have 100% utilization of those racks, you can never provide hardware redundancy. So that's probably taken care in some way. And that's a contract setting that you can add to the cloud provider. Geographic redundancy. Oh, I want to make sure that there is servers both on the east coast and west coast and etc of the United States so that if one goes down, I still have the other, right? Scaling policies. So apart from the redundancy on geographically areas, what you need is an ability to have really smart scaling policies that are suitable for your workplace. For example, if you are a service that provides solutions 24-7, your scaling policies will still need to be modified to make sure that you're providing good service in the day, and maybe a little bit degraded in the night just because you have a lower number of customers. If you are not a 24-7 service, you may even choose to scale down to the minimum amount of VMs in the night and then just go to sleep and in the morning, it again scales up automatically. So scaling policies are key. And the final one is application fallbacks, right? Which is in the application code itself, when developer is constructing a code or even design, they are creating a fallback path so that if there is any challenges or if there is any failures, the customers will be able to see a fallback instead of not seeing anything. As an example, previously at Netflix, when there was a failure, there was an issue, the users would see a generic Netflix landing page, which will still have movies in it that they can continue to play. But that means that they can continue with their sort of whatever they wanted to do, they wanted to watch a movie, sort of do something there, they can continue to do that without having to break their experience. And when the service comes back, the fallback is taken out, and then a personalized page comes back. So there were policies like that that are set in place. So these are some challenges, but as you can imagine, it's expensive as well. So how does the cost work? Networking redundancy typically tends to be a little bit cheaper. And then over time, as we go to application fallbacks, which requires developer time, which requires intelligent design, as well as execution, it starts getting expensive. There are no hard numbers here, but everyone has a sense of how to do this. So for example, the scaling policies, if you start tuning, you need really smart staff and level engineers and above who understand the system and are able to take that risk and take that call on what is the right sizing. So it's an expensive problem to solve. Let's remember this as well. So this brings us to the topic that I'm here to talk about. So minimum viable resiliency. Before I get into what this means, there is one other thing that I want to share, which is the fact that in today's day and age, there is a trade-off of quality versus speed. There are a lot of companies that are either startups or early in lifecycle, and they choose to sometimes send out innovation as fast as possible. If speed is paramount, they're usually saying, hey, you know what, we'll accumulate this tech debt. Let's get the release out. And then even if there's a few critical errors, we'll manage it. But mostly it's all about qualities. We have to have really strong quality, quality is paramount. So there is a trade-off. And in the current economic scenario, there is more of a question about, hey, how do we make sure that we're investing in the right kind of thing? So that brings us to the word minimum. So that's why it's important to make sure that we call it minimum viable resiliency. So minimum viable resiliency means what is the minimum investment that we need to make or the minimum amount of work that we need to do to make sure that our production readiness is solid. So the word minimum leads me to create levels. And these are something that in my past I have worked on and I've sort of used it in multiple areas. So resiliency levels, let's talk about that. We will define five levels which allows any team to pick and choose exactly where they need to be in their resilience journey. So we'll start from one and we'll go up to five. And each one of these, using the requirements that we described, we'll put them in slots and we'll see how that will change the resilience picture for each of these levels. So before we get into levels, obviously, you need to first do a health check. You need to see how healthy you are before you run a marathon. You don't want to just get up from the bed and start running. So you want to get an audit and then you invest. So let's start with the audit. So this is what we would call a readiness check. For each application and service, here are the things that you should be answering for your service. One, is there a standard failover playbook? Which means if you wanted to do a failover, is there any place either in the wiki article or a Google page or something like that, whatever it is that you folks use, is there a standard failover playbook where someone can go and start looking at those instructions and they are well tested? They have been verified, everything is good, I can rely on it. Does that exist? Yes or no? Second is, are all the critical dependencies well defined? What is a critical dependency for a service? Any dependency which causes the service to not function and serve its bare minimum sort of promises is a critical dependency. So are the critical dependencies well defined? Is that noted down somewhere? And third question is, are there well defined mitigation procedures when a critical dependency fails? So if a critical dependency fails, what is your answer? Would you say, oh my God, I don't know what to do. Let's call the other per engineer. That's probably not a solution. There has to be a mitigation procedure so that your customer is unblocked and then someone in the back end can solve problems. Fourth question, is there a list of non-critical dependencies? Non-critical dependencies could be things like, hey, I log, but logging is asynchronous. Hey, I do this thing, but it's a transit dependency, et cetera. But those can also become severe over time. They may not immediately impact the functioning of the product, but over time, they may cause a lag or they may cause significant degradation of the product to the point where it actually starts seeing issues in the customer side. The last one is obviously other alerts that are tuned to warn or notify the team on any dependency failures, critical, non-critical, upstream, downstream, whatever it is. So this is your readiness check. And if you fail this audit, you are not ready to start doing resiliency. You're not ready to build a resiliency model. You would have to go back, finish the readiness check, complete all of those sort of set prerequisites, and then move forward. So if audit fails, you learn, you do it again, and then you may succeed. So that's great. The audit is success. You've done all the prerequisites. Now how do you invest in it? So all of the things that I mentioned previously in the prerequisites, none of them should be additional engineering work. This is all work that has already, should have been already compiled as part of the delivery system. So no additional investments are needed. That should already be part of your engineering investment. So level one, level one is your readiness check is complete, all good. You have performed a manual failover exercise using those prerequisites. You have enabled your alerts for critical dependencies and you're following them religiously. You have also created a single dashboard to track anomalies, which means at any given time, if there is a change in a metric, a delta of a certain percentage that is decided, let's say 5%, 10%, something like that, like, hey, latency is jumped by 5%, what happened, or error rates jumped, or, hey, what happened to this, this page, it stopped loading it entirely, right? Something like that, some metrics that your data dog metrics or APM metrics that can basically come back and the team can look at. And finally, who looks at this, right? It's the on call. So we have to make sure that there is a set on call rotation model, which means every engineer takes a turn at making sure that the service is up and owns that for that week or whatever the sprint sizes and then make sure that the on call doesn't just is not let out in the wild. They actually have a playbook in place they can use and they use that to solve the problems when they are called. So that is level one, right? Level one, if you've completed level one, that's great. Congratulations. Now we move to level two. So level two is all of the level one requirements and making sure that you are running failover exercises quarterly. So a one-time failover exercise makes the entire playbook obsolete very quickly. You want to make sure that your prerequisites are always updated. So your failover exercises need to run quarterly. Critical dependency failures have to be tested in pre-production. So remember that CICD model that I showed, this is when you start investing. If you want to push to level two, the investment will have to be in critical dependency failure testing. Third one is every week the engineers or the on call will discuss if the alerts that they received last week were tuned correctly and make sure that for the next week we are all prepared. So weekly scheduled alert tuning is important. Otherwise, alerts could be firing left, right and center and eventually they just turn into noise and everyone starts ignoring them. That's not a good thing. And finally alerts are enabled for all the non-critical dependencies as well. So it's not just it's there, but it's also turned on. You're responding to it. You have dashboards to track it, etc. It's all of those, all of that good stuff. All of this is level two. Once you complete level two, the next step is once you've all of that level two requirements are complete and then you're doing monthly failover tests. So every month you're failing over your application footprint from one region to another, one geographic region to another or one cloud region to another. You are introducing automatic failure injection testing critical dependencies. You're starting to do chaos testing in prod. What is chaos testing? Chaos testing is the ability to inject random failures in critical paths to see if you have a good enough degradation model to ensure that there's good fallbacks and good experience with the customer. I want to make sure they call this out. Do not attempt doing chaos testing till you have actually accomplished all of these activities. I have tried both ways where teams have gone and head first into chaos testing and it's not been pretty. So you want to make sure that you build that muscle. As I said earlier, you do not want to start running a marathon straight out of bed. You want to make sure you train for it and then get there. This is your training, right? Final two are the on-call playbook. Many times on calls get called at weird hours in the night. If you're getting called at weird hours in the night, you also want to have the assistance of automation during that time. If you had scripts that you could just run and you'd be like, hey, let's just run this while I actually wake up and see what is wrong, you have an ability to actually solve the problem faster. And finally, as a team, as a dev team, you define fallbacks for the key workflows. Not for all workflows. You say, here are the top 10 workflows that we service. We need fallbacks for each of those. That's level three. Level four is when the automation for failover testing starts. So semi-automated failover capability starts. So you're creating scripts. You have automated failure testing for all the critical and all non-critical dependencies within the code. So earlier and earlier in the CI system. So now it's getting a little less expensive to put those in. So now you've already built a muscle. You know exactly where the things are. Your on-calls are practicing failover regularly. You're doing chaos testing. You can increase that to quarterly chaos game days where you're injecting failures randomly because now you have confidence that the application can handle this. So you're verifying that all the fallbacks are working just like you planned it to. Finally, you're doing auto-scaling policies. Now you have predictive auto-scaling policies that can be based on what you have seen as alerts. For example, if you see an alert that says every day at 10 a.m. the traffic shoots up, you can put in a scaling policy that starts scaling at 9 o'clock or 9 30 depending on how much cold start time you need. So all of that is level four. And finally, if you've done all of this, you can now enable failover testing in a CD pipeline itself. So before you even know failover, will it work in fraud or not? You could figure out with an investment on infrastructure, of course, figure out if this failover will work in a pre-prod environment before you go there. So you're verifying your failover playbook early. You're doing automated fallback testing, which means every single fallback in the code is being executed. And you're trying to get as much as 100% coverage to verify if fallbacks are checked correctly. And then you have monthly chaos game days, which becomes an event. It could be part of a sprint retrospective or something like that where teams sit around everyone together and then they're like, let's try and fail this and see what happens. And it becomes much more of a learning exercise, a confidence building exercise, where you're at that level. So five levels, each one has different efforts. We've talked about cost for everything else. What about cost for this? So if you look at this, if you follow it, this is not a linear model. This is a bit of a hump model, which means you start with level one, where there is about one or two engineers for one week as a fixed cost. And there is very minimal recurring cost because you're not doing much recurring. Then as you go to level two, you continue with that one to engineers, but they're extended now for two or three weeks worth of effort, because they now need to add more where there's none. And the recurring thing could be one engineer for one day per quarter. As you go to level three, two or three engineers a month, recurring about two engineers for about one day a month. Now, this is where it gets interesting. When you're at level three, you're already starting to do chaos testing. So there's a lot of confidence, there's a lot of knowledge and expertise within the team. This is when if you go to level four, the entire team is practicing it as a default, which means the recurring effort actually drops down. It becomes part of the process. It becomes part of the solution that you build as the product. And finally, by the time you reach five, it is the least expensive of these, but there is a hump to get over and that hump is at three. I just wanted to keep that in mind when we talk about minimum viable resiliency going forward. So support costs, they are obviously the lowest when you are at level five. As you can imagine, if you have lots of good fallbacks, your on-call engineers are doing your job automated way, you're practicing automated failover, there is a very little amount, maybe less than one engineer per week is needed because your on-call engineer is also not being woken up all the time. Support costs are low. If you're at level one, you might need to work three engineers every week trying to figure out, hey, exactly how do I need to keep the system up? So that is expensive. And that is the same for revenue loss. You lose the least amount of revenue when you're at level five and you obviously lose a lot when you're at level one, which is why they are red to green. Obviously, the level of resiliency increases as you go from level one to five. Finally, remember at the start, I said there is three requirements. So the third one is monitoring. So we always need to have KPIs that tell us, are we doing this right? So typical KPIs that we have heard in the past is mean time to recover, mean time to detect issues, change failure rate, things like that. I'm going to change that a little bit here. And I'm actually going to try to talk about something that I would call confidence metrics. So what are confidence metrics? These are the metrics. These are measurable metrics. One is the mean revenue loss per incident. All of us are familiar with what incidents are in production and how they affect us. What is the amount of revenue lost per incident? Are we tracking that? And that number should be decreasing over time. Second is the average customer impact duration. You could be resolving your problem very quickly. You could say, hey, I know this binary had a problem. I'm going to just replace it. Blue green it. Blue was live. We made a mistake. Let's bring green online, whatever that is. That is the recovery time for yourself. But the customer can still not make a payment or still not do what they want to do. So what is the average customer impact duration? That needs to be measured. And again, that is a measure that should decrease over time. What is the mean amount of time that it took to enable fallbacks gracefully? So if you say that, hey, it initially was failing, but then it recovered, what is that gap? And can we shorten it? Can we make sure that the moment there is a failure, there is a generic fallback that is available so that the customer doesn't lose their experience or doesn't have heard their experience? Once again, it's a decreasing metric. Variance in response time. So if you have latencies that are wildly fluctuating, and if you compare your P75 and your P99, and there is a vast difference in those, there is a long tail that you're dealing with. What is causing that? Verifying that and making sure that there is least amount of variance provides you the ability to say, I know how to predict the behavior of my product over time, at different times of the day, different times of the week, the month, year, et cetera. And the final one is the average time that it requires for you to fail over your application. This number, there have been companies that can do an entire failover of their entire cloud in less than six minutes. Obviously, all of us can aim for something like that, but that's not something that's needed. Everyone may have a different answer to that. But if you see there is a theme here, the theme is about maximizing trust. It's all focused on the customer. If we minimize the impact, then automatically we are maximizing the amount of trust that the customer can place on our solution. So these are what, hence, I call a confidence metrics. So what are the outcomes if we do all of this? Based on my empirical data collected from my experience over time, this allows executives to do stepwise investments in resiliency, which means they don't need to commit to all of the money up front. They can say, hey, let's get to step level one, level two, et cetera. The team actually gets the ability to build the muscle. Instead of saying, hey, we need to do all of this at one time. Let's get into chaos engineering kind of thing. You get the ability to slowly dip your toe and then start sort of lowering into your water slowly. It's basically a gamification. So you can create rewards going from level one to level five. There could be internal rewards that the teams see for going from level one to five. So for those who are managers and directors and things like that, you can create incentives for teams to go from one to the other. And that's when investment will begin. Customer-centric focus is needed. And this provides us that. And we improve on things like MTTD, MTTR, automatically, organically, without having to focus on them. We don't have to use them as metrics. We can use other customer-centric metrics and then get to this. So minimum value resiliency as a conclusion, I would say, like I said, it was a hum at three. And so many of you may have rightly identified as that is the level that I would suggest as a minimum value resiliency for those who want to practice a resiliency model. And that is where the investment has the most ROI. The learnings for all of us and maybe for me as well that I want to share with you is cloud workloads need resilient applications, services, things like that. And if you really want to protect customer trust, you have to test early, often, fail fast, learn fast. And as I said, minimum value resiliency based on my experience of working with hundreds of teams over the last decade, maybe thousands of teams, is level three. And so that's a learning that I want to share with you here. And that's all I had. Let's answer a few questions. Manjunath, I think we have a little bit of time here. I'll answer a few questions and then we'll jump to the hang up. What do you think? Yes, yes. Makes sense. I think it's a request rather than a question. So Mohammed wants you to just shift the slide before resilience. Just wanted to absorb what is there in the slide. I can, I can, I'll share the slide for sure, Mohammed after this. So it'll be available for all of you. Yeah. Okay. But yeah, any other questions or anything specific? I didn't see anything in the Q&A. Okay. Sounds good. Right now. Yeah, it was, it was awesome. Thank you so much for giving me a chance to share this. Yeah. And thanks again for taking time out. Thanks Manjunath.