 Hello everyone. Good morning. Welcome to the day two of KCD Chennai. Very happy to be here talking to you about how to build reliability in cloud native. I am Uma Mukhera, Head of Chaos Engineering at PONIS. I am also co-creator and maintainer of Litmus Project, which is an incubating project at CNCF at the moment. Before we begin, let me talk a little bit about where I work and what I do. PONIS is a modern software delivery platform. It has various features like continuous integration, continuous deployment, feature flags, cloud cost management, security test orchestration, security service reliability management, and of course chaos engineering. Litmus is an end-to-end chaos engineering platform. It is a complete framework for building chaos engineering capability for DevOps in your cloud native organization. So what are we talking about this morning? We are talking about how to grow in cloud native. Many of us are moving to cloud native. We are well into cloud native. But how do you make sure that you grow in cloud native very confidently without any hiccups and also efficiently, which means that you are spending the right amount of resources and you are including the cost as well as you are getting the right reliability or returns within cloud native. So to grow within cloud native very efficiently and confidently you need to have reliability in cloud native. So we will talk about cloud native a little bit and then what is reliability, then what it means to have reliability in cloud native, which will help you grow within cloud native very confidently. So let's do a quick recap on what is cloud native. So as we all know cloud native will usually have lots and lots of microservices. You are basically moving away from either bare metal or virtualized environments into containers where your focus is really the small amount of software and you are writing a good API around it and they can run anywhere. So eventually your application is broken up into multiple microservices and then in cloud native you are seeing these microservices are being shipped out very fast. That's one of the advantages of the benefits of moving to cloud native. So your software tips fast. And of course these microservices are now run under on any cloud wherever there is Kubernetes abstraction. Many of the cloud providers are providing Kubernetes abstraction layer or Kubernetes service. So you can expect that your microservice runs everywhere almost all on clouds or on other enterprise boundaries wherever Kubernetes distributions are being deployed. So this is in summary what the cloud native is. So you have heterogeneous environment. You are having multiple microservices containers which are being shipped out faster. And what's also new in cloud native compared to your legacy system is you will generally see more and more dynamism. So basically they are being shipped out faster. There are too many components. And also you have lots of new personas or people who are bringing in new capabilities not like a legacy system capabilities. And these people are needed because you are bringing in new technologies. You are building new pipelines for your software delivery and operations and management and so on and so forth. So eventually this is also leading to creation of new personas in your software delivery management lifecycle systems. For example, the concept of SRE is becoming a site reliability engineer is becoming more and more common nowadays in any cloud native system. Cloud native developers or a class of new software developers who have to assume that your software is configurable declaratively and it can run anywhere. It has to have a good API. So you need to be using new set of tools to make sure that all these are being satisfied. So basically cloud native environment development ecosystems, DevOps ecosystem is definitely a bit different and we sometimes call this as cloud native DevOps. And so that's also one thing to note about cloud native. As you can see here, there is a big change that is underway and we call it as moving to cloud native and almost you see product services, vendors participating in every area to make the journey of cloud native moment very, very successful. So it's happening and CNCF is leading the entire change from the brand. So that's about cloud native. Let's also talk about what is reliability before we talk about reliability in cloud native. Let's really touch upon what do we really mean when we say reliability. Reliability is actually a kind of a perspective. Everybody says that we are reliable or products are reliable or services are reliable. But then why are we talking about reliability with such an importance? Reliability is a perspective. It's about how many nines do you have in your service when you say your service is reliable. Is it like four nines, five nines, six nines, etc. So it's always measured as a way of how many outages that you had in the last year or so. So when outages are increasing, you are less reliable. Your reliability decreases. How to actually increase the reliability of your service is really by reducing the outages. So outages are the point to be noted here. So why is reliability so relevant now in this modern digital era is the digital services are growing very fast. We are of course seeing a lot of businesses moving to internet-based traffic because of all this COVID. We have seen this digital transformation happening past already and the traffic is probably 10 times more than what we had been seeing in a few years earlier. Reliability is really important because there's so much of new traffic coming on to your digital system. And when things are happening at such rate, the outages, if at all they happen, they're going to be very expensive. As you can see some examples here, how much ever you try outages are kind of a common thing. It only matters how often they happen or how rarely they happen. But they do happen and when they happen, there are going to be either financial issues, losses, or reputational losses. As we can see that the popular products or services that we do then and they have gone under these outages of course are expensive to the owner. The impact of outages can be of various types. They can be reputational outages, just like we are seeing here on Slack. Sometimes they can be causing a huge financial losses as well. If you're under SLA or if you're a really critical business that can cause huge losses for a bit of downtime as well. And this is another example once you go in 2019. And of course what happens is also because of outages is poor user experience. Your users nowadays digital age and they can get really frustrated sometimes because of these outages and bringing them back to your service or business is going to be a huge challenge. So impact of outages is very difficult to quantify. They're usually bigger than what you would really assume for short term and long term. And they're very important. And now let's try to understand what causes these outages. There's not really one thing that causes these outages but a set of things. And usually there can be application failures. So something gets filled up because of logging and something when you try it right sometimes things can go bad and stomach can happen. These are characterized into application failures. And we have also seen infrastructure failures causing these outages. Most of the times this infrastructure there is redundancy but still you know you never know whenever there is failure in infrastructure things start to fail somewhere else. Sometimes big infrastructure failures like the whole region becoming unavailable. It's not very common but they do happen and can happen. And the other types of failures that can cause outages are operational failures. So capacity issues you have not populated your service with enough capacity and then the high traffic things can go very bad and you're not scaling the way you are expected to scale under that conditions or that traffic. And whenever there is an incident happen how well you manage that incident or you're recovering fast or is there an auto recovery for such a failure? These are all part of operational failures. And then a good incident management and monitoring is very important when you don't have very good monitoring and ops monitoring systems that itself could be a major factor that causes a longer outage. So these are some of the generally observed things. And if these are the general reasons why outages can happen let's talk about in cloud native environment what can cause an outage. So before we go there let's just remind ourselves here the typical nature of a cloud native stack. It's a kind of a pyramid here where your app as a developer what you concentrate is really on the top of the pyramid and it's a cloud native app which really means that it's a container and then there are a bunch of services that your app requires to function. They can be the message, bus, data, bus, basis or the cloud native services or Kubernetes itself and then there is another line with platform on which Kubernetes runs. So important thing to note here is a bunch of them are also microservices. It's not just your application or your container the below layers all the way down to Kubernetes they're all of microservices based. So in microservices what can happen is especially on Kubernetes a deletion of a pod is generally not seen as a real value that can happen whenever some pressure is put and Kubernetes can just delete your pod and then relaunch it. So the architecture itself is such. So this is actually a kind of a value but your Kubernetes is designed to withstand that value and is your app or service designed to withstand such a failure and will it be providing continuous service as expected or not is a bigger question. So there are a lot of microservices that's a summary and then microservice based pods can be deleted or such failures are not very uncommon. So other important thing to consider in Cloud Native is how fast they're getting shipped out. So the changes into your pyramid are happening 10 times faster than they would usually happen in the legacy system. So you have a continuous change of stack happening all the time in Cloud Native and pod deletes can happen anytime in your stack including your application. These are apart from the regular infrastructure failures or operational failures or application failure. So the outages can happen more often if you're not taking care of it proactively in Cloud Native and that's why we're talking about reliability as a separate subject in Cloud Native today. So just to summarize reliability in Cloud Native is more important because of two reasons. It's a combined effect of proliferation of microservices as well as faster shipping of these microservices coming into deployment or production. So we talked about what is reliability and why is it so important in Cloud Native. We also talked about what a real Cloud Native environment looks like when you see with the lens of reliability. And now let's also talk about how can you plan to achieve this reliability in a systematic way. So most of the time the systems are deployed with redundancy. There's no single point of failure. So if not two or three or four physical systems or replicas you always have this redundancy in mind when services are deployed but outages still happen because systems are very, very complex and they're continuously in a state of flux. It's very difficult to define the state of system at a whole complete system at a given point of time. As the traffic increases as the network traffic increases system state changes and load a lot of things can be different from time to time and then when something goes wrong your redundancy comes into a picture the way you expect. Most of the time yes it works but it may not be the case all the time. So that's the criticality of service that we're talking about. So the problem with some of these approaches that we are seeing with respect to reliability is they're not proactively managed. What it means is that most of the time they are reactive when problems happen you go and try to recover them. And they're not collaborative you don't have systems in place to collaborate a failure and recovery failure recovery basically and you don't have these failure testings that they do they maybe add how can nature they're not well integrated into your CACDE systems. So what's the right approach to build reliability in cognitive? The right approach is what we call as take the chaos first approach in your DevOps. What is chaos first approach? Let's talk that chaos first approach is to introduce chaos engineering as a tool to systematically build the culture of improving reliability into your DevOps. Reliability is not one person's job it is everybody in everybody's job in DevOps you need to have a certain predefined approach while building the software then while delivering the software while deploying the software and then while managing the software so the entire DevOps need to be looking at chaos engineering as a tool that will help them to build systematic improvements into the reliability making sure that reliability is always there. So what is chaos engineering? Let's talk about that chaos engineering is the practice of breaking things on purpose before they happen you break things and then verify that there is no system weakness even if there is a system weakness now you are going and fixing it So you start with small faults on a system in various different layers to various different systems and keep proactively introducing these faults compare it with the result with the expected behavior if it is same then you are good test your system is resilient at that time against such failure if it is not then you have got something to learn and because you are introducing this failure you are well prepared to deal with the outcome of such an introduction of failure you can either reverse it quickly or automatically recover it is always better than a fault happening without your knowledge Chaos engineering is done always in a controlled environment and you will have the ability to control the blast radius and you are proactively verifying your system against such faults one by one So that is chaos engineering Chaos engineering is a practice that is now coming as an effective tool in cloud native to improve reliability in all areas of DevOps So in other words Chaos engineering has to become a kind of a culture in DevOps and you go and introduce chaos test development CI pipelines or in your delivery mechanisms in CD or in your ops where you continuously validate your SLOs service level objectives against such failures and then you keep creating some random tests like game days and even before you actually get into production you could be doing relatively before actually your code can go into the production environment So if you approach the failure testing from all angles with good tools then chaos engineering becomes very effective and it actually aids retaining the reliability even when your systems are so dynamic and running at high scale So who can use chaos engineering Chaos engineering is typically thought of it's a supporting practice or a tool for SREs or operation systems but it is now increasingly being used by both QA teams as well as the developers So developers are using chaos engineering practices to test their micro services code in various different environments right at the development time similarly QA teams they will have their large test setups and they can now test against various failures before actually qualifying completely And of course SREs we use chaos engineering to perform manual game days and once they are comfortable they go to the automated game days where you are basically introducing faults randomly and then checking your system if it's reliable or not So basically all personas can be involved in chaos engineering and we are seeing that all these personas slowly taking chaos engineering as a common practice in their day-to-day life So let me talk a little bit about one such tool the tool that we work on is litmus chaos It's a complete chaos engineering platform to practice end-to-end chaos experiments and the post-effects of the experiments and effectively manage the entire chaos experience on Kubernetes It is a CNCF project now and it's been there in use and development for about four years now very stable we have done more than 50 enterprise releases and it's been used by many enterprise users and you can see that its growth is increasing by quarter basis So it's a very stable project and feature complete for basic chaos engineering needs in cloud native So it has at the core what we call is chaos center which is a centralized platform where different personas can come and collaborate on chaos and that's what we call as chaos control plane and you have a bunch of chaos experiments that are available in public or your own private chaos hubs where you use those experiments build your chaos workflows or scenarios or chaos tests that makes sense for you to your service or application and then you can launch those chaos workflows or schedule them or execute them against various different infrastructure or systems such as Kubernetes or various platforms or bare metal or your on-prem VMware etc That's how Litmus structures and in Litmus as I just said your teams collaborate around chaos tests and they can be launched in various ways you can send them as a trigger to a certain event or you can just schedule them weekly or you can just put them into your CIRCD pipelines the developers use Litmus chaos workflows as a way they do the other program executions using kubectl so basically a developer can do a kubectl apply a chaos workflow wherever they want that gives a bigger capability to include this in various CIRCD pipelines or in the unit tests or integration tests etc etc so at the end of so how developers really use Litmus chaos and did use chaos tests into their pipelines to do it so they log into chaos center they will have a bunch of experiments available to them and they can use those experiments and create a new scenario to match their application that they are writing and at the end of it you will have an ML file a declarative file that describes your chaos scenario and you can push it into your gate and that's one part of it and then you can come and inject that or call that file through kubectl into a pipeline and when you do that full chaos execution happens the Litmus execution plane gets spun automatically chaos test run and chaos results are pushed back into chaos center so the moment you introduce this chaos stage into pipeline is automatically run and you have your chaos analytics already available in chaos center and you can compare them over a period of time how my application has been improving or you can continuously verify that there is no actually new bug that you are introducing before you merge your code into your repository so another way that developers are using is through GitOps in your pre-production environment your apps are getting upgraded and the moment the apps are upgraded a certain test can be run so before your app or code goes into production the pre-production environment can actually verify the new functionality against certain failures and those failures can be triggered automatically without developers knowing it so this is the capability that will help developers making sure that no new bug is introduced in a larger at a larger play as well as they are verifying their code before it actually gets shipped against such real-time failures so where do you start is another question specific way to start chaos test you can start in pipelines and move towards right pre-production production sometimes people start in pre-production if there is an SRE that is interested in chaos engineering they will bring chaos engineering into that organization and then start introducing failures or automating those failures test in the CD pipelines so we are seeing chaos engineering chaos test being done in all areas there is no particular way in starting and progressing later on so this is how a good chaos maturity model looks like in reality right so you always start on basic infrastructure introduce failures in infrastructure and you verify your code or your service is reliable against such failures and it could be various combinations of failure even in the infrastructure and slowly you will move up introducing failures that are related to middle layer which has message queues or APS servers then go to your data services right or your databases and finally you start writing chaos test that are specific to your application right and all these things happen over a period of time it could take you know a few quarters to sometimes few years by the time that you have a full-fledged chaos engineering practice where you have all sorts of chaos experiments or chaos tests and they are fully automated into your divorce practices so chaos is not like a magic bullet you have a good tool to start off with and it has to primarily become a culture that is when you have a good way to measure reliability and you are measuring the choice of or chance of not introducing new failures into your software before they get shipped so just to summarize the benefits of chaos engineering in your DevOps introducing chaos engineering in your DevOps will lead to three major benefits first you will see the ability to introduce failure quickly it is called MPTA in time to identify this is a common problem in most of the systems you know that a failure is happening but it is very difficult to reproduce when you have chaos engineering failures are well worst in the failure practices there are a bunch of chaos tests that are already available so this new scenario can be quickly reconstructed so you are time to identify a bug or a scenario will directly reduce and because you are able to you are actually reproducing these faults or introducing these faults all the time your capability to recover from potential outage will be very very high and the recovery times will become slower and because you are now going deep and debugging them your development systems are also practicing chaos engineering failures in general will reduce and that increases the distance between the failures or MPTA this is the overall ROI from chaos engineering and that when you decrease the time to fail you are basically reducing the number of outages and then you are increasing the reliability when you do this is a practice you are guaranteeing you know to your customers that the service is reliable that's because you are already testing all the time against a potential failures in various areas right so just to summarize reliability should be a DevOps focus in cloud native and introducing chaos engineering into DevOps your DevOps cloud native DevOps will result in high ROI and it yields structural improvements it's not a magic bullet it yields structural improvements to your product and to your DevOps practices so your teams will be better because they are aware of the product working whenever failures do happen so they debug the system many many times whenever these failures are introduced and weaknesses are found you are basically having a much better engineering team that knows your product very very well so the popular resilience is kind of a structural ladder here when you follow chaos engineering the practice in DevOps that's mostly about what I want to talk right and you can use litmus the open source version from github it's very easy to start and you also have a really available hosted version at litmuscaos.cloud give it a try when you sign up you get a control plane you can connect your target plane or an agent and then start running chaos that's all I have now I can take some Q&A here in this session