 Hey, good morning. Good afternoon. Good evening. Wherever you're joining us from welcome to cube Khan 2022 My name is Raj. What the Raju Enterprise architect with FIS and here with me nilanjan mana Nilanjan, I'll let you introduce yourself Hey, everyone. Thanks for attending this session I'm nilanjan a software engineer at harness and a core contributor to the CNCF litmus chaos project Where I'm responsible for developing chaos experiments for various types of failure scenarios and the requisite tools Required for injecting chaos into various environments. I'm really excited to be here with all of you over to you, Raj All right. So today we're going to talk about chaos in any applied to the fintech domain Before we jumping into chaos in English. Let's spend some time on fintech domain. What it is Here goes a problem statement Non-banking services companies would like to offer banking services and products Example could be a cab company such as uber Which provides digital platform to its drivers and passengers would like to offer art alone Why would they do that? I mean for obvious reasons. They want to Increase the revenue streams and also they want to serve their loyal customers Make them happy by providing some competitive, you know interest rates so If so, why they have to look at some fintech providers, why can't they do themself again for some obvious reasons, you know Ubers primary business is not banking a be They don't have banking because of that. They don't have banking ecosystem and They don't have banking license and They had to go through even they do they would go to a lot of Compliance and regulations of banking sector, which is too much for them. This is where FIS would help the Fintech companies are fintech consumers where it exposes banking services in the form of APIs and these fintech consumers Consume those APIs and build some innovative products Provide great customer experience to their end users is a win-win for everyone So, how does a typical fintech technical architecture look like it? It would be something on these lines Here I want to kind of convey a couple of things one the complexity of the architecture and introduce chaos engineering So the complexity of this architecture is not intentional It is a wall over the period of time just like any other organization FIS architecture is also evolved For the last three decades or so FIS is building products in the financial services. It build It's its products on all sorts of technologies like a mainframe windows based web applications It evolved right now. We are in the Cloud-native computing era where the architectures built on micro services base architecture at API gateways Kafka and everything in in between But these modern architecture need to integrate with legacy architecture because you still have certain capabilities provided by legacy architecture While FIS is going through the digital transformation phase so in this context You have to understand or this is a classic distributed systems architecture Out there and distributed systems architecture Brings its own challenges the challenges around resiliency Where things could go wrong in Anywhere in the system. What do I mean by things could go wrong? Like, you know, there could be a network partition that could happen between the systems There could be a service disruption. There could be a Fault that could be happening This is where chaos in reading this plane could help identify those issues and Resolve them thus providing superior customer experience So what is chaos in reading? I'm at the discipline or a practice experimenting on a system in order to find how resilient it is meaning whether The application can withstand to the turbulent or faulty conditions You know, what I'm what what do I mean by that? Let's work with an example. The example could be There could be a network latency or Generally latency between two microservices in your in your stack and What if the if the if there's a latency on the downstream system, how the upstream system is impacted, right? You want to understand how resilient your application is and overall stack is Just like any other engineering practice chaos engineering practice as you know, three pillars depicted here You have people you have process and you have ecosystem so next few minutes, I'm gonna double click on ecosystem and a little bit of one process and explain to you What the journey that fys went through and the thought process that we went through and how we put together this Ecosystem because it's not just a chaos tool. That's good enough to implement chaos in reading practice But you need to envision and think about Ecosystem So here is the the ecosystem that we have put together Why we have done this One we want to find the toil Within the within the chaos engineering practice and automate it and make it reputable process and Second we need to scale it So why we need to scale it we need to scale it because Organization like fys has Multiple hundred products and we need to find build an ecosystem and tools and the processes so that we and we build this consistently and in a repeatable way so that We can apply this across multiple Multiple products that's how we scale it and second we want to Automated so that we can implement chaos engineering for every release that's going into production Because chaos in itself brings in some additional work So we want to identify the toil around that automated so that it is easy consistent and reputable process For for the application or application teams So as you can see there are four or five pieces in this ecosystem one on the top your cacd pipeline you have application under tests here you have a load generators you have a apm tools and You have chaos tool itself plus the captain so I'm gonna talk about this each one of them here So when you're executing a Precursor to the chaos test is a generating load on an application under test and put it through some load Why we want to do that because most of the issues that happen in production Happen under load. That's why we want to generate some load and the application that we are injecting Chaos into that's where the load generators would help Okay, that's great You have the load generator generating the load on the application and you have chaos tool injecting chaos through its agent into the application, but We want to monitor the health of the application when we are injecting the chaos. What do we want to monitor? You have Several metrics up and down the stack, right? You have a matrix at the application level your matrix at the process level and your matrix at the host level such as You know response time throughput error rate At the application level your process CPU process memory Utilization if it's a Java JVM based application or J boss you have a thread pool connection pools You have a JVM level matrix such as garbage collection and all and you have host level host CPU memory and network IO and this coyote type of matrix that you want to monitor, right? in order to monitor those metrics you need tools like They're not raised plunk and Prometheus. So that that's why they are critical part of this ecosystem now These tools are out there. They are collecting the matrix And you are executing the chaos Experiment and you want to measure the chaos experiment, right? Meaning when you introduce a network latency, you expect say 2% errors But you are seeing a 10% errors today. You do that evaluation manually without any automation, right? But with the with the tools such as captain Which helps with automating these evaluations where you define your SLIs in this case an example that I just mentioned error rate and SLO which is not more than like, you know, 10% Right? You define that you qualify that in a YAML file and you put it in the in the Bit bucket pretty much like some sort of a github's model here and You integrate captain with the APM tools down below Now when you execute a chaos test You tell captain. Hey captain. I executed this chaos test for this period go evaluate for this five minutes or this 10 minutes and Then it goes pulls that SLIs SLOs pulls the matrix from the Dynatrace evaluate and give you gives you hey, you know for this test you you said you wanted a 10 10% error rate is acceptable because you introduce chaos, but I Saw like a 30% error rate. That means your chaos test failed. Then you further dig into it to understand why that Why why there are so many errors whether you introducing a latency in one API caused errors in other API That's for the evaluation Triaging that you have to do So but what captain can help you with is can help you with avail automating that evaluation process There's a bond toil that you have That is eliminated Now this is all good and if you want to automate this further That's where says CICD Would help you what CICD can help you is help you defining a workflow around chaos engineering ecosystem and Help you trigger it's a load generator Through the through Jenkins are CICD pipeline and then trigger chaos Experiments and finally trigger Evaluations that will talk to captain which will give you a boolean well value like pass or fail Using which you can make a decision whether you move this software to the next Next environment or production or wherever you want to or not So this is a kind of an ecosystem that we have built to a Automate the toil be Scale this across the board by consistently and repeated Consistently implementing this in a repeated manner. So now the floor. I'm handing the floor to Nilanjan who's gonna spend some time on chaos in eating or sorry, I will say Litmus chaos tool itself Nilanjan floor is yours Litmus is a tool set to do cloud native chaos engineering It helps both developers and sre's to automate the chaos experiments at different stages within the devops pipeline like Development during CICD and in production which leads to increased resiliency of the system It adopts a Kubernetes native approach to define the chaos intent in a declarative manner via custom resources It uses the operator pattern and relies on custom resource definitions to define the experiments On top of that it must provides chaos inter dashboard where one can orchestrate and visualize the results and metrics of the conducted chaos Experiments as well as litmus API for programmatically injecting and managing the chaos From a top level litmus chaos uses CRs to define the chaos intent and manages the chaos orchestration via operators There are different CRs such as chaos experiment chaos engine and chaos result while the chaos operator That reconciled the chaos engine CR in the course of a chaos experiment execution Given that a service account with sufficient our back permissions for the experiment has been already defined One needs to define the chaos experiment CR manifest for specifying low-level experiment attributes such as experiment tunables container images etc and The chaos engine that binds the experiment with an application instance as well as defines how to perform the chaos experiment in terms of Mounting volumes for the experiment parts or overriding experiment tunables Retaining or deleting experiment parts post the chaos and so on Upon the creation of both these resources chaos operator reconciles the chaos engine CR to create the chaos runner Which consumes the chaos experiment CR data for experiment creation and then creates the Kubernetes jobs which creates the requisite experiment parts for running the experiment logic in the end The job also updates the chaos result CR which Summarizes the result of experiment runs as well as updates the chaos engine to bring the chaos execution to an end You can schedule chaos experiments to run them later export metrics of the experiments via Prometheus exporter and get chaos Execution events from chaos engine and chaos results during the various phases of execution The unique value proposition offered by litmus includes first of all being cloud native chaos Experiments that allow you to validate your entire application stack beat cloud native Kubernetes resources or cloud infrastructure or even Legacy infrastructure with the broad range of experiments that it that is offered by litmus second being least privilege Principal chaos injection which allows for chaos engineering in security sensitive environments using granular RBAC over individual experiments and use of just-in-time Execution of privileged containers that limit the abuse and misuse of target environments Thirdly it provides declarative pre checks and hypothesis validation Which is leveraged using litmus probes which can be used for validating Experiment steady-state conditions with less to no programmatic intervention in a simple and declarative manner For fourth, we have conditional auto-stopping of chaos injection Which is one of the features of the litmus probes where the failure of the probe condition check Leads to the safe abort of the chaos experiment which prevents any harm to the target resources Fifth it provides custom chaos recovery actions and it can be defined as a part of the chaos scenario in a Declinative manner to introduce custom chaos recovery steps that can execute conditionally Sixth it provides declarative custom tasks, which aids at the early running the tasks that simulate real-life Conditions for the chaos execution. So for an example Running a load generator to simulate a network traffic is what we can do here finally the last one quantification of system resiliency by use of weighted experiments and probe success criteria to quantify the result of your system as a metric score Now let us take a look at how some of the litmus chaos experiments can be leveraged in real-life situations to evaluate system resiliency Thank you, Nilanjan Thanks for that that nice overview on litmus tool So let's dive a little bit deep into the experiments that litmus chaos provides Here I want to kind of focus on how do you measure the experiment, right? Just like Any test you want litmus test or litmus experiment to be measured Litmus offers wide variety of experiments here I took three experiments as a sample and I want to introduce how we measured When we execute this experiment for example When we execute pod HTTP latency, we measured A third pool connection pool utilization error rate and throughput why so again back to that microservice a and microservice b example If we introduce latency between a and b we want to understand how is the impacted the resources done on a is impacted Pod memory hog if you introduce a memory saturation type of event and a microservices are Kubernetes architecture the pod is killed with whom kill Event we want to understand whether that the APM systems are alerted because of that event How is this a service response time of API's of that exposed on that part and as well as does The pod memory hog on a particular part or a whom kill event of a particular part caused any wider impact Right, that's how it helps you with the broader stability resiliency issue or understanding the stability resiliency of your application HTTP status code so if you want to inject 500 errors on a microservice running on a pod in say account creation workflow, right? What you want to understand is when you introduce 500 errors how it impacts your business metrics such as account creation rate How resilient your overall application stack is how it is impacting your your business metrics So earlier talked about kept and doing a chaos evolution. So how what is the example of a chaos evolution? Here the way we envision Chaos evolution as a kind of three phase manner your pre phase pre chaos phase during chaos phase and post chaos phase pre chaos phase you can think of a Like a Study state during the study state you may define The metrics that you want to measure and you you define the thresholds example, you want to measure throughput error rate response time CPU memory connection poll utilization thread poll utilization say for example for Response time you want three seconds throughput hundred transactions per second Error rate less than 2% and now executed the test study state five minutes as depicted here, right? You give the captain saying hey captain. Here is my SLL SLI definition and Here's my five-minute window. I executed the test go evaluate it The captain comes back saying that it's all looking good green study state is good Now you introduce chaos example a network latency Now typically when you introduce it latency the throughput will go down the error it will go up But your hypothesis is there error rate could be impacted only one API where you introduce the latency and it could You know cause maybe 5% error rate You feed that your SLIs that you want to evaluate and their objectives are thresholds Again in this YAML file and feed it to captain and kept and tell captain. Hey captain. I executed this chaos This is my chaos phases between 10th minute and 15th minute and here is my SLI SLL definition for this go evaluate Captain would evaluate and comes back to you whether it is past or fail if it's within the bounds It's a pass if it's outside of bounds fail That's that leads to further investigation and that will help you find why there are more errors, right? Are any other issues why the connection pool utilization? You know went up during the chaos, but it's saturated. You're expecting it to go up to maybe 50% But it may went up to 100% right or CPU you're expecting to go up like 5% But it went up to 50% So those are the things that captain can automatically can evaluate and give you kind of as some sort of a heat map and pass fail result and post chaos What you need to do is Which is you define your metrics very similar to your study state The post phase is more of a recovery phase right where you remove your experiment at the 15th minute and 15 to 20 minute you won't understand the application Metrics are back to normal, right? So then your SLI SLL are very similar to what you have in study state and then you give the captain saying hey captain go Go evaluate my post phase. These are my cellos SLI SLI's and Evaluate it and captain evaluates and comes back if it's all looking within the bounds pass Otherwise it mark it fail. So that's the kind of a power the automation that captain brings you in evaluating your chaos experiment So finally, I conclude this talk here. Finally, I will leave the session with a thought that Usually the chaos in leading is looked at as a more of a SRE problem But it can help you validate your architecture and also helps you with the You know measuring business metrics does help the business teams ultimately providing greater customer experience and product quality and also With the architecture validation, obviously it helps with the stability mean tentative detector mean time to resolve type of issues That helps with the SRE with that I open the floor for Q&A. Thank you very much