 on using GitOps to increase system resiliency with Litmus Cures. This is Saina and I work at Harness at Senior Software Engineer and I am one of the core team member at Litmus Cures which is now a CNC incubating project and I've been contributing to it for the past, for around for two years and it has been a great journey so far. And now Amit will be introducing himself. Hey everyone, I'm Amit Kumar Das. I'm a Senior Software Engineer at Harness and I'm a core contributor to Litmus Cures and I've been contributing to the project from past two years and yeah, very excited to be a part of KCD Chennai. Yeah, that's pretty much about me looking forward to it. Thank you. Yeah, thank you Amit. So today's agenda will be, will be first of all, will be talking about Cures engineering, why is it required and then we'll be introducing Litmus Cures, talk about its core components and features. Then Amit will be taking us through the GitOps and giving us a small demo of how you can use GitOps to like leverage, like use, like increase the system resiliency using Litmus. So yeah, without further ado, let's get started. To start off with, first of all we need to know what is resilience. Resilience is basically the system's ability to sustain a fault and bring itself back up. So for example, let's say a pod gets evicted from the node, what is its state? Is it healthy or not? Does it bring itself back up? If it does then it is resilient and that period from going down to bringing itself back up is the resilience. So similarly is the case with node and memory leak as well. Talking about downtimes, downtimes are expensive, not just in terms of money but also they're in other aspects as well such as customer conferences, the loss of customer confidence, then damage to brand integrity, then loss of productivity and employee morale as well. So considering all these, et cetera, like these are some of the aspects and considering all these aspects, we definitely want to avoid downtimes at any cost and one way to do this is by adopting the practice of Kiosk Engineering and Kiosk Engineering is the like process of testing a distributed computing system by injecting fault intentionally. So the goal here is to identify the weaknesses in the application through control experiments so that to check whether it can withstand the unexpected situations or not. So how it is done, it is typically done by like, first of all, you have to identify the steady state conditions. So steady state conditions are the desired behavior of the application in a given scenario when it is healthy. So first of all, you identify that, then you introduce a fault in application, then you check if the steady state conditions are met or not. If yes, then the application is resilient and if not, you can go and fix it and if some similar case happens in production, you are already covered. Then are talking about the foundations of Kiosk, cloud native Kiosk Engineering and how it can practice effectively. So the cloud native definition itself includes some mandatory principles such as it requires declarative configs being scalable, flexible and support of cross cloud. So these are some of the main principles which we have been following for the past few years. And yeah, so cloud native communities and technologies revolve around open source and these Kiosk Engineering framework being open source gives them the benefit to make themselves better and add more and more features. And the Kiosk experiments need to be very simple to use highly flexible and highly tunable with very less or little or no chance of false positives or false negatives. Then with more and more people getting involved into Kiosk Engineering, more and more changes happen very frequently that with the requirements being altered. So there arises a need of like, it becomes very important for the Kiosk Engineering framework to enable proper management of the Kiosk experiment and that to in Kubernetes way. Then as you start practicing Kiosk Engineering more and more start fixing the little issues that come in more and more Kiosk scenarios come into picture and gradually it becomes very large and comprehensive. So these are Kiosk scenarios need to be automated and triggered if changes are made either in the application or in the Kiosk experiments. And tools around GitOps can are one way to achieve it. Then so this is like, this is a section we'll be talking about in detail and giving us giving you all the demo. And then lastly there's open observability which is also one of the principles. Introduction to Kiosk Engineering should not require any new observatory system. The existing ones should fit in perfectly. So yeah, that's that. Then with that I would like to introduce Litmus Kiosk which is an open source cloud native Kiosk Engineering framework with it has also, it also has the cross cloud support and currently it is CNC incubating and it has adoption across several organizations. Then talking about the features that Kiosk Center provides starting off with the Kiosk workflows. Kiosk workflows is the collection of several Kiosk experiments which can be clubbed in either a sequentially or parallelly like in any manner. And it can be created, these workflows can be created using custom templates that you can like you can upload or you can create your own custom workflows from Kiosk Hub which is the repository kind of like it is the place where all the Kiosk experiments are present you can choose from there or you can use some pre-created, pre-created elements are also there you can choose from them as well. Then you can schedule your workflows either as a recurring one the Cron workflows or you can have a singular workflow as well. Then lastly you can attach priority to each of the experiments in the particular workflow according to your own requirements. Workflow management, this is a section we're talking about a bit later. Then you can litmus allows you to add your own image from your own image server, custom image server which can be the public or private. Or then once the Kiosk injection is done you can measure and analyze the resilience score of each workflow. You can analyze how your application performed in that particular Kiosk workflow. So yeah, that's that. Then litmus also supports multi-tenancy which means you can create your own team add other invite other users to your team and like as viewer or editor permissions like it has a fine-grained role-based access controls which gives the necessary privileges to the users. Then scope support is also there I have talked about you can install it in namespace or cluster by its scope and authentication is there. You can choose to have local authentication or the OAuth one. So yeah, that's that. Then coming to monitoring and observability so you can connect your own data source and monitor the workflows or you can visualize there are graphs present where you can visualize the workflow run statistics or the schedule statistics. You can also once the workflows are running or completed execution you can compare two or more workflows how they performed and in case you do not like the interleaved dashboard that is present, you can upload your own dashboards from the available that are available in the community. You can edit them, you can tune the dashboards according to your own requirements. And lastly, you can monitor the Kiosk in real time with the interleaved events and metrics from the Prometheus data source. Then with Litmus Kiosk, you can not only target Kubernetes application but you can also target Kiosk like on infrared sources or attack bare metals or machine as well. Lastly, GitOps for Kiosk. So it basically integrates any Gitbase source control manager to provide a single source of truth provided that you have enabled GitOps. Once the GitOps is enabled, it kind of switches off MongoDB as the DB, the data source, then Git will act as a single source of truth. And this is also bidirectional nature so that means if any change occur to either all the workflows are being stored in the Git source. So if any change happened to either Kiosk center or in the Git source, both of them will automatically sync. Then it also provides even tracker as a microservice where you can launch the subscribed workflows. Like it launches the subscribed or works automatically if there's any change in the application such as upgrades or all. So it automatically launches the Kiosk workflows. So that's that. Now Amit will be talking about GitOps in more detail and we'll be giving a demo. So yeah, over to you Amit. Thank you. Thanks, Aranya. So before moving on to the demo, I'll be talking about GitOps and why do we need Kiosk engineering with GitOps? So GitOps is basically an operational framework which uses Git as a single source of truth and any change in the code or in the Git repository needs to be fully synced with the cloud infrastructure of the organization. It follows the principle of infrastructure as a code where managing and provisioning of the infrastructure is through the code rather than manual processes. Now moving on to the main question, why do we need Kiosk engineering with GitOps? So the Kiosk engineering with GitOps will enable a vast scope of automation with CI CD pipelines. So currently Kiosk engineering is being performed in a closed environment or in a pre-production stage but what if we enable Kiosk engineering in the CI CD stage? So this will actually enable the developers with the known faults before it goes to the pre-production stage. And some advantages of GitOps are increased in productivity. So developers are more focused on the development rather than the CI CD of the infrastructure and it reduces the mean time to deployment. And the second point is high reliability. So GitOps practice are considered one of the best practices because it reduces the mean time to recovery. Like if we have any fault, we can simply roll back to a previous stable version. So the third point is better security. So Git is a very secured platform or a framework because it's very strong with its cryptography and the ability to sign your changes provides the ownership to the change or to the source code. And it improves the auditing. So since GitOps uses Git, so we can keep a track on the audit logs and we can know any change which is going into the Git repository with the logs. So it increases the auditing as well. So now moving on to the demo. I have set up the Litmus Kiosk Center. Let me, yeah. So for this demo, I have installed the Kiosk Center on GKE and along with it, I'll be using two cloud native applications which are the Bank of Anthos application and an online beauty application. And so this Bank of Anthos application is actually a banking application and we can perform a lot of operations like sending a payment or depositing a payment. And similarly, this online beauty application is actually an e-commerce application. Since you can see a lot of products listed here and we have a catalog, we have functionality to change the pricing according to different currencies and we have a cart option here. So we'll be performing some Kiosk Engineering on these two microservices. And for this, I'll be using Kiosk Center and to enable the GitOps functionality of Kiosk Center, this is very simple to do. We have in the settings tab, we have a tab named as GitOps. Simply select this Git repository option and I'll be providing a Git repository. So moving here. So this is a empty repository which I've created for this demo. And to connect this Git repository, I'll use the repository link. The branch where I'll be pushing all my changes which is the main branch. And we can provide two authentication methods which are the access token and SSH. So I have my access token with me. So I'll be using it. Don't worry, I'll delete my access token later. So I'll just click connect. It will take a few seconds, yeah. So we have successfully enabled the GitOps for our project and to verify the same, we can go to the Git repository again. And if I refresh this, I should see a litmus directory being created and the directory structure shows me the project ID here. So if I see that this 205ED is actually my project ID, which is 205ED. We can also verify it from here. The project ID is even here. So we have successfully configured GitOps within our application. And now we'll start to do some chaos engineering. And let's get started with the Bank of Anthos application. So I have deployed this application along with all its services in the namespace called bank. And here we can see a lot of services like balance reader, contacts, load generator, transition history are available. And so currently, what I'll try to do is I'll try to delete this pod, the transition history pod, which actually shows me all the transition, transaction history within this application. So let's get started with it. So I'll try to schedule a workflow. I'll click on the self agent. And here we have four options to create a chaos workflow. So we have the option to run a predefined workflow or we can clone a existing chaos workflow or we can use the Git, we can use the chaos hub, which is a marketplace of all the chaos experiments. And we can also import a workflow manifest email. So for now, I'll just use the chaos hub. I'll click next and I'll provide a name here. Delete the transaction pod. So I'll click next and I will add the pod delete experiment. Pod delete, yeah, here it is. And to target the pod, I have to select the name space, which is this bank name space. And we have the transaction history label here. So I'll select this one. And for the timing, I'll not add any probes. I'll just continue to tune the experiment. Here I can provide different environment variables to my experiment. So for now, for this experiment, I'll select the total chaos duration as 60 and the chaos interval to be as 30 seconds. Yeah, now I'll finish up all my changes and I'll turn off this reward schedule, since I want to know the logs and other details of my workflow. So I'll click next and I can select the weights here of the experiment. I'll select the schedule now option and I'll verify all my changes. It's the delete transaction pod and I'll check if the labels are correct over here. So which is the bank name space and the label is transaction history. And I'll just finish my changes here. Yeah, so we can see that the workflow has started. And if I click here, I'll get a argograph which shows the live changes which are taking place in the workflow. And interestingly, if I go to my Git repository and do a refresh, I'll see that this workflow manifest is also here. So any change which is happening in this workflow will also be reflected in my Git repository as well. So let's wait for a few seconds or few minutes for the workflow to get completed. And meanwhile, we can observe the chaos which will be happening in this Bank of Anthos application. So we can see that the pod delete experiment pause had just started up. And if we go to the litmus namespace, I can confirm that the pod delete runner has just started and the transaction pod is actually terminating. So if I refresh this page, I should see that this service is under chaos and we don't have any data related to the transaction history. And once the transaction history pod is back into its running state, we should see the details over here. So let me refresh this page again. It's still under chaos. And once the workflow is finished and this service is in running state, we should get the details. So let's wait for a while. Yeah, so since we can see that the workflow has completed and the pod delete experiment has also run successfully. So we'll go back to the Bank of Anthos application and we'll just refresh. And we can see that the transaction history is now available. So to cross verify this, we can also see that the transaction history pod is now back and running. So we have induced chaos on this service, the transaction history service on Bank of Anthos application. So what if I need like currently in this manifest, we can see that the chaos duration was 60 and the chaos interval was of 30 seconds. What if I need to change these environment variables? So instead of creating a new workflow completely, what I can do is I can go to my Git repository and I can simply update these changes in my workflow manifest. So let me go here and try to change the variables of the environment variables here. I'll change it to 100 and change the chaos interval to 50 seconds. And now I'll commit these changes. Yeah, so in our Git repository, we have made the required changes and it will take a few minutes to get synced with this with the chaos center. So let's wait for a few minutes over here. So if I refresh the page and load the manifest again, I can see that previously it was 50 seconds or 60 seconds, but since I've changed the environment variables, the values can be seen here. So these are the updated values, which I provided in my Git repository. So the total chaos duration was 100 and the chaos interval was 50 and these changes are now available in my chaos center. And to run this workflow, I just have to do a quick rerun of the workflow and the same workflow will get started with the updated values and we can cross verify it from our manifest. And we see, we can see that the chaos duration value is 100 and the chaos interval value is 50. So all the changes from my Git as well as from my chaos center are synced together. And yeah, so this is basically it. And apart from that, if you want to add some changes here, some other methods like from a pull request. So we can also do that. And for that, I let me go ahead and create a new branch. Create a new branch in this test branch. Yeah, and in this test branch, I'll add a file. I'll probably add a new chaos workflow. I have created one. So I have this workflow, which is the delete catalog workflow and it will actually target the online boutique shop. And here we can see the namespace is shop and the app label is product catalog service. So instead of configuring it from the chaos center itself, what I'll do, I'll just add a new file over here. I'll upload a file and I'll drag and drop this file over here. And I'll provide a workflow name to this. So this is the workflow name. And one thing we need to keep in mind is the workflow name should be the same as the file name. So the workflow name needs to be same as the file name over here. So now I'll just commit the changes. And from this branch, I'll make a pull request to the main branch where all our changes are being sent. So let me compare and let me just cross check if the manifest is uploaded. Yeah, so it has been uploaded, the delete catalog service. And I'll raise a PR to the main branch. Yeah, so add files via upload or PR. And I'll create a pull request. And the pull request has been successfully created. And once I merge these changes into my main branch, we can see that a schedule getting created over here as well as the workflow getting started since it's a one-time workflow. So let me merge this pull request. So the pull request is successfully merged. And in my Git repository in the main branch, I can see that this catalog service from PR has been added. And let's wait for a few minutes to see, to get the changes from the Git to get synced with the Kio Center. Yeah, so now we can see that since the PR bot merged and the changes are now in main branch, so it has triggered the Git operations. And we can see that the schedule named delete Kios, delete catalog service from PR, which is same as the file name over here, has been created. And similarly, the workflow run has also started. So let me click here and see all the related information. So we can see that it's currently installing the Kios experiments. And in a few minutes, we can see that the catalog service getting down. And let me just show all the services over here. Yeah, so this is the shop name space where I have all the services running like the card service, the currency service, the front end email service, payment service, and the catalog service. So with the current experiment, we'll be terminating this catalog experiment. And we can see that the status is in terminating state. And if I refresh this, I will see that, yeah, there is something has failed, below some details for debugging. And the service is down. So even if I refresh this, I think it should be down, but I guess it's back into its original state since the Kios injection time was pretty low in this case. Yeah, and yeah. So we can see that we have injected Kios from the gate repository. And it's now, it was visible in the application as well from the gate repository. So these were a few operations which can be performed from Kios center. So the major scope here for GitOps with Litmus Kios is to add these GitOps functionality in your CI CD pipelines or you can use these in your GitHub actions to run Kios within your CI CD stage. So I think that's it from the demo. And yeah, that's pretty much from my side as well. Thank you.