 Hi everyone, good evening. Hope you're all having a great conference. Almost the end of second day. Thanks for making it. We have about 30 minutes, so we will try to cover a general introduction of what this project is about, and then the actual topic is also about security controls. So today we are talking about some security considerations when you're doing chaos engineering. That's the topic. I'll go a little bit deep into it. Before we start, a bit of introductions. I am Umma Mukhera, head of chaos engineering at Harness. I co-created this project, Litmus Chaos in 2018, along with Karthik and a few others. I continue to maintain this project. So we have a team of great line of maintainers who are at Harness right now. And I have with me today, Raj. Hi everyone, I'm Raj Das, and I'm one of the Litmus maintainers, and I'm a senior software engineer at Harness. And apart from this, I'm also a contributor to many projects like Iverno, Prometheus, and a few others. That's all about me. All right. So I just really want to know a little bit about audience here, how many of you are aware of chaos engineering or how many of you are using Litmus? Okay, thank you. Okay, so that really means that it's good to spend some time on what is Litmus, why, what's the journey, right? But actually today's topic, there were some questions, hey, no, are you talking about security chaos engineering or security for chaos engineering, right? So that, so I just wanted to make it a bit clear here. Chaos engineering for security is about running chaos experiments to assert that there are either security weaknesses or there are no security weaknesses, right? And security for chaos engineering is you're anyway doing chaos engineering, chaos experiments in your environment and you're making sure that you're doing it securely, right? What we are talking about is the second topic, right? Today. And towards the end I'll, if time permits, I'll cover in general, if you want to run some security chaos experiments, what are those? And the Litmus project in general is a to go there, but I can talk to that some of those scenarios. So what we talk today is introduce Litmus in general and some of the security challenges or considerations once you start implementing chaos engineering as a practice, what are the security challenges and possible mitigations around that, how to solve them. So what we do is we take each of those five or six examples and then my colleague here, Raj, will actually demonstrate how you would do that, right? And then depending on how much time we have, we'll talk about where Litmus is going later this year and beyond and some general guidelines on how we can contribute to the project, right? So what is chaos engineering, right? I mean, you must have heard a lot about chaos engineering as break things on purpose, right? And then, you know, kill everything or just munking around all that stuff. Chaos engineering has evolved a lot more in the last four to five years. So it is kind of a, you know, a regular DevOps practice where you are focusing on introducing some structural faults or expected faults that's part of your, you know, software development lifecycle or software product lifecycle and then be in control of those faults and observe whether your software has enough resilience or not. Try to put chaos test into your DevOps and then complete your reliability testing. That's what today it is, you know, most usage for, right? Chaos engineering still has a lot of relevance into avoiding outages or reducing outages but chaos engineering has been used more and more in, you know, on the left side of your software delivery lifecycle, right? In QA pipelines, primarily for Kubernetes. And the reason, there's a strong reason for that. One of them is Kubernetes itself is architected on the reconciliation basis. That means there's always chaos happening on Kubernetes, right? So which is pod delete, right? Pod deletes are not a problem. Pod deletes is what is expected to get to a better state, right? If you're a Kubernetes developer, you're always writing code to reconcile yourself into a particular state. You assume that pod deletes are happening, right? So you think of that as one type of chaos and then try to introduce more such faults into what you do as a developer and then build resilience tests or assertion tests around it. So that's really what chaos engineering and it's kind of a culture that's been introduced into the development teams recently. And Litmus is a chaos engineering project. It's a walled from being, you know, a small set of tests where you have a Kubernetes job where you can go introduce some faults to all the way what it is today, where it's a complete infrastructure to think, design, implement, execute and monitor this chaos test, you know, from end to end basis. It's a complete platform. So I'll spend 30 seconds on how we got there, right? So we started with some Kubernetes jobs in 2017, 2018. And then we wanted to make sure that we provide the opportunity to write chaos tests as you write any other YAML file, right? Of Kubernetes, right? So make it completely declarative. And then we introduced custom resources, chaos custom resources, operator, et cetera. And that's where the chaos experiment became a well understood concept within Litmus. And one of the important topics when you try to do chaos experimentation is how do you measure your steady state, right? Introducing fault is one thing, but making sure that you are capable of measuring your steady state to a very granular extent is even more important. I can bring down the part, but how do I measure, you know, certain key states? So that's where we introduced the probes. So resilience probes or chaos probes allow you to declaratively define and manage your expectations. So chaos engineering is break something and also observe something. It need not be a complete downtime. It could be, the latency is 5% more than what I expect. That's not a good thing. That could lead to an eventual outage or a business issue. So that could be a problem to your resilience. So being able to declaratively write resilience probes is an important thing that was introduced in our journey. And then once you have your developers being able to write experiments and steady states, how do you actually expand this chaos test into your CACD systems? Because that's where most of your work is done, right? So we introduced APIs where you can construct chaos experiment using our tool, UI tool, or manually get some experts to do it, but then use chaos APIs to inject them into pipelines, right? So, and then how do you actually share this chaos experiments across your teams? We introduced chaos hubs. So chaos hub is a placeholder for your fine-tuned chaos experiments within your teams. And if you have, let's say, a team of 15 to 20 developers, only one or two need to write this chaos experiments. Others can just go and insert them into the pipelines or schedule them. So that's how chaos engineering actually eventually gets adopted within the organization. So today, Litmus is fully featured in terms of being highly declarative. It has got a chaos operator, probes, and chaos hubs all together. You have an end-to-end tool set for practicing chaos engineering on Kubernetes, right? So this is just a glimpse of what we do today, right? So it's got a deep scheduling capabilities. You can schedule chaos experiments as a workflow. It uses Argo workflow underneath. You can put chaos experiments in parallel or in sequence however you want. And we got Prometheus chaos metrics to go and put the context of chaos onto your observability systems. And we also discovered some of the assets and provide you during the workflow creation process. Here are the assets or pods or resources that you can go and inject chaos on, right? And Litmus also provides a centralized control plane. That means when you're expanding chaos in your systems, you need to install Litmus only on one cluster. And then you can connect hundreds of clusters to it and continue to manage chaos from one single place, right? So that's a bit of introduction to chaos engineering and Litmus itself. So let me now talk about, well, if you are going ahead and implementing chaos, what are the common security equations that are asked about or challenges that you generally face? And then Raj will talk about how you can actually do that in Litmus, right? So first is who can run chaos experiments, right? So you need to, we have a control plane but then you need to integrate with your authentication systems or authorization systems in your organization, right? So you want to be able to give secure access to the control plane. So that's one. And the second one is you don't want to open up all your environments for chaos. It becomes really chaotic, right? So you want to be able to give certain environments to certain people, right? And then how do you isolate namespaces separately for chaos engineering? That's the second question. And the third challenge is I want to start chaos only on some services to begin with but my cluster has a lot of services including the critical ones. Example, I want to start chaos on a logging service but not really on my order service, right? Where payments are going through. Let me first start this month on doing chaos on my less critical services. How do you actually configure such controls, right? So that's a security question as well. And then chaos can be having a high blast radius for example, zone failures, network failures. They can bring down your entire systems. Rather, I would want to start only with you know, pod deletes or CPU hogs, memory hogs maybe network latencies, but not network losses. And I want to only certain people to go and be able to do this network losses for example. So we use Keivarno, it's another project part of the foundation. You can integrate with Keivarno and set some security policies around that. So we'll talk through that. And privilege escalation is very important, right? So how do you manage privileges through security or security through service accounts is another challenge. And ultimately, once you roll out your chaos practice in your organization you don't want everybody to come and keep touching it, right? So this is another good security practice. Keep it automated. If you are running something securely better run it in an automated fashion rather than manually your team members coming and executing it. So how do you keep things in a low-touch basis? So we use GitOps. We have GitOps capabilities that can be automated. So chaos can be totally integrated for GitOps. So you can do that as well. So these are some of the security questions that we keep hearing from community and other users. So we'll try to provide those answers on how exactly you do that through certain demos. Yeah, thanks. Hey everyone, thanks Uma. So I'll try to answer all these common questions which our community and our user keep asking. And to start with we have the first question like who can run the chaos experiment? So the question is like is it open for everyone? Like is there any authentication or is there any authorization? So let's go to the next slide. And here you can see we have an IT integration with DEX. DEX is an open IDC and it's open source where you can define your own work providers. So for example, I have defined my work providers like GitHub, Google. So you can define your own work providers there is a configuration of DEX where you have to put your client ID and client secret. And that's all you have to do for setting up for all work providers. So here you can see we have a user, right? And a user basically it's an authenticated user and he first see this screen, the login screen and he can use any of the work provider and once it is authenticated so DEX will try to communicate with the OAuth any of the OAuth and it will get the callback based on the successful authentication and once it got authenticated then user will be authenticated or litmus. And you can see there's a MongoDB we store the credentials in the secure way in a proper hash format. If it is a normal user login then we store hash and if it is an OAuth then we store some IDs and all. And once it is authenticated you can see we have few roles like ownership, owner role, editor role and viewer role. So owner is like admin and he can do anything like deletion of project, creation of project to inviting users and all this thing. And next we have editor. Editor have a little bit less privilege than the admin where editor can execute the experiment, stop the experiment and few other stuff. And we have the viewer permission. Viewer permission is basically only to view the experiment or view the result and all these things. And all things are under one project. So we are a project-based thing where you can create multiple projects. I'll go in some other next slides. And let's see something with DEX. And you see here I have already set up one environment. So this is the DEX login page and I'll go with GitHub and I can continue. So it is authenticated. DEX is trying to authenticate with the OAR provider and once it is successfully authenticated it will get callback and the callback will go to the Kiosk Center dashboard. And this is the Kiosk Center dashboard. And we have other OAR providers like Google and all that I'll not go through it. And I'll go to the second question now. So the second question is how we can isolate the Kiosk to a particular environment, right? You may have a production environment or you may have some other application running in different namespaces, right? And how to isolate those environments will go through that part. And let's start with that. And here you can see it must have two scope. One is the cluster scope and another one is the namespace scope. With cluster scope you can access a whole namespace but all of the namespace but I'll talk about the namespace scope only. In the namespace scope you can see we have three users and three users are communicating with different different delegates and different delegate is communicating to the main project. Here all the namespaces are totally isolated but it's not fully isolated, right? Because here you can see all the users are tied with the one project. So that's why we have the multi-project system and where you can assign user for a particular project. For example, if you are having a test project you can assign your user to a test project and give access to the test delegate or if you are having some production project you can give access to the production delegate. So by this you can isolate whole namespaces based on the requirement of the application. And the third question is about how to ensure critical services are untouched. So here if we go, so here you can see we have three users under one project and we have one delegate and we have certain namespaces. So for example, you have a cube system namespace that you usually don't want to do chaos on that, right? So here you can give some service account and you can create some role binding and role and tied all this thing for particular namespace. For example, if I don't want to give permission to execute for cube system, you can do that. And you can see we have another thing where you can restrict the user that is the teaming. Teaming is basically we have the user management where you can invite your users to a project or you can create multiple project where you can give access. For example, ownership access, edit access or the viewer access like you can see you can invite your team member directly to this project. And so once they accept the invitation they can access the whole project. Now we have the fourth question that is how to control what fault you can inject. So we have something called chaos service account. So chaos service account is basically the service account we created during the delegate installation that is called litmus admin. And litmus admin have more privilege. It can run any experiment from port delete to network loss to CPU. So to restrict that, we have introduced the chaos service account for experiment or dedicated experiment. So for example, if you go to this slide, right? Here you can see the service account you can see here this is one of the part and the service account and the delegate can access this part but it can't access this part because it doesn't have that permission. So we'll demonstrate this using a small demo. So for that, I have created some scenarios or we can directly go and execute it. So we'll just schedule one experiment on this is a cluster mode agent. We'll select the agent. We'll select a experiment from the hub. We'll add one experiment called network partition. So we'll give this network partition experiment with some faulty service account. So just to check if we give this service account and will it able to execute that experiment or not? So we'll just edit this one and give a faulty service account. You can find the chaos service account, chaos service account. Here, you can see the by default we have the litmus admin. Litmus admin means it can execute any experiment but we'll give some faulty service account or delete. Let's see, we'll save the changes. Next, next, and we'll just schedule it now. So here's the summary and maybe we can edit this experiment network or here we go. So this will take some time. So for that timing, I can show you some older result. So this is one of the experiment I ran before demo and if you go and check the result, you can see the experiment has been filled and to check why it is filled, we can go to that, we can click on the node and here we have two section, one is logs where you can find in the logs also and that result will be very quick overview and we'll just go to the result and at the bottom, you can see it is giving us the error because it is not able to create the network policy. So the partition experiment that was the network policy, component is network policy, but it is not able to create the network policy because we gave a faulty service account. So by that, you can define your own service account for dedicated experiment. So it will reduce a lot of errors and all. And we have the fifth one that is basically what privilege do you run the chaos with? So here we have a integration with Kibarno. Kibarno is a policy engine made for Kubernetes and I'll just quickly explain how we integrated with Kibarno and you can see there is a delegate and there are some policies installed in Kibarno. You can use anything PSP also or OpenShift policies also, but we have used Kibarno. So let's go to that hub. So this is the hub git repository and I'll just go there and check this policy. We're going to execute this network loss with restricted post name specific. So if we go to the policy and check the policy and this is a typical Kibarno policy, Kibarno have validation, mutation and audit. We are using only validation that will restrict the experiment to run. And you can see at the bottom, right? We can focus on this area only. You can see all our required permissions, right? Host PID, host IPC and host network, everything is false. That means it will not able to run this experiment, this network experiment and this is kind of a negative test we just want to do. And we'll just use this policy. This policy is already installed in the cluster and we'll just quickly execute this experiment and to execute this, we'll just again go to the schedule experiment. We'll select the namespace, scope, cause, delegate. We'll go ahead and we'll just select the scenario from the hub. We have that KubeCon hub and this is the experiment we want to run. And our policy is already installed on that delegate. Go next. And this is just a predefined description and name we can just ignore for now. And here you can see we are going to target this namespace for this label and we'll see if this experiment is able to execute or not. Select the default one next and we are just going to schedule it now. This summary looks good. So again, it will take some time because the experiment will maybe, it'll take three to five minutes. So we can just check some older result. To check the older result, I'll just type the same thing, network loss. And you can see some of the experiment has been failed during the run. And I'll just click here. Check the node. You can again, you can check in the logs or you can go to the result. And if you go to the bottom here, you can see the permission, the unable to create helper part. So Kibirno use admission web book. So this is the error where admission web book validate Kibirno. It is unable to create that experiment. So this is about the Kibirno integration. And again, we can run some positive cases. For example, let's go with a positive one. So there's one experiment called, so there is an experiment called port CPU. And for port CPU, if you open the policy and again, you have a lot of validation. So here you can see we are giving, we are giving a seaside min access. That means it has the access to do the stress on that port. And we have a few more capabilities. And you can see it is the privilege mode is true. At the bottom, you can see the privilege is true. And the socket part, it's already set. That is container D. And this is one of the positive test cases. And let's start by running this experiment. Again, we'll select the name, space agent pool. Next, we'll select from a hub. Next, we'll select the scenario. The scenario was port CPU work with restrictive policies. Next, next. So we'll again target this application. So for this, we have to add one annotation. Just add one, not annotation, it's a label. We have to add for that policy. Just select this label and add it in the kiosk engine. Have to save it. Next, we'll just select the default values. We'll schedule it now. And again, it is scheduling right now. And again, if you see the old result, right? And go to this page, and this is one of the old result. And if you check this one, where it got, so there's one more experiment. So you can see this experiment with that policy. It is able to run that experiment with the correct policy injection. So that we can use Qverno. This is one of the way we can reduce the risk. And we have one last question that is how we can make kiosk low touch. So we know we are even, so we make mistake, right? Even a single manifest change, a single configuration change make the production goes down. So for that, we have kiosk engineering with litmus kiosk. So we have two kind of githubs. One is front end githubs. And another one is the back end githubs. So front end githubs, we'll not cover front end githubs. We'll talk about the back end githubs. That is all about the automation. So for the back end githubs, there is a controller called event tracker. You can see, I'll just make it large. There's a pod called event tracker that is basically a Kubernetes controller. And it is backed by some policy. Just like Qverno, event tracker also have some policy. So what policy is like? So it is tracking all of the pods, right? And it is checking one policy. In the policy, you have to define. For example, if my image got changed from image one to image two, the policy will get automatically informed about the event tracker. And event tracker, what will send an event to the GraphQL server? That is the main server. And the GraphQL server will send some kiosk to that particular application, which is subscribed to. So you can see, for example, I'm changing an image from what let's say I'm changing replica three to two. And I want to check if we decrease the replica, will it going to be resilient or not, right? Then we can create those kind of policy and we are using gms path. So you can define your own query using gms path. And if you decrease the replica, it will automatically trigger an event tracker, inform the server. Server will send some kiosk, for example, port delete kiosk, or some other kiosk to that particular application. It will automatically execute it. And once it is executed, it will get the result and get the final thing. Thank you, Raj. We had to rush through a lot of these scenarios because we have got 30 minute slot. Ideally, would have allowed to run a detailed demo of this, but all these examples that we just showed will be added to the documentation. You can go through this recording session as well. So let me talk through quickly about the second question that was brought up in the beginning. What if you had to run some chaos experiments to check the security itself? What are the common scenarios for that, right? Some of the security chaos experiments that you can think of or in general, do I get access to my S3 buckets? Did anyone publish a public S3 bucket in my organization? Or is anybody able to open some public instance ports in public instances or load test and do the denial of service attacks? Or is anybody able to mount a host path into my pods, et cetera, et cetera? So these are the scenarios where you can write some chaos experiments and then see if any security vulnerabilities are being introduced into your organization or not. So there are going to be more and more examples of how we can use Litmus to introduce verification of security within your deployments or not, but that's a topic for a future session. But Litmus has evolved over a period of time. Quickly in about 30 seconds, we are now at Litmus 3.0 beta, which if you see our current journey, we have a full stack of a stable infrastructure to run chaos and from here, we are going to make Litmus easy for developers to use. So what it means is that you're going to see some chaos libraries that are thin and lean so that you can actually pull them into your pipelines by developers very easily, into your, you know, get abactions, et cetera, et cetera. So we have some examples of that already in beta release, and we are expecting to complete all of this in the next six months. So hopefully by next KubeCon, you will have 3.0 out and I'm happy to receive any feedback, contributions, it's all out there in the discussion threads in GitHub. And also please try to join the community. We expect, you know, more issues while you use Litmus or what's missing in terms of, you know, you need more experiments or not. So we'll be happy to, you know, look at them and provide some feedback. So we have our Slack is part of a Kubernetes Slack. There's a Litmus channel in Kubernetes Slack. You can see a good amount of traffic over there in the chatter. So please join and, you know, provide some feedback. With that, thank you again for this short session and have a great rest of the conference.