 I'm sorry. I mean, oh, hello, my friend took me a little bit to get Missing this morning. I had to try three times. All right, let me let's get this thing started. So it's three minutes after the hour and I'm already and I was the late one. So let me go ahead and share the agenda. All right, so it is March 18th, 2020. And this is the CNCF SEGAF delivery by weekly meeting, just a little bit of reminder. This is recorded and it is public. Do not say anything that you don't want to be judged about later. All right, so we actually do not have a large agenda today. But I would like to welcome any new members. Anyone want to introduce themselves? Yeah, let me start if you want. So my name is from Mendez. I am the founder of the SNKPI specification and the SNKPI initiative. Not sure if you've heard about it. But and I was joining here today just to just to familiarize on how you guys work. And yeah, I'm happy to help on anything that that you made me and yeah, you will see me very often, I guess. Well, hello and welcome. Anyone else? Yeah, this is for Shank. He's still talking. Yes, quickly go and probably can go next. So this is for Shank. I'm from Delhi and ready time for Duma for helping out us in terms of organizing October 1st. At Delhi, they came in and gave a really good session on redness and I've been following the redness since then. So really thank you, Duma, for inviting me here. And really nice to meet you all. All right, welcome. All right, anyone else? Oh, good. This is Kiran here. I am the maintainer of the OpenIBS project. Just saw that Likmas was added on to the agenda. So join this call and we'll try to be more regular here. Thanks. Yes, well, welcome. Yeah, we're going to do as part of the agenda and agenda today, there will be a presentation by Likmas Chaos. I was going to talk about that in a little bit. Yeah, so welcome. Anyone else? All right. Yes, I. So my name is my name is I'm sorry, I think the Internet's not fantastic. I'm home. My name is Winston. I work in spring and nature as a platform engineer. So our organization recently joined CNCF, so I'm new in this meeting. Welcome. Thank you. All right, last chance. OK. Hi, my name is Ashok. Hi, I'm from Wipro, and we use Likmas for some of our Kubernetes workloads for chaos experiments. And I also have a Govindan Guhti, my team member also joined in this call. OK, welcome. Thank you. I'm going to assume there is no one else. And so the next item on our agenda is an update on the Eric Act work group. Jeremy or is anyone here to send an update? Any updates? All right, well, they do not appear to be here and the leaders of that work group. So I will ping them offline to to get us an update. So next up is an operator work group update. Jared, are you here? Yeah, yeah. Hey, I'm here. Can you hear me? OK. I think that's physically. There we go. Yep, I clicked on you button. So so the charter is is I think forming up pretty nicely. We haven't had a chance yet to actually meet. I think we intended to set up a doodle and then then everyone's going to work from home through a wrench into it. I think with this point of the remaining parts of like our goals and our non goals, we really just need to figure out in that call. And then hopefully in two weeks, we'll be able to send that charter back for for a vote on or ready to present it here. You're muted. So did you push the meeting out a week so it'll happen next week? We haven't we haven't set up the the doodle. We pinged the lowest and our Chris hindered. And and so just waiting on a response there, I can add it. But I guess I guess we weren't sure where to go from here on setting that meeting up. Yeah, so I think that meeting was supposed to happen yesterday. Right. I'm thinking correctly. I know I had a conflict. Not that I know it wasn't OK. Well, I will I will think I will figure out when we need to have this and or when it's going to happen and we'll make sure that folks know. All right. Next up on our list is litmus chaos and they're going to do a presentation. Yes. Hi. This is Uma, one of the maintenance of a litmus chaos project. And can I go ahead and share my screen by hand? Yes, please do. All right. Hopefully you could see. First of all, I just wanted to give a little bit of a background on this project and where we are with the process. Last meeting I checked here in this idea, you follow the new sandbox proposal. A process and then create an issue and a PR. So we did create an issue first and then this is the issue in divorce issues. And then Amy asked me to create a PR as well. So we create a PR in crisis. It's just that we should do a review at this sake. So I'm here. Thank you for taking your time to listen to this presentation. So what I'm trying to do here is I did present to this group back in October, a very general introduction of the project when it was newly formed. But now I have a little bit more detailed presentation as per the details required in the process. One question, even before I start doing this is we present and I know it's going to be recorded. And then would there be what would be the next step in general for the TLC? I should go and present to the TLC or do we wait for a few more meetings to happen in general? I'm just wondering, right? Yes. So what will happen is after this, we'll take this conversation offline, but this is for for Sandbox. So we will we will actually we will contact you. There is a little questionnaire that we like people to fill out and and we'll get that to you, Uma. I show what I'll do is get your email and then we'll get it started from there. All right. Perfect. Thank you, Brian. So with that, let me start. So Litmus Chaos, the project itself is to provide an infrastructure to do cloud native chaos engineering. I'll just talk about in a bit what is cloud native chaos engineering. So a bit of background and history is this project was, you know, it's not new. It's been there about almost two and a half years now. We announced it in 2018. CubeCon has a open source project for Kubernetes. And since then, it's been majorly sponsored by my data. But recently, we are seeing more windows contributing to it. And that's one of the reasons why we'd want to present here, right? And originally, it's as a chaos engineering framework for OpenEBS. Kiran is in the call, he's the co-maintenor of the project. We were trying to do chaos engineering for OpenEBS and we did not really see a proper tool that is really cloud native. And we started writing it on our own. Then soon upon this could be useful for a lot of other applications in the Kubernetes environment. And then we started actually creating it as a project. And here we are. It's been in production for OpenEBS CI for the last two years. What it means is a litmus chaos has been continuously being used for the various negative chaos testing for OpenEBS.CI platform. It's well-cooked and a lot of feedback that has come in. And sometime around last year, mid, there were feedback that it's not really easy to use a litmus. You have to know Ansible and then the regular stuff. So what the community and the team, what we did is let's actually create chaos yardage and create application specific chaos charts, similar to Helm charts. And then we start putting them on Chaos Hub. And then we really saw a lot of other community members picking it and using it for their own applications. And that's really the start of the community growth, I would say. And then right now we have full-fledged charts for Kubernetes and OpenEBS and decent charts for Codeinus, Kafka, and then a lot more are coming. And in terms of the project status, we have achieved 1.0 general availability a couple of months ago. A latest release is 1.2 and 1.3 is in progress. And it's completely Apache 2 licensed. I'll share later the details of the POSAR check, complaints and everything. So that's the background. And this project itself is not new to CNC app. So immediately after starting it, there was a chaos engineering work group that was started by Chris and represented on 7.24 July last year, before year. And then once we got more feedback from the community about the CRDs, we defined the general principles for cloud native chaos engineering. And we published that on CNC app blog platform itself. That blog receives a lot of visitors in terms of finding how to do chaos engineering for Kubernetes in general. It's not like super specific to litmus alone, but in general, what do you need to do chaos engineering in a Kubernetes native way? Or how will you do for other applications? And then once the project has received a recent decent progress and also chaos app was ready, we presented here to the app delivery sake or in sometime in October. And then we really saw more charts coming in and we also had the intention to provide litmus as a chaos stage for CNC app CI itself. So we did present into that work group, the CI work group of CNC app earlier this year. And we also now have created one of the team members of litmus created a PR to introduce a chaos stage into code DNS pipeline. And once we go through that, there will be definite progress towards doing the same for other projects and CNC FCI. So we've been here and then, of course, last week we were here as well and just to talk about that. And looking at our community, the community is reasonably big, I would say. We got about six 30 stars and we've been contributing in Hectober fast. So there were a lot of small contributions that to see from all over the world. But majorly, I would say there are about 40 to 50 contributors and many of the contributors are coming from the following companies. We're pro-intuit and as my data, the primary sponsors and we have a fully committed team, about 10 plus people contributing to litmus. And then we've been running from the experience of running open business project. We've been running a monthly community calls and then there are multiple people joining. And it's an organic growth that we are witnessing. And we are part of Kubernetes Slack channel litmus and there are quite good conversations that happen there. And that's the website and Twitter is reasonably active. And in terms of primary users, I will state a few who have got legal clearance from their side to say that they are using litmus publicly, a waypro and Zebrion or the two other companies and my data, of course, we have our commercial products where we use litmus for chaos testing. And then, of course, Open EBS, which is a CNCF project which is using litmus and the genesis of litmus is really attributed towards Open EBS. Then I have two other very big companies that are using litmus for their general needs. They're also part of the CNCF ecosystem, but I would need to wait for their approval to display. But I'm pretty excited about their contributions to the community going forward and in general to the litmus chaos project. With that, I would want to really give a top-level feature list. Primarily, what litmus chaos is, it's an infrastructure. It's not just about a few ways to inject chaos, but in order to do chaos engineering, you need a lot of things. You need to primarily do it through the regular and proven DevOps practices way or through GitOps. So you need to have a well-defined CRDs and you need to have a chaos operator and then application-specific chaos experiments. And that's what it is. And then to do chaos engineering, you also need to have very good monitoring and being able to schedule this chaos in a way that falls into your CD or practices of DevOps. So you need a chaos scheduler. So those are also in progress. So as you can see, it is actually a real infrastructure for chaos engineering and parts of them are chaos experiments. And we are actually very proud to say that we have a hub, very similar to operator hub or CNC hub that's coming up, where we pull together the chaos experiments. The whole idea, which I'll talk later in the presentation, is the whole idea is to have readily available chaos experiments for all the applications or most of the applications on Kubernetes environment. And that's probably why I think we are very much applicable for this group because you name an application and you need to deliver that onto your CD or in upgrades delivery. You need to have chaos engineering as part of the process so you could use litmus for that. And then we are proud to say that we have about 12 experiments, including the one that went out yesterday or a couple of days ago. It's a fully-featured chaos experiment for native Kubernetes. You could almost inject chaos into any of the Kubernetes resources. And there are a lot of application-specific chaos experiments, for example, OpenEBS, CodeDNS, and more. And the other major feature is, as part of the design goal, we did not want to be the monopoly in a way to chaos. So, for example, this is how you kill a pod. This is how you kill a node. So we didn't want to explicitly say that use litmus and do it this way. Rather, we wanted to give it as application-plugable chaos. Or if you know a way to kill a particular resource and you have been using it and you just need chaos engineering infrastructure to do your chaos, you can use litmus. That means the chaos should be pluggable into this infrastructure. And we have well-defined examples here. For example, Pumbaa is reasonably known for injecting network latencies. And we wrapped up Pumbaa as a plugin into litmus. And Powerful Seal is another well-known chaos framework by Bloomberg, and it's been integrated into litmus. And Chaos Toolkit, I think it's recently merged into litmus, where Intuit was using Chaos Toolkit for about a year. And they found litmus as one of a good ways to inject chaos in a cloud-native way. So they put a wrapper on top of it using our plugins. And then now they're using a litmus. And in fact, now they're one of the core maintenance of the project itself. So this is really the summary of it. Applicate Plugability Hub and a very generic operator-based framework to manage the lifecycle of chaos itself. And let me talk about the architecture itself. So the architecture here, I just want to simplify a little bit. So at the install time, you have a chaos operator and an exporter and scheduler to do the basic job of chaos lifecycle as well as to monitor what exactly has happened. And then a scheduler to actually schedule chaos at the required interval. So these are the chaos components. And primarily, we install some chaos schemas, the CRDs. And then the actual experiments will be listed on a hub, just like a Helen chart. And you can pull in. Right now we have about 20, 25 experiments. But we expect this to go into hundreds. So you need not to want all those applications, the custom resources, all the time. So you can pull whenever you want. And at the runtime, you use one of this CR and the CRDs is being watched by the operator and the controller. And a chaos runner is started. Chaos runner through a chaos library will actually inject chaos into a given Kubernetes resource. So it's a loosely coupled, well-defined chaos framework. But it does the job of entering chaos engineering. And this is at a very high level. And why do we call chaos, litmus chaos, and cloud native? That's because the entire chaos management happens through an operator. We are using the operator framework that's reasonably popular, I would say. But it can be written using other operator frameworks as well. And for example, Kudo, but it has got a complete lifecycle manageability right now. And then the entire chaos can be managed through declarative emails. So you can use your existing GitOps model and DevOps infrastructure and inject chaos and actually do it as a process through all the declarative emails. And then the chaos runner, which is the job that manages chaos when it is in the process of injecting chaos, it runs in a container and can survive node reboots. One of the reasons or the definitions of cloud native is it should be highly scalable and highly available. So what if the guy who's injecting chaos that node itself goes down and what happens to the management of the chaos? So because it runs in a container and being orchestrated by the operator and Kubernetes in general, it survives node reboots and then the chaos management continues. So that's why it's cloud native. And the other one is plugability. The entire cloud native ecosystem can use this infrastructure, not necessarily litmus core libraries. So you can, and examples that I talked about are POMBA, powerful seal, and chaos toolkit. Any chaos logic that you have, if it is in the form of a Docker image, you can actually put it into the litmus framework. So that's why we think it's completely cloud native. And how can litmus chaos is one of the other questions that was asked in the proposal for the sandbox. How do you think your project can be used by other CNC projects? Well, litmus chaos itself is a Kubernetes app. And it provides, as I described in the previous slides, a well-defined IML declarative APIs. So all you need to do is use the CR chaos engine and define your experiments. So any project that is running on Kubernetes can use litmus for their own chaos engine. So it's actually a very easy fit and well, it's a perfect fit, I would say. So now, next, I want to talk about developer experience with litmus as well as the SRE or admin experience. So as a developer, how do you use litmus chaos? And how does this entire infrastructure work? So you have an application that is under development and you are a cloud native developer. And you're in general developing some Kubernetes micro-services, right? For example, an application pod under service that manages that application pods. And then because it is Kubernetes environment, you're getting a lot of other microservices being available, which are part of the CNC app landscape. And you will be constantly interacting with those other microservices as well. So for example, the quality or resilience of your application is also dependent on the quality and resilience of these microservices, which are being managed by somebody else, but you're 100% dependent on that. So what you do in general, to build the resiliency or the quality guidelines, so you build a pipeline and functional, it's up to the developer. But we are not talking about chaos or negatives testing, right? So you need to include negative scenarios for your application. It need not be about your application or your services alone, but it may be something that, hey, MySQL goes down. How does my application behave? So you need to actually kill MySQL, which is the database underneath, and then see how it behaves, your application behaves. So you need to have a lot of ways or experiments to inject chaos into the other applications. And that's where Chaos Hub comes into the picture. You need not write these ones, they're already available. So you just select them and pull them in and they're available into your pipeline. And then the results are already available. And of course, for you, whatever the application that you're writing, application-specific chaos of your application, you need to write. So you develop them and then as a good Samaritan, you can actually push those chaos experiments back onto the Hub. So what happens is, once you build this, CA pipeline is all done, your product is reliable now, you think, and then you ship it out. And the users now have a way to test resiliency of your application. And it could be another developer who is using your application as a part of the green box here, or it could be the SRE that is using your application. So they could use the test that you developed during your development. They could use those chaos tests in production later on to do chaos engineering. So that's the idea of bringing the developers and SREs together and work them in tandem in an open way, in a Kubernetes native way to do chaos engineering and increase the reliability or the entire delivery of the application into production. So what about the SRE? SRE, it's actually pretty simple. We expect when Chaos Hub is well developed, you will have chaos experiments for most of the applications. So you have all those application experiments out there on the Hub. You just need to pull those CRs and they tool those CRs and schedule the chaos engineering CRs so that the chaos experiments are run and you observe the resiliency metrics, you take appropriate decision to improve the resiliency and tune the configuration of your application or the cluster or the hardware or whatever. So it becomes almost like from a challenging experiment to a seamless experiment with this infrastructure and the Hub. So what is this Chaos Hub? How does it look? So let me show you that. So Chaos Hub, this itself was, I mean, we wrote this as a four-core operator Hub because it's open source and Apache too. So we have about 11 experiments for Kubernetes resources. These are the charts or experiments that you can use. You can almost kill any type of, you can inject any type of chaos that you need and memory hog, CPU hog or the recent ones that we have introduced. And for example, pod network latency underneath we use Pumba, right? It's some other chaos library and they're using the pluggable mechanism. It works well. And similarly, let me show upcoming charts and those are some of the charts. And there are some in staging, for example, Elasticsearch, Longhorn, which is another storage engine, MongoDB, MySQL, many of them are coming. And we're hoping with the introduction of a neutral governance to this project, more and more since your projects will actually submit their chaos charts onto Hub and it becomes more usable and friendly to and driven by a larger community on this idea, right? With that, I just want to do a very quick demo of what I just talked about. And as you can see that I have a WordPress server which is using MySQL database and intentionally we put one one replica so they're not like highly resilient. So when you kill something, the application will just back up and litmus generates a lot of Kubernetes events for every chaos action. It's very important to know what happened and who's dealing what type of thing. So I'll be looking at Kubernetes events while I do this killing. Let me just go back. So I have a WordPress application. Let me refresh it. It's slide. And I have a pod's name space called SIG demo and I have WordPress running and MySQL and WordPress. They are a single pod with the two containers in it. It's been running for about two hours and I have no events here and let me actually install litmus. So right now there is no litmus at all on this. So through a simple operator ML, I'm going to install litmus. Let me move this one side. Okay. And let me see if the CRS are installed. Yes, they are. And let me see what are the pods running. You have the operator and that's very much for us. And now I'm going to pull some experiments onto this cluster. So I pulled the generic chart. So I got a lot of experiments into this and let me see what are the CRS that are installed. Right, just now installed. So next is security and I have an RBAC policies that are set up already into this YAML file. It basically says that a given person has access to inject chaos. So I'm going to apply those RBACs. And primarily before injecting chaos, I just want to show you what is to inject chaos. You need to create a chaos engine CR. And the spec releases that which is the app and which name space and some annotations. And then what are the experiments that you're going to do? So I'm going to do a pod delete of an application whose app label is given above. And then I'm going to do a total of 10 second chaos at the end of 10 seconds. That means one time I'm going to do a pod kill. So with that, I'm actually going to inject chaos. So once I do that, as you can, you can see that in that name space, the chaos container will come up and start chaos runner is now running. You can see that there's an event that said things are happening here related to chaos. And then the actual chaos that kills or this experiment pod delete itself is running inside a container. And that's what I meant by cloud native. In about 30, 40 seconds, you will actually see once the projects are done, the chaos injection is happening. And you can see that the termination of mySQL is happening. And now we should not be able to see this WordPress app being responsive. It's not. Yeah. So it went down, but we expect it to come back and the entire stateful application. And then the status application recovery happens. Containers are still fully not up. As you can see that Litmus is really waiting. It's not just the job of injecting chaos, but also the job of making sure that what is the resiliency of the application? And I know once the pod is up, I need to wait for the service to be available and et cetera. So as you can see that, because WordPress is looking for database, it's continuously going for crashing and Litmus is really waiting for it to come back. And after sometime, if it doesn't come back, then there is a problem with your application. So it will exit the threshold and say that there is a problem with your application. If the pod goes down, your application is totally going down. So in about 10 to 15 seconds, it should recover. If not, Litmus will say that the application has an issue or the experiment has failed. So while we wait for it, I think it's come back up and you will see that the post check of Litmus says that everything worked well and the AUT is running successfully. And we're now going to delete the chaos, whatever we introduced new resources, everything is done and we're back to normal, right? And let me show the result as well here. The result is another beautiful thing, it's also CR and you can use this to collect from it as metrics and do other sorts of work over at this stuff with the result. And now let me actually go back. It should come back up. Yep, it's all good. And we have chaos page as well. So in this case, the experiment succeeded and application also is working well. So that's the way you define or test your resilience using Litmus. So that's the demo part. And just to summarize in this few minutes, I'm able to actually define my test and pull the test case and run it, right? All in this live demo. So it's very easy to install once you install, you get the operator libraries, you pull the charts and you create the CR and to inject chaos and then that transfer chaos inside a container. In a cloud native way. And then the result is a CR and then you can use it. So that's about it. That's how simple it has become from being very, very complex process of chaos engineering. So why do we think this project should be gone by CNC or why do we want to donate this? Well, the idea itself is it should be useful for a lot of other projects. And that's why we open source it and we architected in a way that Kubernetes applications can easily use it. And most of the Kubernetes applications or the CNC applications or projects are Kubernetes centric. So it is a very natural fit. And just like operator hub, helm hub, we have chaos hub. So it's a proven way for the CNC community to use a litmus as part of their application delivery. And we believe if it's part of sandbox and later on in the ecosystem, the hub itself will grow faster and the community will embrace it much faster. And most of all, we wanted to be a vendor neutral home for this project because it's not just my data. There are two other or three other very big companies that are part of the contributors. So we think it is now time for finding a vendor neutral home for this project. So there are a lot of other details that we put together as part of the sandbox process, including the fossil checks and governance. Thanks to Kiran for helping us with the many of them. And for example, we have adopters list as well who are using big companies as well as individual users. And we also have an ET for litmus. So it's a well-defined platform of CSED to deliver this before we actually goes out. So it's based on BitLab. We followed very similar infrastructure to that of CNCFCI, but it has got a well-defined CI and D2E platform as well. So with that, I would say thank you for giving us an opportunity to present here. I'm very happy to answer any of the questions. All right, thank you, Uma. So before we get started, there is two questions from the chat and I'll read them both. And just, I wanna keep this part pretty fairly short, but I still wanna give a chance to people ask questions, but what is the distribution method for these chaos charts? Is it using it as OCI or we should rope the chaos hub into the discussions that we've been having? I guess there's another conversation going on about hubs and that was the, that was it. Yeah, I mean, right now the hub is nothing, but I mean, because we were not part of CNCF hub, we created a hub, right? And if you want to create a new experiment and you want to give it out, you need to create a CR spec and then just put it out there. We expect this hub will be merged into CNCF hub. For example, users come in. I need an operator for my application. So go get it from the hub. I need a chaos experiment for my hub. Go get it. I need actually an installation spec. So go get it, help chart from there, for example, right? So we expect that there will be a common hub where chaos is a very generic delivery method. I hope that answers your question. All right, and then there's one more question. It says, have you seen, and is there any value in using customized experiments coming from the chaos hub to a particular environment? Yes. So I guess the question is, how do I adjust the experiment before applying it to my environment? Right, yes, it is totally customizable already. And in fact, there is a value in having already customized experiment in the hub. It's not just about doing a pod kill. For example, what I showed right now in the demo is I killed a pod that belonged to MySQL, right? That is Kubernetes generic. And then the post check of the chaos experiment just generally went in checked, hey, is MySQL pod came up and running, right? That's all I could do. But if you have an application-specific chaos experiment for MySQL itself, then the post check of that experiment will go and talk the language of the application itself. And that's again totally tunable, right? You can actually go on various parameters. And that's the whole idea of creating the hub where you will have specific post checks and pre checks after you inject chaos. So that probably answered that question. If not, yeah, I can take detailed questions on the community. All right, are there any more questions? Okay then, all right. Well, thank you, Uma, for this presentation today. I sent you a message on Slack already. I just need to get some information for you. So just go ahead and answer that and we'll get the process and we'll work with you on the process of getting into the sandbox. Thank you. Yes, so that's actually the end of our agenda today. Yes, yes, it is the end of our agenda today. So our next week is, our next meeting is in two weeks. That would be what, April 1st? And I guess I'll see you then. Yep, that's correct. All right. All right, bye. Thank you, bye. Bye.