 Okay hello hello and welcome to the ServiceMeshCon Europe 2021 and we're going to talk about creating chaos in the university using LinkerD and ChaosMesh. A little bit about me, my name is Jessica Stria, I'm a science and systems engineering student at St. Carlos University. Currently I'm working in IDT as a technical support and checking some network stuff in Latin America. I'm a freelance developer and a cloud native enthusiast. Well a little bit about me, I am Sergio Mendez, I am a professor of operating systems at the University of St. Carlos of Guatemala. I really like to play with cloud native technologies and special from CNCF projects and I do a little bit of DevOps at YALO and I am also the organizer of cloud native Guatemala, CNCF community group and I am a LinkerD hero of February of this year. Chaos in the university. How? Well let's get started with some cartoon. This cartoon reflects how the chaos in January appeared in our system, how we create these things, how we create chaos with these kind of tools. Okay this presentation is based on my thesis project that I'm working with Sergio and well here are some challenges. First learn about cloud native technologies from scratch literally, find quality source of information about chaos engineering, understand and implement chaos experiments, share this knowledge in my country that well this is currently a big challenge. The service mesh is what is a service mesh? So a service mesh is like a kind of let's say featured or ability to create a stable network in your services. The service mesh just a proxy instance in your deployments like a side card kind of deployment using containers in order to evaluate the traffic and decide what things has to do and that kind of thing so give your networking more reliable, more stable, secure and everything. The benefits are like well you can check the state of your services in the network, it's really scalable and implement services that are stable in general, you can split the traffic, implement some security at TLS layered and that kind of things using this side card proxy to detect the traffic and you don't have to modify a lot of the times your application so no modifications to has the problem of service meshes. Okay now what is chaos engineering? In this way it's making experiments in a system to improve its capacity to support extreme conditions in production. The objective of chaos engineering is to cause failure on purpose of course to identify where and on the web condition or system can fail and improve our system of course. Well chaos engineering is based on five principles. First build a hypothesis around steady state behavior, variety of real-world events, run experiments in production, automate these experiments to run continuously and minimize the less radius some benefits of chaos engineering well identify vulnerabilities and potential failures, improve our systems and have a proactive system design. Here are some examples of where we can apply some chaos engineering test. First in the application I mean at a software level in working by slowing traffic, traffic was in packages, DNS failures, bots killing them, producing some failures in the bots, the CPU, RAM memory and operating systems as some examples of course. Okay for chaos engineering experiments we define a group of steps that we can follow to create this experience. First we have to define the framework we need, have an architecture with the containers and orchestration. This architecture has to be observable, manageable, safe and extensible. Define this architecture of course, define a healthy state, be sure of what kind of experiment we want to try. We can try a lot of experiments of course, define a hypothesis of what thing could happen, identify the critical points of the experiments to implement, understand these critical points, verify what could go wrong, do I know what could happen if it fails or if it doesn't, set a small last radius for the beginning, define our KPIs, unleash chaos that is my favorite part, verify results and then we have a decision to make to improve our system if we find some vulnerability or increase chaos and keep doing experiments. Well right now the technology we are going to use is Linkardy in some part just for the faulty traffic. Linkardy is a service made for Kubernetes, it's really easier to install and implement in the way to get some more serviceability, implement security, debugging your services in the network. You can split the traffic and get some golden metrics too, evaluate some graphics about this traffic and maybe you can use this traffic splitting feature for progressive delivery and there are other features of Linkardy. And the other tool we're going to use is ChaosMesh. Well ChaosMesh is a Chaos Engine platform that orchestrates scales and Kubernetes environments by fold injection methods. We can provoke folds in pods network, file systems kernel and a lot more experiments. Well why ChaosMesh? This tool works better on service mesh architecture applied on microservices creating a mesh oriented failure. Let's go to the demonstration part. Well talking about how this demonstration is built, we have the architecture a little bit about the things that we are doing on this. We have a Kubernetes cluster installing GKE. We have two nodes with two CPUs with four gigabytes of RAM. We have two main applications, a client that sends traffic to a patching on Kubernetes. We are using CIG inside this client deployment that sends traffic to a patching and we are using the feature of traffic splitting of Linkardy. In this way Linkardy is going to send the traffic to a patching the 50% and the other 50% to the error injector deployment that returns faulty traffic and you are splitting this traffic 50 and 50, 50 good 50 bad. That's the fierce part using a Linkardy for this kind of pretty basic chaos experiment using service meshes and just for faulty traffic and see what happens. In the other side we are going to use chaos mesh, trying to kill pods, failure pods in some way and see what happens using the same experiment, putting all the things together. We are going to do a walkthrough over the different dashboards of Linkardy and ChaosMesh and explain a little bit the code of these experiments using a Linkardy or ChaosMesh in general. Okay, let's check the experiments because of the time I already have everything. Okay, well, the architecture running, for example, here is the Apache service deployment file. We define for replicas for the Apache server, replicas for the client that is going to be constantly sending requests to the server, the error injector and then we have two additional files. First, the faulty traffic file for Linkardy in which we define the traffic split that is going to send 50, 50, 50 to the Apache server and 50 to the error injector. And for the second experiment, we are going to define two experiments with chaos mesh. The first one that is going to kill pods every 12 seconds. In this case, the Apache server pods and the second experiment is just going to cause some failure in the pods, but this one every six seconds. Okay, now I already have the two dashboards running, the one with Linkardy. We're going to get inside the namespace and the ChaosMesh dashboard. As you can see, there is no experiment running right now. Let's wait for the Linkardy to load. As you can see, here are the four pods for the Apache server and the pods for the client. Yeah, in this part now, well, we have the Linkardy dashboard. It's showing how the Apache and the client are in these connections and the traffic and that things. Well, if we apply the traffic split, we are going to see how the traffic is failing and that kind of things and the relation between the faulty traffic, the success rate and that kind of things in this dashboard. So maybe we can move to that part of the traffic splitting faulty injection of the traffic. Okay, first we're going to inject the faulty traffic. I'm going to inject the faulty traffic file as you can see. And here is the error injector that starts running. Yeah, this part is going to show this kind of failure in the service in the error injector. Maybe we can go to the split part, the traffic split in the first one. Yeah, in this part, it's going to show how the traffic is moving. It's going to show that the success rate is going to be like 50%, but right now it's split it on 50% and 50%. In general, we can see the requests that are crossing this kind of service in the downside of this page. Basically, this part of the traffic split, maybe we can come back to the main dashboard. Okay, so let's go to the client part. In the client part, it's going to show this relation between these deployments, how the client is sending traffic to the error injector and the Apache. Basically, the client sends the traffic to the Apache service on the Kubernetes and it's going to the traffic to the other service that connects to the deployment to the error injector. Maybe we can go to the Grafana dashboard inside the client. No, sorry, inside the Apache. Yeah, LinkerD provides you with some dashboard that shows the request per second of your current service that is running. Right now, it's only be affected by the faulty traffic of LinkerD using this faulty traffic feature, using traffic splitting. That's the general metrics that is showing this dashboard, the success rate, the request rate, and the latest thing that are basically the golden metrics for this kind of dashboards and information that LinkerD shows. So that's the way that we use service mesh to generate faulty traffic. So let's move to the part to the chaos and generating test using chaos mesh. Okay, first we're going to stop this experiment and we're going to run the chaos mesh experiment. I'm going to run the podkill file, start injecting this experiment. Okay, as you can see, it starts loading. Okay, as you can see here, one pod was already killed and the system running it on again. We're going to see some live metrics, the pods. Okay, as you can see here, the experiment killed two pods, it's generating another one to fail. And it's going to keep this way constantly cooling and making them fail. We're going to check the chaos mesh dashboard. Moving to refresh it. Okay, as you can see now, there are two experiments running the podkill and the podfailer in the experiments area. You can see when every six seconds the podfailer experiment, it's provoking some failure. And let's refresh the Ravana dashboard. Okay, as you can see, all the metrics are getting crazy. The route is right, the latency, it's going upside down with every failure. And of course, we also can have some checking with linkerd. As you can see, every time an experiment starts running, the Apache server starts loading. And the experiment keeps going and going. And cooling and turning on again all the pods, at least for this experiment. Now let's continue with the presentation. Okay, some lessons that we learned with all of this process. First, how to implement data-based health experiments, implement health experiments to improve our system is not as difficult as it sounds. And understand how a student thesis can help open projects and the industry to grow up and solve some real life problems. Here are like a kind of references or websites that you can check to join the linkerd community, ChaosMesh, or visiting their slacks on CNCF channel and their own communities and general website of contents about Chaos in January. And of course, because of the time, we didn't have a lot of time to explain every detail of the experiment, but we're going to leave every detail in the repository, every step and command you could use to replicate these experiments. Also, the link of the slides. And you can follow us in my case with my link tree. And me with my personal website. And well, that's all. Thank you very much. Thank you guys.