 Okay, we'll start the next session. So it's going to be from Priyas to Kama, Improving Service Mesh Reliability. So it's going to be by Atul Priya Sharma. Atul is a senior developer advocate working with Infra-Cloud Technologies. Work on and talk about cloud-native, cooperatives, and DevOps to help other developers and organizations adopt cloud-native. Welcome, Atul. Hey, good afternoon, everyone. That was loud. I'm not going to ask how the lunch was, whether you guys are awake or not. So I'll straight away get into the talk. So I'll be talking about service mesh reliability using chaos engineering principles. So before that, who am I? I am Atul Priya Sharma, and I work as a senior developer advocate at Infra-Cloud Technologies. I'm also one of the 18 CNCF ambassadors in India, and one of the organizers for CNCF Hyderabad. And as part of my job, I actually help developers adopt open-source technologies, and I help them do so by means of creating developer documentations, blog posts, webinars, Twitter spaces, and talks like these. Outside of work, I am a big-time foodie and a travel blogger, and I post my stories on my blog, socialmaraj.com. So if you are interested, you can check that out. On Twitter, you can find me at The Tech Maharaj, and with my name, you can check me out on LinkedIn as well. So the agenda is fairly simple. I'll talk about a little bit on service mesh, something around reliability and resiliency, chaos engineering, and how we can do chaos engineering for service mesh, and then I also have a short demo. Before I get into this, how many of you are from outside of Chennai? Can you have a quick raise of hands? How many people coming from outside of Chennai? OK. There are quite a few people. So my first visit to Chennai was in 2016, and I was here to attend a family function. And being the traveler that I am, I took a day off and let us explore Chennai. So Chennai is a very vibrant city. You have different places, different areas which have their own specialty. If you want to shop, go to Tinagar, go to Pondibazar. You can get some really nice stuff there. If you are interested in cultural things, go to Malapur, go to Marina Beach. These are the areas which are famous in Chennai, and all of these are known for their own specific stuff. So when you come to software applications, especially microservices, our applications are similarly vibrant. So we have different microservices, each of them responsible for one particular task. I might have a microservice, which is responsible for authentication. I might have a microservice, which is responsible for observability, getting some metrics, so on and so forth. Now, I will not talk much about why did we come to a microservice-based approach from a monolith, but we'll straight away dive into what service mesh actually is. So when I spoke about Chennai and all these vibrant areas, it's really important for everyone to ensure that all these areas are in sync, that they are working fine, there's no issues. And in the city, you have the municipality, or you have the police, who is ensuring that everything is working fine. There's transportation going from one area to the other. You have communication is fine, the signaling, the transport, et cetera. So similarly, when it comes to microservice-based application, we have something called as a service mesh, which is responsible for managing your services, the communication between your various services, and as well as adding security. So the reason why do we need service mesh, I'll probably skip, because right from the morning, we have been listening about microservices, monoliths. Why do we need conversion? So my talk is being reduced, so thanks to everyone who spoke about it earlier in the day. So microservices, talking about service mesh, it's fairly simple in terms of the architecture. You have a control plane, you have a data plane, and you have a sidecar container for most of the service mesh, if you talk about something like an Istio. And that's responsible for service to service interactions, and it sits as a proxy. So whatever traffic that comes into your cluster, it is routed through these proxies. Now, the benefit of service mesh, what happened was when we approached for a microservice-based architecture, you suddenly had a lot of services. But then there was a reason why we actually adopted it from the monolith version. So the first one was service discovery. What happens is when you're working on a microservice application, and the moment you have a new microservice coming in, instead of you manually configuring and telling your application that this is the microservice, and this is where you need to hit, the service mesh actually automates that for you. So the service discovery is taken care of. It also takes care of traffic routing. So basically, any traffic that comes into your cluster and the destination, so you just provide a rule to it that, OK, if the traffic is coming from here or whatever the rule is, based on that, it should be transferred or transmitted to another instance. So things like when you talk about app deployments, if you have heard about blue, green, or cannery releases when you are making application releases, your service mesh also helps you with that, thanks to traffic routing. It also has features for load balancing. It can do some sort of load balancing. So if you have multiple instances, you can define what the weights are. And based on the weights, your traffic will be load balanced. It also comes with internal traces, internal mechanism for observability. So once you have a service mesh which is implemented into your cluster, you get some sort of visibility into how your services are interacting, how the request is flowing, and how everything else just works. They also enhance security. So when I started working with Istio maybe a few months back, I thought it is just like another service mesh where I can just put in an ingress, put it as an ingress somewhere, and the traffic is routed. But just last month, I did a webinar on CNCF around key cloak and Istio, where basically your authentication and authorization was taken care of by key cloak. And the Istio was basically approving the request based on the tokens that is generated. So if you think that a service mesh is required only for traffic management, no. There are some numbers which I'll also point once we move further. And also, service mesh is platform independent. So it's not that the service mesh will always run on a Kubernetes cluster. It can work on a virtual machine. It can work on a server as well. Now these are some stats based on the CNCF survey that was done in 2022. That's the source of it. So as per the survey, almost 60% of the respondents use service mesh in production. So you know service mesh is a tried and tested model. And almost 19%, 20% were actually evaluating it. So they would want to use a service mesh in their production as well. And one of the main reasons why they want to use a service mesh was security. So 79% of the respondents told that they want to use a service mesh for improving the security. And 78% for observability. Now there are some challenges when it comes to implementing service mesh as well. The first one, as per the respondents, was integration with other tools. And the second one was reliability and consistency, which will be the focus now. Because what essentially happens with the service mesh is that you are adding an extra layer on top of your entire infrastructure. So you add some sort of complexity. And that is why you need to ensure whether your system, including the service mesh, will be reliable or not. So getting into reliability and resiliency. So reliability, as you all know, is how your system behaves. When you, let's say, if it is 1 plus 1 is equal to 2, you know that every scenario you will have 1 plus 1 is equal to 2. So that is what essentially reliability is. When you talk about resiliency, it's essentially how fast your system can come back from a disaster. So let's say something goes wrong. Your system has experienced a downtime. How fast is it able to come back to its state? So I did prepare these slides. I was supposed to talk well. But then if you all have attended the keynote by Uma earlier in the day, I think he has covered a gamut of things when it comes to reliability and resiliency. So he did an umbrella talk that he gave in the morning. Mine is just a subset of that. So thanks to him, I can skip a lot of things over here, assuming you all have attended it. So I'll quickly just go through this. So why do you need to check reliability of your systems? A lot of reasons. You start with traffic surge. If you're using, let's say, an IRCTC, you know that the moment that car booking happens, the traffic just goes up. So how does the system behave in that scenario? Does it respond to your queries in a particular way as the steady state is possible? Multi-cloud deployments. You know that the applications today that we do are spread across a lot of different service providers, different clouds, different regions. So you need to see whether your system performs well in those scenarios as well. Dependency outages. Uma did mention about this as well. So a lot of applications that we built today are actually dependent on third party APIs or some third party services. So what happens if that service goes down? How reliable is your application in that particular situation? Autoscaling and elasticity. These are terms we use often when it comes to cloud native. You know, your application should be auto scalable. It should be elastic. But then does it perform how it is required to in that particular scenario? So you need to check that. And obviously, disaster recovery, if something goes wrong with your system, does it come back in time or not? So how do you test for reliability? Now there are different ways, different tools available, different techniques by which you can test for reliability. The most simplest one is observability. So if you are an application developer, you write console.debug, console.log. That's a very basic form of observability because that is going to help you understand how your application is behaving in a particular scenario. So we have logs, metrics, traces that is part of observability. Next thing that you can do is you can do a load testing. There are a lot of tools available, something like an Apache JMeter. Just throw the load to your application and see how it responds. Fault injection. So what you essentially do is in your infrastructure or in your application, you insert a fault, which basically can be you snip out the network and you see how your system responds. Or you add a delay to it, you see how your system responds. You can do security testing. So you can have a penetration test, pen test as they call. Things like authentication authorization, you can check on those parameters. And lastly, you can do chaos engineering. A larger part, again, was spoken by Uma earlier in the day. So chaos engineering is basically how many of you have gone on a roller coaster ride here? Can you raise your hand? Anywhere, roller coasters, amusement parks? Cool. I have never been on a roller coaster. I'm scared, and that's not the adventure that I like. So what happens when you sit on a roller coaster? You are all buckled up. You know that you are going to get into an exciting ride. It's safe. You sit on it. Your roller coaster just starts going up, and it reaches a point. And after that, it zoops up, zoops up, and you're just all over the place. And eventually, once you come back to your station, you are like, OK, finally, I'm back. And you are alive, and that is why you are here. So chaos engineering, if I have to explain in a very layman term, that is what it's a roller coaster ride for your application and infrastructure. So you know your application is steady. You know your infrastructure is steady. But what you do is you purposely add faults at different places at the same time. And you see how your system responds to that. So that is a very layman, very easy way to explain what chaos engineering is. So what it is, it's the practice of purposefully breaking things to make systems stronger. So when you know your system and infrastructure is in place, you try to add faults. You try to add failure scenarios to it and see how your system responds to it. When was chaos engineering found? So early in 2000s, when Netflix pioneered with Chaos Monkey and the whole gamut of things around chaos engineering. Because we all know Netflix is one of the pioneers when it comes to microservices-based application. They have a mammoth application in place for the streaming service. Why do we do that? So we do that to ensure that our systems are robust and reliable. Because when you're adding faults, when you're doing chaos experiments, you are essentially testing your system. And that basically tells you how well your system is built. And how we do that? We do that intentionally by adding control experiments and see how the failures happen. Now, chaos engineering, as Umas spoke earlier also, that it's a concept which is gaining popularity. It's important as well. But from a distributed systems point of view. Because today, if you see all your applications that are built are distributed, they are you have microservices running at different places, different regions, a lot of network, a lot of complexity. So what you see on the screen are eight fallacies of distributed computing. Now, what are these? So basically, when you build a distributed application, these are the things that, unknowingly, you assume will be in place. So when you build an application, you always assume a network will always be reliable. There will never be a network will never go down. But you know how networks are. Even in this hall, if you try to put a tweet, it will not work because network is not working here. So you cannot assume whatever is mentioned here. So your network is reliable. Latency is zero. Bandwidth is infinite. The network is secure. Topology doesn't change. There is one administrator. Your transport cost is zero. And the network is homogeneous. So these are the eight things which you unknowingly have in mind when you're building applications or infrastructure. And because of these things, you need a chaos engineering in place. Because while you're building anything, unknowingly, you know, I mean, you just assume these things. And you need something like a chaos engineering to ensure that these things are met. And your system is robust. So I didn't say this. It is available at eight policies of distributed computing. So I am not saying this. But these are the principles that are in place. So while Uma spoke about chaos engineering earlier in the day, more from a business and a very high level point of things, I'm going to deep one level down and talk about the specifics of what chaos is. And mind you, I'm also very new to chaos engineering. It's been maybe a month. I explored it. I've been working with ServiceMesh. And I thought, why not let's? So I'm a cook also. So I thought, just like I make recipes in the kitchen, let me try a recipe with chaos engineering and ServiceMesh. Let us see what comes out. And the result is this. Actually, the result will be after the talk if you guys like it or not. So the chaos principle, you know, you basically started with a hypothesis. So the hypothesis basically is what happens when you introduce a failure in your system. So you do that. You figure out, you write that that, for example, if I introduce latency, the response time will be affected. So that's a hypothesis that you have. What you do next is that you define the steady state of your application. So steady state is your application under normal conditions, how your application behaves. So you have a hypothesis. You have a steady state with you. Then you do chaos experiments. So you simulate real world scenarios on your application or infrastructure. You get the results and then you compare your hypothesis versus the steady state. And whatever gap is there in between, you try to work on those gaps. And you try to plug them. So when Uma spoke about your latency coming down from, sorry, increasing from 5% to 10%, you need to know that where it has happened. And the gap which you get here is actually what tells what you need to fix. Now designing chaos experiments is something really interesting that I came across. Because when I was using litmus chaos, which I'll show you next, someone can think that it's just about designing an experiment. So what do I need to do? Maybe let's just disconnect the internet and see how it works. But this is a real interesting matrix that we can assume. So if you see, I have known, unknown, known and unknown. So this basically is the known input and the output. So we are checking if we know the input and if we know the output. So the easiest one is known and known. So these are things that you are aware of and you understand. For example, you introduce a latency in your application. You know the response time will be affected. So you know both the inputs as well as the output. The other best thing is unknown and unknown. You don't know what will happen. You don't know how your system will respond. So these are the cases where you are least, unfortunately very least, exposed to. For example, EC2 going down, some region going down. So these are something which you are not prepared to handle. So you don't know how your system will respond. So that falls in the unknown, unknown bracket. The interesting one is known and unknown. So these are things that you are aware of but you don't fully understand. So for example, you have a deployment on Kubernetes. I'm assuming a lot of people know Kubernetes. Yes? Cool. So when you have a deployment in Kubernetes, there is a pod attached to it. So you know that when a pod is killed, it'll come back again. But you don't know how much time it'll take to come back. So you have a known, you have an unknown. Similarly, when you go to unknown and known, in this case, let's say you don't know why or when your cluster goes down. But because your cluster goes down, you know there is a secondary region sitting there where the traffic will be transferred to. So this is where you don't know the input but you know how the system will respond to that. So majority of the chaos experiments that you will design will fall in any of these four. So the easiest one, when you start is obviously known, known scenarios. Uma spoke about the maturity models. So when you start in your organization with chaos engineering, the known, known ones are the easier ones to start with. And that is what I will also talk about. Now coming to tools, so how many of you are aware of the CNCF? I'm assuming a lot of them. So being a CNCF ambassador, I'll talk about CNCF as well. So it's a foundation that is responsible for a lot of open source projects. You name a project, it's part of CNCF, Kubernetes, Prometheus, Grafana, all of them are there. So they have a landscape, a CNCF landscape. So you can just Google it or go on their website and see. So they have categorized applications on different areas, different areas of the lifecycle. So you have orchestration, you have deployments, you have build, you have test, and you have chaos engineering. So the tools that you see here, you have ChaosMess and Litmus. These are the CNCF incubating tools. And the other ones are done by the companies or they are in the sandbox or any other mode. So the focus for today is Litmus. And we'll get into why chaos engineering is required in a service mesh. So I think my slides are not updated. Anyway, so when you talk about Litmus chaos, yeah. So there are principles when it comes to Litmus chaos. So there are five things that you need to know. One is your chaos core, which is responsible for executing all your experiments that you pass it. And then you have a chaos operator. So the chaos operator basically takes instructions from the chaos core. And then it passes it on to the cluster where you want to run the particular experiment. So that cluster can be the same cluster where chaos is deployed, or it can be a different cluster under operation. Then you have something called as which is a chaos experiment, which is nothing but an experiment that you do. Then you have a chaos result, which is the result of the experiment that you have done. And lastly, you have probes. So Litmus chaos provides you with health checks and probes. So these are basically some tests that you can run before the chaos experiment starts during the chaos experiments and at the end of the chaos experiments. So for example, one of the probes is HTTP probes. So example, if you're running a chaos experiment with a network, you can have that probe attached at a different point in your experiment so you know how your service is working or how your external application is responding. Now, I don't think this slide is much needed at this point. It just talks about why we need to do chaos in service mess. So as I told you in a service mess, itself adds a complexity layer when it comes to any application or infrastructure. And because of that, it becomes more evident to have something like chaos in place and see how the system responds. Now, talking about the demo, what I've done is I've kept it very simple. Like I told you, even I am new to this, both these technologies. So I have Istio. I'm sure a lot of you would be aware of the Istio service mesh. It's a popular service mesh. So what I've done is I'm using Istio. I'm using the standard book info application. So if you go through Istio's documentation, they have a book info application, which is a microservice-based application. It's basically a portal where you have books. You have the ability to list books. You have the ability to rate books. And there are about four or five microservices with it. So on my local system, on MiniCube, I have deployed Istio. I have the book info application done. What I'm doing after that is I'm configuring Kiali. So Kiali is an observability dashboard, which will give me insights into my cluster, what all workloads are present, how the traffic is moving. And then to that, I add litmus chaos. We configure the operator. We run a fault of, sorry, we do a chaos experiment. And then we see how our system responds. So let me go to the demo. So one learning for me from today is I need to increase the font size next time I'm giving a talk, because I'm sure even the first rowers will not be able to see what's going on there for get the last row. So what is essentially happening here is I have configured the book info application on my local cluster. And what you will see now is, yeah. So this is the book info application that comes with Istio. So you have books, you have ratings, you have stars to it. So Istio is deployed. I have the book info application which is deployed. The next thing that I'm doing is enabling Kiali, which is an add-on. So once we see how that looks, yeah. So what you see here is the Kiali dashboard. And this comes as an add-on with Istio. So once you are then into getting started document, you will find this. And it basically lists all the services, all the workloads that is present on your cluster. Here I'm adding some load. So basically doing a lot of curl requests at one time and putting it them to that book info application. And this is what Kiali generates for me. So you can see that there is a Istio ingress system. And when I send the load, you will be able to see an animation there on how the requests are going. So if you see those, there are some dots which are just going from the ingress to the other microservices. So this is what Kiali does for me. It tells me how the traffic is flowing between the various services that I have. The next step what I'm doing is I am installing Litmus Chaos. So Litmus basically comes as a Helm package. So you can just install it like any other Helm package. The default login, the first time is this is what it will ask you. And once you are into Litmus Chaos, this is how it looks. This is the UI for it. And what it does by default is it actually creates a Chaos delegate, the point which is highlighted there. So delegate is the one which will be where your experiments are executed or responsible for executing your experiments. So in this case, if you see the Chaos delegate is self-agent. So the first time when you do it, when you run or when you deploy Chaos, it will by default do a self-agent which will be there. And this is what is responsible for executing your tests. So if you see, it will take some few seconds. It will deploy some custom resources. And you will see the status as active. Once that becomes active, you will be able to create a Chaos scenario. You will be able to create a Chaos experiment. And go ahead. So we'll use a self-agent in this case because I want the Chaos experiment to run on the same cluster where it is deployed. And to run a Chaos experiment, what you can do is you can either write your own experiment in form of a YAML file, or there is something called as Chaos Hub. So on Chaos Hub, what they have done is there are a list of by default or ready-made templates that you can use to run your Chaos experiments. So if you see here, these are some generic list which is provided, so container kill, pod delete, pod CPU hog. These are the experiments that is available on Elitmus Chaos. So you can just choose any of this. And then you can run it on the namespace that you want. So in this case, what I'm doing is, since Istio uses a sidecar proxy for the sidecar container for the communication, I am actually going to kill the sidecar in one of the pods, and then we'll see how that happens. So this is the Chaos experiment screen. This is where I choose where I want the Chaos experiment to run. So the namespace is default. The kind will be deployment. And I'm selecting the container based on the label that is provided by Istio. So if you see service.istio.io, that will help me identify the container of where I want to run this experiment. And then there are a few other settings which I make. And then I have an option to set the schedule. So if I want this to run immediately or if I want it to run every time as a cron schedule, I do that. And once that is done, what I've done is I have increased the load so that we can see what is going on. So if you see at this point in time, there is no traffic going to the application. So there is absolutely nothing. And now you see the traffic is going. The load is being generated. And you have this graph because the traffic is moving. Now meanwhile, what is happening is the container is being created, the scenario that we did. It is deploying the container to run the Chaos experiment. And the moment you see, you have your experiment which is created. Let us try to access the website and see what happens. So it takes a few seconds for the experiment to run. And the moment the experiment starts running, you will get an error on your application. So it says, no healthy upstreams. Now why this is happening? Because I have killed the container, the sidecar container which is present. Now immediately, if I go to Kiali, you see yellow and red lines coming because that traffic is now blocked, because the container is not able to handle the traffic. And the moment I click on it, I can see that there is error 500, which is increasing. So it increases from 3%. And it goes on all the way to 4%, depending on the schedule that I've given. So I've given a schedule of only 30 seconds. But based on your actual production scenarios, it can be, this is just one very simple experiment where I knew that when I'm killing the container, it's a known, known scenario. I'm killed the container. I know the traffic will be affected. But on a real production level setup, what you need to do is you need to identify the scenarios that are very specific to your use case. And instead of running one, you can run multiple, the roller coaster example that I gave. You don't need to run one experiment. This was just to show you how chaos engineering works. So in a practical scenario, you will be having a lot more other scenarios. So you can kill a pod, you can kill a container, you can add a HTTP header, you can add latency, and a lot of other complex things that you can do to check the reliability of your system. So if you see here, the status and everything is still being shown as unhealthy. And the moment now you see, the status is green. This is because my chaos experiment has completed. And once that completes, you will see everything is in green, the experiment has done. And you will see the result on Chaos Scenario Center, which will say the test has passed. And then if I have any probes in between, I can set up additional probes, which will give me additional details about the experiment that I have run. So just a few seconds and it should turn green. So this is how you can run a chaos experiment on a service mesh. It can be any application, since if it's deployed on a Kubernetes cluster. And this is what we get an output. So what next? Now, since we did the chaos experiment, we got to know, we hypothesized a scenario. We know what the steady state is. We got to know the gap in between. So what are the next steps after this? So now that you know the gaps, you analyze the data. This was a very simple scenario, just one experiment. But on a production scale, you will have a lot of experiments at a very large scale. So you gather the data, you analyze, why something is happening, and try to see the loopholes that you can plug, things that you can improve, whether it's from a developer's point of view, or it's from an SRE point of view, whether it's an infrastructure thing, or it's a code improvement. You do that. You assess the impact. You do a root cause analysis as to why that particular incident happened. And then documentation is really important. I think Uma spoke about it as well, because whatever happens, whatever scenarios that has happened, you need to ensure that you have a proper documentation in place, so that next time something of that sort happens, your team is geared up to handle that. Remediate, repeat, and re-iterate. So chaos is something that you should not be doing it for one version and stop it. Like Uma gave an example of a banking company with the RBA. You cannot run a chaos for version 1.2, and you say when you have 1.24, you say that it's working on 1.2, and you don't run it here. So I think re-iteration is, again, important, as it is like in any other form of testing. So I'm out of time. I think I timed myself perfectly. I've been practicing for a couple of days, and I know I was hitting the 30 minutes mark. So in case you have any questions, I'm around. You can ask me. And thank you so much for being such a wonderful audience. I hope you found this talk insightful. Thanks to the team for having me here. Thanks, Atul. And last plug, same-less plug, you can reach out to me on Twitter, at the rate, the tech Maharaj. And you can also connect with me on LinkedIn as atulpriyasharma. And then you can email me on atulpriyasharma at infracloud.io. If you're interested in travel and food stories, socialmaraj.com and at the rate, atulmaraj on Instagram. I'm making an Instagram story because this will be posted there. Thank you so much. Thanks, Atul. We'd like to give you a small token of appreciation.