 Thanks for joining everyone. Thanks for joining. I'm Andre from government. I'll be presenting this webinar on serverless resilience. So before we really get into it, I just want to start up a quick survey. And this isn't something that you necessarily have to answer something you can just think about as we're going through the presentation. But if you do have an answer, you want to share your experience, feel free to enter the chat box. So first one, if you're using a service application, or you've ran a service application or you've tested with serverless, whatever it might be, have you ever had it fail unexpectedly? If so, what did you do to fix and prevent the issue? And then once that was done, are you confident that application can handle failure in the future? Now, if you're lucky and you haven't had a service application fail, first of all, congratulations. Second of all, I'm sorry to say that it's only a matter of time before it does fail. And that's something that you're doing wrong. That's just the nature of systems that we run today. My hope with this presentation is that you'll get some ideas about how you can stay ahead of failure so you can keep your systems running reliably. And as Marisa mentioned, my name is Andre. I'm a product specialist at Gremlin. My job is really simple. I help SREs, developers, testers, engineers, folks in tech make their systems more reliable by providing them with the tools and knowledge to do so. Topics like chaos engineering, reliability engineering are incredibly important to me and to us at Gremlin, especially as more and more companies are becoming cloud native and moving to these complex systems. So my goal in this webinar isn't really to showcase a specific product, but to share some ideas and thoughts and practices and how we can apply these practices, these reliability practices to serverless platforms. So a quick look at our agenda. I'll start by kind of setting the baseline of what reliability and resiliency are and what serverless is a jump into how serverless platforms can fail. Then I'll move into what is chaos engineering, explain a bit about how it works. We'll talk about how to apply cast engineering practices to serverless. And then I'll conclude with a quick recap and a jump into Q&A. So let's start with the absolute basics. What is reliability? Reliability is the ability for a system to remain available, or in other words, how dependable the system is. This is often used interchangeably with resiliency, but there is a slight difference. Resiliency is the ability for a system to quickly recover from adverse conditions. And the difference that we try to point out is that reliability is the end goal and resiliency is how we achieve that goal. So in other words, by implementing ways that make our systems more resilient by making them able to recover from situations faster or incidents faster leads us to greater reliability. And I think we can all agree that reliability is important, but really why is it important? Why should engineers focus their time and effort on improving reliability when they keep doing cool stuff like developing new features? The big one that we hear about all the time is, of course, downtime. Unreliable systems break, and this usually means an interruption in service. This in turn can lead to lower revenue due to lost sales. If you rely on your application to generate income, you get customers who are angry or frustrated that they can't use your service. You have engineers who are exhausted because they're spending all their time in emergency response mode, working late hours over weekends to fix a problem. Now, this time spent fixing problems is time that you could have spent doing valuable stuff like future development or handling customer tickets. Downtime isn't just an expense, it's also a missed opportunity cost. And then lastly, you slow down your ability to create and turn around new software since you're dedicating this valuable time and resources towards fixing rather than creating. And these stakes and costs get higher depending on the size of your application and the industries that you're in. For example, an online store might lose a few thousand dollars per hour of downtime, but downtime for a bank or a hospital or an airline could actually cost lives. So it's important to tie that cost to the size of your business and to the size of your industry. The problem, as we all know, is that no system is perfect. Failures happen all the time, even in systems that are designed to be quote unquote failure proof. And these failures are caused by unknown variables in our systems. Even the most thoughtfully designed systems can't account for every possible scenario. And complexity makes this problem worse. The more complex the system is, the more likely it is to contain failure modes or ways of failing that we're not aware of. Unfortunately, modern systems are only getting more complex by the day. And on top of that, our companies are always looking for ways to increase development velocity. Right, we need to move faster, we need to develop futures faster, test faster, push the customers faster, fix bugs faster. You know, we have that real competitive need and these systems are working to enable that, but that velocity means that failures are more likely to creep in. So not only are systems and applications becoming more complex, they're also constantly changing and we're changing them faster than ever before. So how can we possibly improve liability with there's so much chaos going on. And I think a good way to illustrate this is to look at an example of an application that a lot of us have probably worked in or at least know about. So Kubernetes is something that I talk about a lot. It's something that I really enjoy myself. But if you could go without saying that it's really hard to learn and it's really hard to get right, especially when you're starting off. So just as a quick recap, Kubernetes we have multiple different components that are working together. We have these worker nodes that are the great boxes here that are running pods, which contain our applications. We also have the control plane, which is basically the brain of the cluster right it maintains cluster state and might redirect traffic or do load balancing and things like that. And if we're running this on a cloud platform, if I have an integration with a cloud provider through an API or so. Each of these components on its own has its own set of failure modes like we have like the infamous pod crash loop back off. No failures network outages network latency, you can even have a cloud provider outage or control plane outage. In addition, there are failure modes and how these components interact with each other. So for example, if we have a look, a worker node fail, then all of the pods and applications that were running on that node will also fail. And when that happens, we need the control plan to detect it, so it can report this up to a cloud platform, and then start redirecting traffic to a healthy node. Right, all this happens this whole domino effect happens because of one component failing. And every step that we add to that every component that we add to the system introduces additional risk of failures and additional risk of unknown variables and things failing. And this isn't to knock on Kubernetes, but Kubernetes is great at what it does. And when set up with high availability, it can really can be rock solid. But it does have a lot of moving parts and these parts can fail in unexpected ways if we don't practically boot out and address them. So this leads into our next slide, which is all about serverless. And just as a quick recap, serverless platforms that should deploy and run applications without having to provision, configure or manage infrastructure. As a developer, you can basically upload your code to the platform and the platform handles the rest. The great thing about this is developers don't need to think about nodes networking replicas or even containers really. They just need to take their code, push it up, and it gets deployed for them automatically. Now this is great in theory, but it does pose a reliability challenge. So the great thing about serverless platforms is that they separate infrastructure from the applications, and they reduce the workload for developers. The problem is that something has to manage that code right something needs to handle converting that code into an actual deployable executable. And then having additional software on top of your existing stack that can manage that for you, right, which is your serverless platform. Now adding this layer, like adding any sort of layer adds complexity. And if you remember from a few minutes ago, more complexity means more risk of something going wrong and something failing. So when we talk about serverless reliability, we're not just talking about building reliable applications right developing reliable code, but also building a reliable serverless platform for that code to run on. Because you can imagine if the platform fails with the con with the consequences will be for the applications that we have running on it. And I feel like the best way to illustrate this is with an actual serverless platform. And I'm going to pick on native for this webinar. Native is a really fantastic serverless platform that runs on top of Kubernetes. It makes it super simple to take into container and deploy onto Kubernetes without having to worry about setting up really complicated networking rules, really in depth resource allocation rules, or a lot of the other typical boilerplate usually have to do when creating a Kubernetes manifest like deployments and replicas and all that fun stuff. Basically, you just tell native for image you want to run, but never forced to open, maybe a few other optional parameters, and it handles the rest. And there are two main components to native if you're familiar with it. But the one we'll be focusing on today is called native serving. This is the one that actually handles deploying in managing serverless applications. And so we're going to take a look at some of the ways native serving in particular can theoretically fail and how you might want to mitigate that risk. So knowing what we know about native Kubernetes, we can pinpoint really four key areas where reliability is a concern. First is the application itself. This is the workload, the serverless workload that developers are deploying to and running on native. Next is native itself, specifically native serving, which is the component that actually runs our applications. The third is the Kubernetes cluster that native is running on. And last but not least is the cloud platform or on-prem infrastructure, or standalone server, basically the hardware and environment that our Kubernetes cluster is running on. And if you look at these layers a bit more closely, or if we think about them a bit more, we can start to identify areas of ownership. We know the application owner is owned almost exclusively by developers. While the native and Kubernetes layers are owned more by SREs. The environment of cloud platform can be owned by SREs or operations teams or a cloud provider, depending on wherever running a cluster. And this idea of ownership will become more relevant in the next slide. So if you think about how different teams interact with native, we kind of start to see a picture of how reliability plays into the overall system. In this case, we'll just ignore the contributors for now. No offense to them. They do great work. But just for this webinar, we're going to focus on the users, the developers and the operators. What do users want? They want the application to be usable and accessible. That's it. They don't care about native or Kubernetes or whether you're running in the cloud or on-prem. You can be running your application on a smartwatch for all they care, as long as it's responsive and reliable. That's what matters to users. For developers, they're more interested in their application stability. The underlying platform is mostly irrelevant just as long as it's reliable and they can easily deploy to it and they can quickly address any problems with their applications. And operators and SREs care mostly about the systems that they manage. They might not even know or care about the applications themselves, but they do care about providing that reliable platform, that reliable foundation for developers to build on. That's mostly because they know that if something does go wrong, if something fails, they're most likely to get the blame for it failing and they're most likely to have to get involved and get their hands dirty in resolving it. And the reason I'm taking so much time to call out the fact that different teams own different parts of the stack is because, ultimately, reliability is a shared responsibility. In other words, it doesn't fall under one single role. It's not just an SRE thing or an ops thing or a developer thing or a system administrator thing. Everyone needs to participate in setting reliability targets, monitoring for issues and fixing or rather identifying and addressing these failure modes when they come up. For developers, it's pretty clear cut. American making your applications resilient. Fix critical bugs in your code, add things like exception handling retries and timeouts to any network calls your application makes and use really any other techniques at your disposal to make your code as bulletproof as you possibly can. SREs, your job is to keep the serverless platform up and running for developers. If you are using native, learn the best practices for high availability, native and Kubernetes clusters. Learn how to configure things like redundancy, monitoring, logging. Make sure you read up on best practices and even horror stories from other teams and organizations to learn what issues could happen and what to look out for. Before going into production, the better, but in some cases you just need to move fast and get it out there. So it's still important to make sure that you're constantly keeping up to date with the latest practices and techniques for reliability so that you can keep your systems running and stay ahead of issues. For ops teams and SREs, your dependent or rather the organization depends on you to keep the lights on and to provide the hardware and resources that the developers need to build resilient systems. Make sure that your systems are redundant and fault tolerant, that you're able to stay ahead of things like capacity requirements and that you swap out any failing or near failing components. And there are a few things that everyone needs to take part in across the organization. First is being able to collaborate on creating server-level objectives. Now server-level objectives are SLOs, defined the standard or service that customers should expect from your service or your application. This is especially important because SLOs aren't just a technical objective, right? They're not just like uptime or latency or any individual metric. Rather, they're a business objective. SLOs are used to define server-level agreements or SLAs and depending on how your business define SLAs, failing to meet them could have financial or even legal ramifications. Nobody, especially engineering, wants finance or legal on their back because it took like one minute or two minutes too long to restore a service. So make sure that you're aware of your objectives and you're taking measures to satisfy them regardless of whether you're developing or maintaining your platform. And on the same note, be aware of your error budgets. Sorry, error budgets. An error budget is the maximum amount of downtime that is allowed for your systems before you violate your SLOs. And in a sense, it's easier to think of them as cash or as money, right? You start each month or year or SLO period with a certain amount and you pay it back for each moment of downtime. But it doesn't roll over. So if you have extra error budgets, consider using it to try different things. Maybe test out new resiliency measures, run reliability tests, maybe run some cash experiments. You are paying out of your error budget in case something goes wrong or something goes down. But the error budget provides that sort of buffer to make sure that you don't violate your SLAs, but you're still able to try new things and innovate and try to make your systems more resilient. Again, error budgets goes beyond a single team. So if you do start a resiliency initiative or reliability initiative that could affect the error budget, get everyone involved first. All right, so we talked about what reliability and resiliency are, what serverless is, and some of the challenges of making serverless reliable and why reliability is so important. So next let's dive into how you can actually test, validate, and improve reliability. And we're going to be doing this by going over a method called chaos engineering. So if you're not familiar, the idea behind chaos engineering is that it's a practice. It's thoughtful, controlled experiments that are designed to reveal weaknesses or failure modes in our systems. And chaos engineering really developed as a practice about 10 years ago in enterprise environments like Amazon and Netflix. These companies needed a way to test their systems ability to withstand turbulent and unfavorable conditions, like random system failures and region outages. Of course, just going into a data center and pressing the power button on the server or cutting a fiber optic cable is really bad and really destructive. So they built these tools to help simulate these conditions. If you heard of tools like chaos monkey, gremlin litmus chaos mesh. This is what they were all born out of right this idea of being able to test these honestly kind of destructive conditions in a safe controlled way. And that's why we kind of approach it with this scientific process in mind, right, because it has the actual impact to affect our systems. And make sure that we're really thoughtful and consider it about how we go about this. So if you look at the general process of how chaos engineering works, we start by creating a hypothesis right it's an assumption about if we inject this type of failure if we create this type of failure. This is going to be the outcome that we would expect to see on our systems. So it's kind of a cause and effect thing. One thing when you're creating your first hypothesis or when you're starting out with cast engineering is to start small right don't go with example like what happens if I shut down an entire production region or an entire production data center. Maybe it starts with what happens if I increase CPU usage on this one server by 15% right how does that affect the performance of application. So when we say start small, just start with the basics, be able to observe the impact and make conclusions, but without actually creating a severe impact on your systems. So once you have your hypothesis, you then test your systems to isolate issues and mitigate risk. And we say that you should do this progressively. Because again you start out small. Maybe you add 15% CPU usage to a server and see something come out of that a performance issue come out of that. You can then bump up that experiment to use 20 or 25% CPU and then observe the issues coming out of that. And then you're able to see that based off of CPU usage how that impacts your systems and having that sort of progressively helps you isolate issues here. And again, you know you gradually increase that magnitude, which is essentially the severity of the experiment and scale it up. So maybe instead of consuming 25% CPU on one server, you consume it on two or three, or even across an entire cluster, just to get a better idea of what that performance impact would be. While you're doing this, make sure to observe, document, and share the results with your team. All right, this isn't just one engineer doing these tests, it's really everybody, and everybody has the potential to learn from it. So as you identify more issues with your systems, and you learn more about how they work through these experiments, you know, make sure to document your findings and share them with your fellow engineers. Today, everybody can learn together. And ultimately, what this leads to is more resilient systems, which leads to greater reliability, which leads to better customer experiences. And it's a cycle of test, fix, verify and tweak that really gives chaos engineering its power and its popularity. And so with experiments, you have a lot of flexibility in how you might approach them right there's not really a fixed set of experiments that you might want to run. It really depends on your platform and your applications. So just as some examples of experiments or hypothesis that you might come up with for a serverless is what happens to your application when memory usage reaches 100%. How does adding 20 milliseconds of network latency to all network traffic or certain types of network traffic impact the user experience. Does your application remain available, we terminate one of your work in those right. If you're shutting down a server that has serverless applications running on it. Do those applications get migrated to another node, or maybe if you have replicas already running on another node, does traffic load balance over successfully. Again, if you lose a node or some part of your serverless platform, does the platform itself remain available. Right, that's the key thing when you're talking about running a serverless platform is that the platform itself needs to be tolerant to failure. Because if that goes down, then that will just have a cascading impact domino effect on all of your applications. And before I go on to the next slide, I will note that different cascading tools support different types of experiments. So it depends on the tool that you're using and even the platform that you're running on some cloud platforms don't allow certain types of experiments and that's just a security thing. But this is just to give you a general idea of how you can approach experimentation. Generally speaking, you have a lot more power and a lot more flexibility over what types of experiments you run. When you're running the serverless platform yourself as opposed to using a cloud platform provider. So just to get into the weeds a bit, let's walk through one possible experiment. So to set the baseline, one of those important aspects of distributed platforms like Kubernetes and like native is that they can maintain a minimum number of container or pod replicas. So if we deploy a serverless application with at least one minimum replica, and all of our replicas fail a crash native should detect this and automatically deploy a new replica. That becomes our hypothesis. And what we would expect is that this will happen in a few seconds at most, and automatically so zero intervention required from developers sres or operators. So if you translate this into an actual experiment, we'll say, when we terminate all of our application pods, and we monitor native or whatever observability tool that we're using to monitor Kubernetes in native will be able to see our application terminate and then come back up automatically without any intervention. And we can even set a success message success metric for this right. So we consider it a successful test if native automatically deploys a new replica with minimum downtime, and this can be let's say 30 seconds. So this proves to us that we can trust native to keep our application resilient in a live environment, and not significantly impact customers. And we can test this even in pre production. So let's say we have our serverless platform up and test or staging or QA or whatever environment you call it. You can use your cast engineering tool to run this experiment to actually terminate one of your nodes or your application pods. Keep an eye on your observability tool to make sure that they come back up. And since you've already tested this in pre production, you can feel confident that if you happen to lose your pods in production, that native will be able to jump in and bring them back up for you. Now, if this fails, if it doesn't meet our hypothesis, you do need to pinpoint the root cause that you can implement a fix. The good thing is that once the fix is in, you can then repeat this experiment to verify that it works. And if it does work, or once it does eventually work once your fixes are effective. You can trust that that fix will also work in production, and that it will be able to keep your applications running, even if some of your replicas were to fail. And as long as our cluster is relatively stable, as long as it remains generally the same. You can feel confident that our application will keep working. But again, bottom systems, excuse me, bottom systems don't remain stable forever. They change settings and configurations and applications change. So we might want to periodically repeat this experiment to make sure that our applications always remain resilient, even as we deploy new code or scale them up or scale them down or whatever it might be. And just to come up with some other ideas of other experiments test you can run. This isn't an exhaustive list, but hopefully it gives you some good ideas or helps you start to think about some of the types of scenarios that you run into. If you think about on the application layer as a developer, you know, can our application scale quickly and can it scale consistently. If you run an experiment where we consume extra CPU or RAM within our application itself, or within the container or the pod or whatever it might be, and see that it's still responsive. Does our application automatically restart after it crashes. Well, we can run something like a pod shutdown or a pod crash and observe our application or native to make sure that another pod gets deployed to replace it. Well, our health checks working the way we expect them to. We can test this by blocking network traffic to the endpoint that we're using for health checks. And then we'll see if native or Kubernetes detects that that health check isn't responding or that it's failing and restarts our application automatically. If we're on the platform team, we might look into experiments like is our Kubernetes or native cluster highly available. Well, we can test this by shutting down or dropping network traffic to one of the nodes in our control plane. Remember control plane is responsible for maintaining the state of the cluster. So if the cluster is highly available, losing one of those nodes shouldn't take the cluster down. So this is our monitoring or observability setup correctly. We can test this by consuming CPU or RAM across several of our nodes. Maybe we add network latency, or we terminate certain workloads, regardless of whatever it might be, it's changing the state of our cluster. And so if we watch our monitoring tools or dashboards and our alerts, we'll be able to make sure that those are set up correctly to trap the metrics that are relevant for us. So can we scale our cluster without impacting our applications. And by scaling, I don't just mean scaling up right it's not just adding nodes to increase capacity. It's also scaling down to save money or to reduce your capacity, if you don't need it, and scaling down could result in applications like serverless applications actually being terminated we don't mean them to. We can make sure that if we add a worker node by consuming CPU or RAM to trigger auto scaling, or we terminate a worker node to scale back down that none of those actually impacts our applications in a noticeable way. And lastly, when we think about our infrastructure, the actual environment or the foundation that all this is running on is our infrastructure redundant. Well, let's shut down the worker node in our cluster or zone or region and see that our applications keep running is our zone or region failover set up correctly. Right, let's say we have a disaster recovery plan, where if one of our zones fails we fall over to a backup zone. Well, we can test this by dropping network traffic to our primary zone to our entire primary zone and see if our traffic will balance is to our backup zone. And then, this is actually a non technical test, but again non technical tests are something that we can test with chaos engineering so it's important to include them. We can test whether or not our incident response from books are up to date. The way we do this is by reproducing an incident that might have experienced in the past, and then have our team respond to it as if it was a real incident. So for example, a zone outage is something that a lot of teams need to prepare for they need to have a run book or a plan to respond to. And repeat our zone failover test by dropping network traffic to our primary zone, and then have our engineers actually debug and troubleshoot and respond to that as if it was a real incident. And that government we actually call these game days. It's a gay moment where you bring your entire team together or a specific team together to run experiments and actually observe and treat them as a team to be able to learn and practice and respond to similar incidents if they happen in production. But there are also what we call fire drills which are essentially game days but where the team isn't actually notified. Right. They actually treat it as if it was a real issue. And that's a more sure fire away. No pun intended to verify that your internet response run books are working. But it's also a bit more chaotic so you just need to find that balance depending on how comfortable your team is with running these type of tests. So many of these tests aren't specific to Kubernetes and native either. Like for example, health checks and auto restart policies also apply to other container run times orchestration tools, cloud platforms, infrastructure automation tools and monitoring tools. And I know that all this might seem daunting at first, thinking of all these different possible scenarios for each of the layers of your stack. So when you start running these reliability tests and uncovering failure modes and making improvements and then validating the results, the risk of outages drop significantly. Your team becomes more comfortable when thinking about different types of failure modes and being able to respond to them effectively. So it doesn't just empower your engineers. It also makes your customers happier and your systems more resilient. I know we covered a lot today, but if there's anything you should take away from this presentation, it's this. Reliable systems are those that we can trust to stay up and running. And the way that we achieve reliability is by making the systems more resilient. Or in other words improving the system's resiliency leads to greater reliability. The more complex the system is, the higher its risk of failure. Serverless platforms add a layer between our applications in our infrastructure, which creates a lot of complexity, a lot of abstraction, introduces a bunch of unknown value variables, which makes it harder for us to account for the different types of failure modes. So running test reliability rather is a method of testing reliability by simulating unfavorable conditions like broken network connections, node failures, zone or region outages and runaway processes. And by running caches during experiments, you can verify that our serverless infrastructure from the applications to the platform down to the hardware itself is resilient to conditions that can impact it in the real world. So that's why I decided to consider these problems so that they don't happen in production, which again, ultimately is to make your users happy. So that's the beat of my webinar. I'm going to thank you all for joining me today. I hope this wasn't informative. Again, my name is Andre Newman. I'm a product specialist at Gremlin. Feel free to say hi to me on Twitter or shoot me an email. I'm always up for chatting about reliability systems and just geeking out over tech trends in general. That concludes my webinar and get started with Q&A. Feel free to answer any questions in the Q&A box and I'll be happy to answer them. Andre, we've got a couple of questions in. The first one is, does Google Cloud Run allow running these serverless chaos experiments? I can't say specifically for Google Cloud Run. I don't believe that they have their own chaos engineering tool or fault injection tool. But there are some chaos engineering tools out there that can actually inject these type of faults on the application level. So in this case, instead of running it on Google Cloud Run, you would deploy your application to Cloud Run and then inject like a unhandled exception or application latency directly into your code. So that is an alternative that will work across different providers or even just on really any sort of system or platform, but it depends on the chaos engineering tool. Okay. What tools let you do serverless testing? So again, I can't really speak to specific tools, but there are tools out there that can inject faults into applications as well as the platform itself. You just kind of have to see the different platforms that they support. The good thing about the platform-based tools is that some of them will let you use your cloud providers API so that they can actually hook into like AWS or Google Cloud or Azure. But we're also seeing that some providers are providing their own chaos engineering tools. So like AWS has a tool where you can inject fault into some of their services and I believe Azure does too or they're coming out with one. But there are also a bunch of chaos engineering projects that are sponsored by the CNCF or there are CNCF projects that you click into that might support some of those cases too. Okay, great. Malcolm's asking how seamlessly does Gremlin integrate with Azure? Well, we're more of a platform-based product. Essentially, as long as you're running a virtual machine instance or Kubernetes or a container runtime block Docker or container D, you'd be able to use Gremlin. With our products specifically, we use agents to inject our actual failure. And agents can really only run on those platforms for now anyways. We're looking into ways to better support serverless. But as long as you're running a virtual machine instance or Kubernetes for Docker or something similar, then Gremlin will be able to integrate with it. Okay, great. And a question from Anup. When should we start chaos engineering in a product development life cycle? Ooh, that is a great question. The short answer is as soon as you have something deployed into a running environment. So let's say you have code that you put into a test or staging environment and you're able to run like your integration test on it. That's when you should run chaos engineering tool. That way you're able to, you know, in short, you want to be able to get your systems in a position where it's as production like as possible, so that you get more accurate results that could better reflect production. So even if you have just like a basic testing or QA environment, you can get in there and start running chaos experiments and then start learning about how your system responds to those experiments so you can make those changes in production. And the goal is to run chaos experiments in production, which I know is a terrifying concept for a lot of engineers, including me I've done engineering in the past and QA in the past. Ultimately, production is where your customers are. It's where you'll get the most realistic and the most relevant results. So my recommendation is start as early as you can, and move as far right into the process as you can. Great. And from VJ, Gremlin ecosystem only as software as a service model, or can it be on prem to We are only SAS right now, we do offer ways to better support on prem, like we know that if you're running on primary might have really strict firewall rules. So we do what we can to support that from a security perspective. But as far as being able to actually use our tool and run experiments, it's SAS only. Okay, great. And what do you do about serverless platforms that you don't control like AWS Lambda. Honestly, there's not much that you can do, you're kind of at the mercy of the platform provider. The good news is that most if not all platform providers are also interested in reliability because they know that if your system to go down, you might migrate away from them and they don't want to lose that money. So they'll often provide tips and resources or even frameworks for being able to better handle failures on your own. Like for example, AWS has their well architect framework that includes entire section on reliability and covers things like setting up redundancy planning for disaster recovery, setting up reliability goals and improving reliability for their own tools and services. So basically, you know, do what you can with your own app with your own code and your own applications. And then check with your provider to see if there's anything that they can do to help you become more resilient on their platform. Okay, and from Malcolm. Do you have any leadership focused videos on why they want to get started with chaos engineering containing a low risk high reward story that can sell the sizzle of C while laying out the roadmap so that they can get started in a low risk manner. So a lot to that question right to do a second to digest that. I will say that we have a bunch of videos on our YouTube channel if you want to check it out YouTube comm slash government Inc. I believe is a URL. We do have a lot of talks that we offer on the principles and practices of chaos engineering. And some of those talks to involve our own customers and folks who have used our product in a like an enterprise environment. So that might be helpful for you. I can't think of a specific video that we have that would highlight that. But if you check our website we also have some case studies with our customers that could also help up that I see you looking for a series of three minute videos. Other case studies might be helpful for you. If you go to website government comm, they might be information there that you could use. Anybody else has any questions for Andre you can put them in the chat or the Q amp a. Andre you do have another question in here about showing a demo using gremlin, probably can't show a demo here. If you do want to see our product you can go to government comm slash demo dash center, or just go to a website and be able to find a bunch of demo videos there. Again we also post those on YouTube so if you go to YouTube channel you can find them there too. I did see a question come up on recommended resources for fundamental capabilities learning. I personally recommend the Kubernetes documentation site has a ton of great info there. I know it's a lot but once you're able to kind of go through it, they also provide like just getting started and basic materials there to and training courses. I would recommend getting hands on with it, at least that's the best way for me to learn. But yeah definitely check out Kubernetes website and they have a lot of good resources there. Okay, amazing. Well, thank you so much Andre for your time today and thank you everyone for joining us. As a reminder this recording will be up on the Linux foundations YouTube page. Thank you so much and we hope that you will join us for future webinars. Have a wonderful day.