 Okay, hello everyone, thanks for joining today. My name is Marge. I'm a senior developer evangelist at GitLab. And if you want to like find me online, I'm DNS Michi, which is DNS M-I-C-H-I. It's the lovely version of Marge in German. But today it's not about me. Today I wanna talk about what is like observability, how can we gain confidence with introducing some chaos and everything around like what can we do with Kubernetes and around things like that. From my background, I was an open source monitoring tool maintainer for like a decade before joining GitLab. And I'm doing many, many things with metrics, Prometheus, Grafana, Kupremesis, and so on. So in this talk we will also be looking into alerts, service level objectives, chaos engineering, maybe proving that it's always DNS. I'm doing some chaos tracing or vice versa. And then also trying to look a little bit beyond observability depending on time. And you might recognize that I love building Lego models. So maybe you can spot them all in my slides. But before we dive in, maybe thinking of an up story of why we would need monitoring or observability. We have potentially Kubernetes up and running and we need to kind of understand what we should be monitoring. There are like the different components. We have node sports containers, deployment services, API sports, data sources, and even things I don't even know yet probably. And then someone says, well, we need monitoring because developers cannot work. Deployments are broken. Something is like, what should we be doing there? Is it like availability monitoring? Should we be looking into the performance and the resources identifying this load or the blocking deployments? The classic traditional service monitoring things don't really work. We have metrics logs, maybe even more than that. Do we need to understand everything to observe or monitor it? And what are the best practices? So it can be pretty much overwhelming when you start the first time and it needs focus at some point. So in the first iteration, let's look at metrics. Within Kubernetes, we have different data sources. We have service discovery. Luckily in the CNCF ecosystem, we have Prometheus with the slash metrics and point being provided, being scraped. A time series database built in, we can calculate trends, use dashboards, have service level objectives and from there go into alerting, incidenting and fixing and coming back to everything. For this talk, this is like the huge architecture of Prometheus. I think the nice ways like integrating with the services are within Kubernetes, what you can use and everything else is like on top of that. The other thing which is important to really get going is learning the Prometheus query language which is called ChromeQL, like getting the latest sample, calculating something and even calling functions or doing comparisons which is later on important for defining service level objectives and alerts. There is a lot to learn. So I've also added links on my slides which are already available on my website. So you can look it up later on async. Similar to how to visualize the data within Prometheus like the UI which comes out of the box. You can do dashboards in Grafana for example. And I've also seen that there's a new dashboard as code and GitHub style project in development at the moment which is called Pelses or Perses. I don't know its current state but I think it's worth following and seeing where this is going in the future, making our life easier. Speaking of making our lives easier, I think the Prometheus operator and QPromises are a pretty good start to get going. Providing everything out of the box and the great thing is you also get to see dashboards and the learning which means getting started is easier and we can see what is like going on. The other thing is like when you see something we have again a lot of things to analyze and is it now green? Is it okay? Is it healthy? What is going on? So it's nice to have all these things like note status, resource usage, deployments and different other things but it can be a little too much looking at a lot of data right now. So for example, container metrics, different integrations, should I be looking at the system CPU usage or maybe even the port memory usage? Another example is getting the health of the deployments with KubeState metrics, different ways to actually do that and use that and I think the nice way it comes out of the box but we really need to like figure out hey, what is like going on with that? So the idea is to say metrics are great but like what's next? So like defining the definition of failure like the what? Then do something because maybe a threshold was violated or a specific rule was matched to modify and raise alerts. The who which needs to be the responsible persona and team and going to the how is which can be like documentation for incidents, run books, corrective actions and so on which also means reducing the meantime to response or the meantime to resolve depending on the definition. And it got me thinking like okay with Prometheus we can integrate the alerts with Prometheus alert rules send them to the alert manager, I have some like grouping inhibition and silences, different transports, that's really awesome. So like this is provided out of the box for Kubernetes itself, I've also found like a website which is called awesome Prometheus alerts dot grep.to which also provides additional alerting rules which you can integrate with the Prometheus operator into the monitoring of the Kubernetes cluster and get these alerts ready to being fired. Now, integrating them into the Prometheus operator can be done with the Prometheus rule custom resource definition is just an example how to wrap it and not use two different configuration formats which I think is great especially for beginners don't need to learn different things. In terms of an alert receiver like can use chat we can document incidents and ticket and issue systems maybe even mailing lists but at some point it can be overwhelming and the so-called alert fatigue can come up and saying I have 10,000 alerts, I have no idea what's going on and basically is like either mark everything as read or have it somewhere and the counter goes up and maybe we need sort of way of managing these incidents in a better way. Now, the idea is like how would I be getting an alert just like the visual example can be looking at the alert manager and Prometheus and see something that is something is broken but I don't want to do that manually. I was like maybe I can break something in a fun way so like maybe installing Qtome in the production just and killing some parts but I thought like probably it's not applicable for work or might not be allowed but still it can be a way of learning and breaking things in a fun way but to get more serious like simulating a production incident is really hard. You might be having a staging environment, you might not be having it and maybe we can sort of add automated chaos and break things in a professional way in order to trigger the alerts and to verify the service level objectives and iterate on that and make corrective actions based on what we're seeing which brings me to the idea combining observability with chaos engineering and doing so-called chaos experiments and the great thing is within the cloud native ecosystem and our clusters and deployments there are existing chaos frameworks already as open source projects they allow you to define experiments and if you want to extend everything you have instrumentation as the case or you do it the German way of chaos engineering on the right-hand side, not just kidding. One example I found is ChaosMesh or one open source project in the CNCF landscape which allows you to fail specific things in the Kubernetes cluster or on hosts which can be like failing a pod, failing the network and seeing how the application is behaving, how on the entire deployments break HTTP, not even like the responses but inject some headers and see what's going on. If you have like sort of some scheduling with some time dependent things inside, breaking time and NTPs is also a good idea to see what's going on. With DNS, I think probably everything is a DNS problem but it's interesting to see when you inject failures with DNS or just random responses and see how everything is going nuts or maybe even continuing to work. Running chaos experiments can be done like once in ChaosMesh or you define a schedule. So like running the experiment every week in the morning or like something which is defined because you also shouldn't be just running it and then say, okay, I've run it, what's next but also taking action and also informing teams which might be affected by that. If we want to generate some chaos, we can potentially start with killing some ports and this is a nice way of like seeing something. The other way is like when the port ends up in a crest loop back off, even it is tries to heal itself. So is this like the correct chaos to like simulate failure in a cluster and was like, maybe we can find something else. So thinking of more real world example of what can be done was coming back to DNS. In the previous project, this was like six years ago or something like that, we had an application which was creating a buffer and then doing some DNS resolving and then doing some connections and when everything worked, it kind of closed the connection again. And the thing was only when DNS was failing, it was leaking the memory buffer. So like every second one megabyte of leaks but only when DNS is failing. We as developers couldn't really reproduce that in our system or we didn't have a staging environment. We only had our death machines and in production later released to the customers, it wasn't really fun to debug. Really, really late, we figured out, hey, it's actually DNS which caused that problem. And I think this is a nice way to really like figure out whether everything is working in the environment. And the idea is like, how can I like break DNS? A simple way to break DNS is something I learned in the recent meetup of ours. Just scale the core DNS to the replica zero which kind of works for achieving the result but it breaks everything else. So I thought of like, maybe we can do that with chaos mesh in a controlled way. And chaos mesh provides a type for chaos experiment which is called DNS chaos where you can, for example, then define the action which should be error. So like providing any domain as a DNS result actually and then define like namespace selectors, different patterns specifying which domain should be failing. So for like, for this event, I've added events.linuxfoundation.org and some others and chaos mesh also allows you to preview what should be affected in that way. And if the wifi is working, I want to try that now live and see if we can actually break things. Just to get an idea over here, potentially just queuing this up. I have some parts running, maybe, yeah. And these are basically doing not much but trying to allocate a buffer doing the DNS result. And yeah, everything else should be just working fine. We're getting some IP addresses, IPv6 even and it's working. So the other thing I need to be doing is accessing the chaos mesh dashboard and later on Prometheus and the blood manager. When you open up chaos mesh, it greets you with an overview, nice tutorials and so on. The thing I want to do now is creating a schedule or actually like looking into an existing schedule. The nice thing is you can take this definition in its entirety and just apply it manually as a Kubernetes YAML definition or just create a new schedule for example, which we can just do now and import that by YAML which means you can either use the graphical UI or you go with copy paste and this and I potentially need to rename that because it's something already exists. Can like change that. The schedule runs now every minute and the duration is 60 seconds which means it runs all the time. This example is not production already because it will break everything all the time. But oops, not a good idea. I should be submitting at the bottom and then it allows me to verify what will be going on. So it again shows which patterns of domains will be affected by that and I can also see a preview in the default namespace for example, which parts will be affected by that. And at some point there is, I'm doing too many demos, the Open Observability Day port over here. And when I submit this and submit it, I can actually then start it. Yeah, I want to confirm this because this is a breaking, I said a breaking operation. And at some point, we should be seeing some results in that regard. Now the thing I want to really see is within Chromaceous, the container memory usage. So let me see, I need to cheat. I don't know the promo code query at the top of my head. And we want to see it as a graph. So right now, memory looks okay, it's just 200K, something like that. But at some point, we should be seeing some memory going up. Let's see what happens. Ah, okay. So host on file, non-authoritative, try again later means we are already injecting DNS failure. So at some point, we should already be leaking some memory, which hopefully, yeah. It goes up and up and up. And at some point, or like I've already defined the premises alert for that, when the memory usage goes beyond 10 megabytes, 10 megabytes, it will trigger an alert and do that. This is a different container. But yeah, hopefully this will work in the meantime. Let's see, we have the alert manager over here. And at some point, we should be seeing container memory limit going on top. If not, backup screenshot. The idea is really like to trigger the alert and see that and detect that. And being able to say, this is helpful information for me. Maybe something is going on with the memory leak, which can only be triggered by DNS chaos engineering in my environment. Let's see if we're able to trigger that or not. The problem with the Demos is one thing works one minute, the other one works one minute, and this kind of adds up. But let's see how many memory we're leaking already. A little bit of time. And, oh yeah, okay, container memory limit. This has been hit. And which one was it? One of the pods already. So this is now the alert. And if we would have been defining some server-level objectives, they would also have been violated in that regard. So this can be like one way of injecting chaos and seeing that everything works or it doesn't. The demo application and all the examples are available online as open source. So if you want to use them for your own environments, you can totally do that. The thing I want to do now, or continue with, not only like the demo now, this is just the alert definition again, documented for you to maybe try it later on. The thing is now that we have generated some alerts and have potentially some red dashboards or something, maybe optimize or think about optimizing the alert counts, think about grouping, additional context in the alerts, also like focusing on the dashboards. So like using what is already available, correlating data and also often reducing the amount of visible data so that teams who are like, you're getting paged, potentially at 3 a.m. in the morning, you really need to see what's going on and how to fix it fast because otherwise it's really not fun. So like how to gain confidence in that regard, we can use the metrics from Prometheus, for example, define the alerts as a prompt call query and have the service level objectives. This is kind of a step-by-step storyline. The other thing which is important to notice like seeing what is going on and start starting with the golden signals like latency, traffic, errors and saturation, which can help for ops teams or DevOps teams, SREs, doesn't really matter. Like having too many dashboards and alerts like learning and documenting and also ensuring that onboarding for new team members works in that sense because it can be rather overwhelming and the goal should be to immediately like see what is important and also reduce the meantime to response overall. If you wanna go into like customizing QPromises, you need to learn JSON that, but I find it really like rather, I would say easy or good documented to learn. You can develop your own rules and dashboards and for example, by monitoring other namespaces and adding applications, you can also create custom dashboards, adding a data source in Grafana for Prometheus, add a panel, add a dashboard and even automate everything and not having to create everything by hand. The other thing which is great is the service monitor customer resource definition provided by the Prometheus operator, which allows you to really like monitor your own applications and existing applications also with the Prometheus and with the metrics endpoint. So the TLDR basically is like deploying an application. It has a metrics endpoint, just adds the service monitors here, the into the deployments and get even more observability data out of the box. But this was a lot about metrics and observabilities more than that and we've already heard today, it's like six types of events or it's three or it's even more than that. Potentially we will be seeing a lot more types in the future and this is observability kind of, it's like collecting all the data so you can ask the questions, like detecting the unknowns for logs. We have a lot of decisions being made so it's really hard to answer the question of what type of things do we want to see? How long do we want to store it? Is it helpful to see all the port logs in a central location and store petabytes of data with retention time of one year because compliance and things like that. So it's really rough. With tracing, we get a different perspective of like adding spans with the start and end time, more context, more metadata. The problem on the other side is you need to do code changes, developers need to be adding that to the code. Auto instrumentation is another thing coming up which makes it easier but still it's challenging. The great thing is within the community and like going beyond vendors, working on a specification and framework with open telemetry, having the collector, bringing my own backends for Yeager, Yeager for traces for example, premises for metrics, having client libraries and everyone like working together and providing the best for like the different languages, the different scenarios and environments we need to like support. Focusing on traces like in Kubernetes, the components, I think it's in the working sending crisis and also the applications themselves. So this is a great way to like look at the different observability data type. And one thing I recently did was trying out the open telemetry web service decay to instrument NGINX and Apache. Like to see when a client is sending an HTTP request to serve us the backend, does something and sends the client response and it got me thinking of like, can we maybe add a chaos experiment to that? So like slowing down the response and I could add a sleep in my code and then deploy and then see what happens. But this isn't really like the best idea. And a better idea would be to use a chaos experiment for HTTP, for network, for CPU and memory stress tests to allow seeing what is going on. And I've reproduced like one of the first things with chaos mesh to stress test the CPU and the memory for deployed NGINX container, which then sends the crisis to Yeager and the request time increased over time. I was like, oh, this is for me like the five minute success to get going and to really like think about more what else could be possible with chaos engineering with traces, maybe even more stability data in the future. So some other ideas which are always at the top of my head is like thinking of exemplars like linking metrics with traces and being able to correlate and debug more and combining it somehow with chaos engineering. And one other thought was, and I've seen it at KubeCon EU, aggregated trace metrics. So being able to create metrics from traces in open telemetry and again think of like how can we trigger that with chaos engineering and other things? Combined with Kubernetes system component tracing. Okay, lots of things, but there is even more than that. And I'm currently trying to learn eBPF and understanding what it actually does and how it can help. There are great ways to really think of on a different way to collect observability data and great tools to getting started. But one of the ideas was like how does this like complement or fit within premises metrics? How does it fit within traces? Can we do like the same? Like defining a service level objective alerts and then add some chaos engineering. And I've also like seen Celium Chatteragon was open source at KubeCon Europe. And I was like, we should be combining that somehow. It needs an open standard. It needs a way to like say different observability data types since my talk is too long already. I will be looking into this next year and I'm hoping to like find answers together and learn together in the open. Last but not least, looking from the security perspective into observability and chaos engineering. It could be for example an idea like a supply chain attack to create a chaos experiment that downloads and installs malicious software somehow combined with HTTP chaos or something like that. Which could also be an interesting approach maybe in the future. Let's see about this. Now gaining confidence finally, maybe building some Lego in between. So there are different types of chaos or chaos experiments for like SRE, DEF, DEF ops and maybe even DEF site ops like overloading CPU, failing DNS, clients that are not closing TLS connections for example, container pulls not being successful because there are rate limits and even like breaking security policies. So this is like potentially there are like many, many ways to break things but you should also know the limits of chaos like avoiding chaos inception. What I mean by that is like don't run everything of the chaos experiments all the time because it could break existing workflows or team pages and things like that. And maybe think of like starting with a staging environment to prevent data loss because running a chaos experiment could also cause like at a certain point maybe not a database right and then something is really going wrong. And also think of the chaos engineering doesn't solve all the reliability issues but it can help to bring new perspectives into what is going on and maybe the simulated production incident becomes a little more reality. If you want to do it continuously for example within CI CD, I would love to have that out of the box in the future and potentially we all will build this in some way like having feedback in a merge request before something gets even merged to the main branch and later on released making like developers who are never on use and benefit from observability and chaos engineering. And for example with continuous delivery workflows running the chaos experiments in production and having some sort of like rollback and ways to detect that could be seen like the red team for observability test kind of either you announce it or maybe you don't announce it and see what's going to happen. This is really like dependent on what is needed. And the recap, I would say like bringing chaos into observability is super helpful. Can be a way to like verify the alerts and service level objectives. The idea is to iterate and innovate like taking small steps and also think of like what could be next. So many folks are talking about machine learning or ML ops and maybe we can kind of combine that in a good way, not like Skynet but thinking of what could be next. Since this is a lot of things to learn and look into I've done a workshop recently which is like three and a half hours to dive more into Kubernetes observability. I started O11 by that life as a knowledge base and I'm writing my newsletter and this slide deck is already available on my website. Okay, this was a lot. Thanks for listening, thanks for attention. If we have time for questions. And yes, I think we might have some time for questions. Any questions to Michael? Yes. Yeah, Michael, fantastic talk. Thank you. I just wanted to thank you for mentioning curses. We are driving curses. So if you want to find more, just talk to me and or come at the prognosphere booth at G15. Perfect, thanks. Great. Any question? Yes. Thank you for your talk. I was wondering the Chaos Mesh program that you showed here. I'm not familiar at all with it but it basically messes up with whatever you wanted to mess up with. Does it run in a rootless environment or do you need elevated privileges to run it? I think it works in a rootless environment. The thing you need to configure, for example, so I was running it in XevoCloud. You need to expose the container DE socket so it can kind of inject certain things in between. I would need to look up exactly what it needs in the documentation. The thing is for me, when I started using or trying it out, like I started, I think in April and March this year, using the Helm chart, just installing it and having the first success was really like a matter of like 10 minutes or 15 minutes and I really liked this first time experience of getting started and kind of feeling addicted and the thing to mention is it's like still like being root on the open heart kind of. So you should be using our bug and this is all login for Chaos Mesh which you can configure to ensure that not everyone can run a Chaos experiment because then it's like I'm breaking something because I'm funny or maybe I'm a malicious attacker or something, so really also making sure that the security for Chaos engineering is in place. Great, thank you. Okay, any other question to Michael? By the way, I think I don't need Chaos engineering in my production clusters. I have Chaos built in, so. It's kind of next level to have this kind of experience or whatever, like clear, everything is done on the clusters, I can play with Chaos engineering on top of it. Yeah, I was just wondering with Chaos engineering, do you regularly just run it as part of a pipeline? Like what's your process for doing or executing Chaos engineering? Is it just for like SRE disaster scenarios or do you run it on a regular basis? Do you run it on each deployment? Like how do you implement that as part of your workflow on a regular basis? I would say like there are different use cases and I know that like from the GitLab perspective, our SREs are looking into ways of running Chaos engineering or Chaos tests for like the production environment on the SES platform but I'm also thinking of how can we like enable for example developers within or everyone using that in the CI CD workflow. The thing is like running it continuously needs sort of documentation, needs ways to like what are we doing with the results of that Chaos experiment or workflows. So this needs to have like a defined workflow. If that is not possible because for some reason it generates too many costs because it can also be like, you have $10,000 cloud build because of Chaos engineering. This is something you should prevent at all costs. So it's, it really needs some like time to try it out and figure out whether it fits into your systems. If the experiment for example can be a deployment into a staging environment from a merge request to pull request, CI CD and GitOps and whatnot, this can be helpful. If it takes too long, if it needs to run five minutes and then another five minutes and things like that, maybe do it more on a cannery deployment basis, maybe a different branch, not the release branch and move it out of the regular feature branch development but I think it can be helpful to really like see something in advance as a developer and not having to fix that in production and burn out because customers calling. Okay, thank you. I think, yeah, I think other questions can, you can go to and grab Michael on the corridor and yeah, thank you very much.