 Hi, everyone. Nice to finally meet you in person. Now I'm a little nervous, but I'm really happy to be here and to talk a little bit about, for monitoring topsobility, left shift your SLOs with chaos today. I'm Michael. I'm a senior developer at GitHub. On the internet, you can find me as dnsmichie, which is dnsmichie. It's the lovely form of Michael in German. And I figured later on, it doesn't really work in English. But please connect if you want to. And now let's dive into and start with an SRE tail. Let's build some Lego as well. A while ago, or like 10 years ago, we had certain things like black box monitoring, state changes, measuring SLA reporting over time. At a certain point, metrics have been added, trending. Service level objectives have been defined. And we moved into certain things like white box monitoring. And one of the things which inspired me in my past life as an open source monitoring maintainer myself, the Prometheus metrics endpoint was added to Docker. And it made it much more easy and convenient to look inside the application. From there, we defined service level agreements, objectives, indicators. There was much to learn and much to move on. Like saying, hey, I want to have an availability of 99.5%. The objective is much higher. And we also need to define an indicator and the error budgets, actually, which we want to look at and see whether the SLO is actually violated. And there are many terms to also consider and figure out, like the golden signals, which help you immediately identify specific things which are going wrong. For example, in your Kubernetes cluster or in your environment, something like latency, traffic, errors, and saturation. At a certain point, you needed to instrument the code, like making a code change to really export the metrics you wanted to see. The term, or side reliability engineering, SRE was invented. And maybe it solved everything, maybe not. But in the end, it was really a nice way to move forward and go into the metrics direction. As a developer on the other side, something goes wrong. You make a mistake. There is a bug. And one of the stories I want to tell today is we had a monitoring system with a central server satellite, a REST API, a JSON RPC. And it was not that fast. And we thought, well, let's just add more threads to it. Problem was then the CPU was locked. And maybe we should do something else. The application was written in C++. We looked at Go at the time. And thought, well, let's use GoRoutines. And we found a library which implements GoRoutines in C++. It was stackless, putting the function point on the heap. And then there was stack unwinding with continuation. And it was pretty complex. But we were confident that this would solve our problems. Well, in reality, works on my machine doesn't mean that it's like working in a large-scale environment. And there was a crash happening, but only with 1,000 API clients. There was memory corruption. Sometimes it was exhausted, so maybe a leak or something like that. And it behaved differently on operating systems, like on Windows. There were stackouts, which caused the crash. And when there was a security scanner running, it was a different crash. And this was super hard to debug. And to be honest, it burned me out in 2019. And I thought, well, oops, maybe we could do something else about this. Or how would this have been looked if I would have known about metrics and SLOs before? So defining that the heap memory should meet the ops requirements. We are defining the service level indicator as the memory usage level, for example. And the SLO shouldn't be increasing by 10% or another arbitrary number, meaning to say, whenever I'm doing a change in my software and it reaches a merge request, CI CD deploys that. I can measure that and get immediate feedback and never hits production. Or with continuous delivery deploying in production, but immediately doing a rollback when detected. And on the other side, I was like, maybe the API clients and the connections could have used something like chaos engineering or fussing or something else to really figure out if the problem is being hit. Now, another thing is like thinking about switching rows into ops. And my nickname is DNS Mickey. And I'm saying it's always DNS. It's probably true. I was working at the University of Vienna back then. And there was a rollout with the .ed domain with DNS sec. We had a signing hardware. We had a state machine of steps. On a Friday afternoon, the script was changed. It was deployed to production. The signing stopped. No DNS updates. So whenever you register the .ed domain, nothing happened. Monitoring was in place, which meant we did monitor things. We had the serial and the offset defined. The first alarm came, I think, on Saturday at 3 AM in the morning via email. So I wasn't reading that. But at 4 AM, there was a flood of SMS messages, text messages. From all the name servers involved, I think there were 25 at the time. Every minute, 25 SMS is not fun. And after the fifth or whatever alarm, it wasn't really fun logging into terminal and trying to figure out what is going wrong. We later learned that the change was persisted in Git. It was rolled into production. But back at the time, we didn't really have CI-CD or quality gates or something else. So the idea was to say, maybe as a retrospective now, thinking of having a staging environment, having everything rolled with GitOps or infrastructure as code persisted in Git, and verified that the changes are being tested and that no other things are going wrong. DSLi, for example, could be the zone serial age. DSLO is a certain way of saying, hey, the zone should be not much older than one hour. And I thought of, well, chaos engineering, maybe intercepting DNS traffic, denying zone updates, doing specific things, which just do things which software doesn't expect, and maybe make the improvements in that direction. Another story, switching gears from dev to ops to dev ops, like we are used to using containers. And one of the things which happened in, I think it was in September 2020 when Docker announced for Docker Hub the rate limits. And we didn't know what is happening. Is my CI-CD pipeline not working? Because I cannot pull the containers. What about cloud native deployments rolling out your application in Kubernetes? The left-hand side of the first container works, and the second doesn't, what is going on? And similar to, like, when it's based on an IP address, what happens to organizations using an app or bigger cloud providers and whatnot? So, well, we had a certain known state. We saw that there was an API with response headers with limits, simulating a pool, got a header response, passed something. And back then, we also wrote a premises export out of being able to monitor that quickly. This was solvable. The unknown state was like, well, maybe there was something in the logs. Maybe there was something else which tells me too many requests, but I need to, like, dig deep down into a CI-CD pipeline interface or specific anywhere else. And the problem which could have occurred is, like, the application or the shop is presenting a specific price, but to only a third of your customers. And they say, well, it's awesome. It's on sale. But actually, those who are seeing the new price, they think, well, it's too expensive. And the other ones are saying, oh, it's actually cheap. And I'm buying. And then you really need to grant the price. And actually, it was just because you couldn't pull the container in your production environment, because the rate limit was hit. Now, thinking of this, one could say, well, we are defining the limit as a service level indicator. And the objective should be the pull counts remaining. Should be 10 or maybe, like, in a bigger environment, 100 or 1,000 or something like that. And when the SLO is failing, we shouldn't even start a deployment. Because when you're looking at CI-CD deployments as a developer and saying, yeah, nothing happens. And I don't know what to do. This shouldn't be needed. And it can be quite frustrating if nothing works. Now, the idea is to define service level objectives. And the thing is, OK, what is that? And where do I start? Like, SNS3, DevOps, DevOps. And we're going into monitoring. We can go into metrics. We are defining keys and tags. We have values. But what's next? And one of the things is to get started more easily. I found Prometheus and PromKill pretty good, which allows you to really collect the metrics. You can query the metrics. You can combine logical operators with them. And it's, I would say, a rather easy language to learn. And it's a way to define your service level objectives and verify they are not violated. The other thing to understand is, what are your metric sources? So like, typically, or classically, infrastructure monitoring with memory CPU, you might have a Prometheus exporter on the node or on the pod, on the cluster. For services, specific other Prometheus exporters, when you have a custom application or your application, you might be needed to instrument the code. Now, defining the SLO with PromKill can be done in a way of, for example, using alert rules, which are also then triggering alerts. And it also allows you to define the errors, basically, which are allowed in the error budget. So when there's a specific point of, like, I'm pinging a service or I'm using a probe exporter to verify that something is reachable over time, I really want to ensure that it is 99.9%. And when the SLO is violated, I want to get an alert and the possibility to kind of figure it out via an API or something else. In order to add that more easily, I've played around with certain things over the years and found that, like, build a small application, build a Docker container or just a container image, UCICD container registry, use the Prometheus operator and Kube Prometheus to monitor the application inside the Kubernetes cluster, where the service monitors customer resource definition is super helpful. And then you can get going, inspect the metrics with Prometheus and in the future with Open Telemetry, for example. Now, my talk is also about left-shifting SLOs. So service level objectives and observability shouldn't be just at the end of the DevOps lifecycle, but we really want to see the value as developer or when I'm writing code. How can I benefit from service level objectives similar to, for example, shifting security left? And one of the ideas was, OK, we're using SLOs. We do have Prometheus. We are calculating the alerts. We have the promcales, which can also be written in the Open SLO format. So this is a new specification, which was defined, I think, or started last year in May and now reached version.zero last week or something. And using that knowledge, we can deploy it into CICD or using environments, doing metrics monitoring. And one of the ideas was to have a certain quality gate, which means something or you're monitoring the SLOs with Prometheus and let captain as an application verify whether the SLO is violated or not. And if it's violated, it blocks the deployment and it never reaches production. Captain itself works as a quality gate, but also like as a it is more than a quality gate, actually. You can do more with ensuring a specific state is there. I would say it's basically an observability platform for continuous delivery as well. It has a UI where you can test certain things, certain workflows. And I would really encourage you to try it out and see whether a quality gate in your CICD pipeline makes sense to measure the SLOs. The thing which came around was I have the quality gate now and we know how to use SLOs with Prometheus. But the problem is how could I simulate a production incident? Like fail the database, fail the connection, fail DNS, see what is going on. And so I learned about chaos engineering, like not adding it only to the production environment, but also deploying a staging environment or testing environment from a merchant quest, letting it run for some minutes, generate some chaos, and we'll come to that in a bit, and then see how the application is behaving. Is it crashing? Is there some certain deadlock or something else? And if I really wanted to dive into monitoring and collecting metrics, I can also trigger alerts. I can integrate it into merchant quests. I can use alerts and incident management, and so on. So now it's left shifting with chaos. And again, where do I start? And on the right-hand side, it's German chaos. Founded on the internet, pretty funny. I'm not from Germany. I'm from Austria. Sorry about that. The thing is, I started with Cloud Network, and then we have a cluster with Kubernetes where we can deploy workloads and deployments and so on. We can use a chaos framework, which defines experiments, and it might be having instrumentation SDKs where I can write my own chaos, my own rules, what should be happening. And I found these terms to keep in mind really helpful to really keep going and diving deep into the tools, into the frameworks, which are also CNCF projects. One of them is Litmus Chaos, which I think I found last year, and we also had the folks in our Everyone Can Contribute Cafe meetup a while ago. The idea is, to fail your infrastructure and cluster, you want to see how the application is behaving, like I just said, and verify the service level objective, if it's really matching, and from there defining the actions and improvements. It provides a nice UI. The Getting Started guide is awesome. So it's really like five minutes, and then you can get started. And there is the Chaos Hub with experiments, workflows, the community is amazing, like helping each other and adding more things. One doesn't need to invent by yourself, which is a good thing in Chaos Engineering. And I think it's like, I really like the UI. So this is one of the tools to try out. The other one I found at CNCF was Chaos Mesh. I don't know which one is older or younger, but it's kind of, it provides the same functionality like LITMOS, can run chaos experiments, you can also run it just once or schedule it. So like if you say, hey, I want to introduce chaos in my cluster just for 30 seconds, and then every five minutes, which is a different error pattern to just failing once because the application might survive that, and just come back. And then I was thinking about DNS might always be the problem and I heard from a friend, it's super interesting when you're turning off DNS in a Java environment, for example, and see what happens. I was like, okay, maybe we should try that as well. From the UI, Chaos Mesh provides also some sort of previewing and scheduling strategies and whatnot. So I would really encourage you to also try it out. I think it also takes like a couple of minutes to install it using a help chart and then starting up the interface and keep going. When I'm thinking, or when I thought about like, okay, all the stories I've been telling in previously, how does this match with chaos engineering? And like for an SRE story, it could be CPU overload, it could be HTTP requests being blocked, something similar to the golden signals for my nightmare as a developer with the many API clients. It could just be something which is not closing a TLS connection correctly or just intercepting DNS or something else for ops, like okay, again, something around DNS or it doesn't resolve or it resolves to some funny IP addresses or it just provides a PV6 and then see how it behaves. Yeah, and for like the DevOps story, maybe having a registry or container registry proxy which does some sort of limiting and rate limiting in there and just seeing how this actually behaves and then defining the actions you wanna do. When you want to dive deeper into this, like your own chaos, there are experiments SDKs out there and I think Litmus has go in Python if I'm not mistaken. You can integrate that into CI CD. There is a lot of tutorials and documentation out there. And I think Azure is using chaos measures as well. The thing is there was some limits with chaos or chaos engineering because it consumes more resources. It might harm different teams and different workflows so it might have some impact on the system as well. So don't try to enable chaos everywhere and then see everything break. I think this kind of needs some planning and some maintenance and also awareness that this is now being deployed in the systems. And the other thing is to keep in mind, it doesn't solve all the reliability issues but it can help bring another perspective in order to really see what is going on and potentially fix that in the future. Now the thing is I thought about my own story with the failing connections and DNS failing and things like that and the memory leak. And so I hacked a quick demo for which uses a C++ code. It gets deployed to Kubernetes. Kube promises does all the monitoring and chaos mesh invokes some chaos and there's potential to use SLOs and alerts and so on. And the image on the right-hand side, I generated that with a small tool. Yeah, so in this case, what I'm trying now as a short demo, I have created or written a short application. Let me see. The application is actually, it's a short C++ code. You don't need to understand everything which is going on in here. The idea is to create a buffer, do some DNS resolving and then error out. When it's successful, it does something and then it deletes the buffer. So at this specific point, there is a condition missing to delete the memory all the time. So the error is intentionally, of course, for this demo now, but it kind of reminded me where I was at. The thing I've prepared already is this application is built in GitLab CI CD as a Docker image and the manifests have been deployed into Kubernetes already. And I should be navigating over here. And I do have certain parts in here and I'm hoping the Wi-Fi, does the Wi-Fi work or does it not? Probably not. Oh, yeah, just a slight delay. So the idea is to do DNS resolving for different domains and I was just using o11y.love, cncfio and gitlab.com. Now, when we're inspecting the cncf ping and it works, hopefully, yes. So we are getting some results back and everything is just fine. And hopefully my terminal doesn't die. The thing I wanna do now is inject some or create some chaos in the cluster. And in order to do that, we have chaos mesh and a scheduled workflow in there. And again, this is, can we, I think we cannot edit it, but the main idea is to create a DNS chaos as a chaos type and then define the schedule and the domains which I wanna track or which I want to fail o11y with an S2X and so on. But basically I really wanna fail DNS. Now, on the other side, I wanna keep track of the memory and this is currently looking good. So it's not consuming that much memory for the containers, but when we introduce chaos into the containers, we should be seeing the memory going up and this is the problem with live demos. Hopefully it works. Can click on now, yes, I wanna do that. Oh, hmm, oh, it doesn't like it. Okay, then let's do something else. Let's pause this and archive the schedule, create a new schedule and the cool thing is you can just upload the YAML file and submit it. Now it's being pre-filled and it also runs continuously and let's see, I submit that and it's now running. Okay, probably should fix the duration of 60 seconds. So I'm impatient, I'm pressing reload. The other thing I wanna verify, okay, currently it's still resolving. No events found. That's the memory change over here. So in theory, if the experiment is starting, it should be looking like this. So at a certain point, the chaos experiment kicks in. The DNS, the configuration is set to fail and not random, so you could also like say, hey, random responses to it and then the bug is hitting and I'm seeing the memory going up and to be honest, I didn't have time to add more to this demo now, but it shows that you can really detect like a program mistake which just happens over time only when DNS is failing, otherwise it just works. Yeah, and I'm just quickly checking if the live demo doesn't work, something happened. Oh, yeah, actually nothing is working right now with DNS, at least for the domains and maybe the Wi-Fi is more stable today, at least I hope that. Maybe, okay, potentially it's not working, but I will continue, we just pretend it worked. You can try, so the slides are available on sked.com and everything which is linked in the slides, now you can try it later on and the repository which is at the bottom, everything is documented in there, so this is a public GitLab repository with a readme and all the description which I did, just for this short demo to show that chaos engineer with DNS works. Now, when I want to move from chaos engineering also looking more into like monitoring to observability, which is like the second thing, we just talked a lot about metrics and how you measure them and you're defining SLOs, but looking into observability it's much more than that, you might be having logs and events, there is like distributed tracing, continuous profiling, error tracking, RAM agents, real-time user agents, real-time user monitoring agents, yeah, and certain shift from monolith to microservices and this is quite a lot to learn as a developer and really to understand it's like take a step back, breathe and maybe start looking into metrics and tracing in the beginning. Traces itself are a little like different to logs, it's a span with a start and end time, it allows you to define metadata and more context and you can do app code instrumentation. For the specification, probably you've attended OpenTelemetry talks this week already, it's really like coming up, for me the most important things I learned on the journey with OpenTelemetry is you need to bring your own backend, so whether it being for traces, Yega, Elasticsearch, Clickhouse and so on, you can build your own distribution, like I think AWS announced the metrics, AWS Hotel Distribution SGA, today or yesterday, I think Michael Hausenblas tweeted about it and it allows you to also use auto instrumentation for specific languages and SDK so you don't need to go deep into the application code. I think the Java SDK provides that. Another idea is to dive into more into auto instrumentation and observability and look into EVPF, which I personally find super interesting and it would be also interesting whether we could dive into using it as a source for SLOs but also using it as a source for combining all the different data we are collecting and signals and events and really moving forward with that. I can talk 10 hours about this so this is the call to action to navigate to evpf.io for example and check it out and yeah, with that shifting left or left shift, the thing or what I want you to remember is see the value and observability so it's also providing an application insight for everyone who's not the developer or the author of the code, which allows you to also find problems fast so it's like the problem inside the code or is it something else and also use boring solutions like start with the minimal viable change in that regard like metrics is promissious tracing with open telemetry and then continue adding more observability data which you might already have in your environment. The other thing is like when observability is there for everyone you need to teach your teams you need to do onboarding probably right documentation how it's being done, how app instrumentation works defining the service level objectives and the alerts in CI CD thinking about merge requests with staging environments going deeper into alert channels, incident management and so on and from a cloud native perspective it's great to just have like the benefit of the deployments you can do auto scaling you can learn from all the CNCF projects now starting to adopt for example open telemetry and like getting the best practices directly and getting inspired so I kept looking into how open telemetry is now in Kubernetes, in promissious and everywhere else and potentially we will be seeing much, much more in the future and we can all learn together from the amazing open source community and shifting left with chaos to conclude with that try out the chaos workflows like the built-in ones the custom experiments, verify the SLOs think of quality gates, think of reliability and then iterate and innovate my own personal wish list with regards to observability chaos and so on maybe we have some machine learning in the future which allows us to correlate events and auto generate these SLOs potentially vendors are already working on that chaos out of the box so we don't need to like add it on top of it but really have it like on a platform or something like that and should be accessible to everyone not just the developer who gets everything and then birds out from that but really it should be team effort and yeah open telemetry being adopted more widely and I started something for myself with CI CD observability with regards to adding open telemetry into GitLab different story, I will be talking at CDCon in two weeks I think about this topic yeah and just to recap do app instrumentation with metrics with places consider learning crumb curl and SLOs evaluate quality gates with captain and Prometheus do the shift left try chaos engineering and benefit from observability everywhere yeah and if you wanna learn more about observability I've started a small learning platform O11, why don't laugh it really needs like everyone to contribute but yeah I really wanna like encourage everyone to learn and I'm also here for you if you have questions around app instrumentation I'm really I'm overwhelmed I was overwhelmed myself but I really want to encourage everyone we can learn together and in the open and ensure that our systems are running and that observability is fun yeah thanks for your attention