 Hello, good morning, everyone. I'm thrilled to be here today to talk about a topic that is crucial to the health and performance of Kubernetes applications and applications in general, testing observability features. As developers and DevOps engineers, we understand the importance of monitoring our applications, diagnosing issues, and ensuring that our alerts are accurate and timely. Today, we'll delve into how we can achieve these using Prometheus, alert managers, and various testing frameworks and libraries. A little bit about me. I'm John Villasa. I work at Red Hat. In specific, I'm contributing to the QBIRT project. I don't know if you have the chance, but in the first floor here, we have a small booth where we are giving some demos and giving some swag. So if you have the chance, please stop by. I started working on a tool of QBIRT named Hyperconverse Cluster Operator, but then slowly, I started having more focus on monitoring and observability features across the all components. So the outline for today. We'll start by setting up and seeing how to do a test environment, how to test metrics, then moving on to alert testing and how to ensure that they are actionable, relevant, and real alerts. And in the end, I'll show you a small demo of how I did everything together. So let's start with setting up the test environment. First, we need to understand what the test environment should look like. It needs to be, obviously, a controlled space where it can simulate the conditions of a production environment, obviously. But we don't want the risk of causing disruptions to our actual users. And in our tests, we'll need to create and delete a lot of resources, remove permissions, cause network problems. So this is where the concept of a disposable local cluster comes into play. A disposable cluster is nothing more than a temporary Kubernetes cluster that we just create for the purpose of testing. The beauty of this approach is that we can spin a new cluster whenever we want and whenever we need to run a test. And we can just tear it down once we are finished with it. This ensures that every test of ours starts with a clean slate. And we don't have to worry about any leftovers from previous tests. And weird permission changes that we might do to test our components. Spinning up a cluster can be used with a lot of tools. Some of them easy to use and available right now are, for example, KIND, which is Kubernetes in Docker, Minikube, MicroKaitos, or any other. On Qvert, we actually have this cool tool already. I literally use this every day when I'm at work. And it already creates a cluster and has even a flag and sets up everything Prometheus-related. So I really like it. And in the automated tests, nowadays, we don't use it. We provision full clusters. But that's more like, I see, like a bonus for when the projects are more mature. And the small projects that are starting, I don't think they really need it at the beginning. So let's start now with metrics and events. Metrics are obviously the ways of observability, right? They give us insights on how the application is behaving, and they help us diagnose issues. And we'll explore how to test those metrics. I'm putting the question, like, are unit tests enough? Obviously, it's a dumb question. Of course, unit tests are not enough. But in the ideal world, everything will be fully tested, right? But then again, we can't have everything. And we know time and workforce are limited. So in the end, I guess, at least in the beginning, unit tests are kind of enough. In my opinion, it's more important to start with end-to-end tests for alerts. And usually, a lot of our alerts already use metrics for the calculations. So if our alerts are working correctly, we have some degree of confidence that the metrics they are using are too working correctly, right? And even though we can start simple, and most of the times, like the simple things are the ones that end up saving us the most time in the future. And when we talk about metrics, it's very important to first validate that we follow the right naming conventions, and we have the correct labels. And if possible, if checking if the metrics are prefixed with a component name, so that later on we can easily find out where the metric is being created and just more quickly trace it back. And I have unit tests for functions that update those metrics, and in the end of the test, validate that their value was correctly updated, as we'll see later on in the demo. Now, let's start with the alerts. One of the main concerns in this topic is that we don't want to be flooded with false alarms or miss critical alerts, right? So we'll discuss how to ensure our alerts are actionable, relevant, and they are real. And we'll also cover how to configure them correctly and ensure that they react to the appropriate triggering conditions. This is probably the most known quote in the area, and if you worked with alerts, you already probably saw it, that every time the page goes off, I should be able to react with a sense of urgency. I can only react with a sense of urgency a few times a day before I become fatigued, and every page should be actionable. This is taken from the site-level engineering book. So it's like one of the vivals of observability, right? And we have to be aware that an urgent alert might actually wake up someone in the middle of the night. And this is even more important when we are producing software for external clients. We don't know how many people on duty they will have. Actually worked on a small company before, and you were two people managing infrastructure, and we didn't have someone already ready at all hours of the day, right? And since we had clients on the other side of the globe, if we had an alert in the middle of the night, one of us would need to wake up and look at it. So it better be a real alert. So how should an alert be configured? It should have an owner, a contact person that it's basically able to quickly understand the problem that created an alert or the process it refers to. Sometimes it might be the developer of the feature, other times it might be someone from a monitoring team. So this is really important because sometimes we don't want to start with something that we don't know very well and knock your head off and then lose valuable time. And as in metrics, we should be able to quickly identify the component to which the alert refers to. For example, if you have a Kubernetes operator, that is managing, for example, resources on IBM Cloud, we might want to have an alert like, IBM Cloud is not available. So we want to identify which component created the alert so we can quickly understand where it came from and go there to navigate the logs and try to understand in more detail the problem. Also a summary and a description and I think those are actually pretty straightforward, but we should also have a link to a handbook which we'll see in more detail in the next slides. Usually for severity, some people use different severity levels, but Prometheus actually recommends these ones and they are useful to distinguish like which actions we should perform for each alert and to sort the priorities on them, right? For critical alerts, those are usually the ones that will page people in the middle of the night. The warning alerts we sometimes just want maybe to create a ticket that should be looking into the next day. They are usually useful for, like some component is reaching critical memory. If you don't do anything in a day or two, they will reach a critical state and usually don't really perform any immediate action. We just also create a ticket that goes to the bottom of the line. About the alert handbooks, I think those are really, really the important part because they serve as a comprehensive guide for the cluster owners or the operators and they should provide a step-by-step instructions on how to handle the specific alert. Like you all know that if you don't provide these handbooks, usually owners or operators, we'll need to go through a lot of documentation pages or their personal notes to understand how they can debug and fix the problem and this always leads to losing valuable time. Usually that time also means losing a lot of money, right? And even worse than that, sometimes we force them to rely on memory or improvisation because they will handle some related issue in the past and that will also lead to mistakes and more delays. So, how can we test the alerts? In our tests, we should make sure that all the alerts include all the mandatory fields that we said before, that each handbook URL link is valid and the handbook actually exists, that the alert includes a reference to the instance or pod, it might be like the name of the component or really a label about the pod or even in the description. And the alert is triggered when the expected conditions are met, right? And then again, most of these steps are actually very simple, but as I said for metrics, might just save us a lot of time and ethics in the future. So, now let's put this theory into practice and I'll try to show you a small demo that usually goes very badly. So, to start with, I have here the creation. I don't know if you can see if I should increase or do I want to increase? Can you see or is better to increase? It didn't even start already, first issue. Maybe it's good. So, in our project, we want to create the metrics, right? Actually, I should start with. So, this is a simple operator for Kubernetes. I, legit, didn't have no logic at all for now, just some simple metrics and alerts as to show you how you can start with, right? And from then, I used, in this case, Kind to create a new cluster and I just installed Prometheus. I'm doing this locally, but obviously you can follow these steps on a cluster you have for testing. It's even possible to do it on GitHub Actions or GitHub. So, really simple steps that you can easily do in five, 10 minutes. And after having the cluster, I then created my metric. I have here the controller label to refer to the operator. We are now looking also, for example, to have stability levels that will allow us to deprecate metrics in the future. So, all these labels are really important for you to put some time into thinking about them and because these are the kind of things that will help you, your team, and your customers. And then, for example, for these metrics, we have here the reconcile loop of the operator. For those of you that are not familiar with operators, it lets us create a custom resource on Kubernetes and then, for example, when I create that resource, it runs this reconcile loop for me to perform any actions. Imagine that here, as I said before, I want to work with IBM Cloud. I might want to create, like, a machine on IBM or a route, something like that and this is the place where I will be doing that. And, for example, here, whatever the logic is, I want to increment this metric that will tell me how many times the reconcile loop was created. And as I said before, it's here we can start simple and write the test for our metrics. In the first step, we can start with the easy validation that, for example, the metrics follow the prometeers' conventions. And we list all the metrics and then we'll link the metrics. Here, I'm using prompt-linked tool which already brings a lot of validations that, for example, the metrics have all the necessary structure. For example, counters should not have the total keyword at the end. And as I said, this is really important to take a look in the beginning when we start adding metrics. In QBird, we have a lot of developers having metrics and without any validations. And in one component only, this is here. So, what ended up happening is that when we added the linter, we had all these issues. Non-Instagram, non-summary metrics should not have the accounts so fixed. And you'll see those are a lot of errors in terms of units and so on. And now we saw before that we are having stability levels. Why are we having that? Because this project is used by clients. And which metrics are clients using? Those ones that we created before. So we can just simply go there and rename the metrics because they are using them and it will cause them a lot of trouble. So we are now thinking about we should deprecate these metrics for two versions, create new metrics with the correct names, warn the customers that these metrics will no longer be supported, and trying to see if they are already using those metrics, not causing them issues and this will take a lot of time. So, and you saw that really it will take like very few lines to call Prometheus linter and save us from these kind of troubles. But that's life and that's why we are trying to now expose these issues that we have with the community and writing some best practices so these problems don't happen again, right? So moving forward here, it's the unit test for the metric, right? We get the initial reconcile value count and then we have the reconcile loop. As you saw it was very simple so it's obviously updating the metrics but in the future your logic will be much more complex but even though in the end you for example in this case expect the initial value to be added one that will be the final value, right? So pretty simple stuff. From there we move on to for example recording rules. So this is where we are creating the regency in the rules. For this project I created two simple recording rules, right? The first one, the number of operator pods in the cluster which is simply a query for Prometheus which counts the number of pods up in the cluster of that type and also the number of heavy pods which is the sum of the pods in the cluster because this metric will have the value one if the pod is ready to be used, right? And also you can notice here that some of the... We are also already trying to use stability levels on the alerts, for example, alpha. Here is just an example but heavy is to show that this is not like something tested and it's still being worked on and other. From these recording rules we are building our alerts and one of the most important alerts to start with is obviously testing if the operator is down or if the operator is not heavy. We make use of the recording rules we saw before. If the number of pods, operator pods in the cluster are zero, will be triggering the alert and if the number of heavy pods is less than the number of pods will give the alert test operator is not heavy. For those, we have also some validations. For recording rules, we are linking the metrics as, linking the recording rules as we did for metrics because they should follow the same conventions as before and then here, for example, I want all recording rules to be prefixed with the name of the operator. And also the same for alerts that should follow from these conventions. And actually here, I also put the link here that I said before that we are trying to achieve adding the best practices for observability on operator SDK. So a lot of these validations come from there, right? Alerts must be in Pascal case format. They must have an expression. We are validating labels, validating annotations. So just following the recommendations there. And these are basically the unit tests. I could have handles just to see that as a good developer I follow by the rules I created. This is one for the metrics. It will be very funny that it fell. And this one for the rules. Pretty simple stuff. So now moving forward for the end-to-end test that's what we want to know, right? I already here have a cluster. Actually I see that I have the resource created and the pod, but on running it will clear it all. I'll start running because it takes a few minutes I think. So, and let's see. For metrics, we are doing the same thing. Here some setup, deploying the operator, then port forwarding the Prometheus service so that we can access it locally and deleting any previous resources that exists. And our test says that we should increase the test operator reconcile count when the reconcile count is run. So we just get the initial value for the test operator reconcile count and we create a new resource. As we saw before in the reconcile loop, any resource that is created is supposed to update the metric. So in the end, we know that eventually when getting the metric, it should be equally the initial value plus one. This is a really simple test for metrics, but as I said before, it makes sure that everything is working as supposed. We saw that this is very similar to the unit tests, but in unit tests, it's easier for us to know that the metrics are updated. Here it's more tricky because it has a lot of also other operations before because we need to make sure that Kubernetes is actually passing the right events that is being cut by our operator and then later executing the reconcile function correctly. And we have reconcile count, but we can have a number of resources created, a number of resources deleted, anything we want. And for alerts, we first have the verification I mentioned for the handbook. We are checking that the handbook URL is available. I created just some geese on GitHub for the purpose of this demo and I can actually show because I copied it from one of our handbooks, cleaned up some stuff, but let's see because it's also useful to know what a handbook should look like. First, we have the meaning of the alert that supposedly these alert fires when no test operator pod is running the cluster, the impact that these alert has, sometimes incubate the operator might not have a big of an impact because the operator is not responsible for the virtual machines, but if the alert is like weird controller is down, then users might start to have a problem because virtual machines are running wild in the cluster and nothing is controlling them. And then we have all the diagnosis steps and here we should have clear steps that in the end we understand very correctly what the problem is. So moving back to the tests, we are now following the same approach, more or less for metrics, deploying a new test operator and then we are making sure that the test operator is down is thrown when we want it, right? And in this case, for example, we just scale the deployment down to zero and since we have no pods in the cluster, we verify that the alert is being triggered. In this case, pending is enough for me because you might want to have a delay on the alerts just throw the alert if the condition is met for more than five minutes, for example, but if I see if the alert is pending, like the condition was triggered, it's just waiting for those five minutes, I think the alert is working fine, so I'm okay with it. And for the operator not ready, I just come here and set the random image, which might mean, for example, that the repository is not available as happened earlier in the other demo. So it's a problem that sometimes we think, oh, this never happens, but it happens more than... It happens a lot, yeah, that's what I want to say. And we then validate that the alert is then thrown, once again, if it's firing or pending. And as the time goes by, we start adding a lot of alerts, for example, we might want to see if... Oh, it failed. If the... We might want to see if the operator is creating the right resources on Kubernetes and for that it needs permissions. One of the things that we can do is just go there, just delete the RBAC permissions and see if the correct alert is triggered, saying that this permission does not exist anymore and you should look into it, right? So that's one of the things that we test. We can test, for example, for HTTP requests if they are failing, that's actually a metric that Prometheus already gave us. So the number of things you can alert for is up for your imagination, actually. And just to finish... I want to present like this really useful tool. Actually, we don't use it on Qivirt, but I use it on my master thesis, which is ChaosMesh and it's really cool. It's really simple to use and it allows you to create a lot of problems in your cluster. For example, deleting pods, causing network issues, that might be a lot of latency, requests being dropped. You can cause CPU and memory issues. You should really take a look. It's simple to use, simple to configure and it has a lot of potential for the unit testing. So to wrap up, I just want to say that it's really important to add observability features. It really helps us. Usually when we start projects, we tend to overlook those kind of things, but they end up being very important then to help us in the future. And also important that they are actually working fine. As I said, we don't want our DevOps and the clients to be waking up in the middle of the night just to see that the alert was not real or we don't want to have bad problems happening in the cluster, but they are not alerted for and they are losing money and they are losing clients. So those are the main takeaways. So that's it. If you have any questions, feel free. Yeah, because those metrics are not up to standards, right? The idea here is to, when we have the stability level deprecated, we will add a flag in the help text and they are not. Because if we keep metrics that we want to update, we'll have a lot of more things to manage in the future, right? And we really have already a lot of metrics and it's hard to keep up with them. We are also trying to add tools to generate documentation and to centralize the metric creations, but there are so many of them. And if we have 10, 20, it's manageable, right? But when we get to the hundreds, we have a lot of components that becomes a problem. And maybe it's not a thing, as you said, to we cannot remove them maybe in a version or two or three because those problems happen. But eventually in the future, I think we need to end up removing them or else it will be unmanageable because even it's an open source projects, right? And people come and go and if we don't do anything about them, they will eventually be forgotten. And nobody will know why do we have this metric and we have another one similar. So that's why it's really important to validate them in the first place. Yeah. I don't know if you might not agree, right? But it's my only... So any more questions? Please. Four? Yeah. What do I... Mm-hmm. This is a big question. Sorry? This is a big question. Ah, yeah. So, yeah, the question was if how to create metrics for reconcile errors. So basically that example was very simple but imagine you had an error. Usually exit the reconcile loop with an error, right? And there I have just these... I have these just to operate our metrics. So reconcile count. But I could create a new one like reconcile error count. And in our loop here, I perform some operations that throws an error and I just create that new metric like increment reconcile error count. No, that was... Sorry? Ah, but this... Yeah, yeah. But you can have create any metrics you want, right? You might want to have an error that was... You could not connect to an external provider. An error that's just... You can create any metrics for the granularity you want because metrics are really cheap and you can... And you should create a metrics you need to then help you debug, actually. But then when creating the alert based on those metrics we should be more careful, right? Some of the errors might not be worth to alert for because it might come from user configuration that it was not able to create the resource with the properties it shows. So that might be like a warning or an info alert for them. But yeah, my advice would be create all the metrics you want because if you think that information is valuable, create the metric. The alert, you should be a little bit more careful here. Any more questions? So I think this is it. Thank you, everyone, for being here.