 Awesome. Well, the good thing about this is there's not going to be any questions, so lucky me. Good afternoon, KubeCon. I'm super excited to be here. Welcome to Alerting in the Prometheus Ecosystem, a story of the past, present, and the future. If we're going to spend the next 30 minutes together, we might as well just get to know each other a little bit better. I'm Oswe, commonly known as Josh, or God Josh. Yes, is a reference to the 1993 advertisement of Got Milk. When I'm not coding, I'm usually snowboarding, which wasn't the easy sport to catch on, given I come from the Caribbean. I work as a software engineer leading the Alerting Team at Grafana Labs with a bunch of other folks from the Prometheus community, which you might know about, such as like Tom Wilkie, Bjorn Ravenstein, Ganesh Bernakar. Ganesh has a talk just after this one in this room, so please make sure you stick around, because that's an interesting one. I'm very lucky to have spent most of my tenure at Grafana Labs. Working with Prometheus, Mimir, Cortex, and the Alert Manager, has this helped me form or alerted to the service offering at Grafana. Before we begin, I want to see a show of hands. Who here is familiar with Prometheus and the Alert Manager? Cortex and Mimir? Cool, cool, a sea of hands for the first one and not so much for the second one. Hopefully, I get to explain a little bit about what Mimir and Cortex handle alerts today, and you'll leave the room with that knowledge. OK, great. I'm going to be talking about the past, the present, and the future of the Alert Manager. And I'm going to do that with four topics. Has groundwork? I'll review the Alert Manager and remind you of its features. Then I'll guide you to how we achieve horizontal scalability for the Alert Manager in Cortex and Mimir. Moving to the present, I'll show you why Grafana now includes the Alert Manager to power its learning system. And with eyes on the future, I'm going to share my Alert Manager wish list and what I hope to contribute back. And if I can do it, so can you. So let's begin with the past. Originally started as a one file program by Julius Bolz and Matt Proud. The Alert Manager is a binary that's in charge of receiving alerts generated by Prometheus or any other client application. There, you can configure and send notifications to other services, including third parties. Simply put, Prometheus is an alert generator. And the Alert Manager is an alert receiver. We have coined these terms as part of the alert generation compliance to better reference both actors. One would think that sending an alert, it's as simple as receiving it and then forwarding it to another service, such as like Grafana on-call or paid your duty. However, there's much more than meets the eye. The lifecycle of an alert can be broken down into five fundamental functions, grouping, deduplication, routing, silencing, and templating. Please keep in mind that this list excludes things such as high availability or the HTTP API. But what I want to draw your attention to is that the Alert Manager implements a set of five separate and portable components for each of these functions. This makes it really simple to use them as libraries within your projects. Before we delve into this function, let's break down what an alert is. As you can see from what we have here on screen, this is the alert reception endpoint of the Alert Manager. An alert is composed of a star timestamp to determine when it should the alert be active, and end timestamp to indicate when should the alert expires. Two sets of key value pairs, labels, which identify the alert, and then annotations that provide a way to include additional information. Think of like the place where you would put a runbook or a better description of your alert. Finally, a link back to the query and the graph that generated the alert called Generator URL. With this definition in mind, let's move on to the juice. In the case of an alert storm and tens of thousands of alerts being generated, you certainly don't want a one-to-one mapping at a notification time. Instead, the Alert Manager is capable of grouping alerts based on label criteria. If you use Kubernetes, think of like grouping my alerts by cluster and namespace. This translates to buffering received alerts in memory to eventually receive a single notification with multiple alerts in it. Moving on to the duplication. By design, the Alert Manager is a highly available mode. In its highly available mode, it's an AP system. It's meant to be available and partition tolerant. To achieve this, it's expected that alert generators send all alerts to each alert receiver, even if duplicated. However, under normal operation, you don't want duplicate notifications. The Alert Manager uses a hash of the alert labels to determine if we are already notified for that. Moving into routing, we all know it's possible to have label-based granularity based on where you want to send the alert on a three-light structure. Your alerts would descend through a set of defined routes that allow you to configure what service your alerts should be delivered based on their content. Think of having your alerts with environment equals production be sent to Grafana on call or pay your duty, but the ones that have a label of development will be sent to Slack. Silency referred to exactly what you see on the 10. Being able to mute an alert to avoid any current or pending notifications as well. And finally, templating. Yes, you have the ability to template your notification into anything you would like. There's a plethora of different functions to help you craft that beautiful Slack notification with a clear call to action or that perfect email that when woken up at night will make sure you have all the information to infer the size of the problem right then and there. Given the ample feature set of the alert manager and our required companion to complete the permutous alerting experience, you can see how it made perfect sense to include this service within Cortex and Mimir. But we reach a point where vertical scaling was not enough and this is what the upstream version supports. We had to bring heavy machinery into play. Before I begin, I want to clarify that when I say alert manager instance, as you see up here, I'm referring to the imported alert manager that belongs to a single tenant. And when I say replica, I'm referring to the multi-tenant alert manager, the Cortex and Mimir one. A tenant's alert manager, the one on the corner right here, it's a one-to-one copy of the permutous alert manager. The first version of the Cortex slash Mimir alert manager supported high availability by using the same protocol and mechanism as the upstream alert manager. Gossip via hashicorp's member list. It gossips silences and a notification lock between replicas to achieve eventual consistency, ensuring your alerts won't fire two times. But I must note that alerts themselves are not part of the state that's gossiped. Alerts generators must not load balance traffic and instead deliver alerts to all the different replicas of the alert manager. This operational mode, as I said before, enables partition tolerant and survival of machine failure without interruption of service. However, the operational mode does not allow for horizontal scaling, right? Because every tenant needs to be present on every replica. In this example, tenant number one, it's present with all of their data on each alert manager instance. And replicas, same tenant end times. I'll admit that vertical scaling, the three replicas that we ran at Grafana Labs worked pretty well to us for a very long time. But as our Grafana Cloud offering kicked off, we started to a point where reaching 10x, the current load was going to be a heavy challenge. It was the right time to bring horizontal scaling to the alert manager. When Tom and I assigned this, we started off with a set of requirements, as every engineer should do. Survive machine failure, make sure that a single replica crashing or exiting abruptly should not cause any external visible downtime or failure to deliver notifications. Eventual consistency for reads and writes. Users should have an eventually consistent BU of all the alerts currently firing and under active notification favoring availability over strong consistency. Zero downtime deployments that when you scale up or when you scale down or roll out a new version of the service, it should be done without any sort of server interruption and data loss. With that, we decided to split this grandiose design into four areas. Routing and sharding, persistent and state, replication and consistency, and the service architecture of itself. For routing and sharding, all of the Cortex and Mimir services rely on hash ranks. Hash ranks are distributed consistent hashing scheme that it's widely used by Cortex and Mimir for sharding and replication. We were in the market for rain and bent on the wheel. So the only question to answer was, what should we shard on? We settle on tenant ID while accepting the trade-off that a high variance of tenants workload could lead to poor balancing of the replicas. Next, the alert manager is a stateful service. You already know that it persists notification state and configuring silences to this after a certain amount of time. This is mostly to avoid what we call write amplification, right? With horizontal scaling, we needed to move the state around as the number of replicas either grew or shrunk and we also needed to present this state across restart or as we rolled out new replicas. We settled on pushing and pulling the state to and from object storage and making sure that we try other replicas first when starting a new instance. This would help us save on storage costs at the end of the day. For replication and consistency, this is where it really starts to get interesting. The alert manager replicates the notification state between the replicas to ensure notifications are not sent more than once. Remember that firing twice that I said before? Before moving to a model where the alert manager instance for a given tenant are only present on a subset of the replicas, we had to decide how are we going to keep these replicas in sync. I don't think anyone here enjoys getting paid twice. Managing cluster membership is the jobs of Hashi Corp's member list in the upstream implementation. However, in Cortex and Mimir, the hash ring is the one responsible for that function. We settled on keeping the same notification log and red silence replication algorithm while that the upstream alert manager was built upon. But the difference here is that we were going to rely on DRPC has a transport of replication messages and the hash ring for knowing where to send what. Effectively dropping the direct implementation of member lists. I got to admit that this worked out pretty well for us despite seeing it in paper at the end of the day. We didn't have to rip any of the internals to the alert manager and the only required change that we had to make upstream was turned the clustering, the clustering mechanism into an interface in the code so that we could override it within Cortex and Mimir. I would call that a massive win for open source. The last piece of the puzzle was the service architecture. To implement the sharding strategy, we needed a component in charge of handling incoming API requests. Then we needed to distribute it to the corresponding shard, meaning replica and instance. The idea here is to have this sort of alert manager distributor, which is the same as kind of the Cortex or member distributor at the end of the day, has the first step in the request reception. Once received, the component is in charge of sending the alerts in parallel to all the other replicas and instances to fulfill the request. We settled on not adding yet another service, and instead, simply baking it within the alert manager service, the trade-off here that we couldn't really scale this component individually. So now, with everything in place, this is how it looks. Rights of alerts highlighted in red are first sent to any alert manager replica. Alerts are then replicated in parallel to each other replica that hosts the alert manager tenant to the alert manager using what we call dynamo style quorum for consistency. Do note that because the alert manager replicas have access to the same hash ring, we no longer have the requirement of the alert generator having to send the replicas to each of them. This is great for a cloud offering because it meant that users wanting to send their alerts to their cloud instance could do so without having any sort of inconsistent views. As for the state that needs to be, speaking of reads, we have multiple strategies depending on the API call. But the gist of it is that we can either merge responses, depending if you want to talk to multiple of them, or you can send to any if the request only requires a single response of any of them. As for the state that needs to be replicated, which is highlighted in purple at the end, it's constantly communicated between replicas directly using ERPC. This happens whenever the instance receives either silence or sends a notification at the end of the day. And as I said before, the optimization that you see here of talking to object storage between each other replicas is that we'll always try to request the state from other replicas instead of object storage to avoid any major compute and storage costs at the end of the day. We have been successfully running this architecture for over nine months as part of Grafana Cloud with huge success as we didn't observe any major problems or any major increase in our costs. But enough of cortex and memory. I now want to come into the present and show you my most recent project in the alerting world. Fast forward to where we are now, and after working with the alert manager on such a complex project, I realized the value of it and how it perfectly complemented our vision of delivering the two most requested features for Grafana. I want us to look at my reasoning for including the alert manager within Grafana because it truly is a love story. By exploring the implementation details of these features, my hope is that I can convince you to avoid the trap of not invent it here and choose open source software when possible. While Grafana supports a high number of data sources, the examples I'm going to provide are based on permissive data for the sake of simplicity. Let's start off with support for queries using template variables and multiple alerts per graph. Even though this evolved into two separate issues as you see here, in reality, they're the same thing. You have a single query. You want that query to produce multiple alerts. You want individual control over all of those alerts. Number one and number two have already been a part of Grafana for a while now. And it's baked into Grafana data model. And it's exactly how you visualize time series data. Number three would be the new concept here. Imagine grouping all the alerts that come from my production environment and delivering them to Slack. I think you've heard that word grouping before. Exactly. The alert manager does grouping, and more importantly, they're routing us up notifications as well. With that in mind, we can have Grafana use this existing data model and turn any sort of data into a set of labels that we can use as the alert definition. Then I can send the definition to the alert manager, just like Prometheus, and take advantage of all the features I mentioned before, grouping, routing, silencing, and notification templating. Next, our silencing and time of the day restrictions. For silencing, Grafana supported a simple form of do not deliver me this alert. It was called pausing and halted the alert evaluation completely. The pausing feature was per alert, meaning there was no way to pause multiple alerts at the same time that matches single criteria. Let's say that you want to pause all your alerts that are coming from a developer environment because you're having a game day or you're testing something. There was no way of doing that. Personally, I think this was not great. Why would you keep an alert around so you don't want to be evaluated? Silence is with an alert manager. Semantically work very, very differently. First, they're based on label matches, meaning you can match a set of alerts that come from completely different rules. And second, they have an expiration date. They're meant to be ephemeral so that you don't forget about this. You're not ready to unpause the silence, just extend it at the end of the day, right? Now, time of the day restrictions is kind of a hard problem if you think about it. Consider not having your alerts firing during weekends. I knew that the Prometheus community had been trying to solve this problem for a very long time with plenty of input from the alert manager issues. It settled for a fantastic implementation named time intervals. Time intervals at the end of the day allow you to specify multiple criteria for time selection that could then be reused against multiple routes. This is exactly what I needed for Grafana. So I think it's pretty obvious at this point, right? Using the alert manager within Grafana gets us the both features that we wanted and we can continue to collaborate with the Prometheus community by upstreaming any improvements that we make the absolute beauty of open source. Open source, it's all about giving back. And in my next section, the future, I'm going to cover exactly that. What's my plan for giving back to the community? The first order of business was to improve how we gather feedback. We have already created a discussion section within the alert manager repository that allows us to prioritize just a tiny bit better. If you have a feature request or something that you would like to see, please share it with us. My wish list is based on feedback that I have gathered after months of the alert manager being part of Grafana and running the alert manager as a servers as part of Grafana Cloud. I'm going to start with a quality of live improvement that was developed by my colleague, George. Being able to test your receivers directly within an API. We have included an experience within Grafana that enables you to test how an alert would look like on a particular receiver such as Slack or Pay Your Duty or Grafana Alert. No more guessing with those always firing alerts and require them for them to fire as you make a change. Now you make a change and you get instant feedback. We have already offered to work this work, to upstream this work and are working out the design details as we speak so you should expect this to land within the next few months. Another challenge that our users face is the lack of visibility in the notification pipeline. With the most common being, failed notifications that do templating issues. Others such as like, why is my alert not being delivered? I have already on silenced it, are also present. Prometheus offers information about the state of your alert rules as you can see here. You can see the evaluation time, the error, what's their state. I want the same thing for the alert manager. As examples, directly from the UI, I want to be able to know why am I not being notified about this active alert? Is it because I've said to wait for an hour before you notify me? Or is it because it failed to render because there was an error in my template? When did a notification for this alert fire? How long did it took to notify? And for a particular receiver, have there been any problems in here recently? Well, the current logs are extremely useful to debug some of these problems. We'll need access to them in the first place, right? I want to help more the case where the operator is not necessarily the end user. To end, my moonshot ambition of a historical and stateful BU of the alert received and notified. The answer to the question, what was received and notified in the last 48 hours? The alert manager, it's already semi-stayful, as we know. Silences and notifications toward to this, we already saw this. Perhaps, improving the permit, perhaps importing the permit use to ESDV or any other database with a lower retention period and similar query interface to permit use is an idea here. I'm going to be thinking long and hard about this for the next few months and hopefully bring something to fruition that can help tackle this problem. I really want to end by saying that the alert manager is an amazing piece of software and take a second to celebrate the efforts of the permit use community. I'm excited about bringing the alert manager forward, but I feel like there's so much more that we can do as a community. Please get involved and catch me in the hallway track if you want to talk about anything that has to do with alerting. Thank you. We don't have time for questions and I apologize, but please catch me in the hallway track if there's anything you want to talk about.