 I should say that Julian was supposed to be here if you have read the program in advance. On short notice he unfortunately couldn't make it here, so I just made up a talk that I always wanted to give, but never found the right occasion for it. So this is something I wanted to talk about for a long time, but the talk itself is completely improvised. I just hacked together some slides yesterday in the hotel. Yeah, so that's essentially it. Julian wanted to talk about new stuff in the Prometheus world. We will still have a very compressed version of that in the maintainer session on Thursday, 230. It's called Prometheus intro and deep dive. There will be an intro and then there will be deep dive with the news. It's very compressed, but Julian also told me he will record his talk anyway and publish it on his blog or somewhere. So just follow him on Twitter or something and you will enjoy the talk. Anyway, okay, cool. So my topic, alerting an anomaly detection. Best friends forever. Short answer, no. And talk done, right? I guess we have 20 more minutes. I should have a more nuanced view on the topic. Let's go back to 2015. What happened in 2015? You should be fans, right? This was when Prometheus got publicly announced. Remember, Prometheus was always in an open source report on GitHub ever since November 2012, meaning we are kind of approaching a 10-year anniversary, which is pretty incredible. But 2015 was the year when we made a bit of noise and Prometheus became very popular quite quickly, which we didn't really expect. We expected people asking us questions, weird questions about those weird things we're doing, but there was one question we didn't really expect and this was kind of one of the most frequently asked questions. Can you do anomaly detection? Because back then in 2015, the hotness was anomaly detection and whenever somebody claimed to create a next-gen monitoring and alerting system, obviously this must have to do with anomaly detection, right? And we were like, what? We were kind of surprised and I'll explain why. One thing is, seven years later, anomaly detection is still the hotness. I have to read this that way. So, marketing language. Imagine, for example, a food delivery app that has lots of usage at lunch and dimmer times, but is pretty quiet in the early hours of the morning. The same threshold doesn't work well for both scenarios and could lead to missing incidents and all noisy alerts. What if you could learn from our metrics in the past and create alerts that adapt to our data and context over time? It wouldn't be that great. So, I'm carefully navigating the competitive landscape here. I don't want to put blame on any of the competitors of my employer, which is Grafana Labs, as you might have seen on the first slide. So, I took our own marketing language. So, my honest answer is then you would almost certainly miss incidents and get noisy alerts. That's the sad truth. Why is that the case? First of all, outages are not necessarily looking like anomalies. Especially nowadays, you might have a high availability service, three-nines, four-nines. If you just have fractions of a percent of an error, you are already out of your SLO. But patterns, I mean, it looks normal. So, your anomaly detection system might not even see that you're like, you're burning your arrow budget slowly. Yeah. So, you will miss outages if you rely on anomaly detection for alerts. The other way around, it's even more obvious an anomaly is not necessarily an outage. On the contrary, most anomalies are not outages. There is my favorite example, the World Cup Final. Here in Spain, if I say World Cup Final, you know what I mean, right? In most countries of the world, you know what I mean. There are a few countries where World Cup Final means something else. I mean this year, right? This sports, which personally, I don't really like, but most people like it. So, the World Cup Final happens and most of the world is watching that. And you might be the engineer on call and you also want to watch it. But since your alerting is based on anomaly detection, every World Cup Final, every four years again, you get paged during the World Cup Final because your traffic drops, like 70% of your traffic drops away. Everyone is watching the World Cup Final. You want to watch the World Cup Final? No, you have to act the page and realize this traffic drop is actually fine. People are just not using my service. It's not an outage. So, really bad. You will never watch the World Cup Final. There's even a little anecdote internally at Google. I don't know if it's true, but it's so nice that I like to tell it because where's Google from? It's a country called the United States of America. They don't watch the World Cup Final, right? And so, their engineers, at least initially, when they were like more Eurocentric, they didn't know there was a World Cup Final. They just saw 70% traffic drop. What? Something must be wrong for sure, right? So, even if humans do the anomaly detection and they are supposed to be smarter than your machine learning, they will still have false positive and panic because the traffic has dropped. Yeah. So, another thing is that anomalies are just anomalies. They don't judge your anomaly, right? So, if you are a crappy startup, you are not making any profits, your product is unpopular. Suddenly, your product becomes popular, traffic goes up, you should be happy, but it's an anomaly, right? So, you get paged because you have success or you were just like duct-taping your site together. You're most of the time down and suddenly you're up because you did proper work. You're up. It's an anomaly. You've never been up for like longer patches of time. Anomaly, anomaly, anomaly, but no, right? It's not really an outage. Okay. So, coming back to Prometheus in 2015, we kind of knew that anomaly detection is problematic for alerting, but we really didn't have it on our radar. We had something else on our radar, which is this philosophical alert philosophical discussion in symptom-based alerts versus cause-based alerts. That was the hotness for us in our little Prometheus bubble. I mean, it's still like you might have heard about this discussion. Traditionally, you would monitor packet loss on the network. I don't know, servers that go down, disks that fail, right? This is traditional monitoring. And this is what we call cause-based alerting in distributed complex systems. First of all, you have so many notes that this will happen all the time. Luckily, your systems are designed to tolerate that, which in turn means if you alert based on causes, you will get very noisy alerts and you'll get woken up in the middle of the night because of something your system can cope with. So, what you should do is symptom-based alerting. So, you should alert on conditions that actually mean that your users have a big experience. If you have an SLA signed or you have set LCLOs, you actually know what your service should do. So, you can alert directly on that, which is called SLO-based alerting. That's another new hotness. Our discipline has the habit of writing the hype. So, nowadays, I mean, still anomaly detection is a hype, but also SLO-based alerting. So, people have like overblown expectations there. But, in principle, it's a very good approach to do like symptom-based alerting, SLO-based alerting. So, if you get paged, it should be a real issue which has to be dealt with now. This is why you got woken up. So, we were fighting kind of this battle, symptom-based alerting versus cause-based alerting. And now anomaly detection comes into the game. Where's that even here? Anomalies could be symptoms in rare cases. We will talk about this a bit later. Anomalies could be causes or correlated to causes, but most anomalies are neither. There are neither symptoms nor causes. And why should you base your alerting on them? Like this was for us, this was so off the radar that we didn't even think about that. And the rest of the world, they ask us how we do anomaly detection in Prometheus. Now, there's hope. There is, you could argue, there are cases where you can use anomaly detection for alerting. Strictly speaking, that's the case if anomalies are the symptoms. So, there are very rare symptoms where anomalies are a symptom of a failure. There are systems where you just want them to run very normally all the time. It's just hard to come up with really good examples. I always think like of medical stuff, like if my heart rate is too low or too high, that's an anomaly and it shouldn't always happen, whatever. Like there are definitely technical systems where that's true. They are just very rare. Like perhaps one or two in this room have a system where this is really the case. However, the more, like the worst signal noise ratio you're willing to accept because the stakes are so high, the more anomaly detection is something that might make sense. So, stakes are really high if you run an airplane or nuclear power plant, right? You could kind of argue you should have an error budget and in reality you have. Like nuclear power plants blow up, planes fall from the sky, but you shouldn't really tell your passengers or the users of your electricity. Yeah, there's a kind of 0.1 percent chance this will blow up or fall from the sky. You kind of try to keep this really, really low. So, in those cases, you also want to know if something is going to happen before it happens. And it's so expensive if it happens that you are willing to accept noisy alerts. And then, of course, if anything is weird in the airplane you want to know, if anything is weird in your power plant you want to know, even if 90 percent of those alerts are false positives, you don't want to operate your whatever, search engine in that way, right? Because nobody dies if your search engine throws a few errors. I also think about like fraud detection of something where anomaly detection might be cool because with like intrusion, fraud, you have an attacker that deliberately tries to not trigger your alert, right? So, you kind of, yeah, you have an intelligent antagonist and then you might say anything that is not normal that looks weird should ring a bell. And maybe that's a good thing, but I also remember, I was a couple of years ago, whenever I traveled to another country, first transaction with my credit card, it got blocked because normally, right, I never traveled to France and suddenly I do. And the whole point of a credit card is I can travel to another country, put in my credit card without cash and get money and the bank thought that's an anomaly and it should block my credit card. So, even there, you have to use it responsibly. Okay, there's another thing and this is way more relevant for most of us, which is we should actually think of alerts in a wider way than just pages. And for Prometheus people, which you probably all are, you are familiar with that because what we call an alerting rule in Prometheus isn't necessarily a rule that creates a page. We have all kind of alerts, right? But the lingo is really different and everyone uses different terminology. And then there's the SRE book, the famous Google Blue SRE book, which has a terminology established which I find a bit weird, but they use alerts synonymous with pages. And I call this SoundCloud lingo because that's where Prometheus was first used. And this is what I personally tried to establish, like let's not call pages alerts because that's the general term, let's call it pages, right? And then you can, in Prometheus you can attach labels, free form. So, we use the severity label, you can use something else, you can use a different terminology. Prometheus is really inclusive here and open-minded, but it comes at the price that terminology might change. So, severity critical, this is something that you act on immediately. It should only ever be based on symptoms and it goes to your page. Now, another alert and I like the word from the SRE book which is tickets. So, I also tried to establish it at SoundCloud, but most people call it an email alert because they send it to email although it should be an issue tracker. So, this is, it should still be actionable, right? So, anomalies are kind of still bad for this because there are so much noise. It should be a symptom. It could also be a cause because most causes have to be dealt with eventually. They should just not wake you up because you, oops, sorry, here. You should act on it eventually, but you can do it during work time. If you have packet loss in your network, you should not ignore that. It will eventually create something or some real outage or it's definitely something bad, but nobody has to be woken up. The network people who are dealing with that, they can get an alert during their work hours. So, that's what you should do here. It could even be based on causes. And then, this is less known, what I like. I call it information alerts. SRE book calls it logs, which is a bit weird. Maybe logs are usually of that nature, but it doesn't have to be logs. It could be a Prometheus alert which goes nowhere. So, what is this good for? Or it goes to a dashboard. I like those alerts, informational alerts because they make sense if you have a real outage and you want to find the cause. Contrary to common belief, a monitoring and alerting system is not only there to tell you if something is broken. It also helps you to find out what is broken. And those alerts help you because they tell you if something is weird, which smells like anomalies, right? So, this is many alerts actually that you might have in your system and they are even paging or creating a ticket. They might be better information alerts like the developer of a piece of software thought. If this value goes up above this threshold, that's kind of weird. But it's not really breaking anything. But if you see that something else is going wrong, you should know, right? So, information alerts. And I think anomaly detection has a great place here. And this is actually what I think this whole like machine learning blah, blah, blah, blah, blah, blah, buzzwords, where this actually makes sense and is a billion dollar market because that's really what you do if you're troubleshooting, right? You want to see what's wrong. You get a page because you have a real outage. It's not anomaly based. It's based on real measurement against your SLOs. But what do you do usually? I mean, we all have this intuition. What has changed? 10 minutes ago an outage started. Was the new software version pushed? Did the traffic spike? Did something in the network go wrong? And it would be so nice if you could like Star Trek style ask the computer, please list all the things that have changed in the last 10 minutes or perhaps in the hour running up to it. And that's what we humans do by what we call staring at dashboards, right? This is an anti pattern. You just look through all your dashboards and you might find something would be so cool if machines could do this for us. And that's what machine learning can actually do. And this is also, it inverts the requirement for signal noise ratio. If 90% of the pages you get are false positives, you will hate your learning system. If you are fighting an outage and you have no clue what's going on and your machine learning system presents you 10 theories, what could go wrong? And one of them is right. That's awesome, right? I only have to test 10 theories and I will find the real cause of the outage. Great, right? So I think this is where we really have value. Okay, let's come back to 2015. Everyone is asking us, can you do anomaly detection? We slowly understood what people actually want. We tried to explain why you shouldn't use it for alerting. But we also told them, listen, promql is super powerful. Of course, you can do anomaly detection. All the building blocks are there. Brian, you can see 2015, he wrote this blog post on the Prometheus blog. It's still there explaining how you can do anomaly detection with Prometheus. So finally, we could claim we can do anomaly detection and it might even make sense. This is like the example query in the blog post and you can see there are the building block. We can do an average. We can compare to the standard deviation of latencies. This is between instances. So that's a reasonable thing, right? You load balancer should load balance stuff reasonably. And if a single instance is more than two standard deviations away in its latency from the others, probably something is wrong. It might not be a problem that you should page on. But if you get paged based on your nice SLO based symptom based alerts and you see, oh, yeah, one instance is much slower than the others. Great, right? Perhaps noisy neighbor. I don't know. Broken note. This is helpful then, right? In the blog post, this query is refined a lot to make it less noisy and all of that. You can look it up. There's also a lightning talk on this during this Prometheus day later today about anomaly detection where you probably see all those tricks, I assume. So you can do this, but all of this is rule based, right? What the people back then were all looking for buzzword machine learning, right? So I want to close this with Cartesian coordinates, quadrants, that's always cool, right? So this is anomaly detection, the method you use, which could be more on the machine learning part or more on the rule based part and the purpose which could be alerting or what I call troubleshooting, right? And Prometheus is here, right? It's in this quadrant. Prometheus is always rule based and it's for troubleshooting. It's what we want, right? You could use it for alerting, of course. There are Prometheus rules for anomaly detection, but we kind of philosophically don't want this. There are, again, this is already exceptions. There are ways where this might make sense. For example, I would never just alert on a traffic drop because World Cup final, right? But you could be pretty sure that there are limits where you say, if my traffic drops 99%, that's weird even during the World Cup final. You could make up rules where you're reasonably confident that something is wrong, but then you could also say it's not even an anomaly, it's something where you just know this must be wrong. It's kind of more symptom. But anyway, this is careful. Alerting with machine learning just don't, right? Dangerous territory. It's hardly ever useful exceptions, power plants, fraud detection maybe, right? But this is an interesting area here. You want machine learning and this is essentially Star Trek computer style, right? Where Captain Picard still has to ask the computer the right questions, but the computer will do all the hard work of analyzing and correlating and everything. This is just nothing you can do in Prometheus. And my prophecy is that Prometheus will never have those machine learning parts, not just because we want to be focused on what Prometheus does well, but also because your machine learning should look at all signals. And Prometheus is looking at metrics, but you have so many other signals in your modern monitoring systems that you should put them all into your machine learning systems. So what machine learning systems do? They query Prometheus, they can retrieve the data, there's a query API, right? And then they can process this and do this. And then maybe it will help your troubleshooting. I think there we will see the real progress, the real money and the real convenience coming from. But not here, right? Okay. I think that's it. We have a few minutes for questions. Thank you. So one just put their mask back on, but the person, yes, thank you. Questions? We've got five minutes to turn. And are there any projects currently doing that automated troubleshooting that you talk about there that we could use? Can you hear me? No, you can hear me. I don't know of any in particular, but I talk to the people. I'm not following so much the marketing side of this, but I talk to the engineers who are doing machine learning, including the machine learning experts in my own company, because I'm always keen on, please don't tell people that they get automatic alerting. They have never to think about thresholds again, because this is just wrong, right? And they all tell me, yes, we understand. We have already tried that. It's super noisy. But, and then they talk about exactly this, right? They want to make this whole operational troubleshooting part. They want to support you. And I think they can. I'm just not sure if somebody has something that is already in the market that actually works, but I know that the people are working on it. So if it's not already in the market soon, it will be there. And I guess multiple vendors will offer that. And I hope it will be good. More questions. Also knows. Could you put your mask on? Thank you. I don't want to be obnoxious or anything, but again, it is highly unfair if everyone does a thing and then a few think they're special. They're not. I hate those masks. I wear them. Done. Any more questions? The life, the people watching this virtually, they can also ask questions in the chat. Oh, yeah. Question. So what are we supposed to do for alerting? The graph implies that we should only do troubleshooting, not alerting. Yeah, you should do alerting, of course, but not based on a normally detection, right? So the usual, like the marketing story about is that, which you could see from the example from Grafana, but everyone is telling you that if you just have thresholds, static thresholds, it won't work. This is a bit unfair because modern monitoring systems, they also teach you to not just use a static threshold, they tell you do something smarter, look at your SLOs. And even then you say you get like an error rate in your SLO, which is already a bit smarter because it's like relative to traffic, but even that isn't really what you do. You do like error, like you alert on the speed, you burn your error budget. There's like SRE workbook, the second SRE work, chapter five, I think, and there's, I myself did something back at SoundCloud, there's a blog post, then did the same thing at Grafana Labs and it's in podcasts and all over like, this is what we call SLO-based alerting. It's like real science, like rocket science in a way, but it's very elaborate and it has its problem. There are also a lot of talks these days why SLOs aren't like the silver bullet for everything. Nobody ever said that, but it helps so much and there was a reason, like, sorry, shame this blog, Grafana has a new podcast, it's called Big Ten. There's an SLO episode where they interviewed me and Matthias Leuvel, right, who is perhaps in the room, I don't know. So, Matthias also has a lightning talk on Prometh today, I think, about this as a low base alerting, so watch this, watch all the day, all the talks today, of course, but this is so kind of ingenious because it really alerts you in a non-noisey fashion and Tom Wilkie, which is like my boss's boss at Grafana, he said this so nicely, like how they had very noisy pages initially and then I came along and said, we have established procedures, do this and it was like, it was day and night, right? Of course, there are shortcomings and it's not the perfect solution, but it's so much better than what you had initially and if you ever get into this direction, you also realize, why did I even ask for like automatic alert thresholds, push the button and never worry again, like you should know your system better, right? That's where your alerting should come from, from your knowledge of the system, what your users expect and not from some magic, it will automatically tell me if something is wrong. No, it won't, it will tell you something is abnormal, but that's what you want to know. I'm wondering, you said that we're supposed to avoid noisy systems, noisy alerting, but sometimes it depends on context, like you said with this example from Google. I'm wondering if you have any advice how to avoid noisy alerting in the same time, keep the context of alerting and keep alerting that are important, which depends on time or any other things that are outside of the system. Yeah, I mean, it's really like, there's no silver bullet, you cannot stop thinking, you have to what you call context, right? You have to think about your system. I think most of those like online serving system, most of us run, probably read chapter 5 in the SRE book, go through the blog posts and podcasts, this is probably what you want, right? Don't do it, don't apply it without thinking, because if you run a nuclear power plant, you don't want to have a 0.1% blow-up rate per month or something, right? Obviously, but if you run just a normal service, that might be just fine. It really depends what you're doing, and then you have to pick the right method, and it's like, can't give you the one, three, whatever rules for that to find out what it is, it's in one's decision, of course. But most of the time, don't do anomaly detection for your paging. That's the rule that's most often true with very few exceptions. We've got five more minutes. Any more questions? What question didn't the audience ask, which you would like to have seen asked? Oh, I think that they were good questions. We can also... Anything more from online? Okay. Okay, then let's... Thank you very much. Five minutes, relax. Thank you too.