 So thank you for being here. This is Test Driven Alerting. I'm Rick. This is Diego. Sean, the other guy on the label there, on the author's list there. He couldn't be here. He's busy installing a Cloud Foundry Foundation at the 15th century inca citadel known as Machu Picchu. Not really. He's actually on vacation and bailed out on us. Thanks Sean. Hope you see this on the video later. Either way, we're here alone and we do have our boss in the audience, Amber Austin, who is in charge of monitoring and learning. So if you have any real questions, come talk to her after it's all said and done. So here's what we're going to cover today. Basically we're looking at the great digital flood, why alerting is hard, how we can maybe use a little data science here and there to help us out with some of that, and then what our vision of test driven alerting actually comes out to be. So one of the things that alerting, whenever I think of alerting at least, I think it's forever going to be tied to pagers, right? Has anyone here ever had to carry one? Oh wow, that's a pretty, so that means you're a little older. So my dad's pager passed to me after he died. It only went off once after nearly a year of waiting. It was at dinner and we had to go to the hospital. If a pager is going to interrupt you, it had better be very, very important. In this case it was. We were going to the hospital because he was notified that they found a match for his heart and he was going to have a heart transplant. So the problems with pagers are that they don't necessarily work very well when you're out in the remote reaches surfing like I like to do or walking the hill country like my friend Diego, or in Sean's case skiing the aspen snow like he does all the time. And it's probably not a bad thing, right? We don't want to be paged if we're out in the remote reaches of the wilderness somewhere. But it's also a problem because maybe a bigger problem is that it actually works too well in places that you don't want it to work. So one of Sean's stories, and I thought this was hilarious, was that he was actually paged while his wife was in the delivery room because somebody broke payroll and he had to go in and fix it. Like that's not a good thing. And Diego. Yeah, I mean one of my stories is like, you know, we were running our system. We didn't have any metrics set up. And you know, you don't really get paged if you don't have metrics and you don't have alerts, right? And like the problem is like I get calls out of the middle and I from users. Saying, hey, our system is kind of broken. Can you do something about it? And that was my paging. So Diego never got paged because his monitoring sucked and his alerts sucked. And he was called in the middle of the night when everything went bad and there was no good reason for it. So everybody who's been around long enough, and I think everybody who raised their hand that I've had a pager before, has probably some bad pager story I'd love to hear after this. It would actually be pretty fun to kind of exchange war stories. But a lot of them we found, at least for me, was that there's a lot of false positives, right? And those things kind of suck, right? Especially when it's something, you know, where we have to actually take action to prove that we're alerted wrongly, right? We've got to log in the VPN, pull out our computer, SSH into 5,000 boxes, who knows what, right? Just to find out that there's nothing wrong here. But they are a learning opportunity. I think that's one of the things that data science will help give us, right? Is that we can sort of teach the machine to weed out this noise and only tune us into that signal that we care about responding to. There are things that we could have, if we had the ability earlier on, we could have automated and self-heal. A lot of trivial cases, frankly. There's a lot of things, even hard cases. I mean, look at Bosch. What does Bosch do? Bosch does a lot of self-healing, and it runs on these feedback loops. But automation and self-healing, that's a hard problem. It requires a lot of work to figure out the signal that's in that chaos of noise. But frankly, I think the only reason for this talk is because Diego, me, and Sean only ever want to get paged if Godzilla walks across our data center and now we have to shovel the kaiju droppings out of the servers. We're kind of lazy that way, just to get them running again. Although there's a true story. Sean literally was working at a credit card processing company, and he tripped over this 30 amp cord and took out a half a rack of servers. You know Sean, I can tell. Luckily it wasn't anything mission critical. So the bottom line is, the world we live in, especially with Cloud Foundry, we'll talk about that in a second, because we're in this great flood of biblical proportions of data. A data deluge that is unmanageable if you're going to do it manually. It's just not going to work. For instance, this is actually, I think, a very interesting thing because we worked a little bit with our friends at GE, maybe some of them are in the audience today. Aircraft engines are fitted with more than 5,000 sensors that generate up to 10 gigabytes of data per second. For instance, a single twin-engine aircraft on a flight from, say, London to LA can produce up to 844 terabytes of data. By comparison, at the end of 2014, Facebook was only producing 644 terabytes. And then compound that problem when you're thinking along the lines that one of those engines is landing somewhere on the earth every five seconds. That's what we call big and wide data. But they're smart in the IoT side, right? They basically have taken things like cloud foundry and data science and all these other cool stuff, and they've sort of merged into this artificial intelligence. And what they've allowed themselves to do is to predict demand levels and engine thrust and all this other sort of cool stuff that's way beyond my pay grade. But these engines are now, they're more energy efficient. They are more safe. And there's a lot of really interesting things they've been able to do. They've included dropping fuel efficiency, like increasing fuel efficiency 10 to 15 percent, I think. That's a crazy number. You can look at the aviation reek story to find out the details. So now they're using these jet engines and they're detecting anomalies on this big wide data sets. They're processing data and making corrections so that the jets don't fall out of the sky. And all we have to do is something very... All we have to do is make cloud foundry work better, right? To where we can actually not have to deal with it so much. So we do need help for the data deluge. I'm Rick. I'm part of Pivotal's ground game on the services side. I work with a lot of our upfront clients on the ground helping them build out their cloud foundry capability. I do a little bit of data science-y stuff on the side so I won't claim to be the expert. But I know that it's a really good avenue for us to eventually use for our own purposes with cloud foundry. And I'm Diego. I used to be part of the government. I'm now part of a team called Customer Zero that we help our cloud foundry users to make sure that they have sustainable platforms that they can run on the long term. So, you know, we have cloud foundry, right? Like cloud foundry is really cool and you can do a lot of good stuff with it. But the reality is like, you know, you have a lot of information going on about like all the applications and stuff. And the reality is crying won't help you. I tried. And, you know, you need to make some sense out of all the stuff that you're getting, right? It's like huge amount of data, you know, you have your logs and you have your metrics and, you know, they're coming from the operation side, they're coming from the application side, the services, the tiles, you know, the load balancer. You know, there's a huge amount of data and the problem is you need to make sense out of all that. And once you start using this, you have this problem that is already fatigued, right? Like you have a lot of alerts coming in from a lot of different places. And, you know, are they real alerts? Are they false positives? Are, you know, am I the right person to receive the alert? And the problem is that you build up this, you know, muscle memory of just ignoring them. And that's no good. So, you know, how do we solve this? You know, how do we solve the problem of, like, where should I focus on my alerts? And the reality is, like, you know, most people, the first thing that they do is, like, I'm just going to use whatever's out of the box on my metric solution, my alerting solution, you know, just monitor your CPU usage, right? And, you know, that's okay, it's not great, right? Like, you're going to miss a lot of stuff if you're just monitoring whatever's out of the box on your not-confidently-specific solution. Another thing is, you know, let's set up an alert for every little metric that we have in the firehose. And there's a huge amount of metrics, you know. I don't, I'm probably not so many. How many metrics are there? There's a lot of metrics, yes. And where do you set the thresholds, right? And that's not really good. My favorite way is, you know, the approach that I was taking before. It's like, you know, let's not set any metric. Let's not set any alerts. And let's just hope that the gods of the Internet help us out someday. And what ends up happening is, you know, you're entering this meltdown mode where the, you know, your users are saying, hey, everything's on fire, and you're like, hey, what? I don't know about that. And after that, you're like, oh, yeah, everything was on fire. And now let's set every alert again. And you have a million different alerts on the pager going on at the same time. And you're like, okay, you know what? I'll just throw the pager into the ocean and no pay attention to it anymore. And that's not good either. So a lot of these approaches are really reactive, right? And you had to think, you know, how to be proactive in this alerted space. And one of the things that, you know, we think about it's like, okay, let's set up a couple of different dashboards, right? And you're like, yeah, we can put in a dashboard for Bosch, a dashboard for yourselves, a dashboard for the router. And what ends up happening is like, you have like these 35 dashboards that no one is really looking at, and they might or might not have thresholds. And, you know, is that really good? You know, that's, I mean, you can use those dashboards to understand what's going on in your system. But if there's anything going wrong, it's like finding a needle in a haystack. So why don't we step, you know, take a step back and, you know, take a look at what are we really looking for out of the system, right? And if you're familiar with the Google SRE book, you know, one of the things is like the foregolden signals of the system. You know, one is latency. How much time are we taking to serve a request? And one of the cool things is like, if you use a fire hose, there's a latency for a lot of different components. And you can use that to have alerts. Again, traffic. How much volume of stuff are we getting? And in the fire hose, again, you can get how much traffic are you getting on every single component, or almost every single component. Are we getting errors? Are we getting 500 errors, 502s? Is someone trying to use the resource that they're not supposed to? And, you know, we should alert them that. And saturation. And this is like, you know, the basic one, right? Like, are we using too much RAM? Are we using too much APU? Is the router being able to handle all the requests that we're sending? Yeah, so the Greek word for anomaly roughly equates to uneven in English. This is why detection is hard, right? There's unevenness in the data. There's a lot of cavitation. There's lack of uniformity. There's no way to really find that perfectly well-defined thing in the sea of the needle and the haystack, as it were. And ultimately, the problem is that the data just makes it hard for us. There's just so much of it. What we really want to see is a perfectly curled wave that we can just jump on, predictably ride, you know, out to the end. And then we know what we're looking for, right? But what we really want to look for is we want to look for amylose or evenness within this flood of data. So then, at the end of the day, this really just comes down to, I think it's just too much noise. We don't know when something catastrophic happens because of that signal-to-noise ratio being a little bit off. And between the noise and the avalanche of data, we really need some sort of help, right? And together we put a few ideas that we think may be helpful based off experience and some of the emerging things that are happening out in the real world. So one of the things we are very fortunate of is that corresponding roughly about the same time Cloud Foundry has become this rising technology, we also see this parallel rise in this thing called data science. And data science, if you don't know, has been called the sexiest job of the 21st century by the Harvard Business Review. I think that we can all disagree on that a little bit. The real truth is that it's really the Cloud Foundry scientist, right? Which is all of us, I think. So what is data science? It's really just the hyperconvergence of statistical modeling with the big data, the big compute, the things that they couldn't do before with pen and paper that they can now do because we have these facilities that everybody can use called compute that is just unbelievable, right? And what we try to do with that is wrap data, data science is basically wrapped around the process of discovering and creating mathematical models of one kind or another to help us do one of two things. We want to either predict, so we can read some T-leaves and say, okay, great, based off of these inputs of X, we think that Y is going to happen because we used some currently no data to build that. The other is to categorize so that we know where things fall. So for instance, we want to extract some information about how nature is connecting things together such that the coffee beans going into the grinder, we know the resulting coffee and how maybe that was formed together. But for alerting, what we really want to know is given the current state of the data, the real world, what's likely to happen next or what category is this going to fall into? Essentially, what is the color of the next bean through the grinder? Is it brown? Is it green? Is it purple? So for data scientists, they basically build these models to do prediction using these sophisticated and mysterious techniques, lots of dark arts and black magic done from ivory towers where they worship at the altar of Thomas Bayes to draw a line. That's what it basically amounts to, right? You know, it's a bit anticlimactic, but unfortunately it seems like that's most of what data science is all about. So I would recommend this. This is actually from the UC Berkeley machine learning crash course by Daniel Gang and Shannon Sheath. It basically gives you a really good explanation of what data scientists are trying to do. So for instance, the first step is basically take some sort of training data set that they know, so things that we know about. We know that there's apples. We know that there's oranges, right? And the goal is to basically take that and figure out what that line is that says, hey, this is categorized as an orange. That's categorized as an apple. And they call this really cool squiggly line that ultimately derives from some sort of mathematical function, a decision boundary. And within that classification, it helps us that when we see something we haven't seen before and we don't know what it is, we know that, for instance, that blue X, it's probably going to be an orange, right? Because of what we know already about the data set. So the other thing which gets us more to the predictive side is regression, which is just regression to the mean. Everything returns to the average as kind of the bottom line, thought in a lot of statistical contexts. So we have this data set. In our case, we have the home prices and square footage. And I would think, maybe not in San Francisco because I think the cost of housing is so crazy right now, maybe 100 square feet will cost you a million dollars. But in most places, there is some sort of correlation between the two, right? So as your square footage goes up, your price goes up, or vice versa depending on what you're looking at. So when we see these houses get bigger and we're introduced to some house, we can actually make some sort of prediction that our new house, the green X, is actually going to fall here because we have this much square footage which is known and then we can find out that the price is going to land us right about there, right? But the cool thing about this is that using this because we know this sort of nice band of things where everything should live given the state of the world that we think the universe works in a way that everything clusters in this standard deviation, we wind up with the idea that we know that when anomalies happen that that shouldn't happen. That shouldn't exist. It's not something we should necessarily respond to. There's some sort of error happening. And again, all of this is just to essentially at the end of the day draw a line, right? And that line we call the predictor. And all that means is that anything that runs along that line within a plus or minus of some range, we believe that give us this X input and we can predict the Y output, right? And that way we know. So that's data science in a nutshell. So the cool thing for us is we need some sort of tool to help us set alerts. So one of the cool things that I think is exciting, it's actually a very exciting thing is Prometheus, right? And Diego, we're very fortunate to have with us, is an expert on Prometheus. Ask him anything. I guarantee he'll know it. Put him on the spot. And you see, you agree, right? You know, you know. So one of the things we want to do is we need something to help us dial on these signal-to-noise ratios. So if you're not familiar with Prometheus, it's an open-source monitoring system, time-series database. It's got this really cool, flexible query language to help us with our alerting. It's also open-source, so I recommend if you have any interest at all in this, go ahead and take a look at that. One of the components of Alert Manager basically allows us to take all of this information, group and route these alerts. We can throw it out to other things like pager duty. Again, the word pager is still there even in our linguistics. And helps us take some sort of action. But it also has custom anomaly detection, right? So let's get to what we're talking about. These are the five points we want to have you take away hopefully, which is we want to stop the noise. That's what this is all about, right? Because it's going to give you peaceful dinners. We want to use predictive methods that are available. We want to look at things. Do we care about it? If we don't, let's just ignore it. There's a lot of noise that comes out of things that we just don't care about. And then if we do care about something, let's make sure it's consistently ringing. Make sure the bat phone is actually ringing long enough before we pick it up. And then we also want to do some sort of intelligent alerting, which also happens to be built in the Pre-Methias. Obviously, there's a lot of other systems that will do this. And then finally, we want to try to grow this out using some of these data science methods that we're going to talk about to build great feedback loops that help us kill the noise. So the first approach is to use predictive and categorical methods to tune that. So, Pre-Methias has these methods. We'll talk about them in a second that help reducer find the patterns in the data. So for instance, do you have a linear pattern? Is there some slope to it? Is there a trend pattern? Is it up? Is it down? Is there some seasonality associated with it that we need to be concerned about? There is this fantastic article by Brian Brazil on practical anomaly detection, specifically using Pre-Methias. I would recommend this as a starting point for anybody that's interested in this topic. And what he does is I'm just going to kind of, we've truncated a little bit of what he talked about here. But basically he starts with this, you know, a simple scenario where you have a small number of servers that are not performing as well as the rest. And as such, they're responding with increased latency, one of the four golden signals. So we want to look for the instances where there's more than two standard deviations above the mean. And so here's some of the query language that you can actually see being built by Brian where he actually tries to eliminate these false positives when the latencies are very tightly coupled. Because you're going to find out as you peel the layers back, you're going to find out that there's other things that cause you false positives. So you'll have to adjust for those. So in this instance, for instance, he looks for adjustments where the latency has to be 20% above the average, for instance. And then he eliminates false positives at low traffic levels by adding a requirement that there's enough traffic for like one query per second, something like that. So just sort of intelligently knowing your data, understanding what's coming out of it, and working from there. Prometheus also provides these very specific data science methods that are already built in. So for instance, there's the Holt Winters function. It'll let you forecast demand over seasonal data, stuff that essentially is repetitive. We know that roundabout in November we're going to have Black Friday spike. We know that February-ish we're going to have Super Bowl when everybody's going to want to join in looking at our commercials and things like that, right? So we have to be able to do some work around it. Additionally, they've got the predict linear, which is actually a really cool feature that you can take that time series value of data and you can actually now say, I know what's going to happen next, or at least we know what should happen next based off of a simple linear regression already baked in. And then they have the building blocks of a lot of statistical models already present in the derivative function, which basically uses linear regression to help give you that, and then standard deviation, standard variance, right? Both of which are, like I said, fundamental building blocks of most statistics. A lot of these is, you know, data science, not rocket science. These are pretty simple approaches that you can do to stop the noise and make your learning better, right? So one of the things that we were talking about is like, how can you improve your learning so you don't get alerted by the stuff that you don't care? Filter out the noise. And most monitoring solutions have filters, so we should use them, right? A lot of people think, you know, I'm going to set all the alerts, and you're going to get your compilation VMs alerts. I'm like, I only care that the compilation VM should just compile, right, if you're using Bosch. If it doesn't, then I figure out what the problem is, and Bosch is going to try to fix it. And if you have, like, tests or non-essential deployments, you know, just ignore the filter that noise out. And, you know, another thing, too, is like if you're using self-healing infrastructure, you should let it self-heal, right? If you're going to alert on Bosch, unhealthy, you know, right away, then you're going to get alerted when Bosch is trying to fix something. So, you know, you shouldn't try to outsmart Bosch. You know, it's a pretty cool tool. And, you know, with that, we can ignore some of that noise. And another thing, too, is like we should try to see when, you know, there's something going on that is consistently on fire. Like, I don't want to alert that my daughter is at the pool, like, splashing, but if there's someone drowning, yeah, that's a problem. So, like, you know, here's, you know, some example of, like, if there's a metric that is just barely touching a threshold, I don't care, right? It's okay. But if something is, like, consistently out, then I want to alert there. And, you know, again, Prometheus has a cool thing that you can say, like, if something is going on within 10 minutes, then, you know, let me know. And, you know, that's something that you can do with pretty much any alerted solution. And one more thing that I think that it's interesting is that a lot of people say, you know, we're going to have alerts, you know, go to the same person, or, you know, the same team. And then they're going to figure out how to route them. And so, like, you know, like I said, only alerts to Rick and he's going to figure out who is going to be able to fix it. But the reality is that, you know, we can use tools for that. And alert manager has, you know, this idea of different receivers. So you can say, you know, the front-end team can receive the front-end alerts. And, you know, database service alerts can go to a database team and that kind of stuff. And I think that that's pretty important. And also you can, you know, make sure that you have the right, you know, settings to make sure that you group the alerts, right? I don't want to receive the same alert 200 times. Maybe you can send me one every five minutes, right? With all of them together. And another interesting thing is, like, this is something that Prometheus doesn't currently have, but pager duty has. And that is automated scheduling and escalation. That is very useful, right? Like, I don't want to get paged if, you know, someone forgot to change the schedule on the pager. And you can build that into tools. And I think that you can even set, like, out-of-office stuff. And, you know, you should have escalation policies too, that if, you know, I'm napping, you know, like I do every day. And I get an alert that, you know, rake is pinging. Thanks. So, I mean, you know, these are simple things that you can do to reduce that noise. Yeah. So in the end, test-driven learning is all about building feedback loops that work with this noisy data that we're concerned about that's coming out of Cloud Foundry, that's waking us up from our siestas and all kinds of other little problems that get in our way of dinners, vacations, and everything else, right? And Prometheus, the built-in data science functions are really just scratching the surface, frankly. And I think that we can bring a lot more data science methods into it, because it's an open-source project, right? There's no reason we can't contribute just like we do with Cloud Foundry, right? So I think that just food for thought for the future are things that are interesting and I think are applicable from the data science realm to our world of alerting and monitoring. Are the k-nearest neighbors' algorithms, which are fantastic, multiple linear regression, logistic regression, and Bayesian sort of thinking, which includes LDA and Bayesian inference. And then we also have decision trees and my favorite random forest, along with my other huge favorite, which is support vector machines, which get us much closer to what we think of as AI. So I won't drain all that because there's a lot of detail there, but the slides are available to you on the schedule site, so please take a look at that. So all combined, using the predictive methods, don't care, ignore it, care. Let's look at the consistency. Intelligent alerting and building these feedback loops and contributing back to... contributing back to Prometheus and other ways for us to alert on Cloud Foundry. This will help us get our first steps toward test-driven alerting. And we know that we cannot stop the flood. It's only going to get worse. So our only option is really to eliminate this noise. And we think that bottom line is prod should have some alerts. I think we agree. Manually setting these alerts to scale with the data deluge that's coming for us. Doing nothing is a really bad option. We've seen that, unfortunately, at other places. Leverage data science where you can. There's a lot on the subject that you can probably dig into and start ripping the hood off of that and figure out what's inside. Use these false positives that wake us up to actually learn and get the machine involved in those, creating those food feedback loops instead of just using our intuition that we use now. Bottom line is you should have some alerts, really, right? So at the end of the day, get more sleep. Y'all look a little tired to me. It's probably a long convention, right? Have dinners. Have those weekends. Have those vacations so that they're no longer disrupted. There's no reason for that as well. Let's control what we can control. The data deluge is beyond our control, but we can stop the noise. But be careful not to trip on the power cord like Sean did. We can't really help you with that. Maybe we can buy you some duct tape. So we'd like to thank everyone here associated with some of the help that's built this slide. And then if you want to reach out or contact us or collaborate on anything, these are our Twitter addresses. And then finally, if you're interested in the Spring 1 platform coming up, we do have a discount code for you that Pivotal's offering. So take a look at that. And I think that's it. Anything else, Diego? Thank you very much. Yeah. I mean, if you have stuff that is very sporadic, like you have an event every month, like an scan every month, it's hard to apply the data science to that because there's not enough metrics. You can. You can set up alerts on that. The problem is like, what are the right thresholds for that? I mean, you can alert, you know, if there's one of something alert, but is there a good alert? I don't know. Yeah. This in particular is not a product. This is an open source project. You know, Pivotal, we're working on some solutions like this. There's other solutions, you know, I can, I would say yes. I think that the cool thing about this is, you know, an open source project, you know, if you're in a closed service community, the, you know, commercial solutions are going to be different. But there are, yeah. No, we just have the Prometheus repos that are already out there. There's a link in the deck that will actually get you to the actual functions.go, it's written in go, and you can kind of take a look at that and how they've actually built those pre-baked function and queries. Yeah, actually there's a bus release for Prometheus, and it has like I'd say like 30 or so dashboards that are pre-built and like 50 alerts that use some of these functions. So the alerts that I had on the slides came from that release. And it's actually a pretty good release. Any other questions? Alright, well thank you so much everyone. Have a great rest of your convention.