 Hi. Thank you everyone for coming. My name is Mike Tujeron. I work at Lithium Technologies. And I'm Ilan Rabinovich. I'm the director community at Datadog. So we're going to talk a little bit about monitoring OpenSack at Lithium with Datadog. You can kind of go through that. So we use OpenSack in a variety of ways at Lithium. We have a bunch of production-facing communities. We're a SaaS, so a lot of communities for Fortune 500 and enterprise companies. The service is running on OpenSack. We got Redis, Java, MySQL, Node, Elastic Search Katandra, and a whole bunch of other things in smaller usage. We run a lot of our infrastructure services on there, Kubernetes. On there with Docker, our Chef, DNS. We run Console and other kind of infrastructure services. All of our development environments are running on there. A lot of teams use it for R&D. It's OpenSack. It's a cloud. So random stuff gets to run on there, which is kind of nice for the developers. But it kind of makes monitoring a bit of a challenge. So a little bit about what our stack looks like. We run in two regions. In the EU, we're running Kilo. In the U.S., we're running Ice House. Kind of a vanilla installation under the newer services, mostly separate from Ice House just because Kilo was a new introduction for us. We have about 60 hypervisors. Probably 80% of those are in the U.S. We're running about 1,000 instances across the two different clusters. 10 terabytes of RAM. We're using Contrail for our SDN, which is really nice. Does add some monitoring challenges that we'll talk about later. Within Cinder, we're using both SF and SolidFire for provisioned IOPS. This is all managed primarily by a team of three engineers. There's five of us on the team, but about half of us are dedicated to managing this OpenSack cluster. Monitoring. So we have tools out of the box. You have your horizon. I'm assuming everybody's seeing the horizon dashboard. Not very useful. It's quick snapshot, sure, but you don't get anything cross-tenant easily available. We're multi-region, even worse, because then you got to look at the tenants across the regions and so on. And it's a single view. Each team, all those different teams, want to look at OpenSack, but horizon just gives you that one view. It also doesn't give you time-based metrics. Graphs over time are super important to us. You also have the Nuva APIs. You also have Keystone, Cinder, etc., all the Python APIs, really. And you can roll your own monitoring. But I mean, I'm sure you're all busy enough. The rest of your job, rather than writing it, you don't want to write your own monitoring solution if there's stuff already out there out of the box that you can use. But we have used these. We've used both of those before we switched the Datadog integration. We would write some Python code, and then just, you know, we'd either push it to Datadog, or we'd push it to StatsD, or we'd push it to log files, whatever it happened to be. And then we'd have to visualize it somewhere else, yeah. So the way I look at it is, you know, a lot of this graph here, this is from what, the 60s? Yeah, it's a telescope from NASA. Revolutionary at the time. You look at it now and you go, what the heck is that? And that's the way I kind of look at rolling your own, especially making API calls on itself. It's just, it just doesn't work well and it doesn't scale. As you add in multiple stacks, multiple regions, new tenants, new whatever, it's complicated. So we went with Datadog. We were already using Datadog for our application metrics. So on all of our VMs, you know, we're running, I think we have 2,500 servers of some form, some flavor, public cloud, private cloud, bare metal, VMs on bare metal in Datadog already. So being able to pull in our OpenSec metrics was really powerful. But one of the most important things up there are those top two bullet points. Users want graphs and dashboards. Nobody wants to see raw numbers. They want it to look nice and pretty. How tall, comparisons and time series. It is way more important to see how I'm doing today compared to last week for some teams. From an admin perspective, perhaps not so much, but from a finance perspective, maybe you do care what's our rate of growth of utilization of our resources so that we can do budget planning for the future. We also want Datadog easy to implement, just quick, yum or apt install. Real super easy to configure with all the standard provisioning stuff, Ansible, puppet chef, salt, you name it. And the open source aspect of it's really important to us. We worked very closely with Datadog when they first were working on their OpenSec integration. And they were taking one approach to look at the data. And we needed another approach to look at it too. We needed more than what they were looking at because of our custom use cases. And so we were able to take their code and put in PRs to get to it. And we were also able to basically have a second integration that basically extended the first one that they wrote that gave us all of the metrics that we needed. So it was really nice to be able to customize the monitoring solution but still push it all back up into their SAS solution up in the web to get those dashboards and graphs. Yeah. So to give a quick overview of what Datadog is and what we do, we're a SAS-based infrastructure and application monitoring solution, as Mike was saying. The agent is open source. So that's all the bits that run in your open stack environments or on your VMs written in Python. You're always happy to accept contributions but also feedback on how we can improve it. But that's all up there, you know, BSD licensed in GitHub. But really what we look at is time series data. So that's metrics and events. And we'll talk about the difference between the two in a little bit. We were looking at about a trillion data points a day at this point across our customer base. So we're working at quite large scale with folks that are running open stack, folks that are running in the public cloud on various other solutions. We like to help you build insightful dashboards, not just graphs that you kind of scratch your head about what's going on, help you democratize that so folks across your organization can build their own dashboards, and then bring in things like intelligent alerting. So as your environments are changing constantly, you're in the cloud, whether you're doing auto-scaling or some other programmatic dynamic infrastructure, you can't really stick with threshold-based alerting. And so we bring in things like outlier detection and soon more algorithmic style approaches to alerting into dashboarding that will help you there. I'll also mention that we're hiring. If you want to work on integrating with various open source technologies, stop by the booth. We'd love to chat with you. So we're talking about, we're here talking about monitoring in open stack, but it's good to take a second and just talk about why monitoring is important. You wouldn't drive with your headlights off in the rain. Why would you deploy anything in production without monitoring? We actually would tell you to go, you might have heard of test-driven development. I would suggest that you take it to the next level and consider monitoring-driven development. Start writing those monitors in your staging environments or in your development environments, even on your local machines. Not because I want you to send us more metrics, but because when you get to production, it's really important to know what normal is. Collecting that data is going to be cheap, whether it's in your development environments or in your production environments. I mean, we're here at a cloud conference. We know that we've commoditized a lot of the things that we interact with in terms of storage and compute. So yes, it's a lot of metrics, but you're going to find a place to keep it. You're a cloud operator or a cloud provider if you're in this room, most likely. The thing is though, not having that data when you need it can be really expensive. How many of you guys run postmortems in your organizations when you have incidents? It's cool. So most of the folks in the audience, and how many of you have done a postmortem where the only thing you can say to your boss is, I don't know what happened, but I promise I'll add monitoring for the next time. People are sort of shy raising their hands on that one, but I guarantee you've all done it, and that's really what it comes down to. You're going to have incidents occur no matter how much monitoring you have. Hopefully you'll be able to get ahead of it and have a proactive response rather than a reactive response, but you're going to get some bugs on your windshield here and it's going to know why and be able to move around it. So we say instrument all the things and that's going to be collecting data from your OpenStack environments, collecting data from your underlying hardware, collecting data from the services that are running in your OpenStack VMs, all of the stuff that powers OpenStack, whether it be RabbitMQ or MySQL or all the other bits that are involved there, it all fits together and it all tells a story and even if you only look at it 20% of that time, that 20% of the time that you look at it, you really, really want it. So as we're talking about though, let's talk about the operational complexity in our cloud environments and sort of how things are starting to change. So operational complexity, we define that as number of things that you want to measure and the velocity at which things are changing in your environment. So just on average took some numbers from some of our customers and what we're seeing around. Folks are collecting about 30 metrics per node or per instance from OpenStack. They're getting about 100 metrics, for instance from the operating system, about 50 metrics from any custom applications, whether they be off the shelf things like Redis or MySQL or I have you, some of them more, some of them less or something they've written in-house. Just to add to that, that 30 metrics, that's just kind of monitoring pure OpenStack. Once we started adding in solid fire, we got to monitor that. You add in 50 metrics from there, we got a monitor contrail, you add in another 50 metrics from there and so it grows a lot beyond just monitoring OpenStack. So in there I'm mostly speaking about Nova, yeah. So you start to multiply that out. How many folks are doing things with containers here on top of OpenStack? Okay, good set of hands. So that's really, you're now taking that 150 and you're really multiplying that by the number of containers you're running per node. So let's do some math. Let's say you had 100 instances, on average we're seeing people run about four containers per instance. That number is increasing over time but that's again, that's the median of what we've seen. So that's about 400 containers. Again, this is all simple combinatorial math. You do take 180 metrics per host, that's the 150 custom metrics that we had plus the 30 or so from OpenStack on the underlying instance. That's about 630 metrics a host. Again, not so bad, but you start multiplying that out and that's 63,000 metrics that you're looking at right there and hopefully you're collecting those at somewhere in the second to sub minute level granularity. In the world where things are changing all the time, you really don't want to be looking at five minute averages or weak averages or hour averages. This RRD methodology that we've kind of come to accept from our cacti's and our MRTG's over the years really don't work in a world where things are changing by the second. So that's a lot to look at. It's a bit of metrics overload and what we'll talk about today is a bit of how lithium has managed to bring all these together to put together some insights into their environment as well as some monitoring theory from Datadot, what we call monitoring 101 and just how to interact with your metrics and your data. So we've talked about the number of things that we have to monitor. Let's talk about the velocity, right? We did a study back in October on doctor usage and cloud usage and what we're finding is that while VMs are now lasting about 12 days, so it's not the years people aren't treating their VMs as pets like they might have their servers. They're not naming them after Greek gods or after planets and lovingly speaking of them anymore. They're turning pretty regularly and in many cases, if you're in a dynamic environment, it's likely less than 12 even. On the container side, we're seeing about three days and that's decreasing all the time as well. So these things are changing all the time, right? Things that we used to see, the host half-lives that we used to see, if they were hours before, they're minutes now. If they were days before, they're hours now. And this applies to your deployments, this applies to your infrastructure. All of these things are changing much more rapidly as we move into sort of everything being API driven. Kind of makes you wonder, I'll change this before it gives somebody a seizure, but this is the kind of like the response people have as they're trying to figure out what do I do with all of this data and all of this change in my environment? So the first challenge that people run into and the type of things that they need in this type of environment is more dynamic management of their monitoring systems, right? If your goal is to detect normal and normal changes all the time, you can't codify this in a Nagios config, right? You take that, you take like a heat auto-scaling implementation, you shove a Nagios config around it and you're gonna have a bad time. Like your chef runs or your puppet runs or your Ansible runs, whatever it might be, are not gonna converge fast enough in an auto-scaling world where things are changing all the time. This guy's hilarious, you should follow him on Twitter if you haven't yet. Honest update, he's, it's honest status updates. But yeah, I mean, we've got so much stuff going on here, these things are moving, you don't know where they are, you can't keep it in your head. And so you need to find a way to do it in a more dynamic manner. The other thing we need is a way to look at data from all of these different places. You know, Mike was mentioning, we talked about Nova a little bit, he mentioned that they're using solid fire. You know, you might be looking at Seth, you probably have some hardware underneath all of this because I know we're talking about the cloud but I'll have to run on something. So you probably have a pile of Cisco or Juniper, which is somewhere that's powering all this, maybe some cumulus gear. You got the Rabin MQ's, Cassandra, you're probably using Contril for your networks. And so all of this stuff is important to bring together into a single place to look at it. And then I know we're all in OpenStack then, but you're probably in some public clouds. So you know, I've got Amazon EC2 up there, but maybe you're in GCE, maybe you're in Rackspace, maybe you're in any of the other cloud providers. And so being able to look across your infrastructure, your applications across all of these different providers in a single set of dashboards and alerts is a bit important. And then finally we were talking earlier about some of these ideas of programmatic alerting. So this is an example of outlier detection from Lithium's environment. This is one of their MySQL clusters. And what you'll see here is, you know, we're doing things like rather than telling you, you know, rather than setting thresholds, what we've done is say, identify the guys that are different from the rest of the cluster. And that's important, especially like, if you're in a load balanced environment or a cluster environment, you want all of your things to behave the same. Another lesson is tooling needs to really look at things from a service-centric point of view. Stop looking at things from a host-centric point of view, right, the host is dead. We've got some amazing abstractions on top of our hosts and on top of our infrastructure with OpenStack and with other tools in the world. Stop trying to measure where things are. This is a model of the Potolomaic, you know, the Potolomaic model of the solar system, right? Earth's at the center, sun and all other planets are revolving around us. Look at all the crazy lines that one has to go through as they're trying to figure out where the other planets are in relation to yourself. It's sort of like monitoring, trying to figure out where your services are and how your customers are interacting with them, starting at the host rather than coming service-side in. Right, this is much simpler. This is the modern model we look at today. So, yeah, as we start to look at this, I guess the question is, where do our normal monitoring tools fill in? You know, as Mike was saying earlier, they start to feel a little bit of like a relic from the past and something we might want to reconsider, both our strategies and the tools that we're using. So, with all of that data that we just talked about and the views at it, it's really important to decide how to show what it means to you and to others. For us, visualization is key. You may have seen this at the Tokyo Summit. I didn't re-grab the dashboard, but for a quick snapshot, this is a quick, easy way to see how our stack is running. You know, it's a reduction of information, but it has comparisons and you can see some overtime type information. You have, you know, some colorization for this in a good state or not. And you're basically distilling that information overload down into a piece of data that somebody can glance at and know if something's working or not. And another important thing about visualization is it's not you that cares about this. I mean, don't get me wrong, you do care, but your users care. Your XYZ team cares. Your client block cares because they're running their services on here. It's not just you managing your open stack cluster. So your dashboards, even ones that are kind of private to your team perhaps, you still want to make them publicly accessible. So if somebody's going through to kind of see how the stack is going, they have their view, but giving them the information for the other views as well is really empowering for them. Plus they don't bug you asking you questions. Hey, how much of this do I have going on? The comparisons of now versus last time also, super important. You know, we have the number of instance changed since, well, yesterday or last week. So, you know, hey, we're running 27 more. You can start to see the growth of how the stack is going. So that goes back to your perspective matters. You know, let's call A, your admin, you know, open stack admins is the people making sure the stack is up and running, hardware's working. But then you got B, who are your operators? They're the ones that are saying, oh, maybe some user management, some tenants, some quotas, you know, all that kind of stuff. How is, you know, I'm managing this open stack application, not necessarily managing the hardware and infrastructure that manage the open stack infrastructure, and then you have the users that are just like, hey, I got an open highway, I'm in the cloud, woo, you know, I just wanna go. Everybody looks at data differently. You know, admins, you gotta carry about your standard system metrics, of course, your disk, your CPU, but you know how to manage that already on your hypervisors or on the different things. You know, there's a lot of guides out there for how to manage RabbitMQ. But what's not always obvious is you wanna be able to store your data in a way that you can compare hypervisor to hypervisor. Not just what's the CPU of this hypervisor versus the CPU of this hypervisor, it's what's the CPU of the hypervisor, but this one is running, you know, 20, 16 gig instances, and this one's running 24 gig instances. How's the performance matching? Is this one's CPU the same as this CPU usage on the hypervisor, but it's not running as heavy of instances on there. That type of monitoring is crucial for admins to know, are you distributing your instances properly? And so tagging comes into play really important there. A lot of standard tools, it's very difficult to tag. Data dog's really nice because we can tag the hypervisors, not just for the class of machine of the hypervisor, but the instances running inside those hypervisors, that kind of tagging can go along with it. Yeah, so instance of flavor. So as an operator, you're carrying a lot more about your quotas, your usage, how is this team or this client utilizing the stack and is their growth gonna go outside of what I have available for them? Are they able to launch new instances of this size or do they only have this size and where am I sticking them and avoid that? From a developer perspective, they care a lot more about is it up and running? Do I got myself, if I'm having application problems, I wanna dial back into B and A to kind of troubleshoot that down because I don't know how many of you run into it, but when there's an application problem, sometimes it's, oh, it's the cloud that's broken, not the application. No, that happens. Let's go to the next one here. So some important concepts to remember when you're monitoring open stack is you are no longer monitoring an infrastructure stack. You are monitoring an application. It's actually a services, microservices application. There are n number of components and services to open stack, but it's not as simple as just saying, oh, Nova, that's one service. Well, no, because it has databases, it interacts with Rabbit, it has the control planes and all of this stuff tied together. You're talking dozens of services that need to be monitored. It's kind of a mind shift for a lot of people on ops teams that are used to managing infrastructure and coming to the cloud. It's very important to remember. With that, it's very difficult, you gotta be able to monitor how those things are working well together. Your servers may be responding fine, but the application is not. And the other thing to remember is you're gonna have more than one network that's going on there. You may have your network that runs the hypervisors and then maybe a network that does your control plane and then, oh, hey, you got Neutron, that's running your network inside open stack, but maybe Neutron's running contrail behind the scenes and all of a sudden, you've got four different networks you're having to monitor. Any one of those screws up, it may not be visible until you look deeper in the stack and see how it's actually impacting things. So that's why I kind of take the Wikipedia, think about sharing, human share everything. Now, open stack kind of is all tied together and shares everything. So with that, people always talk about, oh, you can't compare apples and oranges. In this particular case, you're making a fruit salad. I don't know about you, but I hate it when I get a fruit salad and there's barely any watermelon in it. That's just me, I like watermelon and fruit salads, right? You may have a perfectly fine fruit salad and everybody loves it, but there's no watermelon, so something's not right with it. It's not there. The whole thing, you still got your fruit salad, it's still up and running, but your stack may not be. So here's an example of some graphs that we have. Top left there, it's the current workload in open stack. So what you can see there is there are jobs that are going into the open stack queue. So in this case, I believe it was to launching instances. The next one over, well, how many messages are in my RabbitMQ queue? Then you go into, well, what's my publish rate of messages going in there? Well, that kind of looks okay, right? But then all of a sudden you see, I have this unacknowledged rate inside Rabbit. So what this means is somewhere along the line, somebody tried to launch an instance and it didn't start up. In this particular case, it wasn't Nova. Oh, it wasn't Rabbit, it went all the way down that it was, was it Cinder? No, it was SolidFire. And in this particular case, it's because the tenant didn't have permission to attach a Cinder volume using SolidFire with this particular IOP level. To a user, they don't care, they're looking to go, gosh, right at my instance and boot up, open stack is broken, right? Or Nova is broken. But in reality, it was something very much under the scenes. And so this is one of the complexities with open stack, of being able to find charts that show with this. We talk about outliers that are really important. Yeah, that's an outlier there for your RabbitMQ unacknowledged rate, but that went away. And suddenly, oh, do I really want to get blipped and paged on a 0.15 average rate of unacknowledged messages? Probably don't want to get woken up for that. So it doesn't matter for you, but your users, they care. They don't have their instance. And they hate the answer that, oh no, it wasn't this, it was way down here. We had an instance, not series of instances, we had a problem where elastic search would just lock up and crash. Oh, crud, you just can't send any searches, can't index anything, it just straight up broken. What again here you'll see is we were doing checks on the average current workload, right? Not very big blips there, some stuff's happening in open stack, no big deal, right? Well, you look at the TCP retry or retransmits, they're a small blip, but you know, shoot, four of them in what is that, a five minute window? It's TCP, that's fine. We're not gonna worry about it. It's not a high enough outlier for long enough that we care. But then when you look at the sum of the TCP retry or retransmits for all the VMs running in that stack, our numbers start to get kind of higher. It's like, oh, okay, well, you know, but we weren't looking at that. We were looking at perhaps just for those elastic search servers. Where we finally tracked this down to was in contrail, our flow queue, our limit exceeded, you see the nice big chart going up there of those going up for an extended period of time. So what was happening here was instances would be launched or destroyed of a larger number that would then screw up the flow in contrail, and we were running an older version of contrail, by the way, this bug has been fixed. It was fixed a while ago, but we didn't upgrade. And so that flow would then tie up the network. Packets would be dropped, need to be retry, and we still don't know why elastic search didn't recover from it, but it still got screwed. The problem here is that we would be looking at individual graphs, and everything looked fine. Who cares about one TCP retransmit in a five minute window? But taking all this together is, you know, what's there? This took us weeks to resolve, by the way. Not a fun one. So, you know, kind of, you know, Mike's given us some great examples of sort of troubles that he's had in the Lithium environment and how good monitoring tooling, you know, I hope you choose Datadog, but if you choose something else, you know, the point is good monitoring and good visibility into your infrastructure is important. But we've talked about all these metrics, and the question is, what are these things, which of these things do you alert on? Which of these things do you just keep around for, you know, as resources for your investigations, right? We talked about how you wanna monitor everything so that you have it when it comes down to that postmortem time or to that incident response. And so we wrote this guide, monitoring 101. It's up on our blog. I'm gonna give you the TLDR edition for your Reddit fans out there. The links are in the slides. You can, when we post them up on the, we'll post them up on the summit site at some point. You can check them out. But we'll dive into this real quick. So as you're trying to figure out how to avoid that pager fatigue and what's important to you for alerting, we encourage you to break your metrics down into three, into one of three categories. The first being work metrics, resource metrics, and events. And we'll go into defining a little bit about those, I mean, in a second. But think of your application or your OpenStack environment as a bit of a factory. You know, you're outputting, I think this is a textile factory of some sort. I'm gonna go with cars, because that's a little bit more, we all know what those are. But let's say, so you've got a bit of a, you've got a factory and you're trying to figure out what you're doing here. Well, so work metrics are gonna be the actual output. They're the things that your customer is consuming. So if you're an OpenStack provider, it's likely the number of VMs you're letting folks spin up and shut down or the number of things going on, that's your throughput. In the car factory example, it's gonna be the number of cars coming off that assembly line. Success versus error rates are gonna be things like, how many of those cars show up without all their hubcaps? Or how many of them are ready to go to the show floor and be sold to your customers? And then performance is gonna be how long it takes you to make one individual unit of those cars. And you put those all together, and these are the symptoms. These are the things that your customers care about. They're the things that your customers are paying you for. Resource metrics on their hand are all the parts that go into making those cars, right? It's this pile of tires and how many of those do you have available to put on the next set of cars? It's how many more VMs, it's the capacity in your environment. How many more VMs could you be spinning up and down in OpenStack? How many more, what is your utilization looking like? And then the last one being events, right? So let's say, again, we're a car factory here in this example. Let's say your CEO got on stage and promised some car that doesn't exist yet and that is gonna make 350,000 of them sold them in a week, right? That's an event that now changed your queue. You now have way more things to produce in a very short amount of time. You can put that on your graph, overlay it and understand why your metrics went one way or another. Maybe you change the formula of your car, or in this case, of how your team assembles the car in the assembly line to get more resources out of it or it slowed you down. These are the types of things that you're gonna pull in there around all of the metrics that you're collecting to figure out why things are going one way or another in regards to the changes you made. So in your OpenStack environment, maybe you've upgraded to the latest release on the first day it came out, or maybe you've made some sort of a change in your environment. You're gonna wanna overlay those on top of your graphs and figure out what's going on. So we'll take some examples from Nova. In your case, customers are consuming from you. Again, the number of VMs that are running at a given time, being able to start, being able to create the number of VMs that you're letting them spin up or shut down at a given time. Things like IO performance, things like network performance. On the other side, the resources are gonna be things like, are the hypervisors up and available and usable? How many VCPUs do you have ready to go so you can keep track of utilization? RAM and disk, things like that. And on events, again, these are gonna be the upgrades, things like we talked about before, code deployments, those might be code deployments in your VMs or to your VMs. They might be auto scaling events in your environment, instance migrations and creations, adding and removing nodes, all of that. And these are, again, the things that are gonna provide you context as to why the utilization on your cluster is higher or lower. So we're talking about events and the idea of overlaying. So here, for example, we have a graph on IO usage in the cluster. And the idea of, when I say overlaying, those are those pink bars that run over this. These are events in our environment. We've added or removed nodes from the cluster and you can see how that's impacted IO. So when do we wake our teams up? Nobody likes to be woken up at three in the morning. So what we encourage you to do is page on those work metrics, page on the things that your customers are consuming from you. So can't spin up a VM, can't shut down VMs. Those are the things that your customers in the open stack environment care about. If it's an API running on top of all of this, it's when your customers can't make those API calls. These are the things that make you money. They're the things that customers are coming to you for. What you'll do is then use those resource metrics, those events, and other work metrics from your environment to find out why the situation is occurring. I realize I'm at an operators conference and this joke probably follows a little flat here. But your CEO probably doesn't page you in the middle night to say, everything's working fine, but CPU usage is kind of high. What they call you to say is, customer tried to spin up a VM, customer tried to make an API call. It didn't work. Why is this happening? Fix it now. And so it's important to think about when we say alert. Alert doesn't necessarily mean page. It could be email. It could also just be create a record in some sort of a time series, as we were talking about earlier. You want these things to be, if it's going to wake you up, make sure that it's important. You have to act on it now. And you want those alerts to include, what do you do about it? Who do you call next if you get stuck? Every one of these should be actionable. The minute your team gets so many alerts that they're ignoring their pager, it's called pager fatigue, and it's not a good situation. I used to work with teams when I was an open-site operator or an infrastructure operator at Viola. And I'd see their pager duty queue, and I'd see 10,000 alerts in one week to their on-call. And I'd say, when do you sleep? And they say, oh, I looked at it, it looked fine. I went back to sleep. That's not a good situation. If your pages are what tell you that you're doing OK rather than you're doing poorly, you sort of flip that around. You want to fix that. So avoid cryptic alerts like this. Like, don't tell people Nova Node 1 is down, or DB server 1 is down. Nobody knows what to do with that at 3 in the morning. They've swapped all of that information out to disk. They're now dreaming about something. They're talking to their family. They're having a good time. What you want to do is, what do you do with this? Nobody knows. You've got to go look in some wiki page. So you want these things to basically say exactly who to call, what the business impact is, even link from your alerts to the wiki page or include the contents of that wiki page directly in the alert so that you know what to do next. But yeah, you're going to get these alerts. And when you do, the flow that you'll take is you're going to look at your work metrics, see which one changed, what the symptoms are that your customers are experiencing. Hopefully these are leading indicators rather than trailing indicators. And you're responding before they've paged you, but from a customer ticket or what have you. But you're going to use those, look at your resources, look at the events, and diagnose. If you can't find it within the service that you build, every other team in your environment that you depend on has some sort of a work metric. So maybe if your work metric is the number of VMs you're creating and shutting down, and all of a sudden there's some issue there, maybe below that, the next work metric is something coming out of contrail in terms of packets coming in and out of the environment. As no, but that's your resource. As contrail, that's your work. And so each person all the way down the stack has a work metric, they have a thing they provide. And it's true in your application stacks as well. Maybe the work metric that your customers notice, again, is those API calls, the work metric that you as an application developer might notice would be maybe something from your DBA team. How many SQL query latency or something of that effect. Every team has one. And so figure out what your work metrics are, figure out who your resources are, graph that out, and that's how you know what to alert on. And so that brings us to kind of what's missing. Some of these are complaints on my part, but some of them are things that you need to think about how it applies to your stack. What Alan was talking about here with the different services and teams with those metrics, tagging is important so that you can aggregate and overlay to each other to say, oh, this did this so I can look at the graphs for this and see this, kind of see the whole thing together. The problem with this is it's very difficult to see how open stack as a whole across those dozens of services is actually performing at a glance. It's very difficult because you still have to dig in to each one to figure out what's going on if something isn't working. So that's one of the problems I see right now with kind of open stack as a whole with monitoring. Grabbing this stuff isn't gonna help us if there's so much there. So it's hard to say when those different services, you're almost like outlier here, outlier here, outlier here, now I care about a problem. So that's one of the challenges right now with monitoring open stack, that correlation of disparate metrics. You gotta tag and monitor your applications that you're running just like your open stack application because when your hypervisor suddenly got this huge CPU spike, you wanna be able to look at the applications that are running in the VMs on that hypervisor to see if suddenly they're sucking up a bunch of stuff to create a noisy neighbor. Or vice versa, all of a sudden your hypervisor is doing something weird, maybe okay for your hypervisor, but now it's causing your applications to perform poorly. So you gotta be able to correlate those kind of things together. Sorry, one last thing on that last one. Real quick. Service degradation, when services don't perform as well as they should. You gotta know for your stack what's important. If this service isn't working right, it may not bring down the whole stack, but it's an acceptable risk. So with your monitoring and your alerting, you gotta decide for yourself what matters. If you can't shut down instances for your private cloud, does that really hurt you compared to not being able to start instances? So these are just some of the things that it's a judgment call for your stack, which makes it really hard because it creates these missing pieces in the monitoring that's not easy to solve. I haven't come up with a good way for us to do it at Lithium. Then if you've got great ideas, let's share them to the community so that everybody can kind of solve this problem. So I got that one. So anyways, we're at about 12.45 now, which is sort of the end of the session, but we've got a break here, happy to hang around and chat with you guys a little bit about how we're interacting with OpenStack at Datadog, how Lithium's using it in their environments and some of the monitoring challenges they've encountered or solutions that they've come up with that would be helpful to you. Also feel free to stop by the Datadog booth. I'd love to give you a demo of what we're doing with OpenStack and talk with you about some of the roles that we're hiring for this week at the summit. So thanks.