 Hey, everyone. I'm Jonathan Kelly. I'm currently working as a solutions architect with the Rackspace OpenStack private cloud group. But I have the dubious honor of having had pretty extensive monitoring experience over multiple roles in my career. At some point in my career, I realized I was never going to be able to escape from monitoring. So I've learned to love and embrace it. And that's what I'm here to talk to you guys about today is monitoring OpenStack and the enterprise. So show of hands, how many of you traveled by car on a regular basis? How many of you have ever experienced an unexpected breakdown? Wouldn't it be great if your car could tell you your alternator just broke? And I know everything seems OK. But if you don't get this fixed now, your car is not going to start tomorrow morning. Or if it could tell you, hey, judging by the wear level on your wheel bearings, you're going to need to replace those within 1,000 miles. Or when your car does break down, if it could tell you what specifically broke and what parts you were going to need to replace to get it running again. That would be pretty great, right? And that's what monitoring can do for you. And that's what I'm here to talk about today. So what we're going to cover today, we're going to talk about the benefits of a robust monitoring solution. We're going to talk about the differences between alerting and metrics and why both are critical to a well-designed monitoring system. We're going to talk about the importance of correlation and suppression to remove noise in your monitoring system. We're going to talk about creating effective alerting and walk through an example of how to do that in the context of OpenStack. We're going to talk about service level validation, how to prove your environment is functioning within certain parameters. And we're going to talk about forecasting growth so you can grow your environment before you start to run into performance issues. So let's look at two scenarios. Scenario one, you wake up 3 o'clock in the morning. You have 700 emails in your inbox. Your phone is blowing up. Every second, more alerts are coming in. You take a look through them, and you try and figure out what's going on. You're still trying to shake off the sleep. There's hundreds of alerts. Looks like everything's broken. Some things are going down, coming back up. Your main site is down. It's really not good. So you know there was a scheduled maintenance last night, so you call the dev on call, and he tells you, oh, we exited the maintenance window two hours ago. QA passed. Everything was still working properly. So you're just trying to dig through all your systems and figure out what broke, how to fix it. An hour passes. Your boss calls you up and tells you you've blown your SLA for the month. Really bad situation to be in. I'm not going to ask for a show of hands on this one, but I will say this is not an uncommon scenario in the IT world. I'm sure there are several of you in the audience who've experienced this before, perhaps every night for a month. And it's not a good place to be. And that's an example of how a bad monitoring system can almost be worse than no monitoring at all. Although the worst thing of all is to have a customer call or an end user call and tell you that your system is down, because you don't know about it. Scenario two, you wake up. You see about a dozen messages in your inbox, and it basically just looks like your MySQL VIP is flapping. So you log into your monitoring console, and you look and see what related alerts are there. You see that all of your MySQL reed slaves are hitting max connections. And so that's, of course, causing them to fail health checks with the load balancer, drop out of the pool, and then after 60 seconds, when all the connections time out, they start passing health checks and get re-added. So it's causing your MySQL VIP to flap. So you log into one of the servers that's currently failing health checks, take a look at the connections that are active, and go back to one of the web servers that has this connection open, and take a look at the application. And you see, hey, this application was modified 30 minutes ago. That's an hour and a half after the maintenance window supposedly ended. So you take a look, and you see it's not closing out database connections. That's great. So you call the DevOn call. Let them know what's happening. Disable the application right up and send out a root cause analysis. And within 20 minutes, you're back in bed. So as someone who's personally lived through both of those scenarios, I can tell you that the ladder is a much better place to be. And that that's what taking the time to invest in building a good monitoring solution will provide for you. And this was something that I learned out of necessity because I wasn't sleeping. And getting woken up every night at 2 in the morning is no fun. So I guess the point of that is that taking the time to develop a good monitoring solution is probably one of the best investments you can make in your platform for a variety of reasons. So when I first started out of my IT career, I worked as a sysadmin. And I really thought of monitoring in terms of something breaks, you get an alert. More things break, you get more alerts. And then there's those alerts that come in at 3 in the morning when your cron jobs run. And you know you can ignore those, because those don't matter. Nothing's really actually broken. And you develop kind of an apathetic attitude towards monitoring. And to some extent, that mindset makes sense when your job is entirely based around incident response and triage. But when you become a platform owner, a lead engineer, or something along those lines, or you're actually responsible for the uptime of your platform, you realize just how lacking that approach is. You realize that you need to know, ideally before things break, you need to know before you run out of memory, you need to be able to forecast your growth requirements to track your utilization. And you need to take a more proactive approach to monitoring. When you do encounter an incident, you need to create a new set of monitoring tools to ensure that if that occurs again in the future, you catch it. Even if you feel like you've permanently resolved it, because recursion is a real thing in IT. If there are any developers in the audience, we all know how much time we spend on a recursion, regression. We all know how much time we spend on regression testing in the software world, but regression is a real thing in infrastructure as well. And just because something's fixed today doesn't mean you're not gonna see it again. Probably six months or a year from now when you've forgotten exactly what causes it and how to fix it. So I hope that at least for some people in the audience, that these concepts reflect a change in the way you think about monitoring and that there's something that you can take away and apply to your job in a way that's going to make your life a little bit better. So what specifically can monitoring do for you? So monitoring provides a variety of benefits. It can improve application uptime by reducing the amount of time it takes to resolve incidents, by helping you to perform a better quality root cause analysis and permanently resolve a greater percentage of incidents. It can reduce burden on staff by ensuring that every alert that comes in is something that's meaningful and needs to have action taken on it. People don't have to waste time sorting through is this something I need to do anything with? Does this matter, you know? And you can improve performance by forecasting growth requirements before you start to run into performance bottlenecks and have outages or performance degradation. So before we dive in, we're gonna talk a little bit about monitoring semantics. So there are two main facets to monitoring. They're performance metrics and alerting. When a lot of people think about monitoring, they think about performance metrics, which are, you know, CPU disk and memory and others, which we'll talk about in just a sec. So this is something that most people are probably familiar with. The purpose of performance metrics is to let you know utilization of various aspects of your system. In a more abstract sense, it's to measure and report on quantifiable data from your system. The purpose of performance metrics are to provide you with information that allows you to identify if resource constraints are relevant during an incident or while performing a root cause analysis, as well as to forecast growth. To contrast that, alerting is oriented around identifying failure states in the environment. So we'll dive a little bit deeper on that in a sec, but in short, alerting is taking the events that occur in a system, the discrete changes that occur in a system and identifying which ones of those actually indicate a failure state and require action to be taken upon them. So performance metrics, also called quality of service or QOS metrics, Cylometer just calls them metrics. It's basically just a way to track and record time series data. So while an incident's occurring, while you're performing a root cause analysis, it's useful to be able to determine if it's the result of like a logical system failure, something breaking in an application, or whether it's the result of insufficient resources somewhere in the infrastructure. And that's where QOS comes in handy, or performance metrics come in handy in an incident response context. And it's really profoundly useful. It can reduce the resolution time of incidents dramatically. If you don't know what your system resource utilization looks like across the cluster and on individual machines, you can spend a lot of time trying to identify application failures when there's no real application failure. It's just not responding because it has inadequate memory network resources. What have you? I think we already, yeah, oh. So CPU disk and memory, everyone's familiar with those. And when a lot of people think about monitoring, that's what they think about. There's a lot more that you can do with performance metrics than just that. Some really useful ones at a platform level are page load time, application response time, and then at a system level disk IO. For most people who have a content delivery system, page load time is one of the most important metrics that you have, and it's one that not many people think about. I don't remember the statistic off the top of my head, and this is a bit extratemporaneous, so feel pardoned me, but the abandonment rate for web pages goes up exponentially as the page load time increases beyond one second. And this is something that every major site tracks and attempts to drive down as low as possible. It's something you should be tracking. So what do performance metrics look like? This is just an example with some mocked up data in case anyone's not familiar with this at all. In this case, we're going to say that we're measuring page load times once per minute, and this is what some data may look like. You can see you have a timestamp, you have a metric name, and you have a measurement. So in this case, our main page is taking 0.73 seconds to load, and you can see we had another record and another record, and over time that aggregates into some very useful data that you can use, and we'll talk about that in a sec. Some other sample data, measuring the five minute load average of web servers, a bunch more simulated data where you have a main page, login page, mobile page, three web servers you're monitoring, and you can see if you were doing incident resolution and you saw a load average like 382.7 on a server, you know that there's probably something going wrong there. Typically, this data is going to be stored in a database somewhere. Most monitoring solutions have a way to view it in a summarized form via graph over various time periods. You could look at a five minute graph, a one hour graph, a one day graph, a one week graph. So alerting, as we discussed, is when events meet criteria indicating an action is required with an event just being something that happens in the environment. So you create a VM in Nova, that's an event. You create a network in Neutron, that's an event. Someone logs into Horizon, that's an event. And as you can imagine, in a large environment, there are hundreds or thousands of events being generated every second, potentially. So there's this vast amount of things occurring in the environment. And the true art of monitoring, in my mind, is building a strategy for taking all of those events and turning them into a meaningful alerting strategy. And we'll show you guys how to do that in just a couple minutes. So I also wanna point out the something happening, the events in the environment, include things like threshold alerts. So if you say, at 80% CPU utilization, that's something that is meaningful and significant to me. Crossing that threshold above 80% would be an event. Conversely, crossing it back down under 80% would also be an event. That's something not many people consider, but it's useful in the case of transient error states and stuff that you may not necessarily wanna wake someone up at two in the morning for, but that you definitely want logged. So some sample alerts. Here's an example of what some alerts might look like. These are made up of a timestamp, a severity level, and a message. They might also have a host associated with them in some monitoring systems where that's relevant. In this case, we can see our flagship website, hatsforcats.com is currently down. The HTTP VIP is failing. One of the web servers is failing health checks and our DB read VIP response time is exceeding the threshold of two seconds. So talking about alerting leads us to our next topic, correlation and suppression. According to Gartner, 80% of the mean time to repair is wasted trying to figure out where the issue occurred. And that's the purpose of correlation and suppression is to remove noise in your monitoring system and to make alerts meaningful. So if you imagine, if you have 200 web servers and an update breaks all of them, you probably don't want to receive 200 alerts. Ideally, you want to receive one alert that tells you, all of your servers are down. It'd be even better if it could tell you specifically how or why. And maybe during day-to-day operations, maybe they don't all go down. Maybe one or two of them go down. If you have 200 web servers and one web server is down, is that probably going to have an impact on your customers and users? Probably not, right? So you may not want that to be something that disturbs one of your sysadmins. You may decide, hey, we don't really care unless there are three web servers down. And that's what you could use correlation for. Correlation is basically taking multiple related alerts and either summarizing them into a single alert or generating a new alert based upon that data. We'll cover another example in a sec. So suppression is just preventing notification on alerts. And this is useful for a couple of reasons. One, if an alert triggers over and over again, you don't necessarily want to get spammed with it every minute when the check runs. Another more meaningful use case is when you have a more complex application where there's a bunch of services that depend on each other. If a downstream service breaks, you don't necessarily want to be notified of every upstream service that's broken. If you can be certain that that's the only thing that's causing the issue, then you basically want to know, hey, this is a database problem, I need to go fix the database. You don't want to see that all your web servers are down, that all your miscellaneous applications are down, your API services are down, because then you end up with 20 alerts and you have to sort through all that to figure out what's causing the issue. And that increases the amount of time it takes for you to resolve the issue as Gartner points out. So a quick example of correlation would be replication lag on my SQL slaves. Lag on a single slave is not really unusual. A long running query, there are a variety of circumstances where you're gonna see some replication lag on a single server, it may lag a couple seconds behind the master. But if all of your slaves are lagging behind the master, that usually indicates that there is an issue in the environment. So instead of seeing something awful like that, where you're just getting spammed with database alerts, you could use correlation to turn that into a single meaningful alert. An example of suppression. Let's say my SQL read-vip is down. You know that all your dynamic content health checks for your web services are gonna fail. They can't access the database to generate dynamic content. So rather than seeing a whole bunch of alerts that your flagship sites top 10 hats page is down, your login page is down, et cetera. In addition to the read-vip being down, you would wanna suppress those and just to see where the problem is actually occurring. And this brings us to what is most important, which is ensuring that our feline friends have access to fine quality headwear. That's what this is all about after all. And by applying these principles, we can do that. And of course it's a metaphor for whatever quality service you're providing to your customers or end users. I don't think hatsforcats.com is registered, by the way, if anyone wants a great business idea. Oh, someone beat me to it. So before we dive in on actually building an alerting strategy, I wanna cover one more thing real quick, monitoring perspectives. So there are a variety of different ways that you can approach monitoring. There are two perspectives that in my experience have been the most useful. One being what I call transactional monitoring and the other being system level monitoring. Transactional monitoring is monitoring from the end user's perspective. So you're not looking at running processes on the system. You're not looking at load average or anything like that. You're looking at web requests. You're looking at synthetic transactions, API calls, simulating things from the perspective of your end user. And this approach has the advantage of showing the actual health of your environment from the perspective of your users. If you have a 200 web server VIP and a couple web servers are down, there's not gonna be noticeable impact to an end user. And you wanna know that. You wanna know that there's no disruption in the environment. You also wanna know what the page load times look like, et cetera, et cetera. And this is the approach that you take to identify, this is how the outside world is seeing my system. The outside world potentially being internal customers, of course, or what have you. The service level perspective as a platform owner is what you care about more, I think. This is what lets you identify that an incident is actually occurring. So maybe 100 web servers are down. The end users are noticing, pages are maybe a little slow. But to the outside world, everything's still operating. But underneath the covers, everything is just going crazy. You wanna know that. And in addition to knowing that there's an incident, you wanna know what specifically is broken. And you wanna have access to all the data that you need to troubleshoot that quickly, get it resolved, identify what the root cause was, and hopefully fix that permanently so the same issue doesn't occur again. And that's the purpose of the system level view. So with that context, how do we apply all this to OpenStack, right? We're gonna take a look at Horizon and walk through a simple design exercise on how we could monitor Horizon. So we start off by thinking about what do we wanna check on Horizon? Well, let's keep it simple. Let's say we wanna perform a content check to make sure that Horizon is loading properly, and we wanna perform an authentication check to make sure that people can log into Horizon. So we've created over there two alerts. We assigned an alert ID to each of them. I'm gonna keep the actual method of health check fairly abstract. But in this case, there are a lot of good examples already on the OpenStack documentation site on how you can make these calls remotely via Curl or what have you. Or whatever monitoring system you're using will probably have some plugins for HTTP health checks. I guess the goal is to keep this as monitoring platform agnostic as possible so it's hopefully useful to everyone here. So let's think about this real quick. We're performing a content check on Horizon. If that fails, if Horizon isn't even loading, are we gonna be able to authenticate? Do we care about an authentication failure? No, if the platform's not working at all, if Horizon's not working at all, you're not gonna be able to authenticate. So we're gonna add a suppression ID there and say, hey, if this Horizon content alert triggers, we wanna suppress all alerts that match that Horizon auth ID. So that if Horizon is down, you don't see, Horizon is down and you can't authenticate to it because that's fairly redundant. Let's think about dependencies. What services does Horizon depend on to operate properly? Obviously Apache is what serves out the web content. So we need a way to monitor Apache via a health check and we know that if Apache fails, that both our content check and our authentication check are gonna fail, which means that we need to suppress, if that alert occurs, we need to suppress both the Horizon content check and the Horizon authentication errors. We'll actually walk through some examples of what the suppression would look like and what actual alerts you'd see in the event of various failure states in a sec. Other dependencies, Keystone. Horizon's authentication depends on Keystone and so we need a health check for that. You could do that via just a curl call to the API or using a Python script, whatever your monitoring system allows. And if that health check fails, your content checks are not gonna fail, your Apache health checks are not gonna fail, but you are going to not be able to authenticate into Horizon, which means that we're gonna need to suppress that authentication alert from Horizon. So if Keystone's down, can authenticate into Horizon, don't care that you can authenticate into Horizon because the root cause of the issue is that Keystone is down and that's the thing that you need to fix first. Finally, what does Keystone depend on? Keystone depends on MySQL, that's where it gets its user data from. So you need to perform a health check on MySQL and at this point I'll talk about, this applies to all of these. When we were talking about the system view versus the transaction view, one of the big differences there is with the system view you're typically monitoring an individual system, with the transaction view you wanna run your monitors against the VIP. So in this example, if you are running HA configuration where you have multiple instances of Horizon, multiple instances of Apache, multiple instances of Keystone, et cetera, in order for these alerts to be meaningful in terms of showing you the health of the overall platform, you need to be monitoring against the VIP, not against individual nodes. Because if you have three Apache servers and one's down, if it's been pulled out of the load balancing pool, everything's still functioning properly. And you wanna know that the overall platform is still working. From this perspective, you don't necessarily care that that's failing. So when we're talking about these health checks, if you're running an HA environment, you wanna make sure that that's against a VIP rather than an individual node. I'll take questions in a sec. Finally, if you wanna be super paranoid, which I am, I think it's almost impossible to over monitor edge cases because those are the ones, while it takes you five minutes to set up an alert for it, those are the ones that will take you three hours to troubleshoot if they actually occur in reality. And as someone who's spent three hours plus troubleshooting a memcache key, cache key corruption issue, if you can monitor it, you probably should. So in this case, we'll monitor the MySQL database record as well. It would probably be a simple select against the Keystone user database to ensure that the user that you're using for your monitoring platform still exists. And so in the event that there's a restore to the MySQL database, there's some corruption, something inadvertently gets deleted and your user's no longer there, this will identify you that that's the problem. If that alert was not there, you could just drop the Keystone table and your Keystone health check would fail, but the issue would not actually be at the Keystone level, it would be at the MySQL level. And you can see the suppression's a little different for that. So if the MySQL fails, we want to suppress, oh, yeah, actually I got ahead of myself there. So if the DB record is not there, we want to suppress the Keystone health check and the Horizon auth health check because authentications are gonna be failing across the board. This brings us to another point, which is that if the MySQL health check fails, that DB health check is gonna fail as well because you can't run a select against a database server that's not operating. And so what we've got from this is, well first you can see just saying, hey, let's monitor Horizon, it's a little more complex than that because if you want a really robust monitoring solution, you don't just want to monitor the thing you're trying to monitor, you want to monitor all the dependencies down the chain so that when something breaks, you know where it's broken and you get a single alert that tells you what you need to start troubleshooting. To blow that up just in case that was hard to read, these are the suppression rules that we've created. So we've got a set of six alerts and a set of suppressions associated with each of them that makes these meaningful. In the event of a failure state, so if MySQL fails, you're gonna see a critical alert, MySQL re-dipped down and we'll suppress the Horizon auth and Keystone auth alerts for Apache, you know, and I just made up these criticalities and messages that depending on your system, the criticality is gonna vary, but Apache fails, Horizon content gets suppressed and you get an error saying that your open stack admin Apache server's down, Keystone fails, we're gonna suppress Horizon auth alerts and tell you that the Keystone VIP is down, et cetera. So now I'm gonna try and quickly cover service level validation and forecasting and then I'll take questions. So service level validation is very simple. The catch is you need to spend a little bit of time thinking about what you're going to do and getting buy in before you do it. So you need to answer what metric are you validating? Page load time, we'll use that as our example. So if you wanna measure page load time and validate that 95% of page loads are occurring in less than a second, right? You need to think about, first of all, how are you going to measure that? Are you doing it from a local monitoring point or are you doing it from a monitoring point that's located at your primary customer site? Because the actual metric that you're getting, the results you're getting are gonna vary based upon where you measure from and how you measure it and that should be built into your SLA. And how are you gonna record and report upon that metric? What time period does it span? Is it from 12.01 at the beginning of the month to 11.59 at the end of the month? That shaves off two minutes, which is actually extremely relevant if you're talking about like a five nines or six nines SLA. And then are there time periods that should be excluded? Do you have a regular scheduled maintenance window that needs to be excluded from those SLA calculations? And that should all be set in stone before you start doing any technical work on this. So we'll take a quick example, validating keystone auth availability. One thing to note, my recommendation when you're doing monitoring via APIs, anything that involves a username and password, create a separate user, ideally even a separate tenant for your monitoring system. Keep it secure, use a secure password, follow whatever your password policies are, but you wanna make it so that if anyone captures that they're not getting access to anything critical. A little bit of paranoia goes a long way. In this scenario, we're gonna say we're gonna perform an authentication every five minutes. We're gonna look at a one month long SLA, and so that should be adequate to give us a decent amount of samples, about 10,000 a month. And since this is a very simple example, we're just gonna record a failure as zero and a success as one, and this allows us to very simply report on SLA uptime. We're gonna report by selecting the results from the appropriate time periods. So we're running from the first day to the last day of the month. We're gonna exclude 2 a.m. to 4 a.m. on Tuesdays for a maintenance window. And in order to get SLA uptime with that data system, all you have to do is divide the sum of the results from selecting this data from the database by the count, and you get something like this. I just, this is made up data, you'd actually have more samples than that in a month. So we have 5,668 samples. We have 5,644 good samples and 24 bad samples, which provides us with an uptime of 99.577%. If our SLA target is 99%, then we met success for the month. So you can see from that, SLA validation is really simple. It just takes a little bit of pre-work ahead of time, but doing something meaningful with it is not challenging. Forecasting, you can go from extremely simple to extremely, extremely complex. We're gonna stay towards the simple side on forecasting. First, we're gonna talk about very basic, the most basic system of forecasting, which is suitable for smaller infrastructures and systems where you don't have probably a very stringent SLA around response time or something like that. So in order to create a basic forecasting system, you need to define what's your threshold for growth in terms of specific resources. So if CPU utilization is over 80% in the cluster, on average, you need to grow. If memory usage is over 80%, on average in the cluster, you need to grow. If your max CPU is over 95%, you need to grow. Things along those lines. And the typical things you're gonna measure in an open-stack environment would be CPU usage, V-CPU allocation, so what's your actual allocation ratio given your current over-commit ratio, hyper-threading, et cetera, disk usage and memory usage. And you wanna measure ideally the average and the max for each of those. And then based upon that data that you've recorded, you wanna calculate the rate of change over time. So you end up with something like this. Let's say that our growth threshold is a CPU average of over 50% in our compute cluster, a memory average of over 80%, or a max disk usage of over 80%, or a V-CPU allocation over 85%. And here's some fictional data here. So CPU utilization 34% with a growth rate of 4% per week. With that, we can forecast that four weeks out we're going to need to expand capacity. Can look, V-CPU, memory disk, the same. You can see that we've got four weeks so we need to increase CPU in 3.8 till we need to increase memory. That's actually good if you see that because that means you've done a very good job sizing your compute nodes in terms of the CPU to memory ratio since they're very close to one another. And then in this case, you can see we're 19 weeks out from needing more disk. And since we're probably gonna add more nodes to meet the CPU and memory requirements, that's probably not a problem. I point out that the V-CPU ratio, if you see something like this where you're at 73% V-CPU utilization but only 34% CPU utilization, that might be a time to reconsider your over-commit ratio and maybe increase it. If your hypervisor is 75% full based upon your V-CPU allocation, you only have a few more VMs you can create but you're only using 34% of your CPU on average, you probably fit a bit more into that infrastructure than you can at your current over-commit ratio. So to get a little more advanced, so for advanced forecasting, you're looking at more like statistical models and this is extremely simple from that perspective. I'm sure there are folks out here who've done stuff 10 times more complex than this. But in short, you wanna define what's your target measurement and what's your threshold? So we want 95% of our page loads to be under one second. And what's your growth threshold? Do you wanna wait until you hit that 95% page loads being over, you know, under one second till you've just finally hit that threshold and then you wanna grow? Or is that like a hard threshold where you never want more than 5% of your page loads to take longer than a second? So that needs to be defined ahead of time. Ideally, you get buying from leadership and everything so it's not a battle to get the hardware ordered and everything once you start getting towards having problems. You wanna measure your target metric page load time, once again determining how to measure that is very important and then you wanna calculate the average, the standard deviation and the rate of change from that and with that data, you can easily forecast the date that you're gonna violate that threshold and you'll end up with something like this where you can say, hey, based upon where we're at currently and how fast we're growing, we can see that in 13 days, we're gonna need to increase our capacity by 10% and then another 13 days after that we're gonna need to add an additional 10%. And this is based on your statistical projections what the utilization would look like or what your metrics are gonna look like. Some additional considerations, cyclical traffic patterns. So when I say that's very simple, this is primarily what I mean. Your traffic patterns are gonna vary intraday, they're gonna vary weekly, monthly, seasonally depending on what line of business you're in. And so maybe 95% of page loads under one second is not adequate for you because during peak times only 20% are under 1% page load. So maybe that's what you wanna measure instead. Maybe you want 95% of page loads between 4 p.m. and 6 p.m. Eastern Standard Time to be what you're measuring. Some additional things to consider in a cloud context, dynamic resource provisioning. You probably don't want extremely low utilizations in your environment if you have the ability to dynamically provision resources. You don't wanna be running at 30% utilization. You wanna deprovision some assets and be running at maybe 50%. So it follows the same principles as what we've discussed, but we're talking a shorter time scale. We're talking seconds or minutes rather than necessarily days, weeks, months. And back to alerting, a lot of this is involved, involves creating these QoS performance metrics and then creating thresholds on those and saying, okay, well now 99.9% of our page loads are under one second. That means we can deprovision some assets until we get down more towards 95%. And that dynamic resource provisioning can allow you to save a lot of money over time. That's a big part of cloud computing. Finally, if you see a high standard deviation, one of the advantages of this slightly more advanced model is if you see a high standard deviation, that indicates that there are probably a lot of opportunities for improvement in the environment. Is there a resource problem that's causing a certain subset of your request to take longer to process? Is there a geographical distribution problem? Do you need to look at content distribution network or something like that to put your data closer to your customers, or your overseas customers blowing your SLA? And do you need to leverage automated resource management? Do you need to be able to provision additional capacity during peak hours and deprovision during off peak? So, here's what you guys know. Monitoring can provide you with a lot of benefits. Alerting and performance metrics are different, but they're both important. They're both critical to your environment. Correlation and suppression are absolutely critical for making alerts meaningful, for ensuring that your alerts indicate a failure state, that that failure state is actionable, something that you can respond to, and that alert also indicates the point of failure. That creating a high value alerting system is awesome. It will make your life better, I promise you. That SLA validation is simple, growth forecasting is slightly less simple, and we also looked at some pictures of cats. So, my challenge to you is to think about the things that we've discussed here, and think about how you can apply them to take your monitoring to the next level, and hopefully make your life a little bit better. Thank you. Are there any questions? Yes. Oh, let me get you a mic. What has been your experience in using commercial monitoring tools versus open source monitoring tools? Do you find that you can do this using open source quite easily? I would say no. So, my feeling is there are a few pretty good commercial monitoring tools. There are a lot of really good open source tools, but very few of them there's a single tool that will do the suppression and correlation is the point where most open source tools are weak. However, you do have the ability to take several open source tools. There are some open source correlation engines, for example, and cobble them together and create an effective monitoring solution. Any other questions? Yeah, so the question was kind of related to that because what we have found is one tool is good for one, but then not for the other. And the way I look at monitoring is you have an agent which you basically capture data and then you have an aggregator where it's taking all this data. And one thing is, let's say in OpenStack, for instance, as you add more computes, how do you automate installing your monitoring agents on the new computes or making sure that the agent itself is running so that you get events. And sometimes things that are not coming as an event, like maybe the node actually is gone down, so. Yeah, so Andy, do you wanna take that one? We actually have one of our monitoring folks here who can tell you exactly what we're doing in the real world. So basically when the agent dies, we have an alert that'll basically fire off. So you have like an agent server, right? You don't just have an agent running on your compute host, for example, you obviously fire the data somewhere else. And so when the agent stopped responding on the server, you wanna definitely have an alert for that. That's actually one of the almost key components because the agent itself could be a little bit flaky and die off and that's just as important as the server going down because if my metrics aren't going through, if my alerting isn't happening, then it's in a pretty bad place. So yeah, you definitely want that kind of thing set up. And yeah, I guess that's pretty much the main piece on the agent thing. But the automation, we just use Ansible? Yeah, we set up, you do definitely wanna include your monitoring and your automation. I mean, as far as we're concerned, like when your servers go down, that's a pretty big deal. And you need to know about that just as importantly as when you set it up. So we actually have monitoring set up as part of the automation of when we deploy things. So you'll know pretty much straight away if services aren't working from the deploy. And actually we've used it in some cases to find out when our deploys have gone wrong, like right from the start, which is pretty useful for the deployment teams as well. And it's the same monitoring we use for like normal day-to-day accessing of the service, et cetera. And this monitoring tool, is it your own home crown or is it using open source? So what we use for private cloud is we have the Rackspace Monitoring as a service solution. So it is, is it actually open sourced yet? Is it available in public? So we've used a couple over the years. We used CA's CAS for a while. We used Nimsoft's NMS, formerly Nimbus. And now we're using our own monitoring as a service cloud monitoring platform. I'll say based on my experience in IT, I really like NMS. Nimsoft is not paying me for this but I've used it in several roles in the past. And in terms of having a good balance of built-in functionality, it has great built-in suppression. And then you can do correlation via just writing Lua. So it's pretty nice. It's been my favorite so far. Essentially, we've written some Python scripts that are available open source that just do some basic monitoring things that we plug into the math system. And then we kind of couple that with some of the Nimsoft stuff to kind of cover off the stuff that we haven't actually done. Yep. Next question. Thanks, Andy. I just want to add that earlier today at nine o'clock, IBM, HP, and Rackspace actually had a session about monitoring as a service. We do have a product called the Manasca that's under development. So if you're interested in monitoring a service product for OpenStack, take a look at the Manasca. How do you spell that? M-O-N-S Manasca, M-O-N-A-S-A. M-O-N-A-S-A, okay. We had a session at nine o'clock earlier today. Excellent. I'll check that out. Any other questions? All right, well, thanks everyone for coming.