 Let's talk to you a little bit about monitoring OpenStack at scale. So the very first question that you should be asking yourself is, what does scale mean in this particular case? So at Rackspace, we have six different data centers across the world. Three of those are in the US. We have one in London, one in Sydney, and our latest one is actually here in Hong Kong. So across those six data centers, we have tens of thousands of hosts. These are basically the high providers that customers are spinning up instances on that spread all across the world. And on those hosts, we have hundreds of thousands of customer instances running right now in OpenStack. So we're really excited about. So that's the scale that we're looking at here. The next thing is, why are we monitoring? And a lot of this should be fairly obvious, but I just wanted to call out a few of the main areas that we want to cover explicitly, just so you know what our view of monitoring is. The first view is uptime. And uptime matters in a couple of different ways. The first way is purely just API availability. How many of your requests are making it in without throwing 500s or making it through all the network layers and things like that? And right now what we're shooting for is four nines. And once we feel like we're meeting that regularly, we'll of course push that to five nines and go from there. The next view of uptime is the build success rate. So these are requests that make it past the API. You get your 200 or your 202. And then does the asynchronous request actually complete successfully? Since those are two different things. Here we're looking for three nines. And the reason that's a little bit lower is just simply the number of ways that something can fail and even have a customer interaction fail. You can have snapshots that may never go active and things like that. And so there's just a certain level of variability that we have in there. And again, this is a number we'll try to push further as we go along. The last facet of uptime is really the data plan availability. So your customer instance is already up and running. But can the customer actually access that instance? Are things working OK on there? Do they have network connectivity? Is it doing the things that that customer wants it to do? And since it's so important for a customer business, we're shooting for 100% uptime here. So after uptime, the next thing that we're monitoring for is performance. Again, there's a couple of different ways to look at performance. The two major ways that we're looking for are first build time. So how long does it take for a build after the API request is received? All the way to when it goes active. And actually, we even care about when a customer can actually interact with that, when they can SSH into it, when it's serving the content that the user wants and when it's doing that. There is a little bit of a gap there between when it goes active and then. The other big area that we look at performance is just API latency. So how long does a particular API request take? And this is something that we watch very closely as well. Just it can be a very bad experience if you're waiting a couple seconds for an API response back. So we watch this pretty closely. The third thing that we look at after uptime and performance is capacity. And for those of you who are running clouds, whether they're public or private, you know that your customers expect basically an infinite cloud. And you also know that there's a very real physical deployment underneath that that you actually have to manage to make sure you're meeting the needs. So at Rackspace, we look at a couple of different things. The very first thing we look at is just purely memory utilization. This is something that you might have a little bit different depending on how you set up your flavors. The way we set up our flavors, memory is the constraint that we care the most about. So we're monitoring for a strict percentage of memory that we have available to use. It's not just memory, though. The other thing that we look at is just empty hosts. The reason we're looking at this is there's sort of a Tetris game that you have to play when you're building your instances. And you need to make sure that you're landing instances on hosts in the most optimal way. And at Rackspace, our largest instances actually consume the entire host. So you can imagine a situation where you had very low memory utilization across the entire cloud, but you had a single instance on every single host. And you actually wouldn't be able to build any of your largest instances. And that would be a serious problem. This isn't the only thing we care about, but it's actually been one of the best indicators for us of do we need to move forward on any of our hardware orders or anything like that to move things along a little bit quicker. The last one that we look at is kind of interesting. We actually need to manage our IPv4 addresses very carefully. These are public addresses. Everybody has hopefully heard the scares about IPv4 addresses running out and things like that. But it is a very real thing that we need to manage. Aaron assigns a certain number of addresses to Rackspace. Rackspace needs to be very careful about which of the groups inside Rackspace get those. Fortunately, we're going pretty fast, so our requests are processed pretty quickly. But we do need to make sure that everybody knows exactly what that looks like, and we can communicate how quickly we're consuming these. So this is why we monitor. This is the scale that we monitor at. And now I want to talk through some of the tools that we actually use to monitor this at Rackspace. The first one that we use is Nagios or Nagios, depending on how you want to pronounce it. And this is just a tool that allows you to manage alerts and things like that. So you can set up various thresholds for the various different components of your infrastructure that you want to watch carefully. And as those alerts fire, you can choose to do different things, whether it's just a notification or take some action or whatever that might be. And I wanted to share with you some of the different alerts that we have here. It's actually not a humongous list, but it's long enough that it doesn't really fit on a slide very well, and also I like word clouds. So I decided to throw it up in a word cloud. What this word cloud is actually showing is just the number of times each of these words appear in the list of alerts that we have. At the end of the slideshow, I'll have a URL where you can actually get the full list, and you can know exactly what it is we're monitoring. But really, obviously, API availability pops to the top. That's simply a function of the number of APIs that are in an open stack deployment. We're monitoring not only the Nova API, but Glantz. We look at, well, used to be Quantum of the Lodge, now Neutron, I can never remember the new name. But we look at all these different APIs, and all of those need to be up. And not only do they need to be up in aggregate, but we also look at the individual servers to make sure that each of those are up. And so that one jumps out. But I want to point out a couple other ones here. First of all, you can see Rabbit and Q are pretty high up here. This is a combination of two things. First of all, it's very impactful when Rabbit servers go down, and you can't access Qs, those Qs, or if the Qs are growing excessively, that can have an impact on your build times. It's also a function of the number of cells that we have. Each of those cells have their own Rabbit deployments. And so we do need to monitor each of those individually. MySQL is, honestly, basically the same story. It's incredibly impactful if that goes down. There's several different MySQL databases for each cell. And so we need to watch that very closely. And then Glantz is just another area that we monitor from a couple of different angles. And so that's why that one popped up here as a relatively large word in this cloud. So I warned you I do like word cloud. So I made a second one. This is actually the exact same list, except this time it's a weighted word cloud. In this case, these are the alerts that have fired in the last year, weighted for how often they fired. A lot of these things make sense. Nova Compute is the single most common service across a deployment. So if everything failed at the exact same rate, you would expect Nova Compute to be the largest. And same thing with Nova Compute log, disk usage is something that is good fire on any of our hosts. Glantz API ping there is actually a check that we do from Nova Compute to make sure that we can actually access Glantz API servers. Because again, it's just super critical if that happens. So this just gives you an idea here of the areas that we're seeing actually fire alerts as we're operating this. The next thing I want to talk about is sampling. So one thing that I have been focused on so far is a lot of the problems that are happening on a day-to-day basis. So these are things that an alert fire as an operator sees that, takes some action, and goes to fix it. Because presumably this is actively impacting a customer. We also use the same monitoring tools to look at things at a larger view. So we want to know how we're doing over the week or over the month. We want to know how our build times are improving at that rate or over that time frame. And so one of the things that we decided on doing was using sampling. The reason we went with sampling to do this is, like I mentioned earlier, there is a difference between when a server goes active and when a user can actually SSH into there. The reason we go with sampling is we actually test all the way through that SSH and the machine is actually fully up. And this gives us that total view of how long it takes for us to really build machines. Of course, we don't only rely on sampling. We're using a tool called StackTac. We're using a tool called StackTac. What StackTac does is it consumes events that are produced by any sort of open-stack deployment. And it can take all those and store them in data stores for retrieval later. StackTac has been really great in a lot of different ways. First of all, it allows us to get the picture of all of our clouds. So not just the samples, but all the different images that people are using, all the different flavor types, whether or not they're using a snapshot image or a base image. It also gives us a lot of detail about why things are failing if there have been failures. It gives us information about the full stack trace. It gives us the request ID and things like that. So StackTac has been really important for that. And we are looking at Solometer. That's probably something we're going to roll out a little bit later. We have a team that's focused on metering in particular. We had written most of StackTac before Solometer was a project. And now we're working closely with the Solometer team to move forward there. So it's a little weird to talk about email as a monitoring tool. But again, this is just for that view, especially for that longer view. So we take the sampling data, and we take the StackTac data, and we basically create these reports that go out. And this is just, honestly, it's most useful for somebody like me. It's not going to be useful for somebody who's doing the operations on a day-to-day basis. But this is just a nice aggregate view for me to see. Are we improving over the, or how are we doing over the last 24 hours, the last seven days, the last 30 days? This is just a sample from one of those reports that we generate. And these goals, we've actually met these and have moved past these. But you can sort of see, I had to gray out some of this data here, but the red areas on there are places where we failed. And that just gives us a little benchmark for us to say, hey, we need to go dive into this. We need to go find out what the real problem is here. And this just gives us a little pointer to say, to dive into things a little bit further. Next is a tool called Slinky. And my star isn't there. That's funny. Anyway, the Slinky is a tool that we wrote internally. It's a tool that our operations teams wrote at Rackspace. I'm going to talk a little bit more about this later, but just to give you an overview right now, this is basically just a way for us to manage all the alerts that are going off and also manage the hosts that our operations team is looking after, as well as some of the on-call schedules and things like that. So this is just a tool that we came up with. The last tool I want to talk about is Graphite. And by the way, we do have several other tools. Most of them are just custom scripts that we use to feed into several of these other things. But for the purposes of this talk, this is the last tool I want to talk about. So Graphite is a service that lets you push data into it, and then you can use that data and graph it in tons of different ways. The biggest piece of advice I can give you about Graphite is if you decide to use Graphite, and then as soon as you start to think there's some metric that you want to gather, I would try to start pushing that under Graphite as soon as possible. I can't tell you the number of times that I wish I had just one more week or one more month of data about some metric so that I could compare it a little bit longer and look for those patterns. You need to know, are you seeing a pattern? Is this a monthly pattern? Every time there's an open stack summit, suddenly some event happens, and you want to have that extra month of data just to see what you're comparing to. I want to show you a couple of the Graphite graphs that we have. Actually, it looks like the top is being cut off there a little bit, so that's unfortunate. Anyway, in the upper right-hand corner, that's actually looking at our rabbit queues across several different data centers and several different cells. And the little bumps there are actually not that big a deal. Those are just jumping up 40 and 50. The graphs you can't see are the ones that are showing in a couple of data centers. We had an incident of some sort, and the graphs were actually jumping up to 1,500 messages or something like that, and that just impacts build times. The operations team was able to look at that, identify what the problem was based on these graphs, fix it, and the queues dropped down almost immediately. So this is the kind of information that is available in this stuff. Take a look at that. Second, over on the left, the blue slide there. So we've actually deployed Havana at Rackspace. We had that out pretty much the day it was released. It was the same day that we did a deploy, and we had all the changes in for that Havana release. This is actually showing build times for our sample data. And as we deployed Havana, the build times dropped very significantly, and I'll show you the exact number here in a little bit, but we were very happy to see that. Not only did the average times drop, but the spikiness in the deployments dropped as well. So we're very happy with where Havana is going. This bottom graph is actually a graph that we created in response to a deployment of an earlier version of Havana, somewhere in the milestone three time range. What happened is that we deployed this. We had some graphs that were showing a large increase in network traffic. We didn't know exactly where the network traffic was coming from just yet, but it basically almost doubled the network traffic that we were seeing. Fortunately, it was in one of our smaller data centers, so it did not impact the service overall, but if it had been in one of our larger data centers, it absolutely would have impacted not only our service, but possibly other services as well. We did some digging a little bit, and like I said, we found that it was in the MySQL range. So we worked with some people in the community and got some of the patches in, and we did a post Havana three release, and we saw that network traffic drop down immediately. This release actually is a little bit lower than where we were in Grizzly, so that's even better to see and got some great features at the same time. And then I just happened to notice on the plane coming over here that as we deployed Havana, there was a little uptick. That's not something I'm concerned about, but it's just an interesting thing to note and something to be aware of as you're looking for this. So now we know, again, we know the scale, we know what we're monitoring for, we know what tools we use. But what do you do to actually fix it? Because there will be some sort of an issue, no matter what. There's going to be some kind of a problem. And the real trick is that you want to drive down that mean time to resolution. You want to identify problems as quickly as possible. You want to know where the problem is, and you want to be able to fix those problems as quickly as possible. One of the big reasons that we wrote Slinky was to do exactly this. It's to give you this view of data centers and let you know what's going on. So basically, Slinky, it's a user interface. There's also an API to it, so you can get information out of it and you can post information to it. But it has this view where you can take a look at each of our data centers, you can dive in to see each of the different cells, and you can really quickly see how many hosts are having problems in this particular cell, how many services are having problems in this particular cell, and it just allows people to really know what's going on immediately. Sometimes the fastest way people know about that are alerted to some sort of a problem is that suddenly their IRC clients fill up with a whole bunch of alerts. And it's really hard if there's a sufficient number of alerts, it's really hard to keep track of all that. So a tool like this just lets you switch over to this, do a refresh, and just see what the state of the system is right now. On that same theme, this is another view that lets you do that. So this is showing what are called host groups. And the idea here is, again, imagine you get a large flood of host machines that are alerting for some reason, and let's say part of the team is working on that problem already. Well, you still have the rest of the cloud that you need to worry about. There could still be other alerts that are happening. And so in this flood of alerts that you're dealing with, what this view does is if a single service in one of these host groups is having a problem, that whole circle will go red. And it'll just let you know, hey, these are all the areas that we have problems on. If people are looking into this, that's fine. And we can go ahead and fix that. But just in case people aren't aware of these other problems yet, we can start to dive in and see this different view. This is another view that our operations team spend a lot of time in. It's a little hard to see there. But there's actually three different colors on this screen. But basically what this is doing is this is showing all the alerts as they come in. Every time there's an alert fire as a Nagios, it shows up in this screen. And this just gives you a really nice view. You can read that surprisingly well, but I'm going to go ahead and zoom in a little bit. So this is just showing those three different colors. You get information about each of these alerts. You can see which cell they're in. You can see some details about it. You can see which host has blurred those out there. But those are links over to a page that I'll show you in a little bit. The three colors there are the red one on top. There's green as the second one. And then the bottom three are all yellow. So red means that an alert has come in. And nobody has acknowledged it. Nobody has done anything about it yet. It's an alert that needs to be handled in some way, shape, or form. The green one is an alert that has fired. But even though it's fired right now, it's actually been resolved. And so this is more just informational, saying, hey, if you saw this a second ago, you're not going crazy. But it's resolved now, and you don't have to worry about it. And then these last three are all yellow. And yellow just means that the alert has come in. There was some sort of a problem. But it's been acknowledged by somebody. So you can see here up there at the top, there's these different categories. You can see who it's assigned to. Since that top level one is still red, it's assigned to an entire shift. You don't know who's going to take it yet. It's just been triggered and things like that. You can see that the bottom three have all been acknowledged by the same person. And then this green one. I wanted to point this out a little bit. This one has an auto-fix status. The auto-fix status is successful. And if you remember from this word cloud, we had several of our major things here that fire very often. And so these are just the things that were both fired often and were easy to automate a fix for. And so these are things that we actually automatically fix. As soon as Slinky sees that there's a problem here, it will interact with that device. It'll interact with that server. And it will go ahead and address that problem as quickly as it can. And so if you go ahead and you hover over that result, it'll tell you what it actually did. In this case, we just dropped the disk usage down to below 90%. And everything looks better. It's a very pluggable interface, too. It's something that you can add new things to as you find different things that you can automate for later. You can go ahead and plug that in and go from there. The next view is a host view of a particular server. So imagine you're looking at one of the alerts, and you just want to get a little bit more information about this host and see what's been going on. This gives you an overview. This is actually just the top half of the page. I'll show you the bottom half in a little bit. But up at the top, we have some information about the host. It interacts with Rackspaces inventory system. I think that's actually one of the areas that we need to figure out how to make it so it interacts with non-Rackspaces inventory systems. But this just gives you some information about the host overall. It also gives you this list of services that are on here. It gives you the status of all those, when they change, and things like that. I wanted to highlight a few of the services that are in here, though. Those top five, you can see they're all disabled. It's not because they're not important, and it's not because the host isn't active or anything like that. The reason they're disabled is that we actually made these non-active checks. Basically what we did is we switched to using SNMP traps on these. And what that does effectively is solve one of the major scaling problems that we had with Nagios. Initially, we had Nagios going out and polling for a whole bunch of this information. And you can imagine, even at hundreds of nodes, it starts to get painful. But certainly at tens of thousands of nodes, it's extremely painful on your Nagios server. So we switched these to more passive checks. And so what that means is that we're still monitoring actively for network connectivity, make sure that certain Nagios daemons are up and running on that machine. But as long as there's no state change on any of these from the perspective of the machine itself, none of these things will fire an alert. But as soon as there is a state change, we'll communicate that up to the Nagios alert, and it'll act just like a normal Nagios alert for anything else. Next is the bottom half of the screen. And this is just giving some more information about this host. A lot of this is interacting with communication tools at Rackspace, allows you to see past alerts, some of our internal ticketing. And then I wanted to dive into a couple of the sections here. The first one is the Crash Log. This is just a tool that allows you to communicate to other members of the operations team. You can put in whatever text you want here just to communicate, hey, this is what happened on this machine. This is what I did to fix it. And that can just be useful for troubleshooting in the future. In this particular case, we also have somebody as marked this as a strike three. So we have a policy on our operations team where at a maximum, a server can fail the same way three times. As soon as it fails the same way three times, we're going to evacuate that host. We're going to move customers off of it and then take some remediating action, whether that's a full re-kick of the server or replacing the hardware or whatever that looks like. Sometimes we might do something sooner than that, depending on the type of failure we'll get. Of course, we're not just going to wait for impending failure when we know what's going to happen. But sometimes it's a little hard to tell if this is just a fluke or something that's actually a problem on this machine. The next question I wanted to highlight was the instances section. So in the instances section, this basically just gives you a view of all the instances that are on this host. It'll list the number of servers that are on there. It'll list the total amount of RAM being consumed by this host. It'll give the VM state, the task state, the power state. It'll give you a link to the tenant, which gives you a view of all the servers that that tenant has to let you know what this customer looks like. And also, depending on how severe the problem is on this host, if you're affecting one instance, as in this case, or if you're affecting 25 instances, or however many instances might be on here, it also gives you access to the IP and stuff like that. So I didn't pick this host completely randomly. I actually picked this host specifically because it does have Rackspace's new 120 gig performance flavors. And I normally don't do sales, but I literally just got the text message that it launched and I had just now. So I was excited about that. So what have we actually seen over the last several months? We've really had a big initiative since the Grizzly release for trying to improve both our build time and our build failure rate, looking to improve our build failure rates, or make them go down anyway. The biggest single impact that we've seen is a 45% reduction in build times. This has been absolutely huge. A lot of this has been from the larger community, of course, and a lot of this has been driven by the modern that we set up. And again, using those reports coming in from StackTac, those tell us exactly where we're seeing failures. And a lot of those failures just impact the build times because what will happen is a server will start building and it'll have some problem on a host. And we'll say, OK, that's fine. We're just going to go ahead and reschedule that somewhere else. But you've basically potentially doubled or even tripled the amount of time that a build can take, even though it eventually is successful because of intermediate failures along the way. The next thing that we've seen is we've seen about three nines of API availability. This is actually something I'm really happy with, because the important thing to note here is that we actually punish ourselves for any downtime that we have during deploys. During this entire period, we've been doing several deploys to our system. And basically, depending on how quickly the deploys are going, anywhere from two to three deploys can effectively eat that fourth nine away and make it so that we don't meet our goals for the year. So one of the things I would really like to see from OpenStack is continue to push on making these deploys non-impactable. We will continue to do deploys. We do think it's really important that we continue to do that and we make sure we communicate about it. But it does definitely affect our API availability. Again, the build success rate, we have 99.5% build success rate. Again, this is caused by several different things. One of those things, again, are the deploys that happen. And then there could be failures on some of the other infrastructure that happens in our system. Like I said, it's easy for customers to have images that are not quite correct. But so this is something that we really think we can improve upon. And we're looking forward to doing that, working with the community to do that. And then I said we had a goal of 100% uptime for our data plan availability. And we haven't quite met that. We've been pretty good. We have about four nines of data plan availability. But we have had some issues on our infrastructure that we've recently made some pretty drastic changes to help correct. So what's next? There's a couple different areas that I want to talk about here. First of all, I mentioned this idea of auto corrections. First of all, it's a really great thing to do auto corrections because it saves a lot of time on the operations side of the house. The problem is if you automate things so well that you mask real problems in your system, that's going to be a problem for you. And so one of the things that I would definitely like to do is make sure that we're doing a better job of communicating out to the development teams, sort of the state of the system, when we had to take an auto corrective action. And some things, it may not matter so much. Some things it's just about configuration management and things like that. So that's not necessarily an open stack issue. It's more of a problem on our deployment side. But a lot of them are open stack issues. We have a case, for example, where Nova Compute will just kind of pause. We know where a lot of this work is. We suspect, anyway, that a lot of this is in the event load driver. And we have people working on helping to fix that. But really getting good snapshots of exactly what is going on in the system when we have to take these actions I think would be really important. There's also a possibility of taking some more complicated actions. For example, we don't have to just restart Nova Compute. It's possible we can just say, you know what? We're going to do two things. We're going to make it so that we don't send any more instances to this server. And we're also going to evacuate the server. But we can leave everything exactly where it is, and then somebody can go investigate that and take a look. So that's one of the areas that I would definitely like to see us improve both at Rackspace and as part of the community. Another big area that I really am interested in looking at is doing some more complicated filtering on the stack tack and eventually the salometer data. So what happens right now is, like I said, we get a lot of really good information. We know how long builds are taking. We know what types of failures we're seeing for particular builds. And we can get a whole lot of aggregate information. The problem with aggregate information is that things can look really, really rosy for the vast majority of our customers. But those aren't the customers who actually call in. And so what we really want to do is we want to figure out ways to slice and dice this data in a couple of different ways. I would love to be able to filter by, for example, what image ID people are using. Is there a particular image that's failing or taking it in especially a long time? Is there a particular flavor where that's happening? It would be really great for when a customer calls in for Pico tenant, our support teams can just quickly look up and see what their stack tack data is just for that one customer and see what that customer looks like. And then even from the more active monitoring situation where you're trying to solve a problem of some sort, you might want to know when a particular node is taking a particularly long time or has a high failure rate and even diving down into either the cell level or maybe it's just a cab level problem. So it's just these different ways of viewing the same data to, again, help us identify these problems as quickly as possible. And that's it. This is the gist that you can't see on here, but I think it's uploaded somewhere so you can go take a look. This is just a list of all the negatives alerts that we have, and you can take a look at that. So thank you.