 All right, I'm expecting this to be working. Microphone, great. All right, so this is only you can prevent force fires, proactive approach to monitoring your cloud. OK, so we're part of the Time Warner Cable OpenStack team. We operate an internal cloud for Time Warner Cable and have customer-facing workloads on it. And one of our main initiatives is just to avoid customer work interruptions or incidents. Quick agenda, I'll cover the what and why. Ryan will cover the visualization, and Brad will cover technical challenges. So quickly, just trying to expose our thought process on this. And starting off with what's really motivating us, I find that most people assume they know why they're collecting performance metrics. It's not always obvious, and there's lots of different reasons to do this, but these are our two number one reasons for doing this. Number one is to avoid customer work interruptions, and number two is to make our work-life balance better. And number two is becoming more important all the time, just because there is quite a bit of stress when the on-call rotation comes around. So what are we talking about here? It's nothing really special, but trying to be deliberate and specific about how we're going about our monitoring and then using that monitoring information to produce a desired result. So basically, prioritizing what kind of problems we're trying to address, create system models. I find when I ask people how they expect their cloud to operate, you get a variety of answers, and usually they're conflicting. But trying to start simply and just progressively build a model that your team can work around, and then that all drives to creating questions. Questions to be answered with your date, and then at that point you're ready to source data, visualize it, and hopefully answer a question. It's very common for us anyway not to get the question answered the first time, so we have to iterate. And then for me, one of the most important things is to define an action. Going through all this work and not being able to do something about it is really where we get our value. But that's the problem that we often find is that we'll source data, make slides, or graphs, and still not really have a defined action behind it. So to answer the question, why is this really different, what I'd call the traditional thought process is to collect as much data as you can and keep it forever or as long as you can. And then hopefully you'll have a chance to use it in your work or make a difference. This is a false assumption from my standpoint. What I'd call a new thought process is really starting with a question. And then once you have a question to be answered, go look for the data that will answer that question. Put it in a format that'll allow you to use it and then ask yourself, did that help? If not, repeat. So this is a very simplistic slide. I really think the last point is the most important one here. What you want to do is be able to come up with a question that if you can answer it will lead you to an action that will make a difference in your cloud. You can avoid a problem. That's what you want to do. So these guiding principles, they're really my guiding principles that I've developed in the last year just working with this. They're a truly open forward debate, but the potential value of data reduces over time, but the cost of that data increases over time. And then the true value of the data is driving a decision or an action. And then graphs are tools, they're not the product. And humans aren't good at this. If you accept that last bullet, then taking a structured approach is much more natural. But most people think they're very good at this, and I think that leads them to make a lot of mistakes. Then lastly, this whole approach was really prompted by exposure to data science. And Coursera has a number of classes. If you're interested in this type of approach, this is a great way to get started on it. But John Hopkins University has a specialty in this, and they do a good job on these classes. Okay, and then Ryan, you're gonna. All right, so when we got started with this, we were just using Nagios for a purely alarm-based approach. This is great, it helps you respond to problems quickly and fix them, you know where the problems are at. But it doesn't help you get ahead of the problems, and it's hard to go back and diagnose the problems using alarms once it's passed. So we've been moving towards more of a time series visualization approach. It's a lot easier to consume the data, to understand it. You can see the trends, whether you wanna use that for planning or to see whether you're headed for failure. And you can also see correlations. So you know if one thing is causing another problem, or if one thing is a predictor. For instance, this is a snapshot of one of our storage graphs. The top panel here was designed so that we could see how our capacity was being used. I'm not sure how well you can read that, but that green is just available, and then the other colors are split out into how storage that's being used for instances, volumes, that sort of thing. The bottom panel was designed so that we could see the health of our CEPH cluster, and it just shows the different states of the placement groups. So you can see in that one that there is that little blue section at the bottom that's remapping placement groups, and then those red spikes are degraded health, or degraded placement groups. And you can see from the combination of the two that we were actually undergoing a CEPH cluster expansion when the remapping started, and then the degraded health groups, or placement groups started. So it was obviously a CEPH cluster expansion that went wrong. For our data collection and our displays, we're using a combination of Manasca and Grafana. Manasca is the open stack monitoring as a service solution. It's flexible, it's scalable, and it's multi-tenant, which means not only can we use this for our own monitoring, but we can provide this information to our customers as well. And for Grafana, or for the visualization, we're using Grafana. It's a time series data visualizer. It's open source, and it also supports the multi-tenancy so that we can allow our customers to have individualized dashboards as well. So where's this worked well for us? Well, when we got started with this, or right before we started setting this up, rather, we were having an issue with network slowness. It would appear randomly, lasts for a random amount of time, and then disappear without warning. And we had a hard time tracking this problem down. All we really knew was we managed to track it down to specifically APIs that seemed to affect all of them, and we were getting conflicting results from different people who were testing it. So we started monitoring our APIs as soon as we set this up. And from this graph, you can see that we got our first confirmation of the problem and also an idea of what was actually happening here. This is just the time it takes to create a server, and you can see it slowly gets worse and worse and worse until it hits a point of failure where it just kind of bottoms out there, and then it just recovers randomly. We also found that the Keystone tokens were another great indicator of this problem. When we were just testing this, before we had the visualization, people would get conflicting results here, which you can see pretty clearly there. Sometimes tokens take the normal amount of time, sometimes they take far longer. So we also found, by comparing these graphs, that this was a leading indicator of the problem, and the tokens would often slow down up to even half an hour before we started seeing these problems in our other services. So this helped us get ahead of the problem, and when we saw Keystone starting to slow down, we could go in and do some maintenance. And in our case, we discovered that restarting the APIs helped with the problem, so we could restart our APIs quickly and prevent this from becoming an issue that was even noticeable to customers. So like I said, this was a great example for us of success. It was the first time we used this, but of course this is still a slightly reactive approach. We were still waiting for Keystone to start showing that there was a problem, and then we would react to that, even though we were getting ahead of the real issue. But this is something we're still working on, is getting farther ahead of the issues. Of course, just having the data doesn't mean we always succeed. Recently, we were rolling out new kernels to all of our nodes, and it performed fine in our dev environments, in our staging environments, but once we got to prod, we started seeing node failures with soft lockups. And once we dealt with the immediate issue, when we went back and we started looking for any indications of why this might have happened, we found that at this time that we deployed the new kernel, there was a significant change in the pattern of CPU usage. The customer percent doubled or even tripled in some cases. And this might not have been, or this might not be immediately traceable back to the kernel issue, but if we had noticed this at the time, then we would have taken a closer look at the kernel and maybe not deployed it to production without further testing. There were several reasons why we didn't catch this. One was just that we had too much data. We didn't have enough people to sift through it, and we didn't have the visualization for it. A lot of our dashboards weren't created at that point. We also just didn't, this is a new process to us and we didn't know what to ask. We deployed the new kernels to dev and to staging, and we didn't see any immediate failures, so we called that good and we went on to prod. But if we had asked ourselves if there were any signs of failure, anything that would indicate that in an environment with higher load or other factors that it would fail, then we might have caught it earlier. So we're still improving. These are kind of the three big points that I wanted to make. We are trying to capture less data. Early on, network data was a big culprit for us. By default, when we set up Manasca, it was capturing network information on all of our devices. When we went back and reviewed that, we found that we could cut out nearly 70% of our data and that has helped us narrow in on network problems a lot faster. We're also adding better visualization. We started on a branch of Grafana 1. We recently rolled out Grafana 2.6 to our production environment and Grafana 3 is coming out soon, so we have high hopes for that. We've also just been creating more dashboards so that more of our data is easily consumable. And lastly, testing is something that we haven't done much of up to this point or at least this sort of testing. But we're hoping to do more of it because if we know where the stress points are and at what point our cloud fails, then we'll be able to see if we're actually headed for a disaster. One last point I wanna make is that we don't just use this for ourselves, we also give our customers access to this data. We monitor all of our customers' VMs for basic health information and we provide them with this default dashboard where we are also in the process of adding router monitoring so that we'll be able to provide them with that information as well. And this isn't just to help them, it helps us, we are selfish that way. When customers have a better view into what's going on in their instances, then they're much less likely to come to us with problems that they can fix themselves or that have nothing to do with us. And also, when they do have to come to us with problems, it means they'll come to us with information that will allow us to solve the issue faster. And of course, when customer instances are more stable and their applications are performing better, we just get to deal with happier customers. Brad, you're up. Okay, thanks, Ryan. I should start out by apologizing. The Marantis bear was gonna make an appearance as smokey the bear, but got into a barrel of fermented apples at Stack City last night and is still recovering, so apologies for that. We're gonna shift gears a little bit and talk about some of the nuts and bolts behind our monitoring solution. So we stood this up a year maybe, a little more than a year ago, and we'd heard really good things about the Manasca project. We liked the fact that the graph data over time, the fact that you could alert on metrics that were coming in. Currently, and even then, as a backend metrics database, Inflex DB and Vertica are the two supported databases. Vertica is an HPE proprietary database and Inflex DB is an open source database or was until a few weeks ago. Their clustering has recently become a closed source, I think, but at the time, unfortunately, 0.9, the version 0.9 of Inflex DB was coming out and its clustering really wasn't stable. After spending several months with that, we decided to meet our delivery deadline by using Vertica. And so, as such, most of the technical challenges or what turned out to be scale issues for us were related to Vertica and ended up being Vertica-specific. So the biggest problem that we had when we initially stood our cluster up is we had no way to guarantee that irrespective of the number of queries that were coming into the system, we could guarantee that data was being written into the system. And so the main thing that we did to fix that was Vertica has this feature called resource pools. I don't know how familiar people are with Vertica, but it allows you to allocate more threads and memory to one process or set of queries or database statements versus another. So by allocating more resource to our persister pool, we can now guarantee that, you know, irrespective of the number of queries coming in, writes still make it into the database. The other thing we started out with, Vertica stores its data with what's called projections. And so when you first set up your schema, you decide how you want Vertica to store that data. And we had been using a segmented projection and we had segmented the data by OpenStackProject. And what that did is like put my data on one or two nodes in the cluster and maybe another project's data on another couple of nodes in the cluster, which sounded like a good idea for distributing the data, except that as we tried to horizontally scale our database cluster, it really didn't matter how much we scaled it from a project perspective because all of the data was really only on two nodes in the database. It took us a while to figure that out, but once we did, we switched the main measurements table or the metrics table to a replicated projection and that was great. So now we have horizontal scaling. We can grow the database cluster and any node in the database cluster can satisfy any query. So that's finally given us horizontal scaling, which is great. The other thing we learned about Vertica is that it prefers larger batch writes at less frequent times. So we were very chatty initially when we set our cluster up. So we had bunches of threads writing little bits of data to get it in as quickly as we could and it was just too much for our Vertica cluster and so we slowed that down, increased our batch sizes and now it's great because it frees Vertica. It handles big batches really well, stuffs them in very quickly, but it frees the database up to satisfy getting data out of it, which is great. Another big one, Ryan touched on it a little bit and it's just kind of maintaining the size of our database and making sure it just doesn't grow without bounds. So it took us a while to figure out the network metric thing that he talked about and he's not kidding, we really did delete like 70% of the data at the time in the database. In addition to that, we're pruning our metrics and I've got another slide coming up here that talks about our retention policy and pruning, but keeping control of the size of that data has been really helpful. Okay, so if you've been going to some of the Manasca talks this week, you've seen Roland and Company's slides on Manasca architecture hopefully and so I'm not gonna repeat that here, but here's a slide that you might not have seen and this would have been helpful when we were setting our cloud up to have something, you know, a working example of how people have deployed Manasca. Okay, so on the left is essentially all the client side knows that we have and this slide is a single region. We have two regions, two data centers and so we're monitoring about 250 physical nodes on the left side of this graph of the different types of nodes and all of those nodes have the lightweight Manasca agent, a Python agent on them sending default system metrics into the back end into Manasca and then we also have our Isingar monitoring node and that's typically where we take the custom Manasca plugins that we write for data that's not part of the default Manasca install and push metrics from there. So all that goes through load balancers and hits the back end and so on the right in each region we have a three node API cluster and so that runs the whole Manasca stack it runs the API, the persister, if you've been to some of the architecture talks we've got Apache Kafka running there, Zookeeper, Apache Storm, a notification process and now with Grafana 2 we also have the Grafana API itself hosted on that three node cluster and then behind that we've grown in each region our database cluster to be six nodes and this is working, this is humming right along for us right now, we're happy with this. But what does that get us? So here's some measurements taken maybe two weeks ago by our QA team. Right now with that configuration we ran some tests like how much data can we push in to the cluster and how much can we get out of it? So pushing data in right now, our threshold, 41,000 metrics per second is about our threshold for what we can push in and have the persister or the Kafka consumer process is still keeping up so we can go over that and it's a message bus, it builds up and then eventually the persisters catch up but this is currently with this cluster size are about the threshold at which our persister processes our consumers can't keep up. Not that data gets lost, it just doesn't immediately make it into the database and then getting data out. So right now with our current configuration with the Jmeter test that simulates 30 concurrent users just hammering the database for about 10 minutes getting an average of like 15 days worth of data on average and over that 10 minute period we can retrieve 55,000 metrics per second and that's just our current cluster size. Now that we know we can horizontally scale we kind of know where we're at capacity wise and what to do if that's not enough. I should also point out that this is test run against our production system, not an idle test system so these are conservative numbers because it's the system that's up and ingesting our infrastructure metrics. Some of our customers just last month we released this as beta for our internal time-warner cable projects so this is an active system so these are pretty conservative numbers given that it was still servicing all of our dashboards and our day-to-day monitoring. In terms of what kind of footprint that ends up being so I told you we stood this cluster up like a year ago, a year and a half ago almost. We've only got five months of data because we had a bug in the prune script that we wrote and we inadvertently had been deleting our infrastructure data and that's bad and I'm glad Jason didn't fire me for that but thank you. But that translates to about two, less than two terabytes worth of data right now in Vertica so Vertica is a paid product. I think we have a 10 terabyte license and so in five months we're still well under it's really like 1.6 so it's not a huge storage footprint inside Vertica. We've taken a couple of measurements of a day of metrics coming into the database and our infrastructure right now is about 5,200 metrics per second so that's a little over 10% of our capacity right now and I don't know if it's worthwhile or not but if you do a row count on our measurements table five months of data is about 40 billion measurements in the database right now. So these last two bullets, that's our current retention strategy so we've got a script that runs nightly for non-infrastructure projects. We delete data anything older than six weeks by default and our plan now that we've fixed the bug in our pruning script is to keep 13 months worth of data for ourselves. We do have the ability with the script that we wrote to take a special request from customers if they feel like this isn't enough for them so we can kind of work around that but going forward this is kind of our default plan. Check the time. So we tried to think about this from the perspective of if we were standing this up today, if there's a team in the room that's thinking about trying Manasca, what might we do differently? We kind of hinted about it but we really didn't follow the process that Steve talked about when we first started so we were so excited to get it up and running and working we pretty much took every agent plug-in that was available and worked and started to shove in data into the database and quickly found out that that's at least our opinion, right? Our opinion is that's not the right approach. It better to be intentional and only push data in that you're gonna try to answer a question with or and can actually do something with. The other mistake we made was we had the entire Manasca API stack and the database on the same node and we met with the Vertica support team and got our hands slapped in the first 30 seconds they're like you can't do that. Vertica wants the whole node, we're a greedy database so that led to us splitting the database off of that API node that runs the rest of the stack. Another thing that came out of our onsite with the Vertica support people is the first, it's a tool that should be part of the product in my opinion but the first thing they do when they show up to a customer site who's bitching or complaining about, we're not recording are we? Yeah. Good. Complaining about their database not working is they have a little Python script that runs like the magic commands against the database that tells them where your bottlenecks are. It's called vBuddyLite and they were glad to leave it with us and they gave us permission to put it in our GitHub repo so you can find that if you look for Puppet Vertica, the little module that we wrote. So now we're putting that on our nodes and it's a great tool. They kind of minimize the value of it. They're like well all it does is run a bunch of commands or a SQL queries and it's like well yeah but we didn't know what we should run and that thing does so it's pretty awesome. And then the other thing is even like two or three weeks ago we ran, we periodically run it and found another bug in the API process where a query wasn't fitting in memory inside our API resource pool so it's a great tool. If you're gonna use Vertica I highly recommend it and we have some links here at the end that can help you find that. Yeah and so then just recently we've grown the Vertica cluster to try to increase our concurrency numbers and it seems to work really well right now. We're at a two to one ratio of database node to API node and when we really hammer the system things are pretty balanced in terms of CPU and memory across the cluster so it feels at least where we're at right now with our workload it feels right and then finally like if you installed Manasca today you'd be where we were which database do we use? In FluxDB it's stable now or more stable than it was when we started out and we're kicking the tires but yeah we have a great track record with Vertica now we're very happy with it. I mean things could always be better but it's stable and it performs with these things that we've worked out but the upstream community's talking about supporting Cassandra too so sadly I don't have an answer for that one if you were looking for one, if we did it today we'd just have to evaluate all three and then see where we go from there and check in my time. Doing okay this last slide. Okay so this is kind of Time Warner Cable's upstream wish list we talked a little bit about this in our breakout sessions with Roland and company today. The first is that we're using a forked version of Grafana right now. We forked or Ryan did all the work for getting Keystone integration into Grafana but he had to change the Grafana code proper to make that happen. So we're using a fork of Grafana 2.6 right now. We're in conversation with the rain tank folks and they're excited to engage with OpenStack so they're talking about adding that in Grafana officially. I don't know half of what Ryan knows about that so if you have questions about that talk to him but we'd love to get off of the fork that we have right now of Grafana. Another thing is we just have this kind of hand built script that runs through Cron right now that decides how much data to keep in the database or rather prune out of the database and we'd like to push that into the API so it would behave much like Nova or Neutron where you can set a quota for a number of instances or floating in a piece or something so and we would want that to be scoped to a project so one project can have more or less data than others. What we're discovering, third bullet, give the customer's ability to prune their own data so we're keeping six weeks worth of their data but even in our process of writing custom plugins and pushing our own custom metrics into Manasca there's a life cycle there and what we found is we don't get it right the first time so you push some metrics in and then you get to the point of trying to graph it in Grafana and you realize well I should have added this dimension so I could split it out by region or split it out by whatever dimension makes sense and so part of the life cycle and we anticipate that we're gonna see this with the customers that are using Manasca in our cloud right now is that they're gonna push the wrong metrics in and they're gonna wanna delete that so we'd like to see the ability to let customers delete their own data if they so choose and they could wait six weeks because it'll drop off after six weeks but that's an irritating amount of time to stare at a mistake let's see and then just a few more things that we'd like to do to improve performance that we've identified we've got a patch upstream right now that HPE is working on and it will make our graphs lightning fast compared to the way they are today by having the API return multiple metrics in a single query one of the dashboards that Ryan showed so it showed like all the CPU utilization for 10 instances or whatever and that translates to 10 API calls, 10 database calls and it's easy enough in the database to just do that in one shot so that patches close to landing and we're excited about that, that's gonna be great we've also talked about caching in the API layer if you really get down into the guts of what the API is doing right now in the database it makes a lot of repetitive queries to figure out which dimension set IDs to use to query my data or anybody else's data and they just get repeated over and over and typically they're returning the same thing so it's begging to get cached in the API layer and then finally a couple of like last things that we haven't finished or tied off after we met with the Vertica folks they gave us some recommendations in terms of horizontally scaling we could actually reduce the amount of chatter between the Vertica nodes if we provide a key value hint that basically says hey, you don't have to talk to other nodes in the database or in our cluster we know that the data is local so you can tell Vertica that and it'll kind of reduce traffic between nodes and speed it up even further and then they recommended right now for our main measurements table because in order to dedupe data that's being pushed in to some of the tables we're writing data into a temporary table and then merging that into the real table and Vertica really doesn't like that it's an extremely inefficient way to do that and so they've given us with Vertica 7.2 there's a better way to do that you can push data straight in with a copy statement and if you tell Vertica that to basically do that deduping for you so there's our list that's our wish list roll and I hope you took a screenshot of that okay and just some links that we talked about yeah and that top one will lead you to a whole bunch more if you're interested so I think we're right on time I think after three days we don't have to remind people if you have questions to come up to the microphones but that's it for our slides I'll leave that up there I can't see alright, thanks