 Okay, so today we're going to talk about capacity management and provisioning. The kind of subtext is the cloud's full, you can't build here. There's kind of a riff on the scene from Forest Gump where he's trying to find a seat on the bus and no one's going to allow him to sit down. This is kind of the experience that we've had at Rackstreet with the public cloud in terms of what has happened capacity-wise that prevents instances from building. My name's Andy Hill, Joel Prizes with me and Matt Van Wickle. So the public cloud capacity at Rackspace, we've deployed over 100 cells in the last two years. We're constantly, we usually have five or six cells in flight. We're constantly adding new capacity to every region. Our process used to take systems engineer three to five weeks after getting the gear handed over to them with a bare metal install. And now we've taken the process from three to five weeks for a skilled engineer down to the operator that's on shift is adding capacity and the new cells are coming up and it's little as one day but usually it takes around a week for a new cell to come online and the restraint there is around networking. Top of Rack network configuration kind of has some issues around automation. We still don't have the right pieces in place to make that as well oiled of a machine as the rest of our provisioning that we do. Around the control plane sizing, so like our NovaDB, our NovaScheduler, the sizing of those nodes that we run, we kind of have to have some ideas around the data plane and its impact on those nodes. So for example, if we have a very, very wide data plane that's doing lots and lots of downloads and uploads of images from Glance, we have to scale out our Glance architecture. And then the other kind of impact with the number of compute nodes or the size of our data plane, the breadth of the data plane, I guess, is how large should our NovaDB be? If we have 600 hosts for a given cell and each one of those could have up to 25 instances on it and the instances are turning all the time, that's a ton of records in the NovaDB, right? So we have to kind of pay close attention to the sizing of our cells in terms of the number of compute nodes that are involved and then the sizing of the control plane components that orchestrate that data plane. The other kind of sizing considerations that we make for cells are around private addressing space. So if you think about one of those very large cells that I just mentioned, it could be up to a slash 17 for the private addressing space and that's a really large broadcast domain for the instance traffic. And once you get to broadcast domains that are around that size, it gets kind of interesting on the open V-switch side of things, to say the least. We have a presentation later on today that's going to get into detail on that, but yeah, shameless plug, right? We're talking about that more of the details around it, but that's just one of the kind of considerations that we have for our build-outs of the cell sizing. And then there's kind of the unsung, underappreciated consideration for your cell control plane sizing and that's more around overhead and complexity. So we don't have infinite resources at our hands for the control plane. We have to right-size everything. And you can't just add nodes and services and different components to each cell because that's more complexity for the operators, that's more complexity for the employers, that's just more complexity overall. So we really try to right-size with the size of the cell and its nodes, minimize that so that the control planes that we spin up are as cheap as they can be. And as simple as they can be as well. Some of the hypervisor provisioning kind of considerations that we have to make are around cow images, so like copy and write images. If you're familiar, if you're kind of coming from a VMware world, it's the snapshot of a VM that kind of hung around forever and ended up filling your drive. You have to kind of plan for some overhead with ZenServer. That cow image size can be two times the size of the disk, so that can really be painful. And then sometimes there are problems with cleaning up these snapshots afterward. So if we're using software to create the snapshots and to clean up the snapshots, sometimes those fail and can lead to scenarios where the disks are full on the hypervisors themselves. We also do pre-caching of images at RecSpace. So the standard images that you get whenever you sign up for a RecSpace, like your Ubuntu image or Fedora image, those kinds of images we call them base images, and we cache those directly on the hypervisors themselves to make the build times as fast as possible. Well, that's also some drive space overhead that we have to account for whenever we're sizing everything out. Otherwise, we could fill up the drive just like we were talking about. So with the combination of the pre-cache images and the cow overhead, we can end up having some interesting scenarios that you should really be aware of for sizing out your hypervisors. And these are just the Nova config options for that pre-caching option. And, okay, so the other kind of things to be cognizant of whenever you're thinking about sizing out a cell and its control plan and those kinds of things. Sometimes pretty awful things happen with the hypervisors that are being used, right? So you need to have some excess capacity for emergencies. If you don't have a spare chassis to put customers on whenever an existing chassis crashed, that's a really bad situation. So there's some options within NovaCells to set a reserve. The other thing is that you're kind of bound for a cell. So it's not like we can just move you to another cell that has capacity. Cells itself doesn't support moving instances from one cell to another. And then kind of the final, like, gotcha that we encountered from a, how do we size our data plane and how do we size our control plane was around VM overhead. So there's a tiny amount of overhead that each VM has that's directly related to stuff like the number of VCPUs that it has, the number of disks that are attached, all that kind of stuff. And however tightly you're trying to pack your hypervisors, you're trying to maximize those hypervisors, this overhead can really start to matter and be an important factor. So there's now accounting for that overhead in the scheduler and there's links to the patches that landed. This is all available today. And I'm going to hand over to Joel for the bigger problems that we've seen in production. Hey, everybody, I'm Joel. I'm going to be talking about, as Andy mentioned, some of the kind of big gotchas that we get when we're trying to scale out our infrastructure to meet the needs that we see. There are way more than we could probably cover in this, in the time that we have here, but these are some of the main ones that we run into. So specifically around load balancers, and that could kind of also be expanded out to load balancers, firewalls, just any kind of single point you might have, say between cells in your regional control plane or the regional control plane and just your ingress. Oh, you went ahead real fast. Glance and Swift is a big thing from the control plane standpoint, just in terms of pure networking, that's where you're going to be generating the vast majority of your bandwidth, and so that's where you hit a scaling point on that side of the house before pretty much anything else in terms of running the cloud. Fraud and non-payment instances. If you are not running a public cloud, if you either are deploying private clouds or it's just something for internal use, this might not come up. But when you start getting into a large public deployment, it becomes a major deal for capacity. Routes, routes, routes. I'll get into that a little bit, just routes, so much routes. And row testing, and by that I mean testing this new capacity that you're adding and how to adequately do that. So load balancers. Yeah, it can become a very quick single point of failure or at least a single point that you can easily saturate things. So up there we have alternate routes needed for high bandwidth operations. Specifically that pretty much refers to glance and glance traffic for us. We have an alternate routing path that we can go through that circumvents these devices for us, specifically for glance uploads and downloads, so that we are not saturating those. It slows things down when we do, obviously, and then the other shared services that are going through those connections can start to fail, like your database connections and that kind of stuff. That's another one that we have listed up there specifically, database queries. If you were in the talk before that was in this room where they did a pretty good deep dive of some very nasty queries that are coming in. We've seen some pretty bad scaling issues where unoptimized database queries were sending way more traffic through load balancers than they should have been to the point that we were saturating or near saturating those connections. A lot of really good work was done to sort of catch a lot of those and bring that data down, but that's something you have to be really aware of when you're adding new capacity and especially if you're deploying to multiple environments. We've seen cases where things look great and we're doing it in some of our smaller regions and then we moved on to the larger ones and it was like, this was an exponential growth for this particular problem and that step up, we crossed the tipping point and moving up. You really have to keep an eye out in your pre-production and when you're analyzing potential changes coming down the pike that you're looking at increase per instance for things like database traffic so that you can accurately plan around that. Swifting glass bandwidth, I talk about this quite a bit. Again, single bottleneck. One thing we do to keep an eye on this so we know when we need to add capacity to handle this is we have triggers based around build times or alerting based on build times and imaging times. If we see those start to slow down then we know that we're hitting some type of scaling point in load for swift uploads and downloads. A big thing here that would help, I think, considerably is Glance API nodes. We don't share the cash between them. So say like if you have 20 Glance API nodes that are handling your image downloads, if 10 of them have an image and 10 don't you have a 50% cash hit ratio. It would be a pretty neat feature to be able to fill the Glance API nodes to themselves talk and share that knowledge among them so that if one or two of them have an image cache that's say pretty popular, like it could be a customer's image, a customer one, not one of your base images, something like that, it could defer to one of the nodes that does have it so that you're not using bandwidth you might not need to by reaching out to Swift or whatever your image store is. We need to get image downloads out of path, or sorry, that should be, well, we need to get image transfers out of path if we can and by that I mean using the Glance API node is essentially a proxy for your traffic to Swift from the hypervisors can be a real pain. You're making a single point that you're concatenating basically all of your traffic through for those transfers. You can run in some really interesting problems there and I'll get into that kind of towards the end of the slide here. We try to avoid these things by caching the base images as Andy talked about the, you know, the base images are, you sign up for Rackspace, you do an image list, those images, if we're rolling out a new Glance API node we'd be sure to at least seed it with all of those images so it's ready to go as soon as it goes into production. We also try to precede images to hypervisors ahead of time if possible and we saw some really big improvements on that in some specific use cases. For instance, our big data is a service offering, Hadoop offering uses a very limited set of images they have and so we made 100% sure that all of their hypervisors have their images on them when they're good to go and that was a dramatic improvement for their build times, obviously. Just everything was good to go and we're using the fast cloning for a Zen server which you can read about on the wiki there so that makes it, the image provisioning part becomes almost near instant for those at that point. It's basically just make a new cow and you're good to go. For troubleshooting problems on this side, Glance and Swift and I guess just overall OpenStack projects aren't as great as they could be at sharing request IDs when you're trying to track those problems down can become pretty interesting. It would be really great if you had an actual single request ID that you could use throughout the entire stack but having to switch between this request ID and then going over to the Swift transaction ID which can be different to try and find where your bottlenecks are here can become very difficult and that makes thus planning for adequate capacity moving forward much more difficult. So what happens if you can't scale out anymore? That means you have to horizontally scaling specifically. We reached a point when we were relying almost entirely on horizontal scaling for our Glance API nodes and as Andy mentioned, unfortunately we don't have an unlimited capacity or unlimited resources to throw at things and we reached a point where we couldn't do that anymore. We couldn't just add more Glance API nodes like we were doing because we were running into saturation problems at the next layers up on the networking side. So we had to actually re-architect and move over to new hardware entirely using 10-gig mix. We had to start spreading across to different segments of our network to spread that traffic to different aggers, different top of racks. You have to really keep an eye on that. Otherwise, you're going to start impacting other services not just Swift and Glance specifically. So fraud and non-payments. Fraud, how do we handle those at Rackspace? We immediately marked them as suspended because the account is flagged as fraud. And that's great. It stops somebody from being able to use those resources that you're providing for free. However, it still takes up capacity while they're around. So what do you do about that? You don't want a bunch of these suspended fraud instances just sitting there taking up room. About a year and a half ago, two years ago, something like that, we developed an internal tool called Account Action Year which basically looks at our internal auth system, our user base, and looks at the accounts that are flagged as fraud. And it has a set of business logic in there to say, hey, this is fraud. It was marked as fraud X amount of time ago. We are confident at this point that this was in fact real fraud and we're not just suspending and deleting somebody's instance. That shouldn't be. I'm going to go ahead and clean this up. So our rules for that on fraud are pretty aggressive. Like if we have a whole team devoted to deciding if somebody is legitimate, whether it's person or not, and we take their word for it more or less. So once something is marked as fraud, it's pretty aggressively deleted so that it's not taking up space in our cloud. Nonpayment is very similar to fraud, but it's worse in terms of capacity management. Because if it's nonpayment, that means it got past the fraud stage probably and at some point they did give us money and we're a company and we like money quite a bit actually. It's almost one of our favorite things. And so we want them to give us money in a point where somebody might have gone through a rough spot and they couldn't make their bill but they want to. We want them to come back into the fold so to speak. So we try to give them much more time to come back into being into rack space basically. So it's a very similar account action here. We're still handling those. It just has a different set of rules that's much more lenient in terms of turnaround time before it deletes the instance. But again from capacity management for quite a bit longer. This is where something like shelving instances might be a nice feature to have or that we could leverage so that we could completely shelve the instances someplace and they're not taking up that capacity. But then say several months down the road if somebody decides they want to come back like hey, yeah, we still got that instance. Sure, welcome back. Road testing. You have to aggressively test any new hypervisors you throw out there. We just bootstrapped something and assume it worked and hoped for the best. So how do we do that? With the new cell it's pretty easy. We use bypass URLs and by that we mean cell specific API nodes. So those are basically configured like how many people in here are actually using cells? All right. Us up on the stage and our friends in the front row. Yeah. We call the bypass URL cell specific API nodes are probably more similar to what the rest of you who don't use cells see. It's just an API node that doesn't know it's in cells and it's built to directly talk to the cell that we're provisioning. So when we're doing our testing for that before we link that cell up to our regional control plane we build using that API node so that bypasses the region. There's no way that customers can inadvertently build to that cell in any way. We're finished testing and we link it up to the region. We delete that cell specific API node because once we're on the region we can use scheduler hints to send to that cell directly. We're looking at potentially moving forward using cell tenant restrictions so we could link the cell up entirely but then have it set up so that only RQE tenants basically could build to it. And that would allow us to more fully test the entire pipeline for building to the cell and making sure everything is working. This is a lot harder when you're re-adding capacity to an existing cell. Sometimes we don't provision an entire cell at once. We get certain cabs in certain time and we put them on as soon as we can. It might just be a new cab that got delivered or it could be a re-kick like if a host we failed out for hardware problems rectified and we're putting it that hardware back in how do we test to that? That cell is already linked up to the region. It might be in production already. We can't stop or want to stop customers from building to that cell as a whole and there's no way to disable a compute node or a hypervisor but still be able to test build to it. We can disable it obviously but then our test builds fail. That makes things a lot trickier because we want to test to that node and we want to test with it in the cell that it's in and everything but we can't prevent customers from potentially being able to build to it. We've got a huge amount of builds coming in even if you just flip it on just for one second and try and get your build in and then close it back down. Odds are pretty good that somebody might land on there. That's a real interesting challenge for us. Matt Stuff. So these guys just to put in perspective do most of the heavy lifting. I'm the BDI manager type but even capacity, planning and management with respect to OpenStack and business has a few tricks to it at that level specifically around the supply chain and how you sort of handle this ever flowing stream of boxes of different configurations. Some of our own decisions from a product development standpoint and how that changes the way we plan capacity and ultimately there's still some challenges from upstream that the guys touched on and we'll visit you here to get in a second. So when you start dealing with public cloud or even internally if you start to say expand to more and more groups if you're running an internal cloud you start to get different constraints around capacity planning. My favorite is the large customer requests hey I'm a sales guy and I know that this customer is going to need but you have to sort of look at those kinds of requests and you have to think about them in terms of overall capacity over subscriptions those sorts of things and sometimes that does actually impact the speed at which you pull in your supply chain expect delivery of cabinets and how you sort of plan what's being built out next and in what regions or what areas of the deployment. We have a lot of triggers we look at the easiest one is the percent used in a given space you know or 70% okay it's probably time to order. The trickier one at least when you're dealing with finance departments and some of the other folks that you sometimes have to deal with is the largest number of slots available so again we offer a range of sizes in any of our flavor classes some of them include whole host sort of you can have an instance that takes up all the RAM on the host effectively. In those particular flavor classes it's very important to track the amount of those empty hosts that are available regardless of percentage because we can actually run into problems where in say a new region or a smaller region we may only be at 56% capacity but there's no room for the largest instances to build in that flavor class and that's just as bad as being out of capacity altogether. So that's the trickier one but we try to keep an eye on it and then ultimately we're also a public cloud provider so you know IPv4 addresses become an interesting constraint to deploy capacity we'd love a lot more but that's kind of hard to do these days so we're still you know as Aaron sort of wraps up its allocation for example in the US the process of getting more is very stringent so we're sort of stuck waiting on the company to negotiate its next round of IP addresses both for us and for our dedicated business and sometimes that means we're like watching some graphs get dangerously close to the numbers we don't like while we have gear sitting there and all we need is the public IP addresses to drop on it to sort of support the customers so that'll get more interesting as time goes on obviously the other big thing about it is schedulers the sales and scheduler services in NOVA aren't aware of IP addresses as a constraint for scheduling builds that's something we're trying to start to talk to people about because that's actually probably more useful for us right now than the RAM typically we run into more problems from IP addresses and provisioning than we do from RAM and provisioning then we have a couple of services we've built, auditor and resolver some of that work we've talked with there's a small project around in the community called entropy we've talked with those guys a little bit that does some of the same functionality but basically these are services that help us keep an eye on various aspects of the environment and in some cases will take action the simplest one I can describe is we actually have an ability to weight down a cell for builds as soon as IP thresholds drop below a certain amount it'll actually go in and update that information now we're still fighting for an actual disable flag because if you've ever well the old use cells but cell weighting is kind of magic voodoo that doesn't always, it's not like an on or off it's a relative value and so you can try to attempt to not build to one and yet it still wins because it has the most available RAM and these guys alluded to this but this is kind of an interesting aspect to how we manage capacity our control plane actually runs on an open stack installation itself so we actually stand up a small private cloud if you will and in there we build all the instances we build instances that become the control plane for the public cloud so when we talk about horizontal scaling we literally were just spinning up more and more instances until we sort of saturated the top rack switches that supported that little environment and to do the glance expansion we actually went out and stole performant hardware from the customer fleet and made them into bare metal boxes and we're now circling back and making other changes for that alright so from a product management perspective we're always trying new things I think in the last just over a year about 13 months or so we've added four completely new flavor classes we've renamed two of those but each one of those brought with them a completely different hardware footprint which meant a different VM density which meant a different amount of sizing per cell and a different number of cabinets per agar and so you start to sort of I think we have six or seven different types of flavors that we manage across all the different regions and so that's just more variables you have to play with also we're constantly doing code deploys I mean our goal is to try to stay relatively close to trunk usually within a few weeks I want to say the one we're testing right now is from mid-October so we're not too far off but again that has kind of its own pace of deployment and then you have all these new capacity coming along and so there's always that period of time you have to sync and make sure that okay we started this cell we deployed a new version of the region that we go back and patch this cell before we flip it on and cause all kinds of havoc even within even within our different hardware classes we have multiple vendors and so sometimes we actually find subtle differences in using hardware from vendor X versus vendor Y and in our oldest sort of it's called standard our oldest flavor class we actually have three separate vendors and in two of the cases we have multiple revisions of the hardware from that vendor now typically that doesn't hurt us at the nova level but from a total capacity management standpoint we do find quirks along the way so it ends up meaning all of our tooling has to sort of be able to determine like which of the seven hardware variants is this and then of course non-production environments additionally take strain on capacity from a not only hardware resources but time and effort and tooling spent getting those up in competition with the production environments so upstream like I said earlier a disabled flag for cells a lot of our upstream needs really circle around cells there's actually three sessions going on this design summit specific to cells one today and two tomorrow the nova team is looking to sort of feature complete it possibly even make it the default in sort of a no op single cell don't have to worry about it kind of model but disabled disabled flag for the cells is big also for the host like the guys talked about just being able to manage like an admin only flag or a disabled flag that allows me to build tests to a specific host without sort of exposing it to customers before testing scheduling based on IP capacity that I talked about and just overall the son the feature completion pieces I know it's being pushed very heavily by Michael still and quite a few of our core devs are involved in those discussions I think we're going to be sitting in them today and tomorrow to kind of make sure from an operators perspective those things see their way through but I think in order to give folks time for questions is that the last one yeah we have a couple minutes left so any questions it's it's most likely disable host yes but we've had trouble with so availability zones as part of the feature completion for cells so this is wrapped up in the first class citizen stuff for cells we can disable the host and we can deploy to it and do everything we want to set it up but then it's difficult to then test that while it's still disabled without exposing it to customers to also build to even if you do a scheduler hint even directed at it it will fail if it's disabled so we can flip it on do a like a full host build to it real quick and hope that we get in there first but yes but if the flag is set to disabled it fails for us it comes back with a scheduler failure I'd like to know more but at the same time we want a full fledged VM because we also test a bunch of stuff inside the VM so we would prefer to have an admin disabled so we could do a full complete I mean that's that's what we would like to yeah our goal is around road testing the entire node making sure all the networking and everything for the instance comes up properly we can do some stuff to get around it and get a build there and do it safely but like we would we prefer to be supported and be trunk basically worth and you're not trying to do anything funky we would just like to say only let admin build to this basically would be ideal so like number of nodes in a cell versus the amount of control plane nodes we put to them yeah so cells are going to vary we have any some that run around 100 up to some that are like 600 hosts or hypervisors a lot of that depends on which type of hardware they are for that we run what is it we have cells, scheduler, couple databases we actually don't run all of the opensack services on a single node we run a separate node for Nova scheduler a separate node for the various components right it comes down to we run quarter of magnitude so I'd say for I think our target right now for most of our cells we're trying to get to the database pairs that are about 8 gig instances running in our private cloud and then the scheduler in cell services run as like 4s or 8s as well and I think that's pretty much all you need for the cell itself now where we I'd say we don't have the math completely right as the sort of I won't say auto scaling but the scaling the APIs and the glance nodes and stuff in conjunction we're still trying to we do it over time but I don't think we have an exact math that says when I get to 15 cells I need 12 API nodes yeah and it's also kind of a function of the use of the cell and and what happens over time so for instance one cell may have a bunch of instances that are booted and then they stay online forever and then another cell may have a bunch of instances that are booted and deleted which ends up really adding some database constraints to it because it keeps all those deleted rows so in conjunction with all of this we have to come up with a consistent pruning plan to kind of give us an assurance that the database nodes are under similar load and so that we can size them consistently all that stuff so that we're yeah I give you a specific example because we were looking after the last talk on database performance we've just recently pruned our global database in two of our regions and in both cases just keeping 90 days of deleted instance information means I have several hundred thousand deleted instances in those databases so it's it's a good problem to have right but it's it does pose interesting challenges when you start to get to those kinds of things so not yet host aggregates is one of those features that we get when they finally feature complete cells I actually like the idea of host aggregates for a couple possible reasons it might allow us to actually start mixing flavor classes and or hardware vendor variations within the same cell and be able to control some live migraine and some other really touchy things that want stuff to be pretty exact also the testing pieces may become a lot easier because you can start to say well these cabs are already in production these are new we're adding them to an existing cell we're gonna throw them in their own host aggregate and route builds based on that so we don't today but I'm actually pretty excited about the potential for using host aggregates especially in the capacity management space so they're internal right now we call them auditor and resolver we are exploring how we can action here action here that's another one we literally just looks at our account services feed coming out of our yeah the open source tool entropy which kind of mirrors some of the auditor resolver functionality and we're looking at a couple other projects where we might be bringing some of this code to these these are all just things that we've built really in the last six to nine months except for action here and so we're still at that phase where we're getting it right for ourselves and then understanding which projects they line up most closely with upstream anything else awesome well thank y'all for coming if you have any questions you can catch us up here