 So my name is Paul Vocio. I work at Rackspace. I'm the director of engineering with our public cloud. And I'm going to talk to you about some of the challenges and some of the scaling problems that we have over at our public cloud and how we've kind of accomplished those and where we've been and where we want to go. So this is a little haiku that I kind of like to remind all our new engineers when we start. And that if you write code, it doesn't really matter as long as we can't get it to production. So that's kind of the key thing that we try to make everyone understand is that's great. You write really awesome code, but we need to get it out to production. So we're going to talk about how we deploy stuff at scale at Rackspace. We're going to talk about where we've been over the past two years with OpenStack and then where we want to take it in the next 18 to 24 months. And so just a little reminder from what we talked about at the Portland Summit. So we launched our open cloud in August 2012. And so the thing that I want everyone to realize is that we'd actually been running OpenStack for the previous year in production before it was actually in production. And so we were deploying code that was roughly two weeks out of trunk. And we've been pretty good at doing that. And over the time, what you can see this graph shows is that our number of releases got slower as our installation got bigger. And this is for a number of factors which we're going to talk about. But a lot of it was just the scale of our deployment got so big and the mechanisms that we had wrote, which worked in the beginning when we were really small, just didn't really scale out. And so once we got to around January 2012, our environment had really gotten so big that our employees were taking up to six hours. And our engineers were getting very, very tired of this and realized that we had to rewrite this process from the ground up. And so that's what we did. And so a little background on our public cloud. So we actually run two clouds at Rackspace. One is the public cloud that you guys interact with on the API. And the other is actually a large internal cloud that we call Inova. It's based on OpenStack. And it actually provides all the external services that we offer publicly, but just on internal APIs on our internal IP space. And so the reason why we did this is we realized that if we are to offer our customers an OpenStack solution, we really needed to run it first to make sure that it scaled. So this is, again, before OpenStack had really taken off and before it was, quote, unquote, mature, we knew that we had to make sure that was up to our standards. And so we built this internally. And we ran this for many, many months before we turned it up externally. And so the conversations that we had with our engineers when they realized that things really didn't work, and that was our motivation to go fix it. Because we had to use it, and we had the same pain that our customers had running it. And so as we encountered these problems, we logged the bugs upstream, go fix these, redeploy it, and kind of run that cycle. So by running everything on an internal cloud, we actually get some of the same benefits and challenges that our customers have in running an internal cloud. So when we spin up our API nodes, we're actually calling an OpenStack API to create a VM and then install the OpenStack code on that. And that's what I'm about to show you here. So this, I'm sure you guys are very familiar with. So we take all our OpenStack services and spin these on to kind of a golden master. And what we do is then we take that master, and we actually store it in Swift to keep a canonical record of it and version it. But we take that, and then we deploy it onto a hypervisor. And we use Zen server at Rackspace. We run a couple different versions. But we take that Zen server, and then we kind of blow that and blow it wider. So we take a couple different hypervisors and bring up a very, very small OpenStack cluster. And this is what we call our seed environment, or our small I-NOVA starting point. And then take that same image and deploy that across a bunch more hypervisors, computes. And then kind of using some scripts will wire that all together. And that becomes our kind of internal capacity for our I-NOVA cluster. So this guy you're very familiar with. So that I-NOVA capacity, though, becomes our internal API endpoint. So then we do is create more VMs. And so using our internal scripts and our deployment tools, we create those environments I-NOVA Glance, Neutron, and then a bunch of other supporting tools that we use to run our cloud. And then do the same thing where we point those two hypervisors and build our compute. And then what we end up with is two clouds, kind of inception-based, where we have then our production capacity online backed by our internal cloud. So talk about our agile deploy process. And the reason why I put that in quotes is because we started out with building Debian packages. And so this was very easy for us. We're all Linux system engineers. We knew how to do this. And we've been doing this for a long time. So we took all our packaging, all our scripts, all our OpenStack code, worked with the community, used some of their packaging tools, using Freight and Apache, and pushed us. And this worked really well in the beginning. So we're able to iterate very quickly, modify our scripts, we use Puppet, with Puppet Masters. And we, like everyone else, Bash and SSH will take you very far, very quickly. It's very iterative. Our engineers would check out the code, rewrite it, push it back, and Bash and SSH were very quick to modify. So we'd use this behind a load bouncer. And one of the ways we can do this is just AB node rotation to do our upgrades. And so this is normal website stuff. Everyone should be really familiar with this. Website stacks handle this very well. But this was a problem for us. And so we knew we couldn't deploy or compute this way for long-running operations. So we do resizes, migrations, and host maintenance, and anything that is transferring a lot of data from another host, you can't just rip that out and reinstall this. And so this kind of process didn't really work well for us. Another problem that I'll talk about later is also database migrations. When we have a long-running migration, you can't really just swap out your API node if it's expecting that the database data structure to be in a certain way. So load balancing puppet, when we scaled out, like on my first slide, I want to show that our deploys went down as our capacity goes up, load balancing didn't really help us very much. And so the reason why is that our load balancers have dedicated pipes coming into this. And then the more that those are pulling that traffic out, we're saturating some of those network links. And so we tried upgrading these. We're adding more load balancer nodes. But again, that only gets you so far. And so because the way we do our upgrades, we need to do these in a short amount of time as possible. Those upgrades didn't happen in our time windows. When we talked about things are taking six hours to run, what would happen is a couple nodes would go, someone fail, we'd keep retrying this. And then that's what would make that end up longer. So the timing of this became very, very critical for us. We wanted to get it out as soon as we could. So we talked about breakage. So when you're running 10,000 nodes in a region or more, 0.1% down is something that you're going to experience. And so on 100 node cluster, you're going to know if one node is at a rotation, you can kind of manage this. But because we have such a large fleet, it becomes difficult to manage anything down. So if you have a tiny fraction of your environment down, when you do your deploy, that node doesn't get that update. So you have to have a way to deal with that when it comes back online or how to fix this. So this is one of the things that we've also taken a lot of time and effort to go figure out. So somehow, scripts to deal with auto healing or inventory management to know if a node's offline and what to do when it comes online is something that we've also paid a lot of attention to. So deploy mechanism. So a lot of people should be familiar with this kind of strategy, where you can do your code, build your packages, you deploy it, and you verify it. And this is what we use our Inova environment for in a lot of our testing and automation scripts. We'll build our packages, push these out to a dev cluster, we'll run our automated suite against it. If that looks good, we'll promote it to our continuous integration environment. And this environment is where our other groups in the rack space will then integrate their systems with us. And what we'll do is we'll kind of let it bake in that environment for about two weeks before we push to production. And so this is where we will stress test it, run nightly builds, have other groups run their tests against it, look for kind of corner cases. And so this is normal. This is what everyone should be doing. But because we take a different kind of approach at Rackspace on how we, our developers develop, we don't develop internally and then push our code. What we do is we actually develop in the open. So our developers, when we're developing a new feature, we'll commit upstream first and then go through the whole open stack process and pull that into our environment, build it, and then push it. And so I tried to illustrate that here by showing, you know, if we started on the internal side where we're coding, we actually go to the external merge prop first and then go through the task, the approvals, and the merge. And the problem on the external side is that that's an indeterministic amount of time. So this could be a few hours if it's a small patch. It could be a few weeks if it's a larger patch and on some of the larger blueprints this could be months before we see something. And so this is a new conversation that we have when other product groups wanna come and deploy something on Nova like it's, it's a little bit iffy because you don't engage with the engineer and then expect the engineering team to go out into the community. The product owners inside Rackspace really need to be part of the community. And I think we've done a really good job about teaching people that you don't just drop a red mind ticket or a blueprint on an engineer and expect them to be the advocate for that. It really takes having people integrated into the community from the product side and the QE side and the engineering side. And so our process again it's code to merge prop test and then merge and then we package it. So we pull in from upstream at least once a day sometimes multiple times a day, build those packages deployed into our CI and our staging environments and then go through our verification process. So this again can sometimes be a few hours if we're lucky and we know who has minus one that particular patch and we can walk through it with them and fix it and push it or this can be a few weeks or sometimes even longer depending on what the feature is. So I said earlier that we had Debian packages and we used freight and Apache to push these out and we realized after some amount of time that this was getting difficult for us to do and we wanted another way to do this. So we sat down at the summit I think it was a year ago and with our engineering team and decided how do we really want to do this? And one of the ways we talked about was the Python virtual environments coupled with our puppet configs and our source codes. We ended up coming up with essentially a tar ball that has everything you would need to run that environment. And because sure Apache worked but bit torrent actually in some internal testing ran a lot better. So we ended up distributing these with bit torrent and then the way we actually run the scripts and execute these is in collective and ansible. The reason why we did the packaging that particular way is we found it was OS independent and the OS in the Linux sense independent. It was easier on our developers to develop again so they could check out packages that had the particular source code, the libraries that were for that particular release and actually the configuration that went with that particular node. So say our API nodes have all the libraries that would go with that, the source code and then the configuration for that particular API node gets pushed to that box. And so if there's a problem with that release, we know exactly what that config was because there's no kind of confusion. When you have puppet and configs coupled independently of your source code, the puppet revs can change. Your source code may not change and then it becomes kind of an investigation to figure out which source rev were you at versus what configuration you were. So by coupling these together there was never really any question that this release is the problem and this either this code or this config was the problem. So it actually helped us in our debugging a lot. And here's the other thing that we solved with the BitTorrent. So by doing this with Debian and the freight packages in Apache, like this was very easy with 100 nodes and this scaled up to some degree of a couple thousand nodes. But when we started having tens of thousands of nodes in a region, this became a problem. And so like I said, going back and doing some actual benchmarking with Apache and with BitTorrent, we're able to prove that it actually was faster and much more efficient for us to do this. I'm gonna show you some information on that in a second. So this is a test that we ran a while ago. We had 200 hosts and a single Apache host serving our packages. I think we had about a 100 mag payload and the same test that distribution was roughly under 6,000 seconds from that single Apache process. Doing the same delivery with BitTorrent across those hosts ran just under, I think it was like 1,300 seconds for the completion. This was just a kind of one afternoon test we ran. And so after we scaled this up we were able to actually do our distribution across. It was taking, I think, was an hour and a half to distribute some of the packages at some point. We're actually able to get this now under 15 minutes onto a very, very large number of nodes. The other thing to keep in mind here is that you're a layer two domains when you're doing BitTorrent. So I don't know if you've worked with this in a legitimate sense, but BitTorrent can actually consume a lot of connections on firewalls and push a lot of data. So one of the things we had to really think about is how do you design your BitTorrent distribution architecture in a multiple layer two domains? We'll talk about that a little later. So we know we still needed configuration management. So this is, again, why we had to move away from a puppet master type scenario where when the nodes would have to check in and get their configs, normally in an environment where you're pushing packages and you can kind of do asynchronous upgrades, puppet is really awesome for this, is what it's built for. But when we had a specific time window that we knew we wanted to target and push our configuration. So we would change our puppet config, check it in, and then normally what happens is a puppet will then periodically check in, pull our configs, but we needed to be very deterministic around when that happened. So what we would do is we would change our puppet configs, check it out on the puppet master, and then we had to do some action parallel SSH that would go out to each host, execute say, all right, now check in and then that's what would come back and slam the puppet master. So those would get overloaded, we would have some failures, have to go back and kind of keep repeating that cycle until it worked. So by moving away from this distributing it, we're actually removing some of that infrastructure from the problem, pushing it down to the host, everything runs at the same time. So everything has the host, the source, and then we say upgrade now, and then we have a nice distributed action that moves that up, moves the sim links, and then restarts the services. Right now we actually have a combination of mCollective and Ansible. The reason why I think we're moving a little more towards the Ansible side is it's agentless, it's much easier for our teams to write, we don't have to run Ruby on the hosts. And again, when we talk about our layer two domains, one of the things that we've noticed with the versions of mCollective we're using is some of the mCollective nodes on the hypervisor would lose their connection back to the rabbit queues. So as rabbits traversing firewalls, then one side thinks the connection's open and the other side isn't aware that it's closed, and again we have a failure. So then that's another failure that we have to go track, repush, and do it. So moving to Ansible helps us kind of get around some of that. So talking about staging the packages, so you can stage the packages with Debian and Freight, you can push the command out, and it kind of sucks them down. It doesn't get us out of the problem of having to move a lot of data in a small amount of time. By doing this through BitTorrent, we actually have to only push a very small amount of data through our firewalls, and then inside the clusters, it's all distributing it among nodes as BitTorrent does. And so we'd actually see a decreased use in our network when we're doing this, and we can actually wait until later in the night. We don't have to start early in the afternoon to start pushing our packages, saturating our network throughout the afternoon. We can actually distribute this right up to the minute as we're getting ready to go. And the other thing that our configuration lets us do is target a single node or target an entire cell. So we're actually able to do a cell at a time, a region at a time, or a single node at a time. And we talk about a node that has failed or that's coming up after a re-kick and a reinstallation. We can target the single one and say, now you're rejoining the cluster as an API node or a compute node. So things that we need to do to make it better. So DB migrations are still painful. When you have millions of rows in your database, either of logs or instances or snapshots, one of the things that we have to pay attention to is every schema change, everything that comes through we have to evaluate and figure out is this gonna work for us or not. So one of the things that we've done is set up database slaves that pull our data off, we've run those migrations and time it. And so I think in our most recent upgrade last week we were looking at about a 30 minute upgrade. So some of the things that we have to do is and that 30 minutes means that 30 minutes downtime on our API. And for us that's not really acceptable. We're trying to do everything we can to have a zero downtime deploy or at least a couple seconds where we can just call it a latent connect, a latent API request instead of a 30 minute API outage. So I saw that Russell said that we're moving toward rolling upgrade, which I'm very excited about. So I'm hoping to follow up with him and find out how we incorporate our database migrations in this as well. So I know everyone has the Tori, NASA, giant screens and launching rockets, but for us this is a little bit how it actually feels. And so we actually have a large team that's online when we do our deploys. And so we run all our deploys in IRC, we have a bunch of status screens that everybody's watching, one team's watching for alerts on the notifications and other teams watching our network and other teams actually watching the transactions as we go through to make sure we're seeing what we want. Have a QE team online ready to run our smoke tests and QE tests when we're done. But that feels like it's kind of overkill. We shouldn't have to have a team of 15 to 20 people up watching migrations and kind of watching an entire environment. What we really want to do is have just a small team that does this. This could be a few people. So one of the goals that we're really trying to push for in OpenStack and within Rackspace is trying to make it to where anyone on our team can deploy any third shift, the first person developer and operations guy, if I had to do it, I would really like to do it. And we actually use Jenkins to do a lot of our automation. So it's an automated tool. We can kick it off. We can record who ran what. We can record the output. So it's actually very well suited for this. So this is kind of our deploy timeline. So when we talked, you know, where we were in Portland a little while ago, so we're starting in May and we kind of look at what our deploys were. So our grade dots here are kind of major releases and the blue dots here are kind of minor follow-up revisions. And so we'll do a major release. We'll kind of watch the environment. We'll watch the logs. We'll figure out where we want to go. Any bugs that need to get fixed. And then we'll go through that kind of life cycle that we talked about before where if we see anything, we'll go work with the community. Is this a patch that we need to short-circuit? Is it a security issue? Is it breaking something that we need to go address? If it is, we'll kind of bypass that process, get it fixed, push it out into production and then kind of work with the community to get that patched upstream. So we do give a small diff of patches, but for the most part, we're running a couple of weeks off a trunk. I think as of what is it, 10, 26, we were running the full Havana release. And so that's something I'm very proud of. Our team spent a lot of time in making sure that as we pull in from trunk that it works in our environment, that it works at our scale. And then we feed all that back into the community. And so these are our priority deploys. So the one thing that we don't talk about is that we actually are able to deploy a lot very often. That's one of the things we try not to do, but if we have to, we can. And so some of these are maybe not API impacting releases, but using the tools that we talked about, Ansible and Im Collective and sometimes PSSH, we'll do a targeted deploy to a particular cell or a particular node to fix an issue or anything that we see in the environment that we need to go take care of. So we still have a ways to go, I think. So we build infrastructure as a service, but we try to treat it like an application. And I think those are some of the kind of conflicts that we find in the infrastructure world. And so we use cloud, we build cloud, but it's very hard to deploy a cloud using cloud, you cloud thoughts and processes. So what I think we're trying to do is figure out what are those things that are transactional? What are those long running processes that we can restart or try to either hand off to another node while they can run? So if a VM is migrating to another host and if you have a terabyte of disk, it's not something that's gonna happen instantaneously. And if I have to restart that compute node or restart the schedule or restart something that it needs to know that state and that's something that we're trying to figure out how we do for a long time. And again, talk about sticky data. So our VMs actually have a disk on our hypervisors. And so we don't do necessarily volume. So if the host reboots, then that data is stuck to that host. So how can we treat these upgrades and these rolling deploys as something that the data is on that particular compute node and it's not something that's a thermal that moves around. So again, that's a problem that we're trying to solve that may or may not impact other deployments of clouds in different ways. Again, I would really love us to get to a point where we can actually deploy it like a website. We're talking about swapping nodes out, AB nodes, swapping out computes. And one of the things we're trying to figure out is do we need, so right now we have a one-to-one compute hypervisor. And if we could pull that back and say that we have a pool of computes that can interact with a particular hypervisor, this could be one way we could do it. And so I know that we're doing some preliminary work on trying to do that one compute to many hypervisor kind of work. So just as a recap, we reduced our deploy time from sometimes hours up to minutes. And this was the result of a lot of work that our teams have done over the past six months. Most of our time in our deploy is now actually spent testing. So we'll get our deploy out in 15 to 20 minutes and probably spend the next 30 minutes to 45 minutes just doing our smoke test before we're given all clear to make sure we're done. But a lot of the work that we've done is to streamline this and speed this up and make it as the smallest impact as we can on our customers. And the thing that we also try to make sure our teams and our customers know and hopefully the community knows is that deployment tools are part of the product. And it's not something that you build as an afterthought. And so a lot of times what we'd see in the OpenSat community and in other communities is that you write a bunch of code and then it's this long complicated process to get it started. And one of the things we try to stress is that the deployment and the tooling around it is actually part of the process. It's part of what you write and what you distribute. It's not something that you try to bolt on afterwards or hope the distros put a NIT script around. It's actually something that you should take care of in the beginning. So that's all I had. So if anybody had any questions, I'd be happy to answer them. Yep. So your question is between the releases, how, what's the delta in the scheme is changing? So from what we see, because we space our releases out, roughly if I kind of go back and look at this bigger slide, it's roughly one a month and we're trying to shorten that. So very rarely that will have a release that doesn't have some sort of schema change. And so what we'll see is sometimes these are small, these may be index changes. Sometimes it's a column add, but as we have millions of rows in these columns, then something that's very small that works, that should be very easy on a 100 node cluster ends up taking 10 or 15 minutes for us. And so we'll try to figure out, do we need this change? And if we don't particularly need this change, what we may do is push this off to a later migration. And so we'll go ahead and modify those schema files and delay it, or at least we'll schedule an outage at a later time, or we'll do some optimizations. We'll clean up that table, say, if we don't need all the rows, but our analytics team may. So we'll have to go let downstream systems know that, hey, when you consume our slave databases later, that we've cut the records off at this point to reduce those migration times. So it becomes more of an operational issue and kind of letting everyone know, hey, this is how we're having to handle it. I think there's some good ways that we could probably fix this in the community and that we'd be talking about it for a while, but I think as we solve some of the larger problems that we have, particularly in Nova, that we'll start getting a lot more focus on how do we operate these things at scale. And did you have a second piece that maybe I didn't answer? So I think what we're also trying to do, and some of these guys, Sandy, and may know that as we, I think part of it is on us to actually be more participatory upstream and either provide sample data or seed data or actually test in the 100 million row range. And so provide a database that is sanitized or sample data of this and run the migrations against that and kind of test those. I don't think we do a lot of that now. We use C-Torrent. So it's a project that already exists and using the C-Torrent. It's in the Debian distribution now. And so really all we do is we use Cells, which is our kind of layer two domain construct in our public cloud. So we'll run a torrent seeder per cell. And so using the Ansible scripts will actually push that torrent file, which is usually very, very small, push the payload to a single node within that cluster and then distribute the torrent using Ansible and Collective and then just light it up per cell. And so they don't cross the layer two domains because what we would see is in your layer two domains, these are connected by the routers and the firewalls and then they would start crossing traffic and then we'd shut down those links and then our network guys get pretty pissed at us. So we have network nodes per cell and each of those contacts are, we're running an older version of Quantum and then our software defined networking cluster has that interaction. So we run a SDN per region and then the networking nodes then call into the SDN for IPs and Macs and our networking setup. So how do we do our bare middle provisioning and our OS provisioning after that? So being working at Rackspace, one of the things that I kind of get for free is that we've had bare metal provisioning solved for a long time. And so we have teams that, so we, you know, our racks roll in, we plug them in, they light them up, they re-register back in our kind of, our inventory management system and then we pull a lot of that data out and then configure and update them from there. So we kind of cheat in that I get this for free which is why we don't use any of the triple O or any of the other kind of deployment tools because I kind of get that. Sure, so how do we, so the impact of the torrents on the network is kind of how you lay out your physical network. So if you have a large layer of two-domain, you may not have this problem. The problem manifests itself for us is that when we have to transverse network appliances and so then it really becomes a function of how many connections can your network appliances hold based on how big they are. So, you know, for our particular size and scale, it's a cost feature that we scale our network appliances at a certain size for our normal kind of management traffic and because we introduce this torrent concept later, then when our networks were built, they haven't quite scaled out. So what we had to do is push the seating into the cell to do this. And so it wasn't a big adjustment. It was just something like, hey, we did that and like, oh, sure, that didn't work. And so we had to then go fix that and push it a little lower into the stack. I think there was one over here, maybe, yeah. So we run, so yes in that, so we run kind of two versions. So we run Zen server as our hypervisor and so that's kind of our base one and so we do apply all our security patches to our hypervisors and then our control pane right now is Debian. So we do run a lot of our updates on our Debian control plane as we see fit. Yeah, that's gonna be a little bit of a different process. So we'll then push those individual Debian packages down and then the Zen packages are also done via some of the other mechanisms. The Zen packages are distributed by a torrent depending on the size. If it's a small patch then we can probably push it with a PSSH script or something like that but if it's a larger update to the hypervisors then we'll use the torrent distribution. So a lot of this, and I think it touched on this a little bit in that it's partly gonna be around your network architecture. So for us, our layer two domains is really what becomes the boundary of where we're able to push a lot of these. So for us we can't really centralize in a single region in a data center where the packages come from. We're just too wide, we have too many nodes and tens of thousands of boxes in a region just you can't update these within a small amount of window without kind of breaking and distributing the problem which is what we ended up doing. So part of it is you kind of need to know where you're gonna end up and so when we started in the beginning I kind of said early on that PSSH and Bash will kind of get you kind of far and very quickly but then that's, will only take you up to a certain level and that's gonna be dependent on your payload, your network architecture and then where you wanna end up. So if you know you're gonna be scaling out wider then I would start looking at some of these other kind of distributed tools but if you're gonna be under two or 3000 nodes you can probably get by with some of the, I'll say the word more primitive but easier to consume tools that don't require a lot of infrastructure in the backside to support. How does a cloud controller scale? And I'm not sure what you mean. So how do we scale the queues in the database and the APIs, is that what you're asking? I mean, most of these are just kind of monitoring the infrastructure and making sure that you're not overloading it. I mean, we monitor everything we can in our DCs to make sure that you're not measuring it you can't really tell what's going on. So what we try to do is measure all the things and then when we see something kind of getting abnormal or if we hear about an incident or we get alarmed on something we have all the data to go back and look at this. So we use a combination of graphite, Nagios, a couple of the tools I think gave West Moss my counterpart gave a talk on that earlier in the week around how you monitor and how you measure. And so as we monitor our API nodes if the API nodes start getting hot because we run in that I know of environment I was talking about earlier actually able to provision a new VM have it join the cluster to make a call out to the load balancer join it to the load and then we're talking 10, 15 minutes which is a lot different than we used to be where we had to go call the DC ops having them provision a physical server we have to install it, configure it which could be maybe a day if we're putting pressure on folks or a couple of days if they don't have the hardware ready so by having this cloud that we can use we then are able to actually use cloud infrastructure to deliver cloud infrastructure. Okay, thank you much.