 Yeah, it's all good. All right, hello, everybody. I am Raina Moser. I am a software development manager at Rackspace in the public cloud shared services group, working primarily with the deploy and release of the public cloud infrastructure services, so Nova, Glantz, Neutron, and all of the extra good bits that go into making a cloud work. With me today is Jesse Keating. He is a senior engineer on the deploy and release team. And we are going to take you on a little journey from our last six months of learning to scale the OpenStack cloud at Rackspace and then also give you a preview into what our current challenges are and what the plan is for the Juno release cycle. So in case you didn't know, the Rackspace public cloud is made up of six public regions located all over the United States as well as Hong Kong, Sydney, and London. We have three pre-production regions that we use to validate code in our pipeline. We have tens of thousands of nodes. And then once you actually add in the physical hypervisors plus all of the virtual instances, it's a whole bunch. I counted them today. I was shocked. And we're growing continually. We are constantly adding new capacity, pulling in new hypervisors, pulling in new cells, all of that stuff. And we do frequent deployments, not as frequently as we would like, as you will see during the next few slides. But we do do very frequent deployments from upstream trunk so that we stay as aligned as possible with upstream OpenStack. We try to do as much development as we can directly in trunk. And we do have some customizations that we've needed to make in order to integrate with billing and integrate with our legacy identity systems and things of that nature. So this is the third time we have been at an OpenStack summit talking about learning to scale OpenStack. I gave the update in Portland a year ago. Paul Vocio, the Senior Director of Public Cloud Infrastructure gave it last summit in Hong Kong. And so Jesse and I are here to kind of update you on where we are. So six months ago, and really even if we go back a year ago, at that Portland summit, we couldn't deploy code in a reasonable amount of time. It was taking us over six hours to get code out. The downtime to the control plane, where it was potential that a customer could make a request and they wouldn't be able to fulfill that request, was well past 30 minutes. And it was really very, very difficult. We did not have a high level of confidence in the code that we were deploying, not because upstream OpenStack didn't do a great job of gating and testing, but because once you pull down upstream code, you're going to do something to it, whether it's your configurations, whether it's the quirks of the hypervisor that you're using, or whether you're actually having to make changes to the code so that it works with billing, so that it works with identity. So it does some things that maybe isn't part of OpenStack yet. And we could not keep up with upstream. We were trying to pull it every day. We were trying to go as fast as possible and really do that classic CI CD. And we were failing pretty miserably at it. And so now I actually am happy to report that we have met at least some of those challenges. It's with great pride that I say that the last couple of deployments, my team and I haven't had to be awake at 2 a.m., helping to do that. There have been people awake at 2 a.m. because that's the maintenance window that we had to do, but it hasn't been my team. It's been engineers and admins, some of them even third shift, paid to be awake at 2 a.m. Where, and now we can do them consistently within an hour with around 30 minutes of impacting downtime depending on the database migration. And if you've been to any sessions around Nova and understand the database migrations are probably the number one pain point we still have. And we can do a deploy to even our largest regions, such as the Chicago data center, in as little as 10 minutes. And this is the nature of the change that's happening, how much is happening, what services are needing to restart, what database changes need to be made. We've consistently had to restart a deploy and maybe it would be really fast once it finally went, but we'd have to kind of try again. We'd have several false starts, two or three or four false starts before we finally got everything in alignment. Now with our new approaches or new tooling, we're very successful the first time out. There may be still some stragglers or some one-offs, but for the most part, and those are easily caught up, easily noticed, easily caught up. I'm gonna re-change that. So I mentioned migrations, we're an unknown factor. They are now happily because of some really amazing work done upstream by Rackspace and other companies. Migrations are actually tested. So anytime the database changes, anytime a commit comes in that's going to have a database change, upstream is actually testing that, seeing how long it's taking, reporting that in the review gate that is the check and balance for what code is going to come in, what code is going to merge. And if a change is impactful enough, the review teams for projects like Nova in particular where the data is the largest are doing an amazing job of pushing back and being like, don't make this change just because you don't like the way the column is named or you wanna change the data type without a really solid reason to do it, no more hygienic changes. And we've seen a huge reduction in the pain. We know when a problem migration is coming and we can plan accordingly as a result. Back in Hong Kong, back in Portland, we were about two months behind upstream. We're still about two months behind upstream. So that's really one of the areas where we're still really struggling on finding the right cadence and the right rhythm to deliver upstream code smoothly and consistently without, especially as we continue to grow and we start reaching these new scaling challenges at the size that we're at without completely burning out all of the teams and all of the engineers and developers that are working on this. So I think this is coming into our transition point. And so my friend Jesse here is a cyclist. He lives up in the Seattle area. So anytime it's not raining, that's where he is on his bike, except in the summer, because it's always sunny in Seattle in the summer. Don't let them fool you into thinking it's never nice up there. So for the rest of the presentation, I'm gonna turn it over to Jesse and he's gonna take you on a tour of where we are in the open stack landscape and how much we're still learning and what some of the challenges we've met and solved and some of the challenges we still have that we're tackling head on. So take it away, Mr. Keating. All right. So for our new challenges, we're not just dealing with scaling out our deployments and making our deploys go faster. There's a few other things that we wanna highlight as scaling issues when you're dealing with a cloud of our size. In particular, we're gonna talk about scaling our services, scaling again, we are gonna talk about scaling our deployments. We're also gonna talk about scaling our frequency. We are trying to be the thought leader and the front runner in issues at scale, but in order to do that, we need to collaborate with a greater community and that's gonna be the key to our success. The developer, the operator and the tester communities need to be aware of these scaling issues and work together on solutions. So our first topic is scaling services. And as the size of our cloud grows and the features of our cloud grows, the services that we use will need to scale along with them. Here we'll walk through a couple of examples of some scaling issues that we've had and how we've had to overcome them. So let's talk about Glance. Glance is pretty cool. Glance is a way that we can get images from our storage system into our compute system in order to run a machine or how we get images from our machine, a snapshot or whatnot into our storage system. The way that we have it set up is our compute stock to a Glance server and our Glance server talks to a storage server and then it acts as a medium to pass the bits back down. Well, we introduced a new feature called scheduled images and scheduled images allows our customers to pick times when they want to make images of their machines. And this is a pretty cool feature and it got a lot of use, but it got a lot of use which meant that Glance got a lot of use. The Glance servers became saturated. The builds and snapshots slowed down to the point where we couldn't keep up with the number of requests that were coming in and we kind of went into a death spiral. So what we had to do to resolve this, we had to scale the number of Glance API nodes that we are running. We just had to have more of them. More means better, right? But we also had to scale the size of the Glance API nodes that we were using. The ones that we were using weren't exactly the most efficient size for the work we were asking them to do and it works great up until we hit this point. So now we've adjusted them, made them better. We're also scaling the use of a Glance bypass feature. There isn't really a strong reason why a Nova Compute has to talk to Glance who talks to the storage system. Why can't Nova Compute talk directly to the storage system? And that's a feature that we're introducing and that's going to get Glance API less of a bottleneck get it out of the way of that data path. Once we resolve the Glance issues which were mostly resolved, we do need to be sensitive that Swift, our storage system for images is gonna potentially become a next bottleneck. So we can't just kick the performance issues down the can and say that's your problem. It's something we're gonna have to grow and monitor and watch. Let's talk about Nova Cells. Nova Cells is pretty cool also. Cells is a feature that Rackspace uses to scale our cloud. We have in our global regions a global cell that is just the top level control plane. All of our compute capacity exists in individual cells which are a smaller subset of the hypervisors that exist and in a region. The cell has a layer of control on top of it and then all of the cells together talk up to the global as a coordination point. So it allows us to have smaller sets of hypervisors be controlled by smaller set of control nodes but we introduce performance cells and performance cells are a much less dense cell so we added more and more and more cells. More cells means more traffic. Bigger is better, right? Maybe. What happened was we only had a single Nova Cells service running at the top layer. So all the individual cells are talking to one Nova Cells service and that Nova Cells service was then relaying messages and down and up and down and up and when you tossed a whole lot more at them it just couldn't keep up. And we knew from the get go that Nova Cells was gonna be a problem. We just didn't expect to run into the problem quite so fast, quite so hard. So eventually what happened was our single Nova Cells service could not consume messages faster than the messages were being produced and again it's a death spiral. Same sort of story. Scale out the number of Nova Cells services. We couldn't just do that and we couldn't say hey let's have eight instead of two or eight instead of one without collaborating with upstream and fixing a lot of the issues that came up out of having multiple of these services trying to do the work at the global cell. We had to optimize instance healing calls. We had to look at what work was our Nova Cells service doing? What messages were coming through? And some of those messages were instance healing calls to try and make sure that the data that the global side has about the instance matches what's down below and what down below should match what's up above. There was a lot of calls going back and forth and a lot of not very optimized calls going back and forth. So with collaboration with the developer side we got a lot of those optimized and now they're doing things smartly and it's reduced the number of calls that we're making. We also had to make adjustments to how Nova Cells was addressing our database. That reduced the load on the cell service or reduced the load on the database service and allowed them to work more optimally. These challenges will repeat. Those were just two examples of pretty similar problems but if you all run OpenStack you know that there's tons of services. There is tons of potentials for these types of things to happen. The challenge is finding these ahead of time. Staying ahead of the pain is the key. We will not be the only ones to experience this but we may be the first to experience this. And what we are looking for is collaboration in how best to manage that kind of scale and how best to monitor our usage, how best to adjust our ratios of one service to another, our ratios from one service to hypervisors. What way can we set things up so that when we scale we scale all the right bits along with it and that's gonna be a collaborative effort. So our next scale challenge is about deployments. We made really, really good strides in Havana and what have we been doing since then? So our theme after Havana, our theme since Portland has really been about a higher form of orchestration, taking a look at exactly what we're doing, when and how and making it the right path. So we continue to iterate on the parts of the deployment that are hard and the parts that are slow and the parts that are painful and we make it better for each next time. So some of the key things that we did, we decided that trying to put all the content on all the machines at the same time we were deploying was not a great idea. So we split up our deployments into two stages. We can pre-stage all the content and then we can execute that content inside of our outage window. And that took redesigning a little bit of how we do our package software, how we do our configuration, how we do our deployments, but now we have it completely split out. We also had to increase our tolerance for downed hosts. When you're dealing with 10,000 or 20,000 machines, the likelihood that one of them is not gonna respond is pretty high. The likelihood that 10 of them are not gonna respond is pretty high. So we have to have software that's tolerant of this fact and can recognize that, yep, that machine went down, I'm not gonna try and talk to it anymore. Instead of a full stop, I don't know what to do. We also are adjusting the way that we bring up our services. So that we can bring up our API nodes first rather than in the mix of all 10,000 so that we get our customer experience back into the place that we want it to be. We've introduced new deployment options, being able to deploy just facts that are driving a little bit of the configuration of the machine instead of a whole new set of code. We can deliver new code or facts directly into a specific cell. And we can do deployments where we don't even attempt a migration. Save some steps that we do, save some time, make it faster. We've also taken a look at the complexity of our deployments and tried to reduce that down. We no longer have multiple ways of doing a deploy. We have a single entry point. No matter what type of deploy you're doing, no matter where you're doing it, single entry point. And we have a single orchestration system driven by Ansible that does all the things that we need to do during our deployment. All that said, we're still treating OpenSack like a legacy software deployment. We have a lot of pets. We don't have a lot of cattle. We are still doing in place upgrades of machines. We are still doing hard-coded IPs floating around in multiple machines. We're not making really good advantage of our load balancers. We're not doing a lot of the things that we should be doing as if it were a cloud application because we are running it on a cloud. There's a lot of barriers to getting there and that's gonna take collaboration. And that's what we've been talking about in our design sessions, summits, meetings. It's what we've been talking about in these sessions and it's what we're gonna continue talking about in the next iteration cycle. Finally, we're gonna talk about scaling frequency, the scale of doing things just much, much more often. This is another great byte quote. It never gets easier, you just go faster. And it's true on a bicycle, but in the DevOps world, doing things faster, doing things more often usually makes them easier, especially if it hurts because you find ways to make it not hurt. So scaling change. We have new features coming in all the time. The number of projects that are getting added into the OpenStack ecosystem grows really fast. The number of features that are being added to those grows really fast. The slide, if you were in here last time that was showing how many commits get thrown at Garrett every day and how many changes get put into the code every day is just pretty astounding. We have new configurations coming. We have new ways of shaping our hardware. We have new ways of doing our networks. We have new ways of delivering things to our customers that are coming at us all the time. We have to accommodate all those without interrupting our customer experience. We have to change faster. We have to change frequently and we have to change on an ever-growing fleet of our machines. And the way that we are tackling this is by understanding change before it happens. We have to have much more clarity into what's going on with our code. We have to have much more clarity into what's changing in our code. And we have to understand that before we try and make that change into our environment. We have to schedule the types of changes so that we don't have conflicts in what's going on. Again, that's vision into what's changing, what's going to come in. We have the flexibility to dedicate release iterations that put the risky changes on top of code that's already known good. That way we're changing one bit, not three. And if we have a failure, we don't know if it's because it's new code or if it's because this risky new change. If we're just doing the risky change on top of the known good code, then we have confidence that any sort of problems are going to be that risky change, not something that's changing out from under us. But in order to do that, we have to be able to deploy very, very fast so that we don't hold everything back trying to get this risky change out. And then we're going to have custom deploy modes based on the change type. The last slide I showed how we could do fact-only, cell-only. We're growing those into even more so that we can do deploy types just to computes, deploy types just to control planes, deploy types just to certain types of service. So improving all of these, improving the adding multiple release pipelines, improving our tests, moving our tests upstream, having greater vision into what's going on helps us to go fast, but we're still not fast enough. And this is our limit. The customer experience is the absolute limit on how fast we can go. If we interrupt that customer experience, we can't do it more often. But if we can find ways so that we can make our change without disrupting the customer experience, that opens the gateway for us to change as much as possible and as frequently as possible and reduce the delta of what we're changing every single time. In order to get this, because every user of OpenStack needs this, we have to collaborate. We have to engage the community of developers and operators and testers and we have to understand these types of issues and work together and make that our focus. So that's enough about what we've done. How about what we're going to do in this next iteration? What we're gonna do in Juneau. So zero perceived downtime. The Ice House Nova, in Ice House, the Nova project made some really great strides towards live upgrades. They've introduced the object model for referring to data bits that freeze the systems from having to care about how it's actually stored in the databases. They've also introduced conductor, which shields Nova computes from changes in the database as well. With the object model and with conductor, we're able to do intermixed versions as well. We can upgrade our Nova APIs to have a newer version while leaving the data plane behind it and leaving the computes behind talking an old version. So we can use our load balancer, we can do a rolling update of our API nodes, get them talking the new stuff, and then they keep functioning. And the users never see a downed API. Once those are updated, we can go back behind them and update all the things behind that in a more safe, more practiced, more careful method. We're also going to be investigating read-only states for the API, looking at ways that we can take down the database into a locked mode so we can do some data migrations, but still service requests that are coming in from our customers. Either requests that are going to have to sit and wait for the database to come back up, that's okay, we give them a transaction ID to look at, or requests that are going to be satisfied from a read-only cache of data that's going to be flagged in a way that says this is cached if you want something real, try again in a little bit. And we're also talking about individual service deployment pipelines. We're talking about can we update Glance outside of the same way that we update Nova? Do they have to be atomic or can they really be disparate? Is there a way that we can do all of our validation of one project in a different way, in a different rate than all the validation of a different project? Can we give Glance its own pipeline? Can we give all of these their own pipeline and deployment capabilities? Does that make anything better? But how do we combat the exponential growth of the service version combinations? If we've got our Nova on one version, our Glance on a different version, and our cells on a third version, how do we know if all of those work together, right, and which parts we move and test? Can we get those validated in a reasonable amount of time? And if we do this path, does it actually make the whole pipeline move any faster? And then these are things that we have to explore. And fully automated environments. So one of our biggest impediments to getting change in place is having an environment to test it in. And getting environments to test it in are hard because creating environments are hard. There's DevStack, but DevStack doesn't really get you anywhere other than a single machine with a bunch of services running on it. You can't upgrade it, you can't move it forward. And DevStack isn't our environment. We're not doing that. We have to replicate our environment and test things in our way of doing things in order to validate them. But creating an OpenStack environment is pretty darn difficult. One way we can accomplish this through automation, but even automation is difficult because of all the different ways and pieces and things that have to work together. So there's been some really good work on creating standard ways of working on operations with OpenStack, and there's gonna be continued work on that. So setting them up as hard, working together, we're gonna make them easier. And it's, again, it's a developer and operational collaboration to make things better. So we do a lot of things that are hard, but if it wasn't hard, it wouldn't be satisfying. And that's what keeps coming us back because these are hard, but fun and challenging problems. Scaling is more than just tossing code on nodes. It's a lot more than just tossing code on nodes fast. It's looking at your entire operation and feeling out how all the different pieces move and grow and react to each other when you make changes. And the last thing is that the developer and operation tester communities have to work together. It's the three legs of the stool. Without one or the other, the stool is gonna fall over and you're not gonna have a very good day. So we have to collaborate on where the painful parts are, particularly at scale, and work together on those solutions. Great job. So we're gonna open it up for questions now. Please do use the mic if you can. Also, oh, excuse me. We'll be around after the session. There's all kinds of parties happening. I'm like, who schedules the last session over the parties? Tomorrow at 1.30, I believe back in this room, I'm not exactly sure off the top of my head, but we're doing a continuous integration conversation and kind of a look at what we're doing to take from the upstream gate and actually get it all the way into our production environment and what that looks like. So I do invite you to join us for that tomorrow at 1.30 and open up for any questions. And even though I'm talking, Jesse'll probably be the one to answer. You'd mentioned that you're currently two months behind Trunk. Can you give us any detail on what that delay is? Is that QA, your integrations just manpower? It's kind of a mixture of problems. Some of it is that it does take us time to get things validated downstream. We do have right now a large pile of changes that we make downstream that don't exist upstream yet. And so we're not able to take advantage of the continuous integration that happens upstream. We have to work on that downstream. And we do discover problems, we do discover issues. So that's one part of it. Another part of it is that there are external forces that are preventing us from going forward. There's been some pretty big changes that we've had to roll through like Neutron that requires a whole lot of care, feeding and coaxing to have happen. And we can't right now be making a bunch of other changes while we're going through that Neutron change. So some of those changes will take a lot longer to get done and not as simple as just taking a new code bit, validating and pushing it out. And we've also been hindered a little bit by how painful a deployment is. A deployment takes a lot of effort from a lot of people and it does disrupt our customer experience. The more we disrupt our customer experience, the less customers are likely to come use us as a cloud. And that means that we don't have a lot of stuff coming in to fund us to do the deployments that... Pesky economics. Yeah, it's an economics thing. So we have a limited number of times that we can do these deployments. Add that all up and what you end up having happen is a little bit of a snowball effect. It takes a little while to get something out. So a larger bit of change piles up. You try and consume that larger bit of pile and it takes longer to validate that bit and the longer to get it out. And it took you longer, so even more piled up. And it's having to find a way to break that cycle and to get things into smaller, more manageable chunks and get it out without disrupting your customers so we can do it more often. Over here. The glance bypass thing, do you guys have a blueprint or something that we can reference to start taking a look at that as well? I will, if you'll see me afterwards and anybody else that's interested, I will help you get in touch with our Glance Dev Manager and find out what we've done and where that is. Because honestly, I don't know the answer right now. I'm assuming yes, I'm almost certain that there is something out there public to talk about it. Yes, sir. I had two part question. What message queuing technology do you use and have you run into any particular gotchas trying to scale that to your level? So I believe we're using Rabbit for our message queuing and I don't know that we've really run into any scaling issues with that yet. So our work with cells, that's one of the things that goes down at the cell levels, Rabbit server at the cell level. So at the cell level, our Rabbit server is dealing with no more than say 600 compute nodes and then the top level Rabbit server is only dealing with the global stuff that's gonna go down to the cells and then the cells come up to the global. So by doing that, we kind of scale it horizontally and I think we're only just using a single Rabbit server in each cell and it's able to keep up pretty well. Even at the top level or it's all just a single server? No, it's one at the top level and one in each cell. Yeah, that's what I meant, but each one, you're not clustering in each of those. No. Oh, okay. Yeah, that is one of the advantages for cells and I know that there's conversation in the NOVA design sessions about making cells the standard deployment model. So even if you're only going to have a 200 node cloud, you still just deploy one cell and then if you ever do need to scale out, you've already got that infrastructure there to keep you from running into those queuing bottlenecks. Based on your experience with all the upgrades and the deployments, I'm not sure if there's something already in the community, but what is your idea of having some kind of comprehensive orchestrated upgrade service that's built into OpenStack? So... Kind of like packages and... There's been a lot of discussion this week on upgrades and how we want to tackle upgrades and there's some differing opinions. There's some opinions that upgrades should be a carefully crafted thing that you spend a lot of time on and make sure that you get all the parts to move in the right spot in the right place at the right time. Then there's people who say, no, software should take care of that. You should be able to change all the things all at once and everything will sort itself out. And if it doesn't sort itself out, then we're writing software bad. I think we're somewhere in the middle of that. I don't think we're at the point where we can kill everything and let the computer sort it out. I think we do have to craft some of how we're doing it, some of where we're shielding our users from having impact. And I think as part of the growing operators community around OpenStack, one of our contributions is going to be our findings at the best way that we've found to orchestrate an upgrade through a live cloud. And we're also collaborating with our private cloud area. So even though we're focused on the public cloud, a lot of the same problems exists on the private cloud side, which is much more, even pure OpenStack, I guess, if you can say it. And so getting those two in alignment will then help upstream as well. And so we are looking at that, maybe not necessarily have it all finished in the Juno cycle, but certainly by the end of K. So we had a Rackspace private cloud deployment and what we were told is in the new release, there are plans that if we want to upgrade to the next OpenStack release, instead of reinstalling the software, it'll automatically be done through the solution. Yeah. That would be nice. Maybe then come do it for us. Logan, you mentioned Neutron. So are you guys running Neutron in your public cloud? Yes. Yes and no. Okay, it was my understanding that Neutron and cells didn't work well together. Is that the case or? I certainly hope not because we're doing it right now. Andy, actually Andy, if you wanna come up to the mic and answer that. So tomorrow at three, there's a session on this? Okay, Neutron at scale. Tomorrow at three, awesome. Thank you for making us not having to answer that question. Because you really need like an hour for that question. Slightly off topic. What mode do you run Ansible in at your node count? Is it SSH push still or do you? We do SSH push. We are using SSH pipelining. That allows us, I think our rate of, I do my math right, we can do a single task across 6000 machines in five minutes. So every task that we added on a 6000 machine environment is an extra five minutes. And Jesse did a lot of contribution to the Ansible project, including some rack space cloud modules, as well as to the core getting Ansible to work better at scale. So get his card if you have Ansible questions. All right. I think we're about done. Thank you all so very much for your time. And we will see you around the summit.