 Hi, I'm Rainier Moser. I am the software development manager for Rackspace's Compute Deploy infrastructure team. It's a new team that we just started up this month after learning some really interesting lessons over the last year. I've been with Rackspace for just over a year. Before that, I spent 10 years as a government contractor. Everybody's got to earn a living. And I'm here to tell you the story of our production deployment of OpenStack. Since we launched in August of 2012, we've learned a lot of things. And one of my favorite quotes is this one by Teddy Roosevelt called The Man in the Arena. If you have a chance to look it up on the internet, I highly recommend it. It basically is at the end, basically, who at best knows in the end the triumph of high achievement and who at worst, if he fails, at least fails while daring greatly. Yesterday during the talk that Rick Lopez gave regarding deploying from OpenStack trunk and why we do it and what we gain from it, somebody asked, are you comfortable deploying from trunk? As the director of QE, are you confident? And he said no. And I piped, but we do it anyway. And so that's why, I think, because everybody else in this room, everybody in our community, needs somebody to be the first ones. And we are more than willing to be the first ones. And it's really, really hard. And I'm here to say that while we have big plans and we will get there and we will do it, it is one of the hardest things I have ever seen. And I do believe in magic, because I think that's, sometimes it's the only way we pull things off. So as we go on, let's talk about what at scale means. For rack space, at scale means our global cloud, which is multi-country right now, US and UK, with more coming in the next year, more regions coming in the next year. We have hundreds of thousands, we will have hundreds of thousands of hypervisors at that level. At the regional level, we're going to have tens of thousands of hypervisors, physical infrastructure that we have to manage and all of the virtual appliances and nodes that are on those physical machines. We come down into the cell and we have cells that we use to manage our scaling and to make it easier to roll out new capacity when customers run out. And then each of those cells have hundreds of hypervisors. So it all just kind of rolls up. And for us, that's what at scale means. We're not talking about a hundred or a thousand. We're talking about tens of thousands and hundreds of thousands. And that's really the problem that we're trying to solve. We're not at that scale yet, but we know that's what we're getting. And we're trying to make sure we have the infrastructure in place so that we can get there and bring everybody else with us. So our basic release strategy, it sounds so simple when I finally realized, oh my gosh, we take code from OpenStack, we put some of our own modifications to make it work with our internal billing systems and some of the other unique things that we have in our deployment. We package it up, we deploy it to everywhere and execute the code. And then we verify that it works. It sounds so simple. It's not. So here is our first scaling hurdle that we actually conquered this year, happily to say. We started August 2012. Our bar, the bar columns are internal code releases. This is not OpenStack trunk. That's a completely different pattern. We basically, right after we launched, we did a whole bunch of small releases, fine tuning, tweaking configurations. Oh, it always works on my machine, but now there's customers using it. So of course it breaks. We did a lot of those. Then we got into a really good pattern, September through November, where we were releasing about three times a month, almost every week with the week off. And it took about two hours. We were doing it after hours, which is an optimal, but various things in the architecture right now required us to do it to minimize customer impact. And in October, at the last design summit, we started talking about the need to build a better deploy mechanism because we knew our current process was what we were going to outstrip it. We outstripped it in about December, which is where on the diagram here, where our capacity, which is just showing the growth, the growth trend, it got to the point where it took six hours, four to six hours to just push new OpenStack code out. There were various reasons for that and it wasn't necessarily because of the OpenStack code itself, but because of the deploy architecture that we had. We used Debian packages with a master puppet in each region and at some point, there were too many nodes for the puppet master to keep up with and we would have to wait hours for it to drain. The deploy team said, we're not doing this again. We did it twice in December, it took forever. We're not doing this again. In January, we did it one more time. They said they would not do it again and in January, they did it one more time because Rackspace is fanatical and that's how fanaticism works in product. It took more than six hours, it was miserable. I couldn't even stay up for the whole entire night and we had guys having to take shifts, trying to just wait and make sure just to verify that it was working. It wasn't that it was broken, it was just we couldn't verify that it was working. And we finally accepted as a company that we needed to stop doing code releases, focus on the deploy mechanism and be able to tell a better story. So that's what we did. Up until this point, we had been deploying from Trunk every two weeks and so we said, we're just gonna stop. We're gonna stop pulling in new code and stop focusing on making it deployable and work on our internal, the internal process which we're sharing with you today. We switched from Debian packaging to virtual environments. Each project, Nova, Quantum or Networks or whatever they're calling it now has its own virtual environment that then packages together into a tar ball that we distribute to all of the nodes using Torrent. We use the libTorrent library currently. We still use PSSH for fact files for all of the individual passwords and all of the stuff that Puppet needs because Puppet is still our configuration management system and we use mCollective for actions to go tell it to go do this, go do whatever this is. And then our execution, we moved from the Puppet Master centralization to Masterless Puppet running on each of the individual nodes. And it worked great. We learned some things along the way. We may have taken the cloud down for like two minutes at one point, but we fixed it. Just learning as we go, understanding scale and going on. And so are there any questions right now about the deploy mechanism? Back in the back, did you have a question? Okay. Yes, sir. No, and the question is is when it was taking four to six hours, was the cloud down? When we do a deploy right now because of the way OpenStack is architected and specifically the way the services go, there's an API dip whenever the API nodes have to be restarted and that it's gonna depend on how long it takes, but it's usually not more than a few, not more than about 60 seconds that the API will dip down. And then the other really bad, so it's a bad customer experience. The customer gets a 500 error when they try to hit during that restart period. The other issue that a customer might run into and why we do them in the middle of the night for most of the world is that when you restart a service like Glance or Nova, if a customer's in the middle of doing something, if they're in the middle of a resize or a snapshot, the way those services work, it forgets when it comes back up. And so it goes to error and either they have to start over again or somebody in support operations has to go in and clean it up. Again, it's just a bad experience, but the cloud was not down during that time. Right now, no, I know that's planned in migration and those are talks that are going on in the design sessions, but right now that's just not the reality of the way things are architected as far as I understand. But that is the way we wanna go. We would love to be able to just say, you know what, we're gonna update this compute and we're gonna move everything over here and it'll be great. And then, but right now you can't do that without. Pardon? Yes, they won't be able to make calls to the API potentially and they may have an instance or an action go to error. Those are the impacts. Not for the six hours. It's for whenever that service restarts, for that 30 second service restart. The, we're talking about upgrading with all of this deploy about the infrastructure that runs the cloud, the control plane, the Nova services. The instance is still sitting there happily and as long as there is a small chance that if you're in the middle of resizing that something, you could go to error, you could have a problem. But if you're just, if you've got a customer hitting your web server and your web server is on the public cloud, you're not even gonna know what's, none of this is going to impact you. Yeah, so it's just if you're provisioning new stuff. Yes sir. Right now Swift, I believe we need to use Swift for that and we're not using that, although we have talked about doing that. Swift is looking at implementing some torrenting. Right now this is off of a local server that we call payload or payload server that is our seed node for the bit torrent. All right, so we feel like we have a pretty good handle on our deploy mechanism. We can, we just did a deployment to London last week before coming here and we finished it from seeding out the nodes to verifying through tests and builds in under an hour, which was really great, that was just awesome. And but we had another hurdle and that was actually catching up to trunk. We had gone about two months before we deployed without deploying trunk and there were a lot of changes. So the code we finally pushed out, which we tagged as 152 v152 just to try to keep track of it internally was from February 28th. So it was just past the Grizzly feature freeze and a lot had changed in two months since the beginning of January. The cells code was in. There were some really big migrations on the database in there. So we decided to take it slowly. We were still hoping to catch up to trunk, but we weren't going to rush this. And so we let it bake in our pre-prod environment for a couple of weeks. We got ready for the database migrations and then we deployed the code at the beginning of April and database traffic increased 10 times. And that's what the graph shows you was the increase of the database throughput in one of our data centers, our smallest data center actually. And there was some panic, I have to say. Nobody else had deployed anything from Grizzly really to a production scale environment. So this was not, this hadn't been seen yet. We have some really amazing engineers and software developers in the community from lots of different companies. So we had rackers and IBMers and Red Hatters and HPers looking through the code, figuring out what on earth is going on. And they felt in their gut that it was from the periodic tasks that run periodically to do things like auto-confirm resizes. So if a customer doesn't auto-confirm the resize within 24 hours, it says go ahead and confirm it, get rid of the source VM, keep the destination VM. Same thing with rescue, things like that. So we turned off three of those. I don't remember what the third one was and I realized I meant to look that up in case somebody asked. And we had the drop that you see at number three. So that was part of it, but not all of it. And so they continued to work, they being all of the community, lots of traffic on the mailing list, lots of stuff in IRC, talking, sharing codes, 20 patches, reviews, patch sets on the code review. And we redeployed 152 with the community fixes and that's where we got down to four, which we knew there was gonna be some database increase, but very minor, like that's what we were expecting to see. And in that particular case, what caused all of this was a change in how the instance type table was being managed. When you say that your instance type is a flavor one and you wanna come down six months later, your product people wanna change and say what a flavor one is. Let's say we're not gonna have a 512 instance, we wanna have just one gigabyte as our smallest one and that's gonna be flavor one. If you were to change that in the instance type table, it would break all of the history of all of your instances. So they decided in the community, it seems like a really great idea to put all of those 10 columns in the instance type table into the metadata table as key value pairs, column name, value, column name, value. So for every instance, it increased the number of rows 10 times. Then on those periodic instance, those periodic tasks, there were some joins that were happening, inner joins against that metadata table that up until now had just been returning one row for every instance. Now suddenly it's returning 10 times as many rows. Those periodic tasks run as frequently as every minute and it was just killing the database traffic between the database and the compute nodes. And then every compute node is querying for that every minute to see who do I have anything to resize. So the code fix was, it sounds, again, it sounds simple. They took out the inner joins. They made it into two queries. They created a path so that you could use it either smaller number of queries or smaller data set is basically from my understanding, the code path. So we deployed that right before we came here and now I'm here telling you the story of where we learned. Last summit we were able to come here and say we're already on Havana. We're already pulling from the next, from the next version of trunk that's been opened. We're already pulling in those blueprints and looking at that code. And this time we're not able to say that. We're not able to say that we're on Grizzly when we're at the last one. Sorry, at Folsom we were on Grizzly. We're not able to say we're on Havana right now. And this is why we had to stop. We had to resolve this. Upside, anybody else that's going to pull down Grizzly Release Candidate, this has all been ported into it. So hopefully you guys will have fewer pain points when you actually go to use this. There's still some big database migrations. Don't get me wrong. But hopefully they're improved and less impactful. So from my final slide here, I'd like to just kind of start the conversation with the community at large of how can we adapt for scale issues? Because we're not the only large-scale deployer. HP has a large-scale deployment. And we've been talking with Monty Taylor from the infrastructure team and Robert Collins who's working on the Nova Bear Metal project about how we can take what they're doing and adapt it to our environment and start to use that. We're figuring out how to contribute to those code bases. We do know we need more testing and more environment options. Just the test coverage in general. Moving more tests upstream into the actual open stack area so people can know, hey, by the way, this code change might break rack space, might break HP, might break a large-scale public cloud deployment. And if you get to that level, then it will break you too, most likely. We're working on internally and we don't have anything to show yet, still conceptually, some personal dev environment options whereby we take some excess hypervisor capacity that's been set aside for development and QA and allow for people to make reservations, reserve for hypervisors, hook it up to networks, hook it up to the auth system, see how it works. Take your code that you've been working in DevStack, that passes the unit tests, passes those things, put it into something that looks like your production environment and make sure it doesn't break anything. So that's something that we're working on and I can't wait to have something where it's about a two-week old project so there's not much to show for it just yet. And then the other one is to simulate compute numbers or node numbers on a limited number of hardware. Right now, we don't have a couple thousand hypervisors sitting around to recreate the compute nodes and all of that stuff. So we're looking for ways to tackle that and simulate that load because this particular issue we didn't find until we had thousands of compute nodes talking to the databases, sending traffic back and forth and pulling every minute. On a couple hundred nodes, it wasn't a problem. And we even tried to look at the data, we tried to look at the traffic and we could not see the traffic. There was just not even enough of an incremental difference. The database and code management, the DB migration pattern I think is one of the biggest conversations started around here and from what I understand, they're gonna start with Glance. Yay! Glance has already got some work done towards that so they're actually gonna start looking at implementing a common DB migration pattern so that it's not all or nothing, we have options. The DB calls with a database the size of six million rows instead of something that's 60 rows would be really nice to think about. I have a database background so I know it's really easy to write something that works great for a few hundred rows and then forget that when you get into production it doesn't scale and then a code optimization path for the large data sets so you can maybe choose between having a smaller number of queries versus smallest data set returned. There are times when you want one over the other depending on the application and it would be awesome to see stuff like that. One thing we learned in the processing community is that we have to do everything we can to stay close to trunk. It's hard but over the last six months getting off of trunk was harder. Not knowing what we were gonna get after two months, after three months, it was just scary but at the same time it would be really, really great to have a continuously deployable trunk to try to introduce the concept of or the mindset of my commit to this blueprint could be consumed immediately so leave it in a state that's usable. May not be completed but just that isn't going to have a dependency, things like feature flags, all those things that people use in the web world that Facebook uses, that Etsy uses, that all those other things to bring some of those patterns in here. So I think I have time for questions if you wanna stay around. I'm just really glad to have the opportunity to share this story with you all and I'll be glad to, I can't wait to come back in a few months and tell you where we are. So any questions? Yes. We cannot do zero down time compute upgrades at this point. There's actually, there was a conversation yesterday about how to do zero down time service upgrades. It's an architectural limitation of open stack right now that is I think will be addressed in this cycle or it will be done to be addressed in this cycle. Yes. There is an actual randomizer that runs and even with the randomizer going, it was still overwhelming the database, the database traffic. That was why we were like, and there was one more migration after that we found later was running every 10 minutes and we're like, what is that? Cause there's randomization. There's randomization down to the minute. Why is this still happening? And it's a power seek, I think. Power status seek. And that's, they're fixing that right now. So in submitting and they're being the community. So I think you had a question, sir. This man right here, he's the operations manager. And we just started having that conversation really. Now that the emergency fires are out and we finally have been able to bring this conversation to the community, that's something that I hope that he's already talking with and that we want it to be a community conversation. We don't wanna just go fix it for ourselves. We wanna fix it with everybody. So yeah, so there was a question over here. Yes, sir. I don't do networks. I don't, I'm sorry. This keeps me really busy and networks scared the crap out of me. Yes, sir. That's what we do right now. We do go first. We go with our smallest. So that if there is something like this that happens. That's how, even when we had this 10 times increase. We went to London first so we can do it in their middle of the night, which is our daytime. Customers didn't know what was going on. It wasn't impacting them. Their requests were still going through. Their traffic was still going through. So that's one of the, we do do that. We do, we generally try to do one region at a time when we're doing a big upgrade like that. Any more? Yes, sir. Right now we don't have an explicit effort going on. So if you have ideas, you know, put it out on the dev mailing list. Go out. I mean, we have really hard problems to solve and we need as many brains as we can solving them. So if you do have ideas, that would be great. There was another one back in that corner. Another hand. Yes, sir. Yes. You're welcome. You're welcome. I wish we could have done it sooner. I wish we could have done it closer to the actual closer, but found it sooner. So, and we'll keep doing it. Right now we're planning on keeping on, keeping on doing that. You can also remember when we come up with blood and dust on our face to say, yeah, that was hard, now get back in there. That's always helpful. So, all right, are there any more questions? Yes, sir. What's he talking about, Brian? I would assume something like, talk about. So, that could be like a whole other 40 minute conversation where we talk about our INOVA implementation where we use a cloud to run our public cloud, which essentially makes our cloud a cloud application. And we have a bunch of really amazing Linux engineers who come from a managed toasting background. So, even when we're telling our customers, think about it like the cloud. Think about it like the cloud. We're also going through that learning curve as well in our operations and our infrastructure team. We're like, oh, wait, we're a cloud. Our control plane is on a cloud. So, and yeah, that's a whole other conversation. Alrighty, any more? Yes. Pardon? All of our regions have over six cells. So, a cell is between 200 and 600 nodes. So, I mean, that's a small scale deployment. And we anticipate by the end of the year or very quickly to bring on capacity for all of our regions to have more than a dozen cells. So, even our small data center is well over 1,000 individual hypervisors with several thousand nodes, individual nodes. Yes. Instac doesn't believe in rolling back. It makes our release managers really, really nervous. So, our strategy is to bully on through until we figure it out. Yeah, and that is, and cry a lot and tell support, we're really sorry, because they are the ones that are gonna get all the angry phone calls. That has never happened, just in case you're wondering. We've had six hour nights bleary eyed with not enough alcohol, but we've never actually had to do that. Yeah, no, when you're expecting it to take two, and you don't start until 11 p.m. because you're on the East Coast, yeah, you cry. And there is no way to roll back really easily, and you know you have to keep going forward. Yeah, so those are the problems that my team is working on because right now we're the ones that are up in, up at the middle of the night, why didn't the deploy work, along with his team, the operations team. So, I call them operations plus because they're way more than what you would normally say, what you normally think of. Any more, I've got 10 minutes left. Or y'all can go grab more coffee, see if there's candy left. All right, thank you all so much, y'all have a great rest of your trip. Maybe see some of you tonight.