 All right, so thank you for coming. My name is Rick Lopez. I'm the director of quality engineering at Rackspace. What's going to be presenting with Rainia Mosher, one of the development engineers in the compute team. She's running a little bit late, so I have Brian Lamar, one of the engineers in our team. Jump in there and help out whenever I get stuck. So why are we here? So we're going to be talking about deploying from trunk. It's a strategy that we selected to do in OpenStack in Rackspace. And the whole reason is because we want to be able to deploy from trunk on demand on a multi-cell environment with a reasonable amount of downtime for our customers and preferably no downtime. And reasonable is a little bit of a changing number because we're still working on trying to improve that process, but we'll talk a little bit more about that later. This is our branching scheme right now. So on a daily basis, we're doing a merge from the OpenStack trunk to our Rackspace development branch. In that process, we deal with any merge conflicts that we encounter. Once we're ready for a deployment, we cut a branch with a major release version. We do that so that we can do any bug fixes or any patches necessary without stopping the daily merge into our development branch. Any changes that we need to do, we apply them to the release branch. We tag that so that we can do our new build and deploy it. And once that's in place, we go ahead and roll those back into our development branch and then we submit those patches up to OpenStack. In the most recent release, which I think is February 28, we internally call it 152, we had over 50 minor patches that we had to apply to it. And that's because of all the continuous integration and all the issues that we encounter while we testing. We had to apply that many in order to be able to deploy. And that actually is applying about 40 patches that we had to apply to that branch once we bring it down for any Rackspace specific changes that we have. That is anything from the way we authorized our billing and a whole bunch of other things. Now, I'll let you talk a little bit about this one because you're a little bit more familiar. But this is kind of the strategy that we use for our packaging. So I mean, we broke our kind of packaging distribution strategy into three steps. The first one is we used to use Debian packages for everything. So we had this giant process where we had a Jenkins server and all of our code from the slide before on the custom branches went through the Jenkins server and created these Debian packages. And one of the things, we had a lot of problems with that. One, it wasn't very portable if we ever wanted to move to a different operating system. Right now we're on Debian squeeze, but we ended up working with virtual environments and basically we just run Python virtual environments. We make a virtual environment for each project, like Nova, Quantum, and we tar it up. And that is basically our deploy package. It's fairly flexible. It works on a lot of different, with a little bit of tweaks. It works on a lot of different operating systems. We haven't quite tested the portability, but that's kind of the reason we did that was mostly for portability. So we package everything up in these tar balls that has all the Python code. And then we actually use BitTorrent to distribute it. So we have, you know, thousands of nodes that we're using BitTorrent to seed and distribute all of this code between them. So that's kind of where the distributing comes in. And the tool we use to distribute it is BitTorrent as well as MCollective to actually kick off the download of the torrent and the seeding of the torrent. So MCollective is by Puppet Labs, I believe, and it's basically, we had some troubles with parallel SSH and it takes a long time to parallel SSH to thousands of nodes. So MCollective is kind of the answer to that so far. After everything is distributed, we do a couple of verification steps. But then the main thing is to really execute that code and deploy the code that we've just distributed out there. So we have a Simlink kind of process, which is pretty simple. In all of our servers, we have a directory with all of these different versions that are downloaded. So there might be, as Rick said, 50 different minor revisions that we're testing out. So we have 50 different versions downloaded and we just have a Simlink or a pointer pointed to the one that should be active on that node. The entire deploy process is put everything out there, change the Simlink to the right version, and then run Puppet. And then Puppet kind of just handles everything from there. As well as, it's thinking the database, I guess, another step, but that's kind of the entire, really, getting everything out there and then running Puppet is the main deploy process. Cool. This is our development and deploying pipeline. So from this slide, steps two and three, the distributed and execute pretty much happens in every one of those environments. And I'll outline essentially what we do with each one of those. So development, obviously, that's where our developers do their work. Then there's our continuous integration environment in which we run smoke tests and unit tests. If everything works well in there, we promote that to the queue environment where we do functional and integration tests. And that's mostly between the components and between internal products that we have in Rackspace. If everything from that point on is ready to roll, we deploy to our pre-production environment. And there we just mostly focus on regression. We just want to make sure that before we say that we're ready for production, we run through our entire suite of tests. And if everything's ready to go, we tag it for release. Yes. How long does it? So it depends on how many patches we need to apply based on the bug fixes that we find. And really, the only limitation to the system is how fast you can deploy it and how fast will the tests run, right? So from environment to environment, if you can deploy fast and test it, we can probably be from our integration environment all the way to production within a few hours. It doesn't always go that smoothly because, again, you go run and run. And I'll talk about a few things that we're looking ahead and hopefully be able to accomplish to shorten that. So why do we do it? I think this is really the gist of the whole conversation here. Issue resolution. So one of the things that we want to be able to do is find issues faster, make the community aware of them so that we can take a decision on them, and shorten the feedback loop. We want to be able to, ideally, before we pull it down into our internal branch, we would like to be able to catch those issues upstream so that we don't have to spend the time patching the branch internally, and then submitting them up. And with that, it gives us the ability to solve those issues faster. Right now, we're fixing those issues internally, and if you remember on this slide, we actually have to submit those patches all the way up to trunk, so that takes a little bit of time. It makes that feedback loop really long. The deployment takes a lot of time. And obviously, if anyone, which I'm sure a lot of you are doing continuous integration, one of the things is to be able to do smaller incremental releases. We want to be able, instead of having to wait until Grisly or Havana are completed and then deploy that and find all the issues at that point, we want to be able to do that a lot faster with smaller releases so that we can provide that feedback a lot sooner. And that ultimately doesn't only benefit us internally, but it also benefits the overall release cycle because we've found a lot of the issues up front, so if we can bring those up to trunk, when the release is completed, a lot of those issues would have been resolved already. So that's really one of the biggest benefits that we see. But it is very hard. As Brian commented earlier, we go through several patches to our branches. It takes a lot of time, and a lot has to do with the merge conflicts. That's one of the issues that we have because we have to go through that big loop. Someone has to spend a lot of time solving through those merge conflicts. Not ideal. And those merge conflicts are really a product of our own patches. As you said, the slide before that said we have about 50 internal patches that we are working actively to get submitted because we're trying not to have them. And a fewer custom patches that we have, the less merge conflicts, the fewer merge conflicts we're going to have. Absolutely. We have disruptive, yes. Yeah, yeah, yeah, we do. Hi. So this is Rainia. Yeah. Fine food. So I'm Rainia Moser, and Rackspace Dev Manager for Deploying Infrastructure. And of our 40 or 50 patches, there are a few that don't go upstream because we have them so that they work with our billing system. So that they work with our internal auth right now that we're using. And as OpenStack continues to mature, as the various projects continue to improve, and we are at a spot where we can move off of those internal systems, that is our intent. But right now we do have some that you guys just don't really want. So the other one is we have disruptive DB migrations. That's one of the biggest issues that I think we encounter the last time we did a merge internally. And the service restarts. Obviously, we're striding for no downtime for our customer, so having to restart a service not ideal for us. From the testing perspective, which is what I'm the most familiar with, one of the challenges that we have is right now in OpenStack, we rely a lot on DevStack. Well, DevStack is great, it gives you a lot of value, but from my perspective, I need to be able to test a full scalable deployment. And we don't have that ability right now. We're working on it. I know there's a lot of initiatives around that, but that's one of the biggest challenges right now. There's a lot of issues that don't manifest themselves until you actually do a scale deployment. Question is, what do you do about database migration conflicts to make a mistake and to take backups and things like that? Yeah. We have many, many backups very frequently. The majority of migrations, we don't modify the community migrations. Every once in a while, we will submit patches to those migrations because they are inefficient for a database of our size, but we always try to put that up into the community and fix previous migrations. For example, there was one recently for instance system metadata that wanted to insert 10 rows into instance system metadata for every single instance you've ever had, and even including deleted ones. So we were going to insert millions and millions of rows for this migration. So it's just one of those things that we work with the community. I mean, we do everything inside the community. Yes? Yes? Right. Okay. Okay. We heard that in the last keynote, in Grizzly keynote that you can move the trunk to production within a short span of time. And you also mentioned that the scale testing is an issue for you. So how confident are you to move to production without such a scale testing? I'm not. So that's what... But we do it anyway. Yes. Yes. Come tomorrow at 4.30, this room again, and we'll talk about how we're learning to scale OpenStack, and we'll really dive into some more of these. And this is how we're doing it, kind of the methodology, the process that we use. So what have we learned in the last year? And it's interesting. It's a good story, I think. So touching a little bit on the rework, this again has to do with the fact that we don't have a lot of our tests upstream, plus we don't have the ability to do that full scale deployment. So we're finding a lot of the issues once it hits our environment. The time it hits our environment, we've probably gone through several days or several weeks worth of testing in some cases. And anything that we find, it creates the situation in which now development has to go fix it. We have to patch our release branch, roll that all the way up to OpenStack. So it creates a lot of delays in the process. And that's something that we're working to resolve, but it's one of the issues that we encounter. Do you upgrade your whole infrastructure at once, or do you just do a small subset, see how it goes? So how long does it take to spread through your whole environment? Rania? Well, that depends. It took us probably about three weeks to get our latest release off of Grizzly out to production. And to be honest, it's still not in all of our data centers because of some of the issues that we encountered with performance, with the database migrations, and we worked. And I know that a lot of the things we tend to say, Rackers tend to say, the community, we work with the community. And we do want to say that we are the community, and so it's like our Nova people are working together with all of the Nova people from everywhere to make this better. And so tomorrow we'll actually be able to show numbers, information, like what was the impact, how do we fix it? So we're still, we pulled code beginning of March, and it's still not everywhere, right? But we have been able to do all three of our data centers in less than an hour, including validation and build testing. And so in the first slide when we're talking about a reasonable amount of time, it really does depend right now. It's a really very variable, what we're going to be considering reasonable. Right now an hour is amazing, is amazing. It's not still true CI CD, it's still not really continuous. But when you've had to wait six hours for a deploy to roll through an entire region of multiple cells with hundreds of hypervisors, an hour is pretty awesome. So yeah, so that's another one if you're able to come tomorrow. And then the presentation I'll have more about that. Lastly, we talked a little bit about the process and this is where I think we have the biggest differences. So internally we're trying to do continuous integration with the goal of doing a continuous deployment strategy. Well that is in direct conflict with how the open stack releases are actually being managed. We go for big milestones, right? So when a developer upstream is doing work they're not necessarily thinking about someone consuming it down the road within a few days. They're just working on their feature, they're doing the development that they need to do. And that's it. So that difference creates a little bit of problems for us and that's where we find some of the scalability issues because we're being impacted by them right away. Including others. And the last one is the time that it takes to merge all those patches. It's an investment, it takes a lot of time. So we want to be able to solve that and address those a lot faster. And that's where, this is what we're trying to accomplish. From a code management perspective is we want to minimize as many local patches as we can. We want to be able to do all the work upstream and not have to deal with a lot of these issues that we have when we're doing merges. Having non-disrupted DV migrations, as Brian was talking about, we just need to figure out a way. And I know there's been some discussions already in the summit about how we achieve that. So that's encouraging. Zero downtime is that's important for us and our customers. We don't want to have to bring down the service on a regular basis for our customers plus I think most of the team is getting tired of doing super late releases in the evening because we're trying to minimize the impact so we have to do it late at night. When are we able to do it whenever we want to so that we don't have to worry about it? And the API versioning, rolling upgrades, I know you have a little bit more insight into that. Let me move on to testing. I'm just curious. Does everybody know what we mean by zero downtime service upgrades? I've been here about a year. Anybody not kind of get what that is? So basically I'm just going to explain it and why it's important. So we do our deploys at 10 o'clock at night central time to minimize the number of active customers that are doing it. We don't want to do that. That sucks. I want to be at home sleeping and watching television at the same time. But whenever a service restarts, if a customer is in the middle of doing something on their customer instance and their customer hyper-vite, on their customer VM, they're at risk of having an error out on them when that service restarts because the service doesn't know what it was doing once it comes back up. And so that image resize is going to go to error or that server resize is going to go to error. That image snapshot has a chance of being errored out. It's a bad experience for a customer trying to use the cloud. And it's a nightmare for our operations and our admin teams that have to go clean it up. There are ways to clean it up. There are ways to recover it. Somebody calls. They're really upset. They don't want to go through the trouble. They're trying to do a resize in the middle of it or migration. There are ways to clean it, but it's manual and it's messy. So that's really where when we talk about the zero downtime service upgrades, that's really what we mean and why it's so important. And there have been some Nova conversations in the design track really around that. For the API versioning, one of our metrics is just we monitor where the API drops and different things. We monitor does it go below this threshold, this threshold. We like it to be 100%. I mean, that's our goal. That's our goal. Zero 100% always there. When you upgrade the API node and you restart it, it drops. It's gone. And it can lead to errors. It leads to just bad, it's a bad experience. That's really what it's about. It's just a bad experience. So the API versioning is a path ahead to help with that so that we can roll through upgrade and not ever have it go down completely. Questions on that? All right. So moving on to testing. So this is where we can address some of the issues that we face around testing and finding bugs a little bit later down in the cycle. So we have a lot of initiatives internally where we want to move all our tests upstream. All our tests minus anything that's rack space specific, obviously, because that wouldn't benefit the community. But our goal is if we can move all those tests that we're using internally that are finding the issues down the road and have them upstream, those issues will be found before we even pull up that code down into our system. That allows us to fix them much faster so we don't have to do any other local patches internally and then go through that longer feedback loop. And then in terms of process, we're having a few initiatives of trying to put those tests, be part of the CI CD open stack pipeline so that we can provide feedback. So if anyone does any checking, any into trunk, any breaks our implementation, we know that there's something wrong. We know that someone needs to go back and look at what was checked in, what was changed and address it before it gets down to our system or any other system because we're all using the same code base. And lastly, the one thing that we had a couple of conversations internally is for the community to get used to the idea that trunk should always be deployable. At any point, if you pull down trunk, you should be able to deploy it without having to do any changes. That's not necessarily the case right now. And part of putting the test upstream should give us a little bit better certainty of that. Having it as part of the CI CD open stack pipeline will help with that as well. But I'm sure there's a whole bunch of other things that we can do to ensure that. That's just a mindset that I think we all have to get into. These are some sessions that are actually going to a few more details on some of the information that we cover here. Tomorrow, we're going to, actually today you're doing... Next is the gating validation of open stack deployments, which is, I believe, Darrell Wallach from our QE team is doing after this over on the B side. And then this afternoon, the 521 is another one of those kind of going beyond the API into the instances and confirming and just more of the testing and how we actually test. This one is a good one because internally, we actually test more than just making sure that the resource is available. So once you spin up a server, it's great to know that the server is there, but we go beyond that we validate that everything that you define in your request actually made it. So it's going beyond just that immediate response. So if you're interested in that, please attend. Tomorrow is the session that Rainia was talking about earlier. And also the one at 240 is, I believe, Sam, are you doing that one? And that is actually our cloud cafe tool, which is being, is open sourced, is open sourced and completely available, and is how we're currently performing our tests internally and is what we would like to have opened up for consideration by the community. So tomorrow at 240 is a deep dive into that, along with the learning to scale open stack and our story kind of of the last year. And the last one is Roberts? Yeah. On Thursday, 9 o'clock in the morning, hopefully Wednesday isn't too much fun, is Robert Collins from HP who works closely on the Nova bare metal project is actually going to be presenting on continuous deployment for upstream open stack and really starting to address and talk to the topic of how to keep trunk continuously deployable so that anybody could just pull it down and it would work at any time. So that will be an interesting conversation. So that is most of the material that we wanted to cover. Questions? Yes. Any thoughts on open sourcing parts of this? Parts of? All of it. All of it? We are. So we definitely are doing that from the testing side. So again, tomorrow, there's a conversation about the testing framework that we decided to use internally. We are making it available to the community for review, for feedback. We want to see where can it be improved? How can we make use of it? But that's available right now in Stack Forge. You can look for open cafe, cloud cafe or open roast. There are three components of the big testing framework that we use. I heard an interesting tidbit from our images team that works on Glance. They're working on the scheduled images code and blueprints right now and are in the testing phase and integration testing and they were running their smoke tests, their smoke suite at 30 minutes and they turned on parallelization in cloud cafe and it went down to 6 minutes. I heard that today and it was like, oh my goodness, that's amazing. And that's without even really trying hard. One other thing real quick is there is the third kind of piece, other than the testing, the deployment actual deployment code that we're using to deploy things is being planned out open source and that as well. So it's just one of the things that, well actually we really love to work with Robert Collins and them to figure out how to do our deployments the exact same way as the Opus Act infrastructure team and the Jenkins project so that we're all kind of deploying using the same tools as long as they scale correctly. The scale has been the biggest limiter to a lot of tools and deployment. So I don't know whether you can share it but as from the network topology, do you guys use VLANs or tunnel? Yeah, I mean our data center has all sorts of VLANs. I really couldn't tell you how and when we use them. Pardon? Oh, Chad, yeah, you talk. Let me go let Chad, Chad is one of our network gurus back here in the corner. Thank you for speaking up. So it's a mix of both. We actually do hybrid so we do bridge tunnel or bridge network for basically the north south traffic coming out of the VM so to your public connection then we do offer overlay networks for within the data center. And then also our overarching architecture is the Inova, Nova OpenStack in OpenStack where we use an OpenStack cloud to deploy our control plane and then manage our customer capacity which exists on cells and it's kind of trippy, taking a year for me to understand it. But it allows us to build four VMs or four hypervisors as a seed node. We do that manually once and then we can kick off everything that we need with scripts and automated it. And we're actually looking at the Nova Bear Metal project and heat as it matures, as it improves to be able to even automate that initial, those initial seed nodes so that we don't even have to touch those. So this is why we're very excited to see what's coming. We have to continue working on this because we have a production cloud but we are being part of this and aligning with that. Any other questions? All right. Well, thank you for coming. If you have any other questions, you know, you can grab us on the side.