 My name is Matt Fisher, you're probably tired of seeing me by now, this is my fourth talk. I really want to point out Clayton O'Neill did a lot of work on this, and so I wanted to call him out here. I'm here today to talk about upgrading OpenStack. This is a story of Ice House to Juno, it's a story of all the things we tested, all the things we found in testing, all the things we didn't find in testing. So last winter, after Christmas, we decided it was time to go to Juno, and some things went really well. A couple things didn't go well. Before I get started, has anyone here done an OpenStack upgrade before? All right, some of you. All right, the rest of everyone else will probably learn something here. First, our general philosophy on this, we don't like to get behind, plan on upgrading every six months, some services more often than that. I don't really trust jumping multiple releases, I know it's possible, but I think this path is more well tested, I to J, J to K. Keep up with it, or you're going to have problems. Automate everything, I'm going to go into that in a little bit more detail later. Test this process over and over, don't go to production and try it for the first time. Get the process down, figure out how you're going to do rollbacks, in case it goes horribly wrong. Also, this may impact your customers. It's probably going to impact the APIs, it may have worse impacts. Figure out what this is before you go to production. And try to do all your testing with production data if you can. I want to talk more about automation, this is a big thing for us. A lot of people have to do upgrades in the middle of the night during a typical outage window, and that's when you make mistakes. If you type something wrong, our team has rebooted a compute node and prod by accident. We've even installed the wrong version of Ceph on a box by accident. And just be careful. If you have automation also, I said test it over and over. If you have a 25 step checklist that you're doing by hand, it's a pain to test. If you have a script, it's a lot easier. Also, it takes a little pain to write these automated upgrades, but you're going to use it again. We've already started the process for Kilo, using the same tooling we used before with a few tweaks. Finally, our automation scripts are our documentation. I'm actually in the process of upgrading Keystone to Liberty and I sat down, I said, how exactly did this work again? And I read the scripts and it's right there and I can see the exact process. Okay, so prepping for Juno. These are some of the prep steps, this might be the wrong slide, but you do have to do prep work before you can go to do an upgrade. The first big thing for us was puppet modules. We were on the ice house branch. We moved to master, we used get upstream to do this. That just basically is a way to import changes from get on a regular basis and automatically merge them in. What we found during this time is that config options move. OpenSack developers like to move config options around just for fun. And you'll find if something was in a default section before and now it's in another section, services won't work or you'll get deprecation warnings in best case. So a lot of the puppet changes we made were to fix these errors. The other thing we did was set something called insure latest, which I'll kind of explain on a later slide, but it's a puppetism that basically simplified this process. Clayton and I did have some unexpected benefits from fixing all these puppet issues when moving to master. Our number of contributions skyrocketed. We started participating in the meetings and on the mailing lists. And Clayton and I eventually got voted in to be core contributors for the puppet. So that was a good benefit. The other thing we did was a lot of planning. We actually wrote a blueprint and reviewed it against our own internal Garrett server. This lets service experts, so the neutron expert and the sender expert could weigh in and say, this is how I want you to handle this part of the process. We also dug deep into the database scheme of migrations. This is actually where it gives us the most significant outage issue. We have an active-active HA setup. So we have an issue of old code, new database, or new database, old code. When we had done Keystone before, we actually read through all the migrations. But in fact, it only takes one or two things in a migration path to make it have a problem. So at this point, we actually just decided, we're gonna take an API outage and just let's keep it short. We also tested deprecations. So we just did a basic upgrade without any automation and then dug in the log files and look to see what's deprecated and what's broken. We made a master list of these, four or five of which we fixed. The rest we decided we could fix later. And finally, actually there's two more things. Doc bugs, the docs are great, but there was bugs. I saw incorrect default values. I saw config values in the wrong places. Even things like the Nova Compact flag had the wrong name in the docs. So check the docs, but you may need to double check that the values are right. We submitted bugs for these. We tried to fix these if we had time. One more quick thing, our neutral database was not stamped rice house. Without a stamp, the database migrations weren't gonna be run by Puppet. So we pushed out a stamp. Font is all messed up. So how do you test upgrades over and over? Because once you've done an upgrade, what else is there to do? We use what's called virtual open stack environments. They're disposable, easy to build, open stack environments. That you can stand up a keystone node and a control node and a compute node, run your automated upgrade, see what happens. Tear it down and start over. Even if you do this over and over, one thing you can't forget is to test a clean build. A node that had ice house packages with ice house Ubuntu config files and then is upgraded and new puppet code running on it might not have the same is resulting config as a Juno Ubuntu package and Juno Ubuntu config files. We actually found a couple problems with this. The Nova database connection piece wasn't being written properly by Puppet. And I think there was some neutron options also missing. Finally, this isn't a Ubuntu specific term, dist upgrade. But basically what I mean is do your upgrade and then we run essentially a dist upgrade to see what packages are missing, what was left behind. We found four or five things that were left behind, mostly Python libraries. So we ended up just managing them with our own puppet manifest. So you can't solve everything with testing, especially in a virtual environment. There's really no way to test the impact to the customer. And the lesson we learned about this is we need to figure out a way to do this. We need to figure out a way to see what the impact is gonna be on the API and on network outage possibly. Finally, we're big on using live migration for maintenance and we have no way to do this in virtual environment. And also, you can't simulate a hardware load balancer. You can't simulate a large stuff cluster. You can't simulate a large Swift cluster. Just two quick background slides before I walk into the process. So I won't lose everybody. Earlier I talked about insure latest. And if you know Puppet, this sounds crazy, because this would mean anytime a new package dropped, you were gonna get the latest version. Which means upgrades would be uncontrolled and completely random. So the way we do this is we do a daily repo that's date stamped. And then our repo pointers have the date stamp in it. So if you boil down the upgrade for us, aside from all the automated scripts I'm gonna talk about, we move the repo pointer, say from January 1st to May 1st. And then we run Apgit upgrade, essentially, and pull in all the new packages for the node. Another quick overview of terms I'm gonna use. For node types, we call compute nodes, those are the hypervisors. Control nodes run the API services, the standard ones. And control nodes are gonna be the hard part of the upgrade, most likely for you, wherever you run Neutron, probably. The other thing, control nodes do host the database, so we don't have separate database nodes here for Cinder and Glantz, etc. Keystone and Horizon nodes, we're not gonna talk about today. And the reason is we run them on a separate set of nodes that are on a globally replicated cluster. Keystone's already upgraded, we're already halfway to having it on Liberty. It's on its own cycle, it has its own process, and I don't have time to go through it today. And Horizon is also special, and we're gonna cover that later at the end here. So let's go through the initial plan, how we're gonna do this. This is sort of a start point. This is an initial basic setup for a control cluster. Three control nodes, everyone's running Icehouse. On these nodes are virtual routers, and they're part of a shared MySQL cluster that's also stamped to Icehouse. All API, external API calls come in through a hardware load balancer. And now I'm gonna talk you through how we got this diagram to be a Juno diagram. The first step is to essentially shut down two of the nodes, shut down the database, shut down services, get everything on one node. Then migrate all the routers onto that first node, so that that node is now your cluster. It's a cluster of one, it's not great, but you're not in an outage yet. Basically now we're going to try to upgrade this cluster. When we do the upgrade, we've shut down the A10, because I don't want to deal with customers booting up instances when I'm in the middle of this. But because we do that, we need to remember to tell Puppet run against the internal endpoints, not against the A10, which it tries to do, because internal endpoints aren't working. We run Puppet now via Ansible. Puppet moves the repo pointer, as I mentioned. Downloads and installs packages, can reconfigure services, runs database migrations, and then restarts all the services. The other very important thing in here Puppet is doing for us, is setting the IceHouse compatibility flag. Because this diagram only shows one control node, and I still have a lot of compute nodes that are still on IceHouse. I basically need Nova to still talk IceHouse until I've upgraded everything. Once this completes, and this took about five minutes, we do a bunch of really quick validations, smoke tests. We're in and outage at this point from a customer point of view. They can't get to the API, so we have to be very fast. We get a smoke test, a couple spot checks, that was done in about a minute. And in total, the API outage we had for this was about seven minutes. Once we've completed the smoke test, we turn the A10 back on. And we're out of the outage, and everyone could use this cluster now. The next thing we do is we get the Galera cluster back up. So we start bringing nodes back into the cluster, the data replicates across. The cluster is now essentially a Juno stamped database cluster. Once that done is done, we simply run Puppet on the remaining two nodes until they're also Juno, and that's just a package upgrade. And each of those is about five minutes. Finally, in order to get back to the original diagram, we have to move these routers. We don't like having them all in one box, so we have a router balance script that basically takes and spreads them evenly across all the nodes. And so at this point, we're done, and the upgrade should be great. This was the plan, and this was what we wrote scripting to do. Now you're probably wondering, I only discovered control nodes. What about computes? Our original plan was, let's do compute nodes the next day. Let's not do in the middle of the night. Maybe we should live migrate instances off the box, and we'll just use Puppet to run the upgrades, but let's get instances off the box first. And that takes a while if you've done live migration, and we didn't want to be doing it at 3 o'clock in the morning. Once that was done, the compute nodes would be good. And we go back and remove the compact flag so that we could be speaking Juno, essentially between all the Nova components. So we sat down in a team room and went down to do this for a hardware dev environment. And our plan was great, and we tried it a whole bunch in these virtual environments. We probably tried it 30 or 40 times, so everything was going to work completely flawlessly. I'm going to lead into what happened next with a quote. This is a Prussian general, Helmuth von Mulkty, the elder. He was speaking about OpenStack when he said, no plan survives contact with the enemy. There's no Prussians here today, so I can probably get a way of saying that. So what happened? The very first thing we did, we take everything down to that single node. We start running the package upgrades. Puppet dies. Dig the logs. We find out that MySQL has died. And we find out it was during the Nova database migrations. So we did what any sane programmer would do. We just ran it again. And of course, it died again. MySQL kept crashing. The system was now half on Juno, half not. The database was in some weird state, and at this point, it's probably time to dig in and finalize that rollback plan. Previously, we had kind of talked through what a rollback plan was. We maybe spent 10 minutes talking about it. As it turns out, you really need to have a more detailed plan when this happens. You forget things, especially when you just draw stuff on a whiteboard. But we did get a chance to run through it in Dev, which is better than in Prod. So how do you get this system back to Icehouse? We pretend this is production, and it's three o'clock in the morning, and you're in an outage. Here's our current state. We have all the routers on this node, which is running some kind of pseudo Juno Icehouse mishmash. And we need to get back to a functional cluster. It doesn't matter if it's one box or two. This is not good. So the first step here is to get everything on this box shut off. OpenStack services, Rabbit, MySQL especially, because you do not want this data replicating a back. And especially, don't forget to turn Puppet off, because Puppet will come through and helpfully turn all these back on for you when you're in the middle of doing this. While we're shutting down the first node, we quickly merge or revert of what caused the upgrade back to our GitHub tree. Get that code onto the Puppet master, and then bootstrap the Galera cluster. Because we left Galera shut down, this node here is still running Icehouse in terms of, from a database point of view. We do have database backups, but this data was still okay, because we hadn't rejoined this into the cluster. Once it's in the cluster, we basically just let Puppet restart all the services for us. It handles that pretty well. Make sure everything's good, and then get all those routers off the bad node onto this one. Kind of in the background here, we get that other node back up, so we're not down to one node. The main goal was to get one node up, so this is kind of a background task. Having a second node gives us another target for routers, so we don't have everything on one node. The final part of this plan is to rebuild this box. DD the drive, run pixie boot on it. For us, Puppet will bring it up, it'll automate everything and get it back in the cluster with no problems, back on Icehouse. We were in a panic that day, because we hadn't really written down the plan, so we immediately DD the hard drive and powered it off. And then 10 minutes later, we realized we probably wanted to get some logs and things off the node, so don't panic. Remember this in your plan, that if you have this weird problem, you're gonna need a way to maybe click this data. You just wanna be sure that it can't rejoin the cluster. Taking the Puppet agent off the box, even removing MySQL, whatever you need to do, you don't want it coming back on and corrupting your cluster. So we repixie boot the box, and we're back to my first slide. This took most of that afternoon, and then we set down to think about what are we gonna do about these database migrations? We basically took the database backup I told you about, put it in a virtual environment, and ran upgrades against it. We were able to track it down to actually one exact MySQL line that caused the problem. So we sent Percona, we worked with Percona on this, the config, the MySQL line, the database dumps, the versions of MySQL, et cetera, and we worked with them closely over the next couple weeks to get a new fixed version out, which has been general release since February at this point. Once we got the new package, we deployed it. So now we're good. MySQL's fixed, Clara's gonna be great, and we're never gonna have another problem. So we set down back in the room, got everyone together, everyone's excited, and we go do the upgrade. Now this time, the control nodes are fine. MySQL is fine, there's no database migration issues. Once we did the first set of control nodes, we go to do our test plan, and our test plan includes live migration. This is when we found out when you're running the compact flag, Nova will not let you my live migrate. It prints out a horrible dire warning, it's gonna blow up everything. By the way, I think this may have been changed recently, but it did sort of affect our plans because we were gonna do live migration the next day and run the compute upgrade. So, second change of plans. Let's drop by migration. It can be impactful, it certainly takes a long time, and it wasn't gonna work anyways, so we didn't really have a choice. And let's just do compute nodes that night. With our new plan, we upgraded Dev, it actually worked. We upgraded both staging environments with no issue, and now we're ready to go to prod. We sent out an announcement to our customers, we told them about all the great features that we're gonna get, we told them there might be a slight blip during this maintenance window, and we were really ready to go. We got everyone together, it was like eight or nine o'clock at night, and we're ready to go. Now, production is the first environment that this has hit that actually had load on it. I'm sure that wouldn't possibly be a problem. What happened in production? Well, you remember how we put all the routers on one node. When you restart neutron and OBS-related services, the flows get dropped, and they time out. And when the flows time out, customers can't get to their VMs. And our customers' customers, meaning external people, can't get to those systems either. So alerts start blowing up, and this is every virtual router. So this is every customer in this region when this happened. The solution was essentially to wait and watch the flows come back. It wasn't a great solution. Also, some routers didn't come back right. This can sometimes happen. The solution is simply to migrate them, but at this point with one node, we have nowhere to put them. So the only next thing you can do is just try to get another control node back online as quickly as possible, and you're essentially racing the clock and seeing how many customers are going to call you within the next six to seven minutes. Also, remember I said tests with production data. We actually did that. We pulled database dumps from staging a production, and we ran the database migrations, and we were sure they were right, except I only pulled them from one region. When we ran the other region, the neutron database migrations failed. There was something about a missing key. The table was empty, so we dropped it, and then database migrations recreated it for us. That was done as spur-of-the-moment decision, completely unplanned. So test with database backups from both regions if you can. Or this will happen to you. So this all finally completed, and the upgrade was finally done. We did cause that network outage to customers with the OVS issue, but other than that, this actually went pretty smoothly. All the other issues I've told you about occurred during testing, in terms of the MySQL issue and whatnot. This wasn't too terrible, but there was things we would have liked to fix in the future. There should be a better way, we think, to do these upgrades. Most of our problem actually has to do with package installs. Of all the downtime we had, package installs were about five minutes. That was almost most of our downtime, was watching packages downloaded and install. The other thing we had is we had thought at the beginning, let's just upgrade one service at a time, and I keep this really minimally. But because of package dependencies and interdependencies, this is not doable for most packages. And so you end up with this massive upgrade of upgrading everything at once, at least within a box, and I think that's just inherently riskier. We'd like to be able to upgrade, say, Nova and Neutron, or maybe just Heat, without causing this problem. So we've actually spent time looking into Python virtual environments. We really like the advantages they give us here. You don't have to worry about conflicting versions of Python libraries. And this isn't just for upgrades either. We run Horizon against Master in a Python virtual environment all the time. The rest of the stuff on that box is not against Master, and there's no interference there. Also if you do this, instead of having a massive upgrade that the whole team has to be involved in, in this massive plan, people that own each individual piece can be responsible for what they do. If the team running Glance is happy with Glance on Kilo, leave it. I would like Heat to be on Liberty, so I can upgrade Heat, and verify that it's not going to break everybody else. The downside here is the open stack public modules don't support virtual environments. So we've written a patch, a proof of concept patch for Designate, and it's something that isn't really going to be accepted in the community in its current form. It needs a lot more discussion. It's about a thousand line patch right now. And so we're trying to figure out what the plan to go forward with that will be. As I mentioned, we're also using virtual environments for Horizon. With Horizon, we actually deploy them with Ansible, so we don't have the puppet issue. So if you have virtual environments, how do you do an upgrade? It's extremely simple. Ahead of time, you stage the code you want on the box. You can have 10 versions of the code if you want. Then when you're ready to do it, you move sim links, maybe upgrade a config file if you need to, restart the service. It takes less than a couple seconds. If it blows up, you undo it in a couple seconds back. No package downgrades, which I'm not a fan of. If you do this, though, you don't want to deploy directly from PyPy. Things move on you. PyPy is unreachable in the middle of your deploy. That wouldn't be cool. So what we do is we build mirrors on a per-service basis. So we take all the libraries Horizon wants, and we build a mirror just for that. We don't mirror all of PyPy. It's huge. Well, we need to have load in our dev and staging environments, and we've actually worked with our test teams to do this now. Now, anytime we do a major upgrade, a kernel upgrade, anything like that, we can tell, is this going to cause someone to lose networking for five seconds? The API is going to be down. What's the impact going to be? This was probably one of the bigger failures we have of not having this beforehand. We actually also have instances that are permanently running. We blast API calls against them constantly. We can tell that we had a two-minute API outage against this box. It's very simple, when I say API, it's a very simple discurl and response. I've already mentioned this. Do database backups of both regions if you can, if the file is not too big, and try to do the database migrations against it. But finally, one was really interesting. We chose to do this kind of in the late evening, mountain time. A lot of our customers are on the east coast, so we thought we'd be fine. We were actually looking at lowest API usage time, but it turns out that our customers care more about what their customers are doing. If you're running a website where people pay their cable bills, they come home from work, and they pay their cable bills in the evening, and that's one of your main customers. You don't want to be doing upgrades at that time. We've actually shifted upgrades to do early morning, early morning mountain time. You may want to look at that too. It'll be different for every company. Keylo, despite the issues we had, we think our process was actually pretty good. It was all completely automated, and we think we learned a lot, and have changed some of the processes to avoid some of the problems we had before. So we've already started with this. My colleague Clayton has already basically gone through most of the things here. We submitted puppet fixes. He's done automation refinements, and we're moving quicker. We want to be doing this three or four months later. We want to be doing this when we get back next week. Clayton just told me today that he ran through an upgrade, and it's not five minutes anymore. It's now two and a half minutes, so that's better. We do still need to do something to mitigate the network outage possibility, but if we can keep it short, that might be acceptable. So despite the automation and everything I mentioned, fundamentally there's only a few steps. Put a new package on the box, run the migrations, drop new config options, restart the service. There's four steps, but you're going to need a way to do those in a certain order, on a certain order of nodes happening before other nodes. In my opinion, this wasn't too horrible. I think this is a lot better than it was in the past, and as I said, Kee was already faster, and I think liberty is going to be even better. The painful process, the pre-work, was figuring out the order of what nodes would go first and how we might recover and coming up with that plan, and once we did that I think we can tweak it. So hopefully by sharing our fun and our pain today, you can think about your upgrade process, how it might be better than ours, or how it might have something in common or things you might want to share or gain from this. And I have some time for questions. Surely you have a question, Dave? Yeah, it's the Nova Upgrade Levels flag, and that was the one that was wrong in the docs. Yeah, the question was what was the compat flag I mentioned, the Nova Upgrade Levels flag? Yeah, that's an interesting topic that we talked about earlier this week about the status of master. I think they're waiting on a couple things in order to branch off what they call stable kilo. One of those is I think some Keystone V3 stuff and there was something else in there. We did our tests, or Clayton's doing his test right now against master. So it'll just be missing a few things. We try to stay on master because once you get a stable branch, if you want to do a fix, it's two commits, not one, and it's just slower. For the benefit of the recording, could we have the rest of the questions at the floor, Mike, if possible? Thank you. I can repeat it. I can explain it to the limit of my knowledge, which is essentially it's a stamp, right? And then when the migrations run, if they don't see the stamp, it was unstamped, right? So the migrations, I don't know what I'm starting with here, right? So I'm not going to go to Juno. So stamping, it's a simple operation. We just had an ansible script to go blast the stamp everywhere. It didn't have an impact or restart or anything like that. It was just to ensure migrations incurred when we went to Juno. And they're stamped now with Juno, because they've been migrated. They weren't stamped, I'm guessing, because Icehouse was our first install, and we had never upgraded before. But, yeah, so the question is, the initial... The initial DB sync must not have because we didn't have a stamp. Now, maybe that's something that's been fixed. You'll notice right away, because stuff's not going to work if the migrations don't run, right? And maybe that was an Icehouse issue that's been fixed. Maybe? So the comment was that this is likely fixed, the lack of a stamp or the stamping. There's a couple other things I probably didn't mention. We look at... That's how we start with Kilo. So we dive into the release notes. One of the decisions we need to make is, do we want to do this now or do we want to wait for .1? A lot of people wait for .1 because .0 is when people are trying to upgrade for the first time and finding these problems rather than gate checks finding them, right? We did .1 with this. We did .1 mainly probably because of the holidays and the break between the release and Christmas and all that. That's the other thing, and then to go look at bugs that have been filed since the release. To make sure that you can live with whatever bugs are out there. Those two things are critical before you decide to press the button on this. You guys have some time left? Oh, we're done.