 Okay, so let's go ahead and get started, I guess. Today we're here to talk about upgrading OpenStack. In theory, we're not going to break everything. And in theory, we're not even going to break Neutron, which is a big feat. Just a little bit of introductions. I'm Clayton O'Neill, principal engineer at Time Warner Cable. I focus on automation, CICD, deployment, that sort of stuff. And I'm Sean Lennon, also a principal engineer at Time Warner, and focuses on Neutron and Nova. Just a little bit of background, our OpenStack team formed with about four people about two years ago. Not about four people, four people. We did a proof of concept implementation on Havana. After the Atlanta summit, we decided to quickly move to Icehouse and VX Land-based networking. Before we went to production last summer. Since then, we've done an upgrade to Juno and Kilo. As you can see here, these are the versions that we're currently running in our production environments. This talk is going to focus on our last round of control node upgrades, which included Nova, Neutron, Glance, Cinder, and Heat. Since our Kilo upgrade, we've actually upgraded Heat to one of the Liberty Vedas by moving that into a Docker container. Horizon and Keystone, we're not going to cover it. We were actually already on Kilo for those things. So there are a few core tenants that we feel are important as far as doing OpenStack upgrades. One of the big ones is Don't Fall Behind. I'm not sure how these guys got on the treadmill. They'd never be able to get back on. But we plan on upgrading every six months. We think that you probably should, too. Part of this is just because that's the only tested path. But also with changes that are occurring with rolling upgrades and lazy database migrations, it's sometimes just not possible now. So a good example of this is the Nova Flavor migrations that came along with Kilo. That's a step that you have to do on Kilo before you can move to Liberty. Also, you want to automate everything, because if you don't automate everything, you're going to end up feeling kind of like this guy. You want to test this over and over and over again. You want to get your process down. One of the really important is to figure out what the impact to your customers is going to be. And that was a big focus for us on our Kilo upgrade. So our team gave an upgrade talk at Vancouver. Some of you may have been to that. If any of you wanted to hear us talk about upgrades twice in one year, then thank you. We're going to try and not cover too much of the same ground. We're going to talk about updates to that approach, things that are new and how we handle that with Kilo, and also talk about some of the problems we ran into with Kilo. So whenever we were deciding the timing for our Kilo upgrade, we was one major feature that we were looking for. That was AMQP heartbeats. Like most people using OpenStack, we're using RabbitMQ for all of our intra-service communication. And like most people that are doing that, we've had a lot of problems going down that path. And the biggest remaining problem that we had with Juno was just that if anything went wrong with Rabbit or network connectivity or what have you, we would frequently see services thinking that they were connected to RabbitMQ not really being connected. And Nova Compute was a big problem with this. We would have instances get scheduled to a compute note and it would never pick up the request. So AMQP heartbeats are a protocol level feature that allows RabbitMQ and the clients to talk to each other on a regular basis and check in. And if any of them kind of go missing, they can clean everything up and reconnect and kind of get back in a good state. So this was an experimental feature added in Kilo, but we'd heard good things about it. We knew people that were using it. As with any process, you have to know your requirements before you start down that. And that's the balance of your technical abilities and also your customer needs. What we would like to do is forklift everything. It's super easy. Replace, there's a ton of downtime involved. So obviously customers don't like that. Another option that we could do is more like a pit stop. Go in, shorter, you know, shorter full outage, but just change everything out really quick. And we also like that because we were able to take advantage of that downtime to have to play tricks with the system and overcome a lot of technical liabilities. The problem is our customers don't want any downtime. So they want something like this. Now if it's a great YouTube video, go online. These guys actually in the course of five minutes change both of these tires. Our problem is actually more difficult. Our problem is changing all four tires at the same time while the vehicle's rolling. And to this end, we had these basic requirements. It's the same ones we had in Juneau and it's the same tenants that we have to follow. And API outage is okay. We can take 10 or 15 minutes of customers not being able to hit the APIs, but nothing else can go out. No connectivity between instances, no storage outages, and no rebooting of customer workloads. But we did have some improvements, lessons learned after our Juneau upgrade. We changed the time of day that we upgraded. What was comfortable for us, which was later in the evening, but not so late, was actually super uncomfortable for our users. So we had to move our upgrades to the middle of the night, kind of a 2 a.m. time period. We also improved by testing production data from both systems. Obviously, we started running into a few issues here and there by dataset size differences and then the actual data needed to be tested. So we pulled that back into our development cycle for the upgrade. And one of our chief faux pas of the Juneau upgrade was that we did have some network outages and customers were impacted. So we spent a lot of time studying why that happened and trying to prevent that for the Kelo upgrade. And back to the sizing issue we talked about, or I just talked about for production data. We also added size to our development workloads to try to simulate our production environment and run test cycles through that before we even got to the full size. So we talked about how important upgrade automation is earlier. I just want to touch on that briefly. All of our upgrade automation is done using Ansible to drive Puppet. So Puppet ends up being responsible for package upgrades, config changes, service restart, that sort of thing. And we use Ansible to handle pretty much everything else. So orchestration and ordering during that process. So this is something we covered in a fair amount of depth in the Vancouver talk, if you're interested in more detail on that. So one of the advantages of doing this with Juneau is that whenever we went to go to Kelo we were able to basically take that, make a snapshot of it and start from there. And we were actually able to reuse almost everything that we had done for Juneau and then add on to that. So we're looking forward to reusing that also. So with that, let's talk a little bit about what the actual upgrade process looks like. So this is what our starting point looks like for our control cluster. We have three control nodes in each of our environments. Each node hosts all the services that we're gonna be upgrading plus a bunch of virtual routers. They're also part of a shared MySQL and RabbitMQ cluster. And external users talk to these nodes via hardware load balancer. What's not shown in this diagram is we also have HAProxy based load balancer that we use for all the internal communications. And we'll talk about that while it's important a little bit later. So let's go through the steps of the actual upgrade and keep in mind that this is all automated using Ansible. So the goal of the first step is really to take the first two control nodes out of service and upgrade that first node. So we do that by, we start doing that by taking MySQL down with two of the nodes. We do backups in case we have to do some sort of rollback. And then next we need to get all those routers off of that first control node. So what we're trying to avoid here is that until Liberty, there's a problem with the OVS agent where whenever it starts up, it just bleeds all of the network flows, which causes an outage. So our goal here is to never, to avoid doing an upgrade on a node that has an OVS agent running on it where we need it. So to do that, we're using L3 agent failover. So we shut down the L3 agent on the first control node after some period of time. The second and third ones notice that hey, the first guy went away and they start building out all the plumbing that needs to be done for those routers and the traffic all moves over. So this leaves us in a situation where we're functional. We have one node up, it's a cluster of one. So the next step we take is that we actually turn or we disable the L3 agent on that first node administratively via the Neutron API. And the reason that's important is that whenever we actually finish doing our upgrade, that L3 agent's gonna come back up and we don't want all the routers to move back over automatically. And then the last thing before we actually start our API outage, and this is one of the improvements we made in Kilo, is we get a list of all the instances and the floating IPs associated with them. And the reason we do this is because we wanna do monitoring of connectivity to those. And we want to be able to report on the status of whether or not we're losing connectivity to those. And this is really instrumental whenever we're doing our development of this process to figure out what the impact was going to be for our customers. So we start the API outage by turning off the external load balancer. We did run some issues here, we're gonna cover that later. And then we shut down open stack services on all of the control nodes. The goal here is to not have any Juno services running and trying to talk to a Kilo database during the upgrade. The routers continue to function because that's all kernel level functionality, so there's not any software that's handling that. And then to actually kick off our upgrade, we run Puppet on that first control node and that goes through and upgrades all the packages, makes the config file changes, restarts the services. There's two things to note about that. Because we have our external load balancer turned off, we're actually setting an environment variable that OS endpoint type to force that to use internal URL. And that way it goes through HAProxy instead of the public endpoints. And we also set the Nova API compact flag so that our Juno compute nodes will still be able to talk to Kilo control services. So once we get to this point, we run a simple smoke test using the CLI clients to make sure that basic functionality is there before we move on any further. And at that point, that's successful. We want to start trying to get everything back to normal because we're still at an outage. So at this point, we can enable the L3 agent on this box. It's up and running. We validated base functionality. It starts trying to figure out, hey, who's still alive? And after some short period of time, it realizes, hey, control nodes two and three have disappeared, they have all these routers. So it starts plumbing out everything. All the routers get moved over. There was some complexity that's involved in this. We'll cover this a little bit later. And then we want to re-enable the load balancer. So at this point, we're out of outage. We're back to a single node cluster, but we're running Kilo on that node. So that's where we wanted to be. The length of time for this outage is basically tied to how long it takes to move those routers, how long it takes to actually install packages and how long it takes to run database migrations. The rest of it's pretty minor in terms of the amount of time. So we can relax a little bit at this point, but we still have two more control nodes to upgrade. So the first step here is we want to get the MySQL cluster back up and running. So we bring MySQL back up on those other two nodes. They rejoin the Galera cluster. The nice thing here is that Galera will automatically make sure that any of the changes that occurred during the upgrade get replicated to those other two nodes. So we don't have to worry about running database migrations in the time required to do that on the other two nodes. So we run Puppet on each of those two nodes. They do the same thing that it already did on the first one, brings them up, fixes the config files, gets us in a good state. So at this point, we're nearly done. The problem that we have is all of our routers are still in one box. That's kind of not a good place to be. We already have tools for rebalancing those and trying to avoid high profile tenants. There is situations where that can cause some impact, some little mis-traffic. And so we rebalance those and at this point, we're pretty much done with our control nodes. This is where we start doing a lot more testing. We start doing, we have some canary instances running on compute nodes. We make sure live migration still works. We have pretty extensive regression test suite. We run that, make sure that volume attach and all the other complicated things still need to work. And we spend a little bit of time going through all the logs. Just make sure we don't have any services that are spinning, that they're not logging any errors that we didn't expect, that sort of thing. So to finish the upgrade, we need to get the compute nodes upgraded. And that's a lot easier than the control nodes, but it's still something we want to be careful about. So we evacuate a couple of the compute nodes and we bring up some instances on those for testing. And we validate base functionality again. So does live migration work? Does volume attach, detach, those sort of things work? Once we've verified that, we just go ahead and do a normal deploy to those compute nodes. It's gonna upgrade the packages. Now this does cause a short outage because the OBS agent's gonna get restarted, but because we have so much less flows on a compute node, that's much briefer. We could do a live migration to avoid that, but it's kind of choosing between what type of impact you wanna have to customers. All told, the upgrade process took less than three hours per region. A lot of that's doing validation. And we did the two regions on separate nights, unfortunately, as Sean mentioned, in the middle of the night. So the last thing we have to do is merge a change to turn that Nova API compat flag back off, and we just rolled that out in our next normal deploy. In designing the Kilo upgrade, we had to make a couple changes to improve the outage window or reduce the outage window networking. We're using VXLAN OBS, if you can't tell from this slide. The first thing we realized is that after you take that Layer 2 agent down, there's a timer in the OBS Mac learning that starts the timeout on active flows. You have about five minutes total duration, but flows start to trickle out and die within that five minutes, and customers start losing traffic. And it's very confusing the first time you run into it because it's just sporadic. So what we did is we went into OBS on all the control nodes, tweaked out the timeout. You can see the hard timeout of 300, we upgraded that five minute timeout to about half an hour. And then we had a wait period that waited for those active flows to be respawned. So we just waited a little over five minutes. It drastically reduces the amount of timeout that you actually experience. There were still a few things that were unavoidable. As Clayton mentioned, when you start up the L2 agent, at least until Liberty, it flushes all those flows out. Not the shutdown, the start up flushes all those flows. So we changed our Ansible and Puppetry to avoid that at all costs or at very key points. So we could pinpoint that outage. The third thing that we ran into is that when you have a massive number of legacy routers on a single control node, Kilo and before, that build up of transferring all those routers to a different control node or restarting is a very lengthy process. We have 50 to 60 routers on a control node. Taking that, that's about 2,500 flows and rebuilding that and all the namespaces and everything underneath the plumbing underneath the hood takes about 10 to 15 minutes. And that's a massive user outage. The other thing that we did because we couldn't use HA routers, the VRPHA routers, we're using L2POP. There's an existing bug that's just been fixed. So we decided to abuse L2 agent. It's not designed for this. L2 agent HA should be DR. It's designed when you unplug the L2 agent. There's a timer that goes down and then it automatically reschedules and distributes a router across your other network nodes. It's not designed for what we were doing, but it actually pre-populates all the flows, sends out a gratuitous ARP and you have very little user outage on this. The problem with this is it assumes that the original control node is down hard. So you end up with a situation where in a lot of these routers you have two end points that exist. So here's how we overcame a couple of these things. Basically I talked about this. We upped the OBS flow timeouts and then did that wait process. Do not restart OBS agent. And I just made a note that down at the bottom is as of the kilo upgrade we had to go through some of these procedures as of liberty that flow flushing doesn't happen. There's a good transaction mechanism that minimizes that and we should soon have the HA router L2 pop bug fix. Where I wanted to get to is the abuse of the L3 agent HA required us to go on to that original control node that we turned down and all the routers were migrated off of there. We had to hand delete all the namespaces, all the OBS ports, everything, all the flows related to that. By manual I mean we programmed Ansible to run a job just to do that for us. So how did all this upgrade and the upgrade testing go? Good analogy is tropical storm kilo ironically. It kind of wandered across the Pacific over three weeks gradually losing steam and didn't really go anywhere. The only part of the analogy is that we wished as we'd only had a three week period to do all our testing hours took much longer. So here's some of the problems that we ran into into during the upgrade. The Nova configuration of Cinder changed locations between Juno and kilo. It's partly a little documentation glitch that either we didn't pick up and it was a lightly documented, let's put it that way. The way we ran into this was our first data center went great. The second data center we ran out, we ran through our full production deploy and transition and we tried to mount a Cinder volume and create a Cinder volume and everything failed. What it was trying to do is it didn't know which data center it should be in. So it created the volume but then was looking in the opposite data center for the actual volume. This is a release note thing. It was actually deprecated in Juno but we missed it. It's actually not documented in the kilo release note so that was a doc miss. Python Neutron Client. We have hundreds and hundreds of tenant networks. When that list gets super long, one version of the Python Neutron Client would just return an error. It was actually fixed, a bug fixed but because of our canonical packaging, it wasn't, and also the documentation, it wasn't packaged right in the release that we got so we had to downgrade to avoid that bug. That's since definitely been fixed. As part of the Python Neutron Client downgrade and this happened only in our development environment, as a set of requirements list on the packages, we got a whole bunch of things uninstalled. One of those was Nova. And this is development. We never made it to production with this but as part of the Nova upgrade, flavors need to be migrated and this happens lazily. It's actually a pretty interesting way to do it. As the flavor is accessed, then that database translation and the migration takes place. Seems really cool until you've had some hiccup that partly migrates things and leaps bad data. Again, this was only in development but we had to manually fix the database to move on. One of the things that we got really bit with for quite a while after the upgrade, we had a whole bunch of tenant instances that started losing IP. People calling, going, oh, my instance, I can't get in my instance, can't get in my instance. What we found out is that new feature allow automatic DHCP failover defaulted to on super lightly documented. Meaning not mentioned at all in the release notes but it's pretty buggy and for us we don't need it. It manifested itself by a lot of false positives or try to rebuild the DHCP agents on another network node and then fail part way through that and then the instance or tenant networks weren't having instance DHCP at all. Took us a couple weeks to figure that one out. Track it down, once we turn that off everything seems to be calmed down. So, as I talked about before, we had a validation process in place for using the CLI clients to make sure that base functionality was there for all the services and our plan was to turn off the external load balancer and use the internal endpoint for that. So, we ran into a problem, you know, you can change which endpoint any of the CLI clients use by either setting an environment variable or command line option and so we were doing that in all the places that we should be. Specifically, the neutron and sender clients just completely ignored this, R would use it only whenever they were talking to Keystone which is not really very helpful. So, this broke our puppet runs, it broke our smoke test stuff. Unfortunately, we found this problem really late in the process and we actually ended up deciding to leave the external load balancers on. That's something that we're gonna test better whenever we go to do future upgrades so that we don't have this issue. We were also ran into some schema problems with Glantz. So, in Kilo, Nova started using the V2 Glantz API. The V2 Glantz API does schema validation on all the objects that it returns which is a good idea, but the V1 API doesn't. So, what that meant is that you could create images via the V1 API that the V2 API couldn't actually manipulate. So, we saw this originally where we had a bunch of images, we're in the database, the description of the image was null. And the V2 API thought that you couldn't have null images it wanted them to be empty strings. So, what would happen is you'd go to boot an instance off of this and some instances of work, some instances wouldn't work and it was kind of not obvious what was going on. So, and there was also no way to tell Nova to go back to the V1 API. So, we ended up opening a bug upstream, we worked with Flavio from the Glantz team, I think we actually got a fix for this the same day, the canonical guys pulled in packages, this was really quick turnaround, those guys were all really great. We ran into another problem that was really similar later on, there's also some attributes that are handled via a schema JSON file. And this time the attributes were kernel ID and RAM disk ID. We had a lot of images where this was null and they thought they couldn't be null. The fix for this was just update the config file and that was actually the fix that had already been addressed upstream. So, we just pulled that in. We've been really good at finding MySQL bugs whenever we do our upgrade. So, this was an interesting one and our first dev environment upgrade, we found this problem where Nova migrations want to take a column and convert it from being nullable to not null. This was failing because in MySQL 5.6, there's a bug that prevents you from doing this if there's a foreign key constraint on there, just across the board. The weird thing about this isn't that we ran into this bug. The weird thing is that it didn't happen in all of our environments. We do have a support contract with Percona. We went to Percona, we explained to them what was going on, went back and forth a lot. But eventually they came back, we had a bug opened up with Percona. They got that fixed, it's been pushed upstream. So, if you're using MySQL 5.6, you probably want to be running that version. Another database problem we have, you see a really horrible error message like this, it's pretty opaque. The issue here, we ran into a center database migrations, but it's kind of a general problem. It's important that the sort order on all of the databases and columns and tables match whenever it's doing foreign key constraints. So, the problem that we ran into specifically is, is that the puppet modules that we use had one UTF sort order, it changed, they standardized across all the modules at some point. And so, whenever we went to go run these migrations, the database was set this way, all the old tables were set another way, and whenever it created a new table, tried to set up that relationship between them, it would just fail completely. So, this is just a good thing to audit if you see this weird error message, you dig into it, you're gonna, you may run into this. And this can happen for any migration, it's not Kilo specific, it's just a database issue. Another issue we ran into is that the Keystone middleware was moved into its own package in Juno, but Juno still supported the old library names. So, in Kilo, those old library names were removed, that seems reasonable, but it wasn't mentioned in the release notes. So, the control nodes that we had that were the oldest, the ones that we had upgraded from Icehouse, they still had the old ones in place, and we had kind of missed this whenever after we finished our Juno upgrade. So, this was an easy fix once we finished it, but issues like this are particularly hard to find because the normal mechanisms in Oslo config for reported applications don't work here. This is kind of a free form field and it can't check this. Last but not least, we found this problem after our first prod upgrade at two o'clock in the morning after we had turned API services back on. There's a new feature in Nova Scheduler called Scheduler Tracks Instance Changes. And what this is supposed to do is that give Scheduler filters the ability to more information about making Scheduler decisions. So, this is the commit message for that. So, the way this works is whenever the Scheduler starts up, it starts pulling all your compute nodes in batches of 10. And our experience, what this meant is that while it was doing that, it was chewing up 100% of a core, nothing else was going on during Nova Scheduler. Originally, we saw this as that RabbitMQ was getting disconnected, and we think that was probably heartbeats getting starved out. But even after we turned off heartbeats to kind of rule that out, we still saw instances just not getting scheduled. So, we don't use any Scheduler filter right now, so we didn't need this, we turned it off. The biggest issue with this is the, you see there's a doc impact tag here that's supposed to open a bug upstream. This didn't get into the release notes. We found out about this quote unquote feature during our prod upgrade. So, after we ran into all those issues, this is about how we felt for those of you that haven't seen Groundhog Day, you should. Just to kind of summarize those, a lot of the problems that we ran into were really kind of because we didn't look closely at the release notes. We read all the Kilo release notes, we felt like we were good. We didn't read the Juno release notes to do our Kilo upgrade, and that was a big part of our problem. Those deprecations that were, things were deprecated in Juno just weren't documented as removed in Kilo, so we didn't think about that. MySQL has bugs, we're good at finding them. Part of the reason that we do upgrades is because we want new features, and bug fixes. But at least two of the problems that we ran into were because of new features that didn't work well and were poorly documented. And so buggy services are one thing, but if I don't know about that new feature, it's not actually all that helpful to me. To give credit where credit's due, some projects are better than others about doing release notes, and it does vary from release to release. The center guys did a really great job on the Kilo stuff. If you go and look at the Liberty release notes, Nova guys did a wonderful job there also. So with that litany of issues that we ran into, you kind of maybe think it's like, well, is this really worthwhile? Well, one, as we talked about before, we don't really have much of a choice. We have to upgrade to stay current. But after resolving all these issues, stability has improved significantly. MQP heartbeats have been a large chunk of that. Most of the issues that we've run into there in the past have been resolved. We do still see some intermittent problems with Nova Compute getting disconnected during these same sort of events that we did before, but really not anywhere else. So just to wrap up, talk about upgrading to Liberty real quick. We've already started some of that work. We've been working on getting our puppet modules up to date. We're pretty close to master on all of our modules, except for Keystone. We've got a couple of things to work out. We don't really know what our timing for our Liberty upgrade's gonna be yet, but I'm pretty sure it'll be before Austin. We also know that we're just gonna run into weird problems. The Kelo really taught us that. We thought we were experts after doing Juno, and Kelo kind of taught us that that's not a thing that exists. We're also gonna continue moving services into containers. We've had good luck with that with Heat and Designate. That's allowed us to upgrade those services or not upgrade them in the case of Designate. The issue there for us is really that we wanna run multiple services on a node and the Python dependencies make that really hard unless you have some way of isolating them from each other. Another big part of that also is just that installing the packages takes a long time with containers, we can pre-stage that ahead of time and then just shut down the old container and bring up the new one. That saves us some time during our upgrades. Also, a lot of the complexity in our upgrades has to do with working around these OBS agent issues where the flows get cleared, and that's fixed in Liberty, so we're really looking forward to that. And lastly, we're hoping to move to HA routers. Whenever, once we get on Liberty, the L2POP bug that Sean's talking about is addressed in there, and if we don't have to move routers around during our upgrades, we feel like that's gonna make things go a lot more smoothly also. So that's what we've got, we appreciate everybody coming. I know that you're all probably hungry and wanna go to lunch, but if anybody's got any questions, we'd be glad to answer them. And we do have a mic up here, but I can repeat your question in worst case. Yes, sir. So I've followed DVR and I mean, it's definitely an option and sometime in the future, I just don't quite feel it's there for our reuse case. There's still some bugs in that that need to be worked out, so we're taking the safe approach. I think somebody in the back had a question. He may be really hungry. Okay. Why aren't we using DVR? Yeah. No, we've looked at Kola and we're following Kola. There's some features we want, as far as like specifically with designate, we have a custom designate sync and today that's kind of hard to get into the container. They have that on their roadmap that's something we've been talking to them about. So ideally we'd like to start using Kola, but we're not there yet. Okay. Well, I guess everybody- Oh, we got one more. Thank you very much. Oh, yes. We try to. So the question was, do we upstream the bugs that we found? A lot of times we actually find that there's already a bug on it. That's how we realized that this problem that we found is not us. So a lot of these we went and we looked and it was already fixed. The glance one that we found had not been reported yet. And I think that that was one of the reasons we were able to get such good results back is that it was blocking our upgrade and the clients team was really interested in helping us. I'm sorry, what was that? Dev discussions. So specifically on the DHCP failover issue, we did bring that up on the mailing lists. We didn't get a lot of feedback on that at the time. There's been since been other people that have run into the same problem and there are upstream bugs against that. One of the reasons, one of the ways we actually found that found that that was the issue we were running into was actually going and looking on the master branch at the number of fixes that were in the DHCP failover code and seeing that only a couple of those had actually been back for it. So if we get stuck, that's kind of where we end up. Well, thank you everybody. Thank you.