 Okay. Hello everybody. My name's Alexander Diba. I'm here today to talk about a four version fast forward upgrade that we did earlier this year at STFC. So today I'm going to, initially I'll introduce STFC and SCD because it's the first time we've presented it up at the OpenStack Summit. I'll then give you a bit of a rundown on what the STFC cloud is. Then go through the various stages we went through in planning and implementing our upgrade and what we learned from the process. So initially what is STFC? So STFC is one of the UK's research councils. It's the Science and Technology Facilities Council. It provides large scale facilities for science and it's also one of the funding bodies for research in the UK. So we support, we have around 1700 scientists directly employed by STFC when we provide access to around 7,500 scientists within the UK and internationally as well. We run science and innovation campuses across the UK in various facilities. The one I'm based at is the Rutherford Appleton Laboratory which is just outside Oxford. There's also the Dorsbury Lab which is in the north of England and some other interesting facilities like an observatory and there's a research facility in a mine in the north of England as well, which is an interesting site. So STFC science programmes, so we support a lot of particle physics, so we work, we contribute to certain work on the Large Hadron Collider. We're in various astronomy experiments as well so we're involved with the European Space Agency, the Square Kilometre area and we have involvement with nuclear physics as well. As I said one of the things that we do is provide large scale facilities, so we have neutron sources, high power lasers, light sources which are various campuses. So lend into the scientific computing department which is probably of more interest to people here. So this is the scientific computing department for STFC. It's more than just IT, so we don't provide any corporate services or anything like that like STFC users. We're focused purely on what scientists need to deliver their research. So we've got 180 staff supporting all of STFC's users. We have various expertise in application development, hosting services, providing large infrastructure. So on to the infrastructures, so for the STFC facilities specifically, we post large amounts of data. So rising from about petabyte in 2013 to 14 petabytes approximately this year. And we provide analysis infrastructure for that as well, so AHPC facility and Lizley Cloud Service, which I run, which I'll go into more detail later. We have various scientists and developers, like about half scientists, half developer, so they know how to achieve research, how to get the most out of the software, to get the most out of the infrastructures we provide. There's various simulations that we do as part of Liz's in computational biology and life sciences area, various things in engineering environment areas. So the diagrams you can see here are in the top left nuclear reactor and how a heat load is distributed in there. The second is how air moves around in a turbine. The third is how blood moves through a blood pump. And the fourth one is how water moves in a washing machine, so a lot of diverse use there. Then there's computational chemistry, so the usual kinds of things you'd expect from waste material science, how it applies to biology and physics as well. Then we've got the large Hadron collider, the computing we provide for that. We're one of the tier one sites, so we provide 24,000 CPU cores, 64 petabytes of storage, about a third of that being disk, less being on tape. There's a 100 gigabit network and we've got a 30 gigabit direct optical link to CERN as well. Then we've got one of the other large instructive facilities we run called Jasmine, which has 38 petabytes of high performance storage. It's what we call a super data cluster of in a supercomputer or a HTC form. So it's really focused on putting high performance data next to high performance computing. Then a new project, which is started earlier this year, it's called Daphne. It's a system for aggregating the various models that we've got in the UK for physical infrastructure. So roadways, railways, all those kinds of things and how they will interact if you make a change in one. So things like planning new railways and that kind of stuff. On to what you probably all care more about, the STFC cloud. So this initially started off before I joined STFC as an experiment done by a few graduates using a product called Strataslabs. That proved somewhat successful and certainly showed enough potential for us to pursue things. In 2014, we deployed the SCD cloud for the scientific computing department, which was based on open nebula and primarily intended to support internal users. Because of the success of that, we got a lot more users, a lot more use cases coming on board. We really couldn't meet without significant rearchitecture. At the same time as doing that rearchitecture, we moved to OpenStack because of much better support in the community and a much greater feature set. So who uses the STFC cloud? So internal users within SCD. So everything from development and testing infrastructures through to people hosting natural services, for European projects, for work within the UK, all kinds of things. The work's part in the STFC facilities. So the ICIS neutron source, the central laser facility and the diamond light source. There's the LHC computing grid, which I mentioned tier one site for. They also use the cloud for some of the capacity. And the Ada Lovelace Centre, which again is a relatively new project, it aims at simplifying end-to-end the software and infrastructure required for the STFC facilities and other things. There's the IRIS collaboration, which is a UK-wide collaboration to deliver infrastructure for STFC. And then the Daphne service, which I mentioned, which are probably going to be using the cloud in one capacity or another. So on to the cloud architecture. So now we're running on Queens OpenStack. Prior to the upgrade process I'm going to talk about, we were running in Mitaka. We use Ceph for OBD storage. We deploy using a configuration tool called Quattar, which originally came out of the high energy physics community. And it's one that has been used in STFC for a few years. And it's the tool that we chose to use when we deployed it. We use Scientific Linux, again, because of close ties to a high energy physics community. For the OpenStack packages, we currently use the audio packages provided through Lysentos repositories. So in terms of the pure architecture, we run all of the OpenStack services you would generally expect. Us to run Sir Keisler, Nova, Neutron, Glans, Verizon, CinderHeat. And we rally for doing functional testing to ensure we keep everything up and running. We run three instances of everything to provide high availability. We put loads behind load balancers, which use HA proxy and keep alive D to keep everything highly available. Again, we run MariaDB using Glowy Replication to keep everything HA, free rabbit and queue servers, free MongoDB for Cilometer. For the networking, we have a routed network to give us the best performance we can possibly manage to get. We have a leaf spine network. We run cumulus links on our routers and we use VXLine and EVPN to do overlays for our tenant networks. We use KVM on the hypervisors and OpenV switch for the hypervisor networking. So that's 38 service nodes, 166 hypervisors, 66 storage nodes. That delivers around 4,000 usable CPU calls at present and just under a petabyte of usable storage. So, as I mentioned, we have, we're exposed a number of different physical networks to our users. So there's the actual physical network, which we're exposed to our admin users for testing and debugging of things, but that's not exposed to regular users. We have various EVPN VLine networks, which is a thing provided by cumulus. So we have a services network for internal OpenStack services to run on, an internal network for VMs that are only needed to be reachable by everything within STFC. And we have an external network for our floating IPs for tenant networks. Then we have VXLine tenant networks for actual users. So, now into the real meat of things, the fast forward upgrade. So starting off with what is a fast forward upgrade. So from the documentation, a fast forward upgrade is an offline upgrade, which effectively runs the upgrade processes for all versions of OpenStack components from the originating version to your desired final version. So, in short, it's a fully automated multi-stage upgrade. So, in preparing for this, because we run Quatar, which for most intents and purposes can be considered an in-house tool because there's a relatively small amount of users of it, and we are all a primary developers of the OpenStack part of that. So, we started off studying the documentation, update Quatar's template libraries and configuration as a pillar documentation, backup our pre-production instance, update the local configuration to use the updated Quatar configuration, script any necessary changes, test it using Rally to check if everything works, then fix any problems, put them back upstream if necessary, into our configuration, whatever we need to do, roll back to the database before we started the upgrade, and then go through it again and see if it works automated this time. That doesn't work, then we go through the process again until we get it working. And we did that with each of the upgrades. We went through the front attacker, through Newton, a Quatar, Pyg, and onto Queens finally. So, in preparation for the upgrade, we wanted to minimise the downtime for our VMs for the workloads as much as possible. So, to prevent any interruption from those, we first did a rolling upgrade about all of our hypervisors to the latest versions of the QEMU and OpenV switch, so the things which would cause network dropouts of VMs to need to lose connectivity and if it can go along those lines. So, we went with the versions from the Red Hat Enterprise Linux for QEMU and Libvert. We went straight to the Queens version, the OpenV switch version provided in the Queens repo. For that, we blacklisted packages from all of our repositories so that we don't accidentally install them during the processes. So, that allowed us to, when we came to doing the actual upgrades, to upgrade the hypervisors without interrupting any VMs. So, in terms of testing the upgrade, we had our rally test that we'd done as part of preparing the upgrades to ensure that all the functionality we knew about and that users had told us that we were using work. We then released the fully upgraded pre-production instance to our users to test against and none of our users took us up on this, which they may have, they could have regretted later, it wasn't too bad, fortunately. So, the actual upgrade process. We had a three-day mentions window to do with it. It was at a time when there was going to be some interruptions in other areas for various reasons. So, one of the things that was happening at the same time as this was some power testing, which we were assured wouldn't interrupt anything. So, after we'd gone through the upgrade process, preparing the upgrade process, we weren't completely confident that we'd be able to do it fully automated. So, we opted instead to go version by version performing a few extra tests ourselves in between. Just so if we got a problem in the first upgrade, it wouldn't be compounded through the rest of the upgrades. So, the actual process was similar to the similar to the preparation. So, back up with databases, update the config on the load balancers. So, any endpoint changes, anything like that would be done when the new versions were up and running. First, we upgraded Keystone. Then, we upgraded through restively open stack components excluding hypervisors and horizon. Once later, I'll be upgraded upgraded hypervisors, lend horizon to expose everything to our users, test it, fix any problems that arise. So, I roll out a back out plan for if there was any major problems, we'd restore about database backups. If there was time left in our maintenance window, we'd try again. Once we got to the last version we could get to, that's when we'd call it a day in terms of once we hit the end of our maintenance window. I'm not going to go through the exact step that's required for every individual open stack upgrade. They're pretty well documented and they were enough to get us most of the way there and more things. So, instead, I'm just going to cover any issues that came up as part of our upgrade process and what we did to get around close. So, starting with the first upgrade, Metacote Newton. So, a few issues came up with this upgrade. So, an issue with Nova prevented it from being able to list instances. So, there was a schema change in Nova API.build request, I believe it was. A new field was added, which couldn't be null, and we had some instances that were stuck in this state from prior to the upgrade because of the issue we'd had. So, when the data migration happened, at least didn't get updated, it was kind of annoying, but we deleted loads from the database and then Nova was able to list instances again. We had an issue with database connections in Newton, in Neutron, which prevented listing of networks. So, after going through the logs and various bits of documentation, we discovered we needed to increase the queue pool limit for the databases. So, we increased that to the same amount as the number of agents we have, which resolved that issue. And only later to the upgrade, we lost a rack of hyperbysers during the upgrade because something went wrong in the power testing. Actually lost power to the switch, which length caused us to lose the hyperbysers. Didn't have any upgrade impact on the upgrade. Once the power came back a few minutes later, we carried on with the upgrade and everything went fine. It took about half a day for that upgrade. So, at that point, we called it a day for the first day. We'd gone into the afternoon far enough. We didn't think we'd be able to complete the next upgrade before the end of the day. And we didn't want to leave our users hanging overnight if we could avoid it. So, laying on to Newton to our cat, so this was the most time-consuming upgrade for us. So, technically, there wasn't a huge amount of... there wasn't any major problems with it. But when we came to adding the cells and doing the online data migration, it took a lot longer than we were expecting it to. After doing a bit of digging, we discovered that was because we hadn't archived all of our old instances before the upgrade. So, we did that, and then what had taken several hours and hadn't completed, then happened in about 20 minutes, which was pretty good, because of the delays and the bit of digging we had to do to find that took all of the second day of the upgrades. So, then, when we went our cat to Pike, this was a lot quicker than we expected it to be. We didn't hit any major issues. It took about two hours from start to end to rally test passing and users being able to create VMs again, which was great. It was a real improvement in terms of technical and documentation on how things happened. So, then, in the final upgrade, we did Pike to Queens . So, again, we didn't actually have any problems with the upgrade as such. I did the upgrade at the end of the upgrade. All of our testing was working. That took about two hours. Then we had a couple of problems. One of the problems we had was our hypervisors were now out of disabled after a number of failures. So, that was problematic. That was made worse because one of the ways in which we exposed our networks to users, some of the projects had two ways of accessing the external network, both as an external network and as a shared network to place VMs directly on the external network. That no longer seemed to work after the Pike to Queens update, and it wasn't a thing that we tested for because we weren't completely aware that that was how users were using it. So, we took that. What we ended up doing was removing shared access to the external network. The reason users had access to that was when there'd been an issue with some of the site firewalling for our internal network, which meant that users needed a number of way of creating it, and the external network had been given access as a quick fix to get them up and running again. So, once we'd done that, everything was working. We re-enabled the hypervisors that had been auto-disabled because of the issue with the external network, and then after the following day, we discovered we weren't able to snapshot on the hypervisors, so we couldn't snapshot our VMs, which wasn't a thing that we'd tested for. It's a thing that we've since added to a suite of functional tests. So, after various bits of digging, we discovered that this was due to the way the URLs were internally handled by glance, I believe. So, we added show multiple locations as per the most up-to-date self documentation for OpenStack, and that resolved our problem. So, what did we learn about this upgrade? So, the big thing we learned was our pre-production instance is somewhat of an idealised version of our production instance. There's various quick fixes and things that we hadn't been able to reproduce on our development cluster, so we've fixed it and gone straight into production. That's something that we're working hard to minimise happening again, and we need to make sure that we get our users more involved in the testing at the moment, but it's something we're working on with them. So, in conclusion, I'm glad we didn't commit to just doing a fully-automated fast-forward upgrade, largely because I feel some of the issues we caught if we hadn't caught them when we did could have caused, but I'm not sure what we could have done if we hadn't caught them when we did could have caused much bigger problems through the upgrade process. So, the stage approach, it was more time-consuming, but it gave us the flexibility to back out a particular version if we needed to. Any questions? Yes. So, the only issue we ran into, as I said, was due to it taking a huge amount of time, we'd got four, five million records that hadn't been archived on various instances tables. Once we'd got those archived out, Llywodraeth was much faster, but we didn't have any actual issues in terms of anything else. And when you're... So, what we did was we stopped the backups in the cluster, did it on the primary, and then once everything was done, we brought the others back up. And when we were doing our preparation, just running it on the primary, seemed to push the backup server, they started to have problems. We never quite got to the bottom of why, but it was easy enough for us to just upgrade the primary and then replicate the changes afterwards. No, we weren't. So, we all brought up the cells in for the Cata version. No, apart from it taking time to discover all of the instances, it all went pretty smoothly. So, for backup, we use Percona's extra backup to do the backup. We use Percona's extra backup product to do the lead backup of lead databases. It makes... Restoring is a bit trickier because you have to restore it cold, but you can back it up while it's running and still get a consistent backup, which is a thing we've liked. So, that's what we use for our nightly backup process as well. Any other questions? Thank you, everyone.