 Thanks for joining us and welcome to Open Infra Live. Open Infra Live is the Open Infrastructure Foundation weekly, our long interactive show, sharing production, case studies, open source demos, industry conversations, and the latest updates from the global Open Infrastructure community. This is our sixth episode, and we have some great content coming up in the next few weeks, so I hope you will join us every Thursday at 14 UTC. My name is Thierry Cares and I will be your host for the day. We are streaming live and will have plenty of opportunities during the show for Q&A, so please feel free to drop questions into the chat throughout the show and we'll answer them as many as we can as we go or at the end. So let's go ahead and get started. Today's episode is part of a series on large-scale OpenStack infrastructure, promoted by the OpenStack large-scale SIG. We invite operators of large-scale deployments and get them to present how they solve given operations challenges, and discuss live between themselves their different approaches. For today's show, we'll discuss a classic OpenStack challenge, upgrades. OpenStack is released every six months and keeping up-to-date with the latest version is often cited as the number one pain point, especially for larger deployments. So we invited operators of large-scale OpenStack deployments to present and discuss how they do it. So joining us today, we have Joshua Sledder, Senior Cloud Systems Engineer at Blizzard Entertainment, Arnaud Morin, Site Reliability Engineer at OVH Cloud, Mohamed Nizer, CEO at Vexhost, Imtiaz Chaudhury, Cloud Architect at Workday, Chris Morgan, Cloud Infrastructure Team Leader at Bloomberg, and Belmiro Morera, Cloud Architect at CERN. So to kick this off, let's hear it from Joshua on how they do upgrades at Blizzard Entertainment. Hello, everyone. My name is Joshua Sledder. I'm a Cloud Systems Engineer for Blizzard Entertainment, and today I'm here to talk about upgrading OpenStack, the platform that powers all of our games here at Blizzard. So in order to talk about upgrades, we need to talk about how we deploy OpenStack. So we pretty much follow a pretty standard pattern of pulling down the code from the upstream stable branch, building a runnable container out of that code for every OpenStack service that we use, and deploying that onto some control plane hose via config management. In our case, it's puppet. So building a new version of that service is easy enough. There's a pretty standard pattern that I've spoken to most in the community about on how they do it in a non-packaged way. But deploying it is maybe not so easy. So next slide, we will see if you've played any Blizzard games. You're probably familiar with this screen. So it's Tuesday. I guess I just don't play my video games today. Too bad that was my only day off. That sucks. So this kind of weekly scheduled maintenance used to be a standard practice in the online gaming industry, but it's really not normal or acceptable anymore. We just don't have that weekly eight-hour window to just take down APIs, perform upgrades, do any kind of maintenance that we want whenever we feel like it. Game servers that we support host long-lived player sessions may depend on OpenStack APIs to be available. And how long are those sessions? Well, just think of how long it might be possible for a person to be awake for a day, and then maybe add a few more hours of Monster Energy drinks to that. And that's about how long they last. So really the only way we can deploy upgrades to a regular schedule is if we're able to deploy them with zero downtime. Did we solve for that? And the next slide, we will see that, no, we did not. So as you can see, since we deployed OpenStack in 2015, we've upgraded a total of two times, which is, I guess, twice as good as one and maybe indivisibly better than zero. So we are catching up just about or not. So when we decide to upgrade, we'll pick the most current release, and usually by the time we're done deploying that to all of our 22 regions, it's not the most current release anymore. So it kind of feels bad. And I was hoping to be here to kind of talk about how we solve for a more reliable zero downtime upgrade in this kind of environment. But our plans to put any more work into upgrades have been on hold for a while due to the removal of Neutron Elbas, which brings us to the next slide, the removal of Neutron Elbas. So this is a big one for us. This means thousands of load balancers that we're going to have to migrate or recreate in Octavia. These objects are part of the very important data plane, player-facing traffic on specialized load balancing appliances that we use due to being a high-value target for Radidos. So we require a fully functional Octavia driver from the appliance vendor before we can even touch Octavia, which we don't quite have yet. And API-wise, while Octavia should be a superset of the Neutron Elbas API and while the database migration tool should migrate all of our objects from Neutron to Octavia, will that actually work as advertised with that vendor driver that we're going to get? I don't really know. What I do know is that none of it's been tested by anyone because the driver doesn't exist yet. So moving on to database upgrades, kind of one of the biggest reasons we're kind of unable to perform a zero downtime upgrade is due to the upgrades can sometimes be unpredictable, not backwards compatible with an older release, and typically require some kind of API downtime, which means I'm going to want to do this as least often as possible. Unpredictable problems that aren't caught in testing tend to surface due to inconsistencies in the database, such as a fresh database might look different from a database that's been upgraded from Okada, which also looks different from a database that's been upgraded from Kilo. So you combine that with 22 different regions and all those data centers that may have been deployed over different years, means there's a lot of different versions of that database out there. So sometimes one region will upgrade without issue while another will fail, and we have to roll back and kind of figure it out all the while having OpenStack APIs down. So figuring it out sometimes will mean database surgery, like just an example when one of the release upgrades tried to inject the RabbitMQ password into the database. We had some special characters in that password that somehow blew up the process and had to be surgeryed out. So just weird and unpredictable behavior that is hard to test for when those databases aren't the same. So that about does it for me. Again, my name is Joshua Slater, Cloud Systems Engineer at Blizzard Entertainment, and thank you for watching. Thanks, Joshua. That's very interesting. I like that you pointed that sometimes changes in the software upstream can affect how early you can make an upgrade because you basically have to adapt to the changes in those. And maybe do we have any question for Joshua's immediate question about this presentation? Then it was probably crystal clear. We'll switch to Arno, who will explain to us how OVH Cloud does it. Hey, everybody. Yes, so thank you, Joshua. Thank you, Thierry. My name is Arno Morin. I work for OVH since a few years now. I've been involved in the OpenStack public cloud deployment for OVH as a software engineer and working on the deployment of the OpenStack infrastructure. So before going into the upgrade itself, I just would like to share some numbers with you in order to understand what is the typical deployment of OpenStack at OVH. So first, we are deploying more than 35 OpenStack region. It's around 20,000 computes in global. And for each region, we try to deploy... We use Puppet just like Joshua explained. We are using Puppet in order to deploy the control plane on everything for OpenStack. And we deploy the control plane usually on two or more nodes for each OpenStack services. It means we have two nodes for Netrun server, two nodes for Nova, et cetera. And for each region, we also deploy one big RabbitMQ cluster and one big database based on MayaDB cluster. Our typical region is having between 500 and 1,000 computes. Some of them are having more than 1,000, but we consider these as very huge OpenStack regions. We consider that it's quite big for us. And when you manage such infrastructure, you have to keep very strict rules in order to deploy correctly your OpenStack. And we try to deploy everything with a lot of consistency. So we try to keep the same OpenStack version everywhere. We try to keep the same configuration. We try to keep everything identically in order to lower the number of issues or the number of bugs that we can have. And the result is that when we decide to do an upgrade, usually it's a big jump because we have to jump from, I don't know, five or six releases of OpenStack. So let's go to the next slide. When we decide to upgrade, first we first ask ourselves some questions. So first is why do we need to upgrade? Usually we say that newer is better. It's not usually true, but newer is having more features. So if we want new features regarding OpenStack, then usually we have to upgrade. So that's one of the reasons we have to upgrade. And the second big reason is about killing end-of-life releases. It can be OpenStack releases. It can also be Python version. It can also be operating system and even hardware. So as soon as we decide to upgrade, what we want as a public cloud provider is to avoid any downtime on already running instances. So we don't want our clients to be affected by the future upgrades that we are doing. So we don't want any ping loss on running instances. We also want to avoid any data loss. It means, of course, we don't want to lose any instance during the upgrade, but we don't want to lose any data in the database itself. We don't want clients to come to us after the upgrade and say, hey, where is my instance metadata? Why is it not visible anymore? So we have to keep an eye on that. And as a public cloud provider, we also sell our API uptime. So we just want to make sure that the API will be available most of the time. So we want to upgrade in a very short period of time in order to avoid any big downtime on API side. So our strategy regarding the upgrade is the following. First, of course, we tell our customers, our clients, that we will do an upgrade in a specific date. And so everybody is aware that the API will be down for a specific number of hours. And we do region by region. We don't do all region altogether, of course, because our clients are used to switch from a region to another. So before the D-Day, we try to deploy a new control plane with new OpenStack releases. So this control plane will just be running but doing nothing, even not connected to the database itself. It's just here to make sure that during the D-Day, everything will be quicker. So the control plane will be ready. Then we do some tweaks on the database itself. We try to clean it. We try to remove obsolete data. We try to make sure that no data will block the upgrade itself. So we have some script and tools to do this preparation. And then on the D-Day, our procedure is the following. We start by stopping all OpenStack services on control plane, on computes as well. So everything will be shut down on OpenStack side. We do the database migration, offline migration, online migration. We have tools to do that. It's not very easy as Joshua explained a little bit before. It's quite a big challenge depending on which OpenStack version you are going to. You can have some issues specifically. So this is a challenge here. And then we start, if everything goes fine, we start the OpenStack on the new control plane. We also upgrade OpenStack on hypervisors. So we do upgrades directly on already running hypervisors with running instances. It's quite easy for us. We are using Python virtual environment in order to push a new OpenStack version. And then we switch API load balancers from the old control plane to the new control plane. So if everything went fine, API were down for only a few hours, for one region. And then we are done. Next slide. I don't know if we have it. Yeah, we have this next slide. At the end we just wait for everything to come up and we just monitor everything. We try to keep an eye on our dashboard and check the logs, check that everything is running fine on OpenStack. And yeah, we try to do our best to provide the new OpenStack release in a good shape. Thank you for watching. Thanks Arnaud. That was very interesting. One immediate question I had is, how do you handle the workloads that are running in the region? Are you migrating them upfront to somewhere else before you upgrade? Or are you transitioning them afterwards? How do you handle the running workloads for your customers? You mean instances which are already running? Yes. So everything stays up. We don't touch anything. Nothing is live migrated or nothing moved. But nothing can be moved actually because the API is down. So the instances are running. They are still pinging. Everything is accessible. But we cannot reboot an instance. We cannot live migrate an instance. We are just freezing the workload actually. Okay, that makes sense. Any other question for Arnaud? I don't see any questions in the chat yet. Quick reminder that this is a live show. So we can take your questions live and feel free to add them if you have any questions on the chat of your favorite streaming platform. There's a hand up from Sergei Galovletiak. Sergei should ask that question. Yeah. Okay, while waiting for live questions, we'll... Oh, so we have a question. Can we get the question on the screen? So question from Salman. What will happen for the hypervisor failover during the upgrades? I don't get the question. What do you mean by hypervisor failover? Yeah, I'm not sure exactly what... On hypervisor side, basically, we just shut down every agent, Nova or Neutron, and we install a new package, a new version of OpenStack and start everything back when API is up. And that's basically the only thing we do. It's quite easy. The workload does not move. So we don't have any network downtime. So we don't have any failover mechanism on that part. Everything is frozen. Okay, we have another question from Hussain. Without live migrating instances, how do you complete OS upgrades? Yeah, that's a good question. Actually, if we have to upgrade the operating system itself, then yes, we do live migration. We did that in the past, and it takes up to one year, for example. We upgraded from Ubuntu 14 to Ubuntu 16, and we did that in a year. So we have to live migrate every workload to another compute which is upgraded, but it's not correlated to the OpenStack upgrade itself. We can do that in a second step. Okay, yeah. So you also have to account for OS upgrade, and it's actually more complicated in some cases for the customer workload themselves to migrate the underlying OS than to migrate the control plane. Yes and no, because live migration is a daily thing. We do that every day. So basically, it's just taking more time because we have to live migrate everybody. So it's not only a few instances in a day, it's all instances in a year. But we have the same issue when we have to upgrade kernel, for example. Yeah. So we'll switch to Mohammed now. We've seen that in the first two cases, you upgrade only rarely from, like, you make big jumps in your upgrades, both at Blizzard and OVH Cloud. And so I'm interested to see how Mohammed does it at Vexhost. Okay. Thanks, Siri. So yeah, so something at Vexhost is that we try to do upgrades pretty often. We try to stay up with the latest release of the OpenStack, and that helps us track as close as possible to upstream as we can. So going to my first slide that I had, which is we focus really on running upstream as much as we can. And what that gives us, it gives us a lot of value of the hard work, you know, that we share with the community and being able to have a tested set of, you know, code all together. So one of the things that is really great that we have in the OpenStack upstream community is that we have jobs we call grenade jobs, which pretty much install the previous release of OpenStack and then go through upgrading that to the next release and ensure that there was a loss of connectivity and a bunch of other kind of smoke tests to make sure that we're not, you know, breaking upgrades entirely. And so that's one of the things that kind of helps us is by leveraging 100% upstream code, it allows us to take advantage of that kind of basic functional testing that's provided by the community. But on top of that as well, it also makes the process of upgrading a lot easier, mainly because when we see problems that we need to fix, we go ahead and we pretty much go ahead, sorry, we pretty much just go ahead and propose those upstream to the community. And so by back porting these bug fixes, so let's say we're running in theory, a Surrey and we're trying to go from a Surrey to Victoria because we're running a Surrey and we ran into a bug that we ran into a Surrey, we back ported that change back to Victoria and a Surrey. So when we upgrade to Victoria, we already have that upgrade already in there. So not only does the community benefit from this sort of thing, but we don't have to be like, oh, do you remember that patch that we had to patch in our infrastructure like seven months ago that nobody now remembers why or how we had to do that? Well, it's now part of like the tested upstream code. The next thing I want to talk about is we heavily leverage Kubernetes in our deployments. So pretty much our entire control plane and data plane, everything is all running inside containers. So that's working a lot with the OpenStack Helm project. And so what we do is pretty much deploy our base hardware infrastructure. And then on top of that, we set up Kubernetes infrastructure. And then at that point, we don't touch the underlying systems. We purely interact with Kubernetes via API. And so that helps us to do a lot of things like GitOps and maintaining all of our changes in a Git repo that then is automatically applied into the Kubernetes cluster, which also kind of goes back to a little bit to kind of what Blizzard does with building those containers is that, yeah, we also built these containers that we use, which is an excellent way of maintaining that, you know, container with all the supporting utility packages together. So that you actually have the ability to, you know, be confident that this environment is going to be the same exact environment everywhere. Versus first controller that was deployed six months ago that's running some version of some application. And then you had to redeploy whatever your control was a few months later. And now you've got this mishmash of different versions across the stack, whereas containers really helps us to give us like a solid block that is kind of reused across all of our systems. And so kind of this goes into how kind of we go about doing things. So we start by building containers and those containers are then is what we use in production. But the really nice thing is because we use Kubernetes and because we use these containers, we're able to build staging and testing environments that mimic our production environments like pretty much exactly, you know, other than maybe a few different things that might be very minute differences. But because we treat kind of Kubernetes as our base, we pretty much are deploying we can deploy staging environments that replicate the upgrade as much as possible, which means that that random one-off thing that someone ran at some point against the entire fleet of systems, no one's going to forget to run that script you ran a year ago because we're rebuilding these environments from scratch all the time or testing the upgrade on that existing environment. And then when we're doing the upgrade in the production environment, it really reflects exactly what we were running in that staging test that we did the upgrade in. And that's been great because a lot of times when we get clouds converted to our Kubernetes-based platform, it takes us, you know, literally hours to take the cloud across multiple releases. And mainly because all that's involved is changing the image version and as long as it follows the hooks and it just runs a DB sync and runs through all of that automatically, you're really, everything is kind of done. So it's been really great. It's been very nice to leverage Kubernetes as a base platform for deployments and allowed us to really start deploying clouds at a much faster rate because while we have our big public cloud that we run, we have a large set of private clouds that we manage for customers all over the world. And so having a way to quickly deploy these changes across all of these clouds and making sure they're all consistent, Kubernetes has been really helpful for that. So I'd say our biggest kind of Swiss knife toolbox was Kubernetes in order to allow us to kind of do all of that upgrades and testing and making sure that all these environments work out. Thanks, Mohamed, that was great. We have time for one question from Steve. Oh, no, you have one more slide, sorry. Yeah, well, that was kind of mostly what I covered at the end, which is, yeah, we build the images where everything, we run them in the production environments but those are the same ones that we run in staging so that there's no surprises when things are getting migrated. We do it over and over again and we don't have to worry about unpredictable things but unpredictable things and upgrades are going to happen anyways, but we're minimizing them. Okay, so maybe we can plug Steve's question. Are you using OpenStack Helm or AirShift because you've been mentioning Helm charts for deploying OpenStack. Yeah, so we actually leverage OpenStack Helm at the moment and it's really interesting. So we have a combination of Terraform and OpenStack Helm. So because OpenStack Helm requires a bunch of values that you have to fill it out and these values are largely just calculated predetermined values, we actually use Terraform to automatically generate the values files for those OpenStack Helm deployments and then we feed that into Helm to deploy it into the cluster. We are looking at AirShift 2.0 because that's really like, it's a big parallel to what we do currently and so AirShift is what we want to leverage and start using eventually so we can use AirShift to deploy OpenStack Helm. So AirShift will be like our delivery vehicle for running orchestration upgrades and things like that. So Ben Miro had a question as well. Hi, Mohammed. I'm curious how you build your images. Are you using OpenStack Loki for that? Yeah, so actually for the images we are actually building them using OpenStack Loki. The way that we do it is we have our CI infrastructure and any changes to Loki is pretty much goes ahead and builds the base image, the requirements image and then builds the final image at the end. And primarily we use the profiles and everything that is created in the OpenStack Helm images repository where the profiles are all set up there. But yeah, that's how we use it. We have done some experimenting in the past and doing multi-stage Python image builds but that's not something that we've like super went into because the Loki images do a really good job at allowing us to build kind of nice repeatable images all the time. Okay, we'll switch to the next presentation so that everyone gets a chance to have enough time to present that we might go back to some of the questions that have been asked on the chat depending on how much time we have at the end. So switching to Imtiaz, presenting for the OpenStack upgrade story at workday. Thanks. So yeah, I'll talk about how we did our upgrade at workday. Next slide. So we started our OpenStack journey a while ago like back in 2013 and we started with the OpenStack version ISOs and the next upgrade that we did a few years ago was with Mitaka. We are building our next deployment with Victoria. That's not shown here because we haven't done the upgrade yet so I'll talk about what we did in the past. And for us, upgrade, although we preferred to do it in place and we looked at many different strategies, it proved quite difficult. And one of the reasons was like the version between ISOs and Mitaka sent to us like we use sent to us. It was six in ISOs. It was a different operating system version in Mitaka. We use Contrail as our SDN provider, our network provider. And the Contrail version also was different between what was supported in ISOs and what was supported in Mitaka. And the network provider added actually a lot of constraints in which version of OpenStack works with which version of Contrail. And Contrail also has very strict kernel dependency which adds also additional challenge. Like you cannot update the hypervisors, let's say the way OVH cloud did and keep the hypervisors in one release while upgrading the control plane, for example. Like they had to be done at the same time. Deployment architecture was also slightly different for us between one and other. Like we started with a very simple, like a non-HA architecture single controller. And then we moved to a multi-controller, multiple rabbit MQs. Like everything was highly available design. So doing that like in a short time is also like, so that was another change that we had to do. And deployment tools for our configuration management, we use Chef. And the deployment code itself changed. Like if we were to deploy ISOs, it's one set of code. And for Mitaka, the cookbooks that we used to deploy were different. And the version of like Chef also were different. So there were a lot of moving pieces which made like the upgrade from ISOs to Mitaka a bit challenging. And there was no way for us to do it in place or even like do parts of it at a time. Next slide, please. So some of the constraints that we have was that the entire upgrade needed to be done within maximum two to three hours. So we have a maintenance window and in production it's even shorter. So we have few environments production and non-production. And our typical cluster size were somewhere between 200 to 300 compute nodes. And we have 45 different deployments and five different data centers. But we were doing one data center at a time or one installation at a time. But even then like trying to upgrade all these things proved a lot of, I mean challenging. And the other requirement was we had to roll back if during the upgrade things failed. And it did fail and we were able to roll back but that was a requirement moving like when we were trying to come up with the strategy and solution that was like one requirement that we had to plan for. And the third main requirement was like it had to be transparent to the people who are using like our services. So there are VMs running. How do we make it transparent to them so that they're not impacted? And our strategy at a high level was like, yeah, we deployed a Mitaka cluster side by side. So like we kept our ice house but then we had the entire functional cluster deployed at a time. Then the way we made the migration. So one of the things that we do during our maintenance is like we destroy all the VMs of a given service and we recreate them. So the N version for last week goes out and we recreate the N plus one version. So which gives us this like a little opportunity to move them from one cloud to another or like one version of OpenStack to another. So we destroy basically all the VMs for a service from ice house and we redeployed them in Mitaka. And to make it transparent, one of the things that we did is like we had the same projects created in two different like clouds and we kept the IP address as the same. So that was our trick to keep it sort of like transparent to the end users because a lot of the users also had some like IP rules or other. They had association or ACLs associated with IP address and changing that would mean like bigger changes in the rest of the infrastructure. So that was our trick for like keeping it transparent to the end users. So then third step was destroy all the VMs in let's say the old ice house cluster and we use automated orchestration tools to then redeploy them in the new Mitaka. And finally, like we validate the health of our cluster as well as like all those services that we are bringing up. And if everything looks smooth, then we say, okay, upgrade went through successfully, we are ready to move on. And we hand it back to the service users like whoever's using our like OpenStack cloud. And if it fails, then we just again destroy basically on the Mitaka cloud and then put them back on ice house. So that's basically at a high level what our strategy was. It's not ideal, but that's what we did. This strategy requires having like extra hardware obviously like when we managed to, since we are growing at a very fast rate, like we had extra hardware because like eventually it got consumed into our all Mitaka cluster. But I don't know if this strategy is feasible for everyone. We even in cases like borrowed from other groups like we needed 200 hypervisors maybe some other team had like here and there. So we borrowed them and after the migration was done we like put it back into the pool. So that's basically how we did it. Thanks MTS. Do we have any immediate question before we switch to Chris? Well, if not, I'll let Chris explain how he does upgrades at Bloomberg. Thanks, Jerry. Apologies, no slides from me. It wasn't feasible to get approval for anything written in time for this. Sorry, there's nothing to look at. So a couple of topics for operating system upgrades. We constrain our clouds to be 100% live migrate compatible, in other words, all VMs can move to any host. So we just, you know, it's like a sliding block puzzle you upgrade an empty host live migrate VMs into it that empties up the next one, upgrade that one and so on and so forth. And we actually, we moved I think 15,000 VMs over three weekends. So the operating system upgrades are actually more doable. OpenStack upgrades have been very, very problematic. So we have nine old deployments and they got kind of stuck at Mitaka. I want to say, I want to talk about the old stuff just because it's, you know, it's like sort of a safe place to say, yeah, it's difficult. So those are, those are end of life and we're actually making our users do the transition to new OpenStack deployments, which are at Rocky currently. We got our new architecture up with Rocky and then an enormous tidal wave of load came. So we haven't actually done upgrades since then, but we have engineered them. So how do we do upgrades on our new stuff? Well, we have an HA control plane and basically we pause the minority. We use Ansible Playbooks to upgrade those. We then push load to this, to the smaller minority part. If it looks good, you know, run some tests then upgrade the rest of the control plane. So it's kind of an in-flight upgrade. It's not during peak load, but it is with API up as much as possible. So the only time that there's any kind of relief for a financial data company is kind of Saturday morning. That's when the least markets are open. So Saturdays, we're going to be, you know, doing this on our first cluster for real within the next few weeks. The technology that we use is very, it's the opposite of Mohammed who's all containerized. You know, we just install OpenSec on machines and then upgrade in place. So it's pretty challenging. But we do have lots of hardware, as I said. So things like, you know, bringing up shadow rabbit clusters are things we can do if the load of upgrading rabbit in place is too much. Some of the questions that Thierry had suggested about why, you know, around upgrading, why do we want to upgrade? It's a mix, definitely some new features, things like new malware, live migration. We see in new versions, we can't do that right now. So we don't have new malware or anything because we always have to be able to live migrate. We also would rather upgrade to get away from many CVEs rather than having to go patch, you know, spot patch. But recently, something more positive thing, you know, we've actually been starting to upstream fixes and we're far enough behind now that we can give our vendor a fix and they can't bring it back to the version that they fully support on our deployment. So we have to get out of that so that when we send it back upstream, it kind of lands back in the, you know, the app tree does. So the question of frequency, how often upgrade will clearly we are frequency of zero right now, but I think we're aiming for annual, which suggests that the twice a year pace is too fast for us. Frankly, we have never caught up with that. We never caught up in the old release series and we haven't caught up in the new one. But I understand that that's kind of controversial. The OpenSec developers want to go more often, I think, right, Tiari? So if it was less frequent, I think it would kind of feel like it was a meeting in the middle and maybe we could attempt to, you know, you know, meet them there. And one of the things that the Blizzard, I'm sorry, I forgot your name, Joshua mentioned about the version skew and therefore problems, you know, unique to this cluster but compared to that cluster, one of the things we have thankfully been able to achieve is we've actually unified every single deployment in the current generation, which is rocky and that neutron calico to 100% the same so that when we build a test cluster and we upgrade it and we put load on it, it's a reasonable proxy for upgrading a production cluster because it's all the exact same software on similar hardware on the same network. So it's not as flexible as the full Kubernetes virtual cluster approach, but you know, we're definitely seeing some good results. And I will say that the upgrades success rate that we see now with the database, you know, the Alembic contractions and expansions and all that, I think the amount of testing that the OpenStack community has put in more recently means that we kind of expect those things to work and we don't see failures like we used to five years ago when upgrades, the whole cluster kind of came apart in our hands and we had to glue it back together with duct tape. That's some probably scattered thoughts about how we do upgrades at Bloomberg, but we are about to jump in with both feet. So wish us luck. Thanks, Chris. We have lots of questions coming in in the chat, but not necessarily for the most recent speaker. So what I propose is we go through the last presentation with Delmiro and we'll line up a number of those questions at the end of the show. So Delmiro, you can take it away and you're on mute. And I am. Okay. Can you hear me now? Yes. Okay. So hello everyone. I am Delmiro Morena and I'm a computer engineer at CERN. CERN is the European Organization for Nuclear Research. So let's talk about the OpenStack upgrades at CERN. I believe that upgrades challenges are also related with the cloud size. So I would like to give you a glimpse of the CERN cloud size for a little bit of context. As you can see in the slides, we have around 300,000 cores, more than 7,000 bare metal nodes, a little less of compute nodes. So you have a sense of the size of the clouds. You can also see that we remove a lot of compute nodes at the beginning of this month because they needed to be commissioned because they were very old. So definitely size represents a challenge when upgrading OpenStack. So if we move to the other slides, we use OpenStack since 2013 and since then we have been upgrading OpenStack every release. But upgrade challenges are also related with the number of OpenStack projects that are available in each cloud. So at CERN we run 15 OpenStack projects and as you can see they are not in the same OpenStack release. We have projects that are still in Stein, Nova Neutron and others that are in Victoria. So one of the latest releases. So what is interesting is to see that we have a cloud and we have all these projects with a mix of different releases. So how is this possible? So we don't consolidate the control plane in few physical nodes like many deployments, but instead we run the control plane as virtual machines on top of the cloud that they actually manage. Each OpenStack project runs alone in multiple virtual machines. The slide that I have here tries to illustrate this more or less. The VMs dedicated to the OpenStack control plane run side by side with user instances. These allow us to have not only an isolated environment for each OpenStack project and therefore being able to upgrade independently that OpenStack project, but also it's highly distributed because it is spread between multiple availability zones, novel sales, and at the end different compute nodes. So let's... I will not guide you how we do the upgrade because that is very complex and depends on the project that we are upgrading. However, I'm going to give you some considerations that we follow. So as I told you, each OpenStack project runs in different virtual machines. We use CentOS as operating system and the RDO OpenStack distribution. So a lot of work is done by the RDO project that allows us to use these packages for our deployments. However, we still have some internal patches in some of the projects like NOVA. And for those cases, we need to rebuild all these packages. And because of this, we need to keep our internal repos and for these at all packages. We use as many other deployments the upstream puppet modules. And as I said previously, we don't do any release jump. To test all of this, we have a small infrastructure that of course runs on top of the main clouds for integration tests. So we first upgrade in this test infrastructure that is only open for us, for the cloud operators. We do our integration tests and then we move the plan to the main cloud upgrade. So upgrade challenges. Different projects have completely different challenges. So it's completely different upgrading, for example, a project like Glance than upgrading NOVA. It's a completely different game. So we try to adapt the procedure considering the project, the risk, and so on. Also, we don't upgrade all the projects at the same time. Usually, we have different projects in different releases. So when we upgrade, we upgrade a project. We upgrade Glance or we upgrade Keystone or Neutron. We never mix two projects at the same time. That reduces a lot of risk and work in the planning. Usually for the upgrades, we stop the APIs for the users. But this is not always required. This is required, for example, for NOVA. But for example, during the latest releases for Glance, we didn't stop the API. So the upgrade was completely transparent for the users. Then depending on the project, we can use in-place or replace. In-place meaning that we just upgrade the packages and configuration of the existing virtual machines. Or we replace completely these virtual machines. We create new virtual machines with the new release and then we just move the traffic to the virtual machines with the new release and at the end we remove the old ones. For example, this happens in Cinder project. We usually do that Keystone. Sometimes it also uses this procedure. Projects like NOVA and Neutron, usually we do it in-place. Puppet modules upgrade because we are upgrading the project. So we also need to consider the Puppet module. So we usually do it after the OpenStack project upgrade. Why? Usually the options, the configuration options deprecation period is two releases. So this allows the old Puppet module being compatible with the new release. It's always challenging changing the change of these configuration options especially when sometimes they only change configuration group and there is no other functionality change. If this is challenged for five nodes, imagine changing these configuration options and understanding the impact in thousands of compute nodes. DB schema updates. Depending on the project, this could be also a challenge. So in small projects, if there is a DB schema change, it's pretty fast. Databases are very small. If it's a big project, sometimes we have DB schema updates taking a few minutes. So this needs to be taken into account. What we do now regularly is DB cleanups. For example, in over soft deleted instances are regularly deleted to avoid these issues at upgrade time. Then I mentioned here the Nova control plane and the compute nodes because I mentioned you that we try to distribute everything all the projects and also the components between different virtual machines. And this is awesome for the reliability of the service. However, this also causes an issue for the upgrade. For example, our Nova control plane includes the APIs, schedulers, conductors. Everything is running in independent VMs because the control plane for every cell and we have a lot of Nova cells. So at the end, we have around 80 different virtual machines that need to be upgraded only for the Nova control plane upgrade. So upgrading this huge amount of virtual machines for Nova, this is a huge challenge. And then we have the compute nodes. Compute nodes can run with the minus one release. We try to do it in the same day. Sometimes they run all the release without any issue. But we try to avoid that because we need some issues in the past. Compute nodes upgrade, this was already mentioned by others. We see a strong dependency in the OS and the version of OpenStack that we run there. For example, using RDO, the train release was built for CentOS 7 and CentOS 8. And the train was the latest version that was built for CentOS 7. Which means that if we want to continue to upgrade the compute nodes, we will need to upgrade the compute nodes OS. And that is a real challenge because upgrading from CentOS 7 to 8 usually requires every installation. And doing live migrations of thousands of compute nodes in thousands of compute nodes is an extremely challenge. And then we have all these dependencies. For example, RabbitMQ clusters and MySQL. For RabbitMQ, we try to have completely independent RabbitMQs for each project. So when we need to upgrade, we need to upgrade them individually. And the same for MySQL. We have hundreds of MySQL instances because we have one for each project, plus all the nova cells databases, which at the end when we need to upgrade, for example, from MySQL 5.6 to 5.7 or 8, means that we have a huge amount of databases, instances that need to be upgraded. So I just wanted to bring here these challenges that we face every day when we need to upgrade. And because we have so many projects and we try to follow the releases, means that almost every day it's an upgrade day for us, continue that release cycle is six months. So thank you. Every day is release day. I like it. So we have a lot of questions in the chat. We'll try to answer as many of them as we can during the remaining time we have. So first, we had a question from Ahmed to Ben Miro, which is one common question those days around where will you go once CentOS is no longer released like with this old cadence, but is more of a warning releases? How do you do it? So CentOS 8 lifetime was dramatically reduced. So it's end of this year. Then we have Stream. So what CERN is evaluating is what should be the next distribution for the organization. It's not clear yet. Currently what we do, we are, because we need to get out from all the projects that already migrated to CentOS 8, they will move to CentOS Stream. In the future, it's still being evaluated. Yes. So next question, we have a question that was directed to Imtiaz around upgrading from a non-HA architecture to an HA architecture in terms of complexity and any good practices for that. Do you have any insights? Sure, I can comment on that. So as I mentioned, while we changed architecture, it was not during the upgrade. We had to do side by side. We had our control plane built with HA architecture beforehand before we did the actual migration. I would say doing HA was definitely a necessity. I mean, we got away with having a single control plane for years and we had only one failure. Even that was a minor one from which we could recover. But we've noticed multiple failures after going into HA architecture. So I wouldn't recommend anyone to have a non-HA architecture in production. But having HA also adds complexity to upgrade story in general. But in many ways, it makes minor updates and other things quite easy. Just for control plane, you can actually shut down one controller, bring it up with a newer version, and do a sort of rolling up update. Not a major open stack upgrade, but any minor OS update or even bug fixes. So those things we benefited from with the HA architecture. So I think that's, I guess, I would say, what I would recommend. But yeah, it was not like a live upgrade from non-HA to HA. Thanks, Imtiaz. We also have a question for everyone from Hussein. In a hybrid environment with different CPU families or even between the same CPU render, do you use extra CPU flags on Nova to keep it from live migrating instances between those nodes? Yes, yes we do. At OVH we are using, I don't remember exactly the name of the CPU model we are using. It's either Broadway or Aswell. It's something outcoded into Libert. And then we add extra flags, such as VMX, stuff like that. And with this, we make sure that every CPU is able to, every CPU model can handle all instances if they are in the same range, let's say, same group of labels. Yeah, I can also comment very quickly on that. So what we do actually is we let Libfer decide what the best model is. And our experience live migrating an instance from an older generation to a newer generation is almost always going to work fine. The problem usually is trying to bring from a new to an old, which is a problem. So generally when we roll out new flavors, we just add those new compute nodes and disallow anything new of spawning in the old ones as they're getting recycled out or things like that. And that can speak from the Libfer side. So it's, as far as we do at Blizzard, we will enable NUMA pinning for our servers. And we just don't support live migration due to the amount of problems that there have been with NUMA pinning. So all of our migrations are cold and it's just typically done, like if a game is going to go down for maintenance or they have a patch, that hypervisor, they'll just get off that hypervisor at that time. So it'll just take time for them, but they kind of do it at their convenience. So in the case of CERN, we organize the compute nodes considering the CPU family in different cells. So when we get the delivery, all the compute nodes are the same and they will be integrated into a cell. So because all of them inside the cell have the same CPU, live migration is not an issue. And initially, we let LibVirt to select the best CPU for the virtual machine. However, we found some issues when LibVirt and kernel, it's upgraded. Sometimes it introduces a new CPU flag. And in that case, we are not able to migrate the VMs from that compute nodes that it was rebooted for some reason to the other compute nodes because the others who are older didn't have that new CPU flag. So what we are now doing is for each cell, we define the CPU model that we want. That is the one that is more similar to the physical one. And also we define all the CPU flags that LibVirt selects supports for that CPU. This will allow us that live migration will always work even if the kernel introduces new features. And then when these nodes need to be removed and the VMs need to be live migrated to a new generation of CPU, that will continue to work because we will configure the CPU in the same way. We expose a common subset of flags, kind of synthetic CPU type so that we can always live migrate anywhere within deployment. Well, thanks everyone. We're getting out of time, so I'll just ask a last question to each one of you if you can do a short answer. My question would be around the OpenStack release frequency like Chris mentioned earlier. Would you say that less frequent releases would help you or make it harder for you or not change anything? Because you all have very different approaches to upgrade. So I'm curious to hear if some of you think that it would help you to have less frequent releases or would it basically not change anything for you? So maybe Joshua first. Yeah, so for us, I would say that the frequency of the releases doesn't really matter. It's more of what is in that release and what kind of large changes, such as the change to go to Nova cells. That version was a really huge version and a lot more difficult for us to upgrade to. So I would say it doesn't matter. It just depends on the content of the release. I don't know. Yes, I think having less releases would help us because there are parties of the database and each time a new release is out, the database change is almost there as well. So yeah, less release means less database change. So it could help. Mohamed, your opinion on that? I know you have one. You're on mute. Well, yes, I do have an opinion on that. So on that, I feel personally that with upgrades, one of the hardest things is the more you fall back, the harder it becomes because you have such a huge delta of ketchup. So if we start having releases that are happening more often, that delta will naturally just be bigger. And so all of a sudden a release every six months might allow for the upgrade to be a lot less impactful than a release when you've got a whole years of development that's been put into it. So going to Joshua's... talking about Joshua's comment, talking about Joshua's comment around like the cells thing was really painful, I agree. But then I can imagine there would be a scenario where we would have both the cells and some other painful upgrade that have to be all stuck into one release and that would be even extra painful. Okay, Imchos, do you want to... before dropping off, do you want to give your opinion on that? Sure. Yeah, we do prefer like fewer releases and long term support. I think that's more important than let's say the release frequency in some extent. The database migration can hurt. Like we're trying to like moving forward, upgrade more often, stay close to master as much as possible. But I think what we felt we would benefit from like for any given release if we had like three year support or something that would help. Chris, you already mentioned your opinion but you can summarize it now. Yeah, for us any deployment will take a long time because we're only going to change one location at a time and it's going to be between half a dozen and a dozen in a year or two. So if we tried to do every six months we would be starting the first one, the next one before we finished the current one which is just infeasible for us. So our constraints would suggest maybe annual would work for us. I'm not saying that's true for everybody. I mean I love hearing what Belmiro and Mohammed say they deploy daily or whatever it is. But we can't do that. I'm not changing more than one side at a time for sure. And Belmiro, what's your take on that? Quickly? Yeah, I'm not convinced that having less releases will help us, specifically because of the reasons that Mohammed already mentioned. I think the key is to make upgrades as easy as possible that allows big deployments to do them frequently enough without a big risk. Not being an event when we upgrade. Okay. So we have plenty of questions in the chat and we could not answer them all but I'll invite our guest speakers to participate in the chat and maybe help answering those questions. And it's time for us to wrap up this amazing episode. Thanks to all of our awesome speakers today. I appreciate you all joining us and having this lively conversation. It's always great to have users of OpenStack sharing information about their deployments. And on that note, the foundation is running a user survey to collect data on OpenStack users and help try technical decisions. So if you're an OpenStack user, I invite you to fill it. Next week we'll have another great OpenInfralive episode lined up with great content, infrastructure at scale like we've seen relies on quality software that is tested before it's deployed. And operators rely on open source CI systems like Zool for getting, scaling across organizations and cross project dependencies. So in next week's episode, Mohamed Nasir will be back with Jim Blair, Zool maintainer and CEO of Acme Getting and both will provide an overview of Zool, discuss open source CI CD business case and present a demo showing what cross project dependencies look like in production. So mark your calendars and I hope you will all be able to join us next Thursday at 14 UTC. Thanks again to Belmiro, Joshua, Arnaud, Mohamed, Imtiaz and Chris for participating in this episode. And see you all on the next OpenInfralive.