 Check, check. Good morning, everyone. My name's Jing, and I'm a senior software engineer from Bloomberg. My name's Tyler. I'm a cloud infrastructure engineer at Bloomberg. There's a story we've all heard of. One day, there's a hare passing by a tortoise. Because the tortoise was slow, the hare stopped and started to make fun of the tortoise by asking, do you get anywhere ever? And the tortoise said, yes. And I get there sooner than you think. We all know how the story ends. The two run a race, and in the end, the tortoise beat the hare. When we upgraded the operating system of our cloud platform, we had the same experience. And this is what we'd like to share with all of you today. Here's the agenda. First of all, we're going to have an overview of our private cloud infrastructure. And then we're going to present to you the different options we had for this OS upgrade. And next, we're going to deep dive into some of the technical details for this OS upgrade before we share with you the results and conclusions we took away from this journey. In the end, we'd also like to leave a few minutes for questions. So overview. When we say the cloud, we're referring to the private IS infrastructure as a service platform of Bloomberg. It's very similar to AWS EC2 where you can send an API request to our service. And in return, you get a virtual machine with operating system running on it. Like any other cloud service, our platform is on-premise, meaning there's no extended downtime or scheduled maintenance window. The most recent version of the platform was built on brand-new hardware with a new architecture. So for the OS upgrade of the physical machines, this was the first time for us. For this most recent iteration. Last but not least, our platform is highly scalable. We'll have eight clusters across different regions. We'll have more than 4,500 physical machines across different clusters. On top of them, we run more than 43,000 virtual machines with more than half a million virtual CPUs. Plus, we're constantly growing. So why do we need to upgrade the operating system of the physical machines? And why now? The first reason, like seeing the operating system we use to build this platform and the vendor support we get for that. When we first build this platform, we use Ubuntu 1804 LTS. LTS stands for long-term support. For these versions of Ubuntu, each one of them gets 10 years of support from Canonical, the company that publishes Ubuntu. However, the standard support only lasts for five years. In our case, because we use Ubuntu 1804 LTS, for this version, as the name indicates, it was first released in April of 2018. So by the end of May 2023, we would lose a standard support for that. This was important for us because Canonical not only provides support for the operating system. They also provide support for more than 25,000 open source packages that they publishes along with each LTS release. That includes OpenStack, which is the core of our platform. The second reason is really forward thinking, because Ubuntu 204 LTS unlocks new features and new hardware for us, including new storage backend technology, new hardware model, and OpenStack yoga. So why OpenStack yoga? This diagram here shows the release cadence of Ubuntu LTS releases and the OpenStack versions, each Ubuntu LTS release supports. Before we run this upgrade, we were running OpenStack Rocky on top of Ubuntu 1804 LTS, which supports OpenStack Queens, Rocky, Stain, Train, Yusuri. So if we want to upgrade OpenStack, we could go to Yusuri safely with Ubuntu 1804 LTS. However, if we want to go to a newer version than Yusuri, safely, we'll have to first upgrade the operating system to Ubuntu 204 LTS. So that was the goal of the project. In the future, we plan to go to OpenStack yoga and Ubuntu 2204 LTS. The third reason was we wanted to push for more proactive maintenance, meaning we don't want to maintain machines only when things are broken. In the same sense, we don't want to upgrade components only when they are outdated. We want to stay on the latest and best of everything, and that's one of the goals of the team. Before I present to you the different options for this OS upgrade, I need to talk about how we build our machines and how we run the releases. In order for you to really understand what goes into the OS upgrade, we build our platform on top of bare metal machines. When they get these servers, they only have basic network connections. And then we use a tool called Foreman to install Ubuntu on these machines. We also patch a firmware through a self-service firmware installer. Next, we have Ansible and Chef Automations. With all the automations, we deploy and configure a whole lot of software, including Open and Chef. This entire process is how we build the machines. And Ansible and Chef Automations is our main code base called ChefBCPC. It's open source on github.com. And when we say a release, it means we check out a new version of the automations, run the automations, to redeploy and reconfigure all the software here. So far, it sounds pretty straightforward, except for a catch, because this is a private platform in a private network. In this entire process, we get zero internet connection. You may ask me, then, how do you get the packages? The answer is internal app to repo. With our automations, we configure what we call sources.list files in Ubuntu. And these files contain URLs pointing to the internal app to repo. This is how apt, the package management tool on Ubuntu, is able to find the packages. It's by looking up the URLs in these sources.list files. If you're curious, you may ask me, how do you build the internal app to repo, then? The answer is, there's external app to repo on the internet. And we have a nicely procedure that mirrors the packages we need from external app to repo into our internal app to repo with another open source tool called aptly. This happens every day, and we assign a date to all the packages we mirror on that day. And with this date, this is what we refer to as a baseline. And here's an overview of the topology. It's pretty standard. We have control plane, compute nodes, or what you call them, hypervisors. We also have storage nodes. The only thing I'd like to highlight here is what we have, stub nodes. These nodes, they only have the bare minimum common services, such as Calico Felix for network, internal metric services, Python, and et cetera. And now I can finally present to you the different options we had for this OS upgrade. The first option we had was in-place upgrade. It works very similar to how your cell phone gets updated automatically. Usually, what happens is you get an notification. You press a button on your phone, and the installer goes in the background. Next day, you wake up to a new software, new operating system, while user data, like your photos, are still intact. It's quite nice. In one tool, it works this way. Of course, we first need to prepare the machine for maintenance. And then we would call a command called do-release-upgrade in Ubuntu. After do-release-upgrade does all its magic, we're going to re-enable the services and the machine. The other option we had was to rebuild the machines. With this method, we basically treat existing machines as if they were new machines. We run everything we do when we build these machines. There were quite a few steps. And just to make things a bit more complicated, as we were planning for this OS upgrade, the format itself was being upgraded from v1 to v2. So in order to use format to build these machines with Ubuntu 20.04 LTS, we also needed to move machines from format v1 to format v2. It sounds like a lot of steps with rebuild. And that's why, in my mind, in-place-upgrade is a hair method. And rebuild is a tortoise method. Of course, given the scale of our clusters, we wanted to go for a faster method. And that was what I tried out in the beginning. So my plan was simple, which was to build an Ansible Playbook that does in-place upgrade and test that Playbook on the stop nodes, because these were the easiest nodes. However, as the sound wanderer goes, all the roles that lead you there were winding. The reality was bugs and risks. For example, there were issues like some old configuration files were no longer compatible with the new software after due release upgrade, because due release upgrade doesn't change your user data. And that includes the config files. There were also issues like how to release upgrade with Igdo or some of the URLs we set in the sources.list files because Chef puts double quotes around these URLs. And also, there were some critical package that didn't get mirrored correctly by Apple, and it was a non-bugging upstream. So we fixed all these issues in the Playbook, but these were not the most concerning issue. The most concerning issue was inconsistent app baselines. So basically, I tested this Playbook on our stop nodes and everything worked fine. When I moved this Playbook to our virtual environment, the upgrade just failed there. What happened was our virtual environment was built with an older baseline, and that contained older packages that broke the dependency chain of the release upgrade. This was especially concerning to us, because this was very likely to happen on our cloud clusters, because we build our clusters over time with different baselines. So very likely, this would also happen there. So the problem became, is there a way for us to downgrade and roll back cleanly? Unfortunately, there was not. If you run due release upgrade before Ubuntu 20.04 LTS and the due release upgrade fails, the only thing you could do there, we could do there, was to rebuild the machine. With all these issues in mind, I started to reconsider the rebuild seriously. Although it was a slower method, the risks were much, much lower, because we knew the procedure very well. At the same time, this became an opportunity to move to a new foreman, new kernel, to reboot everything, of course, patching the firmware, and it also prevents the configurations from further drifting. And if we think forward again, this can also unlock new technologies that require machine rebuild, such as secure reboot. So in the end, we decided that the slower method had greater benefits, and that's why the tortoise beats the hare, and we decided to go for rebuild. However, rebuild is still much slower. If we're going to spend several months upgrading everything, how do we minimize disruptions to our end users and our coworkers? So for the end users, we live migrated all the VMs from their hypervisors before upgrade. And we upgrade one hypervisor at a time just in case something breaks. We don't impact too many clients or too many VMs at the same time. The procedure was pretty successful. The procedure was largely unnoticed by users, except for some VMs running older versions of Red Hat. And we're going to talk about that in more details in a technical dive. For our coworkers, this was just one of the many projects we had at that time, but this project required more changes to our code base. Of course, we could have just created a development branch for Ubuntu 20.04 LTS, but that could have imposed some operational challenges to our coworkers. For example, when you're checking the change, you may need to commit the same change to two different branches. And when our release manager does a release, he or she needs to first figure out which OS version a certain machine is running and check out the right branch for that. So we didn't want to do this to our coworkers and that's how we decided to support both OS versions in the same Git branch of Chef and Ansible Automations. That sounded a bit scary because we needed to add a file sustainment. You may ask, wouldn't that turn your code base into spaghetti? Well, it didn't because of something I didn't tell you just now. Before we ran this OS upgrade, we upgraded a lot of the software components, including a baseline, new baseline, and your kernel. For example, for OpenStack, we upgraded from raw key to USURI. Why USURI? If we look at this diagram again, USURI is a special in the sense that it's supported by both Ubuntu 18.04 LTS and 20.04 LTS. So by upgrading OpenStack beforehand, we didn't need to worry about the OS upgrade breaking OpenStack. And the same logic applies to other components here. Basically, we were running to 20.04 Compatibles when we still used Ubuntu 18.04 LTS. From here, I'm going to hand over to Tyler to talk about the technical details of this OS upgrade. Thank you. So Jing did an excellent job describing some of the problem statement and our architecture. So we're gonna get a look a little bit into the actual playbooks. So before I get there though, I do want to say that the team that basically developed and executed this workflow were a very small team. And we have a lot of other initiatives going on. It's not like we had eight people working on OS upgrades. So as a result of this effort, we had to make sure that the work was not only low touch, automated all the things that you would expect it would be, but also that it's reusable. So a lot of the components we developed for this workflow we reused today and other things in terms of taking machines offline, putting machines back online, updating the firmware, we can even just rebuild a machine if we need to because one becomes problematic. So this was all reuse code. But of course, again, the most important part and what do you think about when you think about the cloud is stability. The cloud, you just ask it for a VM and it must give you one and that VM must continue running. So stability was paramount for us, our businesses entrusting our cloud with an increasingly critical amount of production workloads. And so we took each piece of these playbooks, each step of our maintenance project process and ran it through again and again and again on our lab hosts. And so in doing this, we found a lot of problems. So one of our big problem areas was live migration. We wanna clear off hypervisors so that we can actually prepare them for maintenance. And when we did that, we have found issues that spanned across LibVirt, that spanned across Calico, that spanned across issues that we introduced. So one of the biggest things that we shot ourselves in the foot with was a MTU issue. So we had basically, as a parallel initiative, again, we're working on more than one thing at a time, we had scaled up our neutron networks from an MTU of 1,500 to 9,000. And so one of the things we had anticipated in our live migration workflow was to maintain the MTU, the VM across the migration. So when the tap gets created on the destination, we make sure that the MTU stays the same. One of the things that we did not anticipate is that when you restart Nova Compute, it will go and set all the taps to what the neutron network currently has, which was 9,000. So some of the VMs were basically getting their MTU stepped up, and then we would live migrate them, and it would push them back down to what it thought they should be, which was one of those hard to detect issues. So again, a lot of this was all prep work, making sure that we're really ready. And a lot of these issues we found by actually going through the source code just very function by function and building a kind of flow chart to understand exactly what's happening. This was a very hard problem to solve. And this is where we learned kind of a lot of OpenStack's innards. So we learned the difference between RPC Cast and RPC Call and those kinds of things. And there's still a lot of room for improvement that we know about, that we want to continue to get back to the upstream. So one of the examples is that we're big users of Calico, and one of the things Calico does is it has a management coroutine. And that management coroutine is currently time-sliced with the actual neutron API threads. One thing we'd really like to do is separate that out so that it's no longer part of the same whole piece. And as much time as we spent with OpenStack and making sure that the live migrations were smooth, there was a similar amount of effort investment in Ceph. So all our VMs use Ceph for the underlying block storage. And anyone who uses Ceph knows that it's very hard to maintain high quality of service, high quality of latency. So some of the tests involved just stopping and starting the OSDs. And again, we're thinking the tortoise versus the hare approach. We should just be able to, in Ceph parlance, set no out, tell it not to move the data, take the node offline, rebuild it, put it back on, life's good. However, if you do that, and you measure the IO latency with FIO, you'll see huge latency spikes when you stop the OSDs, or at least weed it on Ceph Octopus. And when we started digging into the code, what we observed the Ceph Octopus introduced something called fast OSD shutdowns because there've been a lot of bugs historically in Ceph with clean graceful shutdowns. And these fast OSD shutdowns rely on other novel features of Octopus, such as timeouts for release locks. And a result, you're kind of using the failure detection to determine stopping of an OSD. Well, that's all fine, but when you're relying on timeouts to do these things, it's going to adversely impact your tail latency. So what we would see is our tail latency would go from tens to hundreds of milliseconds to multiple seconds because we're hitting these timeouts for some clients. And so one of the things we did is just looked at the source code and a fellow from Canonical actually had bumped into the same issue and there's a flag and you can turn it on. You just tell the Mons that the OSD is going down. So they have an opportunity to observe the fact that the OSD is going away. And this prevents the cluster log from getting flooded with failure messages and slow IO and all kinds of other things. So there's big differences that you can do just by small tweaks. And so as far as the actual process goes, you think first step is to rebuild the OS, but that's not what we do first. Actually, the first thing we do in conjunction with our hardware engineering team is to redeploy the firmware on the host. So this is something we did design in-house and what it does is it gives us very tight control over the firmware deployment process. So our hypervisors hosting development VMs can get firmware that's a little more aggressive and newer. Then hypervisors hosting production workloads. We have full telemetry on kind of the rollout speeds. We can do multiple reboots. If we need to do a cold reboot versus a warm reboot for some firmware, that's kind of all captured in this process. And it's just reliant upon pixie boot. So when we go to reboot node, we're gonna use basically our DDI or IPAM solution to just steer which way we want it to go. And it's just like a train track. It'll go down the firmware update path. It'll go through a few reboot cycles and then we'll push it back to the OS which is where we proceed with the next step which is the actual deployment of the OS. As Jing mentioned, we use Formit. So that's a whole bunch of templates and everything that we can go and check in to get and have the model deploy and go build all our deployments for our OS templating. And the other thing that is worth mentioning is that when we go and totally tear down these machines for the OS rebuild, there are certain advantageous things that you gain out of doing this full rebuild. One of them being things like the FTL, you can go and do a low level format of your MVMEs. So if you're a user of Ceph, Ceph does not send trim to your SSDs if you've heard of what trim is. So formatting them here gives you an opportunity to actually clear those out and give them a little bit of extra performance. We haven't benchmarked it, but just goes to show that doing things fully and cleanly is maybe advantageous. And then the last step is just the deployment of the software stack. So this is what Jing referred to as the Ansible and Ceph. We go and just basically run our automation. This is the easy step. I'm sure everyone has something somewhere similar where they're just going and converging the machines. And other than the fairly standard process, the only thing that's worth mentioning is we push the BIOS settings after we update the firmware because we did find some firmware goes and changes the BIOS settings when it goes into place. So at this point, we've basically given the opportunity for the machine to reboot. We have all of our fresh configurations, anything that changed the kernel is gonna be observed through this final reboot cycle. And so then we tie all this together. So we take all these components that we've developed to go and migrate hypervisors to drain Ceph OSDs. They all know how to do their individual pieces. And so we can go and write playbooks that does just that one thing. And then it calls the common three steps that you just saw in the previous slides. And so this way we basically defined how to rebuild any node in our infrastructure. And then we can go a step above this and say today this is just rebuilding, but we can use parts for decommissioning. And the other thing we can do is we can increase the concurrency. So it's just a cattle farm. There's lots of hypervisors out there. If we want to rebuild many at once, as long as you're not sending out too many cattle out into the field, you can rebuild multiple hypervisors at once and the playbooks are perfectly capable of doing that because they treat every node as independent. So the timeline. In August, 2021, we started to discuss about how we should upgrade operating system. And we decided the roadmap of doing this. And this was also the same time we started to upgrade a lot of the software components. And in April, 2022, we started to experiment with in-place upgrade versus rebuild and decided to go for the rebuild route. So we build the automations Tyler just mentioned and in October, 2022, we were ready to run the automations on our pod clusters. It took about five months for us to upgrade all the physical machines except for the storage nodes. Unfortunately, for storage nodes, we were blocked by a vendor and we didn't get to upgrade the storage nodes in our clusters until May, 2023 this year. But that's something we're going to be working on once we get back from this conference. So takeaways. First takeaway, I think is investment in research and experimentation is always worthwhile. Without all the testing we did for live migration with SEV, we will never be able to found those issues and had a very smooth upgrade in the pod clusters. And the second takeaway is really how the fastest option may not always be the best, especially with infrastructure because stability always takes a high priority. As the saying goes, slow and steady wins the race. And that's why the tortoise beats the hare and that's the biggest lesson for us. From here, we're going to open the floor for questions. Go ahead. I had a question about controllers. Did you also apply the same sort of rolling process for your OpenStack controller nodes? Did you have to do anything special about the distributed databases, message queues, other things? Yes, the same set of playbook was for the control plane nodes as well. And for services like Rabbit App Queue, I think we reveal the queues. We drain them. We drain them, yes. Do you want to add to that? No. It also works for the control plane nodes. That's the short answer. It's the same idea where we're essentially defining how we take a node out of service. It might be this way for SEV, it might be that way for OpenStack, it's this way for Rabbit and then we can go and stitch it all together. How long did it take to upgrade, sorry, rebuild a node, roughly? So the rebuild, I think it took about 10 minutes per node. Did you... The convergence? Yes, when I tested it in my experimentation, that was how long it took. I think it meant the full rebuild, too, though. Yes, the full rebuild. Like we delete and redeploy the OS. Do you remember how that took in the end? Yeah, of the hypervisors, the full redeployment took anywhere from one and a half to two and a half hours. It depends on the vendor and how out of date the firmware was and how many times we needed to reboot it and things like that. That's including the drain time? Yep. Okay, thank you. Yeah, because my experimentation happened on stop nodes and they don't have too much going on, so that's why it's much quicker there. Why eight to five months? First of all, we have a lot of machines. And we also had holidays and what we call lockdowns. So that's why eight to five months. And I honestly think that was a pretty short amount of time. I was wondering to know if you have any observability dashboard to check if all the compute nodes or the OpenStack nodes are running in the latest version. How you track that you are running latest patches and latest upgrades? We do. We have an internal system that keeps track of the record of all the machines. Although I think for us, it really depends mostly on the version of automations we use. Because Chef mostly manages the configurations, right? And that also you can check which version of Chef I think recipes were using in order to decide what's being run on a certain node. You have a question? So the question was, how do we address the MTU misalignment? By the way, that was the red hat VMs problem I mentioned in my part. Yeah, so for that, that was actually a fairly costly fix. We essentially inserted a supersede MTU statement into the guests. So we worked for a while to actually make an invasive change within the guests where we're basically forcing the MTU down. We're checking to make sure that the machines rebooted. We left them there for however long to make sure everyone had an opportunity to reboot the VMs. And then we went and patched the MTUs back on the taps down. Then we know which the MTU is because that's the one in Libvert. When we live migrate it, that's the one. It's always gonna want to stay wherever Libvert's trying to push it to. So the question was, did we do anything to make sure VMs land on nodes that were recently rebuilt? No, we didn't do anything special. That was still decided by Noah's scheduler. The question was, did we do anything special in terms of networking in the virtual environment? It was completely different. We used the Vagrant and we set up network with Vagrant and other services. So it's actually quite different from the proud build. We also use Calico. So we're fully layer three routing. So we're using a BGP. And we can see before we put the node up that BGP establishes and making sure that it's all okay. But once we drain the node, we don't have traditional neutron routers. Every machine is a router. It's a layer three cloud. So don't use Neutron at all? We still use Neutron. It's just that we use the Calico driver for Neutron. Okay, I got a question. So you mentioned that, sorry, it's too loud. You mentioned that you have to keep the IO of the site when the OSD changes, right? And I'm wondering how can you make it still, how can you keep it always above the benchmark? Like it's still acceptable, usable. Like we all know that once the OSD changes like some OSD down or you got too many instances opening in that self pool, the IO will change, it will go down fast. So how can you make sure that it won't go down that fast and make it stable? So regularly when we maintain storage nodes, we would move data on that node to other storage nodes. And what we talked about just now was an attempt to make this process happen faster by just taking out a node. So usually that's how we maintain the storage nodes. Okay, got it. So in the production environment, so you would prefer to use the old OME as OSD or you compare with HDD and other kind of a disks to make it have acceptable IO and also acceptable costs. I mean for the self stack. I didn't quite understand that question. Although I think it's worth pointing out that we have a lot of machines. So if you take down one of them, it doesn't have a huge impact. Correct, yeah, again, we're all layer three networking. So we have tons and tons and tons of network bandwidth for us to set no out and bring down 20 OSDs. It's a very small impact because we're not moving data, we're just basically saying these replicas essentially won't be healed. We can bring them up slowly and let the whole cluster kind of recover. Yeah, we use NVMEs and we have lots of them. Got it, thank you. Yeah, it's kind of a follow up to that. We're also looking at updating the OS for our self and other groups I've talked to have fully drained the data off their nodes and then rebuild them and then rebalance them back in, re-successful and keeping the OSD data in place during the OS rebuild. Well, we're still deciding. So we ended up starting with the no out approach. We have both in our playbooks. We're still kind of debating on which way to go. The no out approach, there's issues with it. So Octopus, I think it has some bugs where it will corrupt the OSD and then you try and bring it back up and there's no FSTK and you're kind of like juggling balls. So we can also have the option of draining them. And again, because we have NVMEs because we have layer three routing, draining isn't that impactful to us. It's actually fairly decent. Mm-hmm, mm-hmm. Hi, from your discussion, it sounds like you went from Rocky to Yusuri, correct? Correct. How did you maintain API compatibility over the five month period? Jumping three releases, which is outside the traditional N minus one. So that happened way before the OS upgrade. So the idea was we upgraded open stack before the OS upgrade. And during the OS upgrade, we didn't change the version we used for open stack. But you stepped through the normal scheme update so you did have to go through like train and Stein train all the way through with your control plane and then you did the OS upgrade. Great, thank you. Yes. I agree, I have more questions. Yeah. Thank you for everyone. Just unplug everything. Yes. Rich, you will mic you up. And I will take care of your laptop. Are you running off the internet or is this all on your? It's just on my own. That's sending. So you won't need a network connection? And I'll get power right now. Do you have a?