 Hi. Sorry, everyone, when you've got ears as big as mine, you feel a little bit self-conscious with these things. So my name's Cass Halcroft. I work at Bloomberg. I work on the cloud infrastructure team there. We run numerous OpenStack clusters at Bloomberg. The actual production code that we use, we'll look at it a little bit, is available on GitHub. If you go to github.com. You can see exactly what we're running in production, which are nuts. And we're going to talk a little bit today about how we went about upgrading all these live production clusters. So I've been asked this a couple of times, actually. We're sponsoring the DevLounge. If you want to see a Bloomberg terminal, you can go down there and see one in doing its thing. And I've been asked this a few times. So I just put a slide up in case you haven't heard of Bloomberg or a little bit unsure about what we do. If you go into almost any large financial institution, you'll see a bunch of very high octane people clustered around screens like this. You trade those and risk people and such and such. All poking and making trades and doing things. That's the Bloomberg terminal. That is our core product, or more properly, the Bloomberg professional service, which has been essentially pretty much any financial data you want and a lot of analysis. And if you go down there and have a look at what we have, it's really quite amazing. But in addition to that and supplementing that, we have a lot of other products. Bloomberg News is probably in most of your hotel rooms, at Bloomberg Business Week. We're also in a lot of other verticals, particularly law, sports, and government. So we're branching out into other areas. Just to give you an idea of the kind of numbers we're dealing with, we have something like 22 million instant messages a day, quarter of a billion, excuse me, messages a day. We suck down about 10,000 feeds. I think that number is actually a little bit out of date now. And we deal with, what's that, 50 billion ticks a day, at least. So what is the Bloomberg cluster private cloud, or BCPC for short? It's part of a much bigger cultural shift at Bloomberg towards a DevOps culture. We all know what this is to reduce our developer. The friction for product deployment to reduce our turnaround to market, all these good things like machine provisioning as code, we're an essential part of that. We're the private cloud component. And when we were designing this, we had some very specific design goals in mind. We wanted to have a complete automated install of an entire cloud stack, not just OpenStack, but everything, soup to nuts, right the way from provisioning the boxes, the actual hardware boxes, right the way through to monitoring and alerting and what have you. So if you go and look at GitHub, that's exactly what you get. You get everything we run, the entire stack. Each cluster in itself should be highly available, active active, highly available. And we wanted to make it as simple as possible. We don't want terribly complex interdependences that things can go wrong. So it's a simple design. We also want to try to keep it as homogeneous as possible. What I mean by that is there isn't lots of special dedicated nodes. We're running just MySQL or just Nova or something like that. We have one stack. And then we just slice that stack to add compute power. We'll have a look at that. And it was also designed, excuse me, to scale purely horizontal scaling from a single node up to many hundreds. So you can run BCC on one node on your laptop or three nodes on your laptop. Or you can run it in many hundreds, and it's exactly the same architecture. You don't change anything apart from just provision more nodes. And one thing that's often overlooked in environments such as ours is it has to be completely deployable in the absence of any internet whatsoever, so in a totally isolated environment. So we have recipes to get out to mirrors and get everything you need to install this thing in total isolation. So as I said, it's not just OpenStack. It's not just Seth, but OpenStack. It's a whole lot. And this is a list of the technologies we use. I think it's fairly exhaustive. Rather than go through them in great detail, you can just go and look at our code on GitHub and have a play with it. To sort of graphically represent what it looks like, and I want to go through this in too much detail, but just to get an idea of what we're trying to upgrade, we have, obviously, the hosts at the bottom on Seth. There's our storage layer that acts as both S3 storage and sandbox storage. We have MySQL Galera and Rabbit MQ cluster, and then we run the OpenStack services in a sort of shared nothing architecture, and we front that with HAProxy and Keep Alive D. I said that we have one stack. Obviously, we don't run hundreds of Nova schedulers. We have a reduced stack where we essentially just, for adding compute, where we just slice off the top layer of this and we just run Nova compute and Nova networking, and we keep everything else the same. We have a, well, obviously, we don't run MySQL there as well. The compute nodes not only contribute to Nova compute, but they also contribute their disks to the Seth pool. Yeah, so every hardware I know, if it has N disks, it'll have one root partition, which is the OS and then contributes N minus one to the Seth pools. Okay, so a little bit of a note about our development. We do a release cycle that sort of tries to sort of match OpenStack's release cycle. We actually call it leases slash ice house is our current production release. We're working on Juno, and that sort of comes out a few months, hopefully after each OpenStack release. The one thing to note is that you can actually run an entire BCPC stack on your laptop. This laptop's running one. You can just spin it up inside VMs. This is a three node. It's got one head node with the entire stack on it and then two compute nodes and a bootstrap node, which we'll talk about a little bit. So you can go to GitHub and just spin one of these up. So that's how we do the development. That's our stack, what we actually run in production. We have lots and lots of individual isolated clusters. So rather than have one large OpenStack cloud, we have lots and lots of smaller ones that live in different networks in our environment, but also of course in different data centers. So each cluster is, each network zone is mirrored in of course at least another data center. And by policy, we apply a policy where every app that runs on us at the platform layer has to be able to run on more than one data center. So it can fell over to the other side, no problem. So how do we actually go about deploying it? We have a deployment node, which we can push, it's just a bootstrap node, which we can push code into our Chef server. We can control Cobbler, which we use for provisioning hardware. And we can push new packages into our app to mirrors. And from there, we can also talk to our out-of-band management. So IPMI tools to talk to our out-of-band management. So how do we go about upgrading this? So our policy is that we'll only ever upgrade one network segment in one data center at a time. So we always ensure that at least that network segment is available on the other, at least another data center. And then we, because of our policy with our apps, it means that they can just fell over to that other data center and keep on running. But we do try and keep the downtime, any downtime to an absolute minimum. So our target downtime is roughly one hour to upgrade a cluster to have the control plane back up and running. One thing that's important to note is we have a lot of data on our Chef clusters, and we can't just like take that off and put it on tape every time we wanna do an upgrade. So we have to keep all our Chef data, you know, live on the cluster and keep it safe. We can't just sort of blow everything away and start again. That's not really an option for us. We replicate within a cluster three times. We do not replicate across the river or across to another data center anywhere else. So we have various options when trying to upgrade. We could try a rolling upgrade when we did Grizzly Havana. This wasn't really an option. So we didn't design our upgrade procedure around that because it wasn't really available to us. We could just do an app get upgrade, but we do so many architectural changes. We, you know, when we were upgrading from our version, our stack from Grizzly to our stack in Havana, we've changed a lot of other things, not just open stack. We've changed, you know, made improvements to our own architecture, made improvements to our self-crush maps, et cetera, et cetera. So upgrade isn't gonna get us there. So rather than doing a lot of tedious smucking around without get uninstalling stuff and reinstalling things and things like that, we just tend to wipe host and reinstall it. But we've got a lot, a lot of clusters to do. I should just add a curiosity. How many people here have been personally responsible for upgrading open stack clusters? Okay. When you're doing an individual cycle, how many independent clusters have you upgraded? Raise your hand if you've done more than two. Raise your hand if you've done more than five. Raise your hand if you've done more than 10. Okay. All right. So it has to be automated, right? You're not gonna do this like, order 10 times, you know, by hand. So we developed a workflow in Ansible. That runs through a cluster, one, two, three, to five. The number of nodes you actually do at any upgrade at any particular moment depends on, you know, what they're doing. Head nodes I'd probably do one at a time. Worker nodes I'm a bit more confident about. I'll do, you know, three at a time, depending on your Ceph crush map, if you feel comfortable doing a whole rack at a time, you do a whole rack at a time. One important thing is that you have to run a bunch of functional checks before you start to upgrade a host and after you're done. The possibility of cascading failures is really kind of scary. So if you've got all your checks in place and you've got all your motivation in place, you just sit back and relax and watch it roll through the cluster, right? Right? No. So that's the long periods of boredom. 2 a.m., sitting there, watching your Ansible scripts running, running, running, running, running. That's the long periods of boredom. A little note on state. So if you understand, of course, we're nuking individual hosts, but we're not gonna wipe out the OSDs. The OSDs stay where they are, they're separate disks. They live through the upgrade. And our cookbooks are written so that when you bring back the root partition, it will reincorporate those OSDs back into the Ceph cluster. So that all works rather nicely. So Ceph maintains our images, our volumes, our S3, all our OpenStack data. And then MySQL Galera, of course, is the OpenStack database. As I said, Ceph generally does a pretty good job of reincorporating those OSDs out of trouble. MySQL lives, the databases MySQL lives on our root partition so when you redeploy that head node, you lose that MySQL install so you have to do an SST back into the one you've just deployed, which adds a little bit of time, depending on how big your database is. So sort of detail the plan, what actually gets automated. We take down, oops, seems to have got squashed a bit, we take down the control plane, first of all, it's sort of time zero, so this is the action, this is the impact it has on our tenants and how long, you know, in the, how long it takes. So we take down the cluster, Vip, that's the front layer, that basically means that at that point, our tenants can no longer control their VMs. The VMs are still there, they're still running, and they're still absolutely fine, but they can't spin a new ones up, destroy them, et cetera. And we also, obviously, disable Chef at that point. After that, we actually upload all our new code into Chef, you know, what we're targeting. You deploy new app to mirrors if we need to, et cetera, et cetera. Then we upgrade Ceph, if required, not all of our upgrades require an upgrade in Ceph, but the grizzly, no one did, for example. That's usually a pretty painless process, you just upgrade Ceph and bounce all your Ceph, all your Mons and then all your Hursties and you're pretty much okay. Upgrade MySQL, that usually goes pretty smoothly. So you're, of course, this thing is clustered, so you wanna make sure that you, when you bring up the new host, you're in the same version as the ones that are keeping the data for you. So you have to do that before you actually start to reimage the new head nodes. And then you start repixing and reshuffing your head nodes. So all this point, you probably, it depends on how big the cluster is, how big your database is, et cetera, et cetera. You're probably about 60 minutes in before you get your first head node back. And at that point, you've got your control plane back. You've got a control plane available to your tenants. Now you don't have any compute, unless, of course, Nova Networking and Nova Compute is backwards compatible. So the details of what actually happens on that last stage is you IPMI off your server, re-enable Pixie booting, chef in a new OpenStack control plane. All the recipes, like I say, reincorporate the OSDs. They rejoin MySQL cluster, they rejoin the Rabbit cluster, everything's great, and then you do your DB sync, and the world is a fabulous place. The compute nodes are a little bit easier. Once you've got the fully operational control plane, you don't have anything to control, but now you can start to rip through your compute nodes and upgrade those in exactly the same way. And, of course, they'll just rejoin the cluster. At that point, your tenants now have new compute nodes to deploy. Now, if they're running VMs and they haven't terminated their VMs, if they're running VMs on those compute nodes, at that point, they'll get shut down. But, of course, we're running all our stuff with boot-from-volume, so when the compute node comes back up, they can just actually basically take that volume and restart their VM if it needs to be. So, that's essentially what we're saying here. So, they will get terminated, they'll get shut down when you happen to roll through their node if they don't terminate. We do tend to encourage people to terminate their VMs, it's just cleaner, and they can bring it up on the other side of the river, it's not, you know, for our apps, it's not a particularly big deal. So, we're chasing through our cluster, doing one head node at a time, doing three worker nodes at a time, and then before every iteration we move on, we've got to check that everything's okay, you know, so some examples as Ceph returned, as MySQL being fully synced, as Rabbit joined, you know, is there any errors on the functional checks, and then we can move forward. And at no point, we're risking any of our data. So, as I said, this all works great. If your machines have been up for a year, or something, they're hot, you bring them down, some of them don't come back, you know, some disks fail, whatever. So, you have, at this point, you have a choice to make. You've got a node that has maybe a root volume that is not gonna come back, so you either bring that host down, down and out it from Ceph, let it migrate the data off, or you fix it in place and move on. You just gotta be a little bit aware of cascading failures that you've got maybe something wrong in your new recipes, and if you kept on going, you're gonna end up with, you know, a rack and a half of completely bought nodes, and now you've got some data problems. So, I generally stop when I hit the first or second error, and go back and just check that I'm not doing anything silly, and then make a decision about what I'm gonna do about that particular host, and it's case by case basis. There's always problems. The Ceph attack, so as I said, we're not just upgrading OpenStack, we're upgrading Ceph, we're also upgrading our architecture. So we make improvements to our crush maps, we make improvements to how we lay out the data on the disks, et cetera, et cetera. So that obviously has to get chefed in when we bring up a new head node, it's gonna apply those changes to the crush maps. So some things you have to be really careful of here. For example, when we went from dumpling to firefly, there's this rather innocuous little tunable that we just put straight into our crush maps, which it says right there in the documentation, it causes every single byte on your cluster to move, which is great when you've just done one head node, you've got a control plane back and no compute, and now your entire cluster is trying to rebalance. So you have to be quite careful on how you manage that. The way we do that now is we have switches in our recipes where you can basically say, you know what, I know that this crush map is not how you want it to be ultimately, but just leave it alone for now, and we'll come back and we'll do it later. Let's get everything back up and running and then we'll apply some of those crush map changes, those tunable changes later. The database nephew, so obviously you got to back up for your database, right, it's the first thing you do, but then you're doing 10 or so clusters, at some point you're gonna have a database drop when you're halfway through a big sync and now you have to blow it all away and reload the database and start again. So that's another thing that has gone wrong a few times. The deprecation warnings that you get, I don't know if you all got these when you go from Grizzly, Havana, those will come back and bite you in the butt later, so you better take care of those during that upgrade process. And there's also some BCPC related changes, like we changed from going from one large pool for block storage to SSDs and HDDs separated out in the G2H upgrade, so you have to go into the database and actually make some changes related to our own architecture. So that is essentially how we do it, we call it the rolling nuclear update. We just roll through the cluster, blow one away at a time, maintain the state of the data as we go. Because of our policy applied at the app player, we can have that short outage at a control plane level in any particular one data center, I let it fail over to one of the others. It's actually works really well and it works, I think it will continue to work, hopefully in the future we can have that backwards compatibility so that the VMs that are already up when we bring up the control plane will be able to still apply some to control cells and do some live migrations, let's hope. Doing it this way, we don't expose any of our data, we don't expose any of the CEP pools to any potential data loss as long as we do all those functional checks. And of course we just completely automated this because there's no way I'm doing this 10 times by hand. The only downside to this option is that you will take a short outage at your control plane level and of course every single VM would have to be restarted if you're in an environment where you don't have this fail over option, you would have to essentially go back and restart those VMs as you go through the cluster. So that's the downside. So that's it. Any questions?