 I think this is going to be everybody. Thanks for making it 9 a.m. after last night. I don't know what everybody else was up to last night, but I appreciate it. This is Zen Security Advisories are full of Venom, which is talk about how we had to reboot the Rackspace Cloud. All of it a couple of times now, unfortunately. So I'm Joel Priest. I'm an operations engineer at Rackspace. Been there about four years now. I'm Ben Burdick, also an operations engineer at Rackspace. Been there since about 2008. And this was the look on my face when I learned we had to reboot everything. Yeah. I mean, you'll see a little bit later. We'll have a little call back to this, but we pretty much orchestrated most of the stuff. It went on for a week, basically, all day, every day. So most people are working from home. A lot of video conferencing. So that's pretty much all I saw have been for about 108 hours. I think it's like one hour per XSA is our average now. So what are we going to talk about today? XSA 108. That was the first one way to deal with what was it, why did we have to patch it, and potential threats of it. So why we felt this was such an important thing that we basically sacrificed our lives for a week to get it out. Developing the patch. This was a new thing for us at Rackspace. We used Zen server. And so most of the hypervisor development, that's what we have Citrix for, and our relationship with Zen server to handle that. Because this particular problem was in the actual Zen open source kernel, it was a little bit of different process, and we had to kind of figure it out ourselves on actually patching the issue to get it done before the embargo was lifted. Tracking with Galaxy. Not that Galaxy. We used a lot of Ansible at Rackspace. There is a Galaxy project for Ansible. We have a name collision. We just happened to pick the same one. Sorry about that if that's confusing. And patching the cloud. How did we actually do that? With Ansible. And host triage. Because to make the greatest playbook in the world, if a host doesn't come back, you still have to fix it. And lessons learned. Which we gave an original version of this talk back when we had only done one reboot. We've done a couple of them now. And I am happy to say that all the things that will tell you that we improved upon, we said the first time. When we did the first time, we were like, this is what we would do different if we have to do this again. And we actually followed through on all those things. Which was pretty good for me. So the reboot apocalypse. That was like my name for it. This was the actual email I sent out. Are there any Rackspace customers in the room? So you got this email. Or you saw this notification. It was basically just what we sent out saying like, hey, this is going to happen. Please prepare your instance for rebooting. Because we don't typically have guest access. We can't restart your services for you. If you're managed cloud, we can take care of that for you. So impact. That was basically 3K or 12K out of range memory. Basically, all the XSAs that we've had to deal with recently are all more or less overflows, instantiated variables that don't get reset. At first, we thought it was just going to be potential for a host machine crash, which would have been bad. But then we found out that you can actually read memory you shouldn't have access to, which is bad. But it was very, very small pieces of memory. And it was not contiguous. And it was constantly changing as other memory on the hypervisors are being written to it. So you can use it. Obviously, you shouldn't be reading memory that's not yours. We couldn't in practice really get anything that was... We were exploiting it and that we could see memory we shouldn't. We weren't exploiting it and that we could find anything that was feasible to use. But we still felt it was severe enough that we had to deal with it. And that's just the XKCD about the heart-bleed vulnerability, which is kind of similar in that you're basically just asking for memory and able to see it that you shouldn't be able to. This is the entire diff that we had to patch. This is what made the Rackspace cloud have to be rebooted, as well as a big portion of lots of other public clouds, including a large portion of the AWS cloud. And if you're having trouble seeing the actual difference, it's those two characters right there. That's it. Two threes and the cloud had to go down for about a week. Well, I mean, it wasn't down. So, pre-patching. First things first, proof of concept. We can't prove the vulnerability exists. We can't prove we fixed it. Chris Barron, who's a developer at Rackspace at the time, helped us develop a way to replicate the issue. And we were able to see DOM zero memory atoms. And by that, if I remember correctly, the first thing we noticed was actually that we could confirm wasn't just junk data or anything like that was a portion, I say portion, I mean, like three characters or something of a UUID that belonged to the DOM zero, but the guests shouldn't be able to see. So, that's the kind of thing we were, when we saw that, we were like, okay, confirmed. We're seeing things we shouldn't have, we shouldn't be able to see. So, we had to develop a new process for this. We had to create a package, test it, package it, test it, test it, test it, test it, make sure we weren't going to break the entire cloud by applying our fix. And even though it was a very simple fix, again, this was a new workflow for us, not something we'd really done before. So, we were very careful about how we went about that. And we were committed to fixing the entire cloud before the embargo was lifted. That's something Rackspace is huge on. And I don't think I've ever seen a company, you know, sales pitch real quick, really commit to that as much as we have. I've been involved with other security vulnerabilities we've had in the past. And it was literally if one is still broke by the deadline, the cloud is still broke. So, We had about five days notice between when we learned that we had to do this and when the embargo was being lifted. It was something like notified somewhere in the middle of the week, confirmed that it was not just a crash, but like a memory leak by the end of the week, or not even the only way like Thursday or Friday or something like that started patching on Sunday. So, very aggressive timeline to make sure that we could get it fixed before it was disclosed to the public and thus more people could start exploiting it. So, we used Ansible to drive all this. Why did we do that? Excellent performance. No overhead or dependencies, just Python and SSH. And you've got your deployment mechanism and configuration management mechanism. Easy to read YAML files. Pretty self explanatory. If you can read this slide, you can read a YAML file, probably. And easy to read pass fail tests, which when you're patching tens of thousands of hypervisors, that helps a lot. And the Ansible summary you get at the end, we kind of got to learn like if you get that summary back and you have 27 okay messages and then one failed, oh, it stopped at this point. You didn't even have to scroll back up to the actual failure. You're like, oh, I have 27 okay and one failure. That was this part back here. And so that made it easy-ish to triage. You'll get why it's an ish to triage in a minute from Ben. And it's adaptable. It was very easy to iterate on the playbook as we were going. The playbook we started with was basically just the inspiration for the playbook we ended with. It was continually changing. We were like, this isn't working, this timeout is too long, or we're not accounting for this. Commit, go, done. And it was easy to iterate on when you still have tens of thousands of nodes to patch that you can then test on as you're going. So it was very just easy, collaborative to keep going. People were constantly throwing in commits like, oh, I ran into this, I ran into this, I ran into this. Fix, fix, fix, fix. It's extensible. We need to write a full Python module to handle something on the hypervisor. Or we're also doing things on our Nagios nodes, monitoring nodes to account for these kinds of things. Easy to do. Dynamic inventory. I'll talk about Galaxy, which is what we used to drive our dynamic inventory in a bit here. And this is, by the way, Galaxy is a big thing for us. And this was its first trial by fire. Up until this, we were sort of waiting in with it, and then it was like, no, we got to use this new CMDB kind of thing we're working on to drive all of this. Out of potency, if you can put logic into your playbook that if all of this stuff is already done, then just skip to this part, helps us keep velocity when you do have failures in the playbook. And call back plug-ins, which we didn't really leverage in the XSA-108, but in subsequent patching, we were able to utilize those in a way to interact with things outside of the scope of the playbook kind of, and you'll kind of see how we use the call back plug-ins to improve on subsequent runs. So how big is our cloud? Six regions. Chicago, Dallas, Virginia, London, Hong Kong, Sydney. 100 plus cells. Hundreds of cabinets, thousands of hypervisors. Flat files are bad. That's how we ended up at Galaxy. We were tired of managing flat files for inventory we needed. We had a dynamic inventory, but it wasn't dynamic enough, basically. So what is Galaxy? Galaxy is essentially our CMDB. CMDB has been a big kind of hot button topic at a lot of the operators' meetups, especially recently, and starts with our sources of truth at Rackspace, which that can be anything from like directly talking to a switch to find out how it's configured. Our internal Rackspace asset tracking things. I mean, there are systems in Rackspace that our hypervisors and such have to be inputted in, and it's their source of truth, and those systems predate the Rackspace cloud by 10 years. So those aren't going away anytime soon. And OpenStack itself. We didn't want to go out to all these systems and pull all that information in together. It was slow. It was annoying. I want to go, I need to know everything about this hypervisor and get a big JSON object back that has everything I need to know about that hypervisor, and that's the kind of power that Galaxy gave us. So we have all these sources of truth, and then within Galaxy is a service called gravity. Gravity is essentially a collection of pullers that reach out to all those different systems and pull, put all that information back into one place. And if you ever played kind of Mario Danzi, that's kind of, and that that slot, that part of the slide was in there before we came to Tokyo, by the way, that was in the original version back in San Antonio, but it just kind of rolls around, picks up all that data and aggregates it for us. Ends up being thrown into a MongoDB, scalable, fast, easy to work with, throw it all in there, and then we put an API in front of it, and then we can access all that data. So essentially it's a giant key value store that has pullers for us that is bother pulling stuff from Galaxy. We can also put data in there manually, which is something we do in this reboot step. We have a value in Galaxy like patching failed or something, and we'll kick up the actual last failure. It got from the Ansible playbook into that, and then you can look at those at a at a glance. Another side note, Galaxy is something we're looking to open source sometime in the semi-near future. Yeah, we were hoping to get at least something ready by this summit didn't quite work out, so hopefully if you all make it to Austin for the next summit we'll have some some nice gifts for you when you come to our backyard there. So how did we exactly leverage this for the patching? Divide and conquer, pull the inventory using the Galaxy API, and it's very flexible in terms of how we do that. We can pull say every host out of a cabinet, every host from cell, every host from this and that. All those into a dynamic inventory and that allowed us to break it up because we had, I don't know, when we were really cooking there the first run I think we might have had like 10 to 15 people all just running the playbook on their different sections and triaging and stuff at the same time. So we didn't want to obviously have multiple people patching the same thing at the same time. The patched host check into Galaxy when they're done and that gives us as well as senior management a look at our patching velocity. Especially the senior management, they were getting hammered by people on this issue, when is this going to be done, when is this going to be done, especially at the region level because if customers say have instances in our Chicago and our Dallas region they want to know when Dallas is done because they might have their architecture spread out as their HA setup across the data centers. So when Dallas is done they know they can make it the active because those instances aren't going to reboot anymore. And this is a look at the kind of data we were able to pull. This is we basically spun up and ad hoc graphite node just for this and that graph is just reaching out to Galaxy and saying getting a list of total number of hosts, which is obviously obfuscated in this slide, and then the percent that have been patched. So the dramatic jumps up you can see are basically when a batch finished, because it's Ansible, it's a batch, we were operating it, you know, X number of hosts at a certain time, we were doing it as a serial task, but everybody was kind of starting at the same time in their little pockets, so jump up as everybody's playbook pretty much finished at the same time, jump up, jump up, which that by the way is something we're really looking forward to with Ansible 2.0, it has a new feature called execution strategies, which you can not do everything block level, you can do one host start to finish. And this is just a quick look at our patch tracker. This is part of Galaxy, so the main, the real power of Galaxy is the API that gives us operators the power to pull all this information and act on it intelligently, and especially with Ansible and other automation tools, but this is a look at the front end that we have, and it was just very easy for us to give this to AMs, to management, and just for us to look at a at a glance and look at this front end and say okay, I have 50 percent of this cell done, I have 60 percent of this cell done, blah blah, it's 100 percent, yeah, everybody parties and then moves on to the next one. So, patch with Ansible, I can't remember, okay, whatever. All right, pre-flight checks. First thing to check, is it already patched? Obviously you don't need to patch it twice. Are we skipping it? There might be a host in the infrastructure that for whatever reason we didn't want to patch at that moment. Come back to them later, so we put a capability in there that we can say oh, this is on the do not touch list, we'll come back to you at the end or whatnot. OBS Kmods, this was a big thing for us, we used Open V Switch, we ran into a very small subset of our hosts, but when a small percentage fails out of tens of thousands, that's a lot of hosts that had a mismatch between the kernel version they would have ended up on post reboot versus the kernel version of OBS they had installed, which blacks out the networking when it comes back up because you don't have an OBS module for your kernel. So we put, this was one of the things we iterated on quickly to add a check for that, fix it, move on. Verify that the host is going to reboot and ensuring that all hosts will start automatically, all guests will start automatically when they come up. This is more or less built into Nova now, compute starts, it tries to return instances to their state it was in before it went down, so more or less that was us making sure compute was going to start and they could take care of that. Patch, check the version, don't want to apply the 4.1.5's NGZ to a 4.1.3 host, going to have a bad time. Back up and copy the patch, there's already inherent error checking in Ansible when you do a copy module or any of that to hash and make sure everything's going good, but we checked it ourselves anyways with separate tasks just because we're paranoid. Measure once, cut twice, however that goes. Host state snapshot, and by that I mean we would take a like a VM list, dump it out to a file and then we could reference that later and say, oh, hey, when this host went down, this instances were running and now only this smaller list of instances is running, that's probably bad. And try to minimize downtime. And then stage three was to recover and check in. Self-check in. It wasn't a self-check in originally, but we fixed that before it was being done in the Ansible playbook, which means if you got a random time out, but it actually patched successfully, you don't know because it hasn't reported. So that was something we added on the go. Double check our CBS volumes, instances can't start. If a CBS volume, they're expecting to be plugged in, isn't plugged in. So those attach automatically for the most part, but we iterated through them anyways and made sure to try and plug them just to make sure that they were there when an instance tried to come up that was going to use them. If we forced the patching process through a host that was originally on the skip list, remove that skip file because you have no need for it anymore. And hey, we won. Ideally. So yeah, one of the great benefits of Ansible is you can put a lot of additional logic in there, depending, and there was a lot of additional logic. We're running many different hardware types, several different versions of Zen Server. Each version of Zen Server needed a different patch on it. So with Ansible, we were able to say, you know, have logic. If it's running this version of nServer, use this patch. If it's using this version of its nServer, use this patch. Additional logic to say, you know, if the compute node doesn't start by itself, make sure the hypervisor is configured that way because we don't want to be running through making sure compute nodes have started after the fact manually. We want to back up the existing version of the ZenGZ that we were patching, you know, just in case there was some issue along the way, we had to back up there. We could easily revert to. We verified the patch was copied correctly, checked it against an MD5 sum, and Ansible has, you know, built-in modules that make all this stuff very easy. We drop a kind of a needs reboot file on the host machine. Currently, we now do this within Galaxy. We will tell Galaxy that the host needs a reboot. So we've gotten a little better since then. You know, we only had five days to get this going. So at the time, the easiest thing to do was just touch a file indicating that the host has been patched, needs a reboot, hasn't been rebooted yet, and add a little init script that would remove that file when the host was rebooted. And then, you know, we could write a separate playbook there that would quickly audit all hosts, tell us if there are any hosts around this, you know, have been patched and haven't been rebooted yet. And we drop, you know, the file that says the host has indeed been patched. We'd save the current state of the guest instances. You know, at any given time, you may have suspended instances on the host, customers who didn't pay, stuff like that. So we wanted to make sure that we had something to compare to to ensure computers weren't starting up instances that should have been running. Then, you know, we'd reboot the server and Ansible has a really nice wait for module. So you can reboot a server and the module will just, the playbook stops and just waits for the server to come back before it'll continue doing what else it needs to do, like finding and plugging in the cloud block storage volumes, removing the skip file if it exists. We had some problems with the self-check-in to Galaxy at the time, mostly, you know, waiting due to timeouts to the host's boot. That's the one downfall of the wait for modules. You do have to set a timeout. If you set that timeout value too low, the playbook moves on, doesn't finish what it's supposed to do and counts the host as a failure, even though the host successfully patched and rebooted. And so those post-reboot tasks did not run accordingly. So, you know, that was confusing for people. Wasted time. So we created a new once-time startup task and a new playbook that would check for that file and mark the host patched in Galaxy. So we could easily run that playbook across the entire fleet and it would account for any of those errors that might have happened along the way. As Joel said, you know, we did have to make several iterations along the way just to make the process better, easier, less confusing for people. And we didn't want to increase that timeout. Originally, we had it very high and it was killing our velocity because of the batch processing in Ansible. If you have one host that takes 20 minutes to come back from a reboot, all those hosts take 20 minutes to come back for a reboot because it won't move on to the next tasks. So we've set a more aggressive timeout and put in logic for the host to kind of do some accounting on itself and then we can keep the playbook moving, moving, moving and then deal with those outliers later. And in the newest version on every reboot, the hypervisor will actually check in with Galaxy, tell Galaxy what patches it currently has. So this is all tracked via Galaxy and includes, you know, checksums of all the files. So there's auditing processes that can go through, make sure all our hypervisors are up to date, make sure all the checksums match and there's no, you know, nothing iffy going on there. This is Joel's favorite. This is my favorite. This is my favorite. I gotta give this to you. I've written a lot of Ansible and this is my favorite task I've ever touched. I love it. So some hosts hung during shutdown. A lot of these hosts had very high uptimes. A lot of them were just in a bad way. Again, huge fleet, you're gonna get a minority. And again, back to what we just said, if one of those hosts refuses to shut down, you're hitting your max timeout window every time and that's really slowing us down. So what did we do? We had to foreshut down the host. We tried some things to get it to just shut down. We shut down the instances first as best we can. And if the host is just hung for whatever reason, has zombie processes or something that just won't let it die, we need to get that host to reboot. There's nothing we can do about it. So we came onto this. We thought about trying like an IPMI command with that. We started writing some other errors there. And we decided on doing the echo command to the sys triggers because obviously we're already sshing into the host. Otherwise it would have failed far before it got to the point of actually sending the reboot command. And this proved to be very reliable for us to actually get these problem hosts to reboot finally. The issue became if you do this command in an Ansible playbook, at least the version we had at the time, which I think was 1-7. 1-1-6, 1-7, yeah. It kills the entire playbook. Not the thread for that host, the playbook. Ansible as a whole would stack trace and die and your entire run would stop. That's bad. So we had to do some creative problem solving. So anybody who reads Ansible you might be able to figure out what's going on here. Essentially what's happening is we're doing a background task that's a one-liner that sleeps for five seconds and then sends the echo command. The async stuff is telling it, telling Ansible, don't worry about this task, background it and leave. And the pull zero is saying don't actually check on it. I'll check on it myself later. So this is basically like the mission impossible. This host will self-destruct in five seconds. But because it was backgrounded on the host itself, the Ansible SSH session was closing. It was getting off the hypervisor and this allowed this step to finish without destroying the actual playbook run. And yeah. So that's... I kind of like movies and gifts, so I'm not sure if you noticed. But yeah, that was a lot of fun. And actually the UDP shutdown thing, Andy Hill, one of our co-workers in the back row there hiding is the one who coined it. I was like, what do you think about this? And he was like, it's a UDP shutdown. Man. So yeah, in the playbooks, we also have special logic. So to make sure you're doing what you think you're doing, we wanted to be able to patch hosts that did not... We wanted to be able to patch hosts that didn't have any customer instances before we patched the hosts with customer instances, just so we could increase our velocity with the actual customer impacting stuff when the time came. So we added a special variable in the playbook to reboot hosts with instances true. If you didn't run it with that, we could run it across our entire fleet and it would only patch and reboot hypervisors that had no customers on them. And Ansible makes this very easy and it's always better to explicitly state if you need to reboot, than explicitly state if you don't want to reboot. So yeah, the general process, operations would run, run and monitor the playbooks. Once completed, unfortunately we would place the failures in an ether pad. We didn't have a very good tracking system at the time and we were flying by the seat of our pants. So we'd kind of paste failures into an ether pad and support kind of just swarmed like sharks around the ether pad and pulled out bad hosts, triaged them, removed them from the ether pad. That had a lot of problems. It was really clumsy. People were pasting the same variety of dead hosts. People were working on the same hosts at the same time. People would start looking at a host for 20 minutes before realizing, hey, this has no customers on it. I should be focusing my time on a host that someone actually cares about. We had a lot of compute nodes that just didn't come back up for whatever reason after the reboot. We actually have a Zen server bootstrapping playbook which kind of takes a bare metal Zen server and configures it for everything we need, sets up the compute node, configures networking. It's a good 45 to 60 minute process that takes it from bare metal to customer ready. And that does all the compute work already. So there was no sense in going in and writing a new playbook, figuring out little bits here and there while we're in the middle of the storm to figure out how to do that. So we could easily open the existing playbook we had, add a few tags to the parts that do the compute nodes and run that playbook with the compute tag. And 10 minutes later, your compute nodes are fresh and shiny and brand new and everyone's happy. Yeah, for clarity, one thing Rackspace says is a little different from most implementations. As we do a one-to-one ratio of compute nodes to hypervisors, every hypervisor has a small VM on it. That is the compute node only for that hypervisor. So in this case, we're talking about the compute nodes that live on the hypervisor we just rebooted having issues. And so we needed, instead of fighting with those when they weren't coming back up correctly, we just bulldozed them and rebuilt it from scratch because there's nothing, the compute node itself is completely ephemeral. So it didn't particularly matter. Yeah, and at one point we even had an entire cell of hosts that did not come back online. When one of our guys told us that he had run it against the cell and the host didn't come back, we thought he was crazy. But it turned out he was right. You know, it was a unique issue with that cell due to some firmware versions. Luckily, we had DC guys on site who were able to help jiggle the handles and get those back online for us relatively quickly. Unsung Heroes, DC ops, they were, we were running them all over the DC and in this particular instance, I think they had to reseed a few hundred hosts at the same time. So yeah, they got a good, they did some curls that day. So thanks to those guys. You know, and it was important to get the playbook just absolutely perfect. People were tired, people were hungry, people were, people getting kind of grumpy. And you know, at that point in the game, it's easy to make human errors. You know, one of my favorite DevOps barats quotes is, you know, to errors human, to deploy error to thousands of servers in production is DevOps. So yeah, with so many people, you know, helping, we had support, we had ops, we had engineers, you know, we had leadership, and you know, it was crucial that we tracked what we were working. And that's, that's kind of where, kind of where Etherpad came in. It wasn't the best, but it's what we had at the time. And you know, people would track who was doing what though. But you know what we learned that night? We hosted Etherpad in our public cloud, and we rebooted it. And for a little while there, we, no one knew what was going on. Our source of information had just vanished. It was everyone, I can't remember who said it first, but you know, we're all sitting on a video chat. Somebody's like, I lost connection to Etherpad again. And then everyone else is like, yeah, me too, yeah, me too. And then just somebody real quiet was like, where is it hosted? So yeah, hindsight's 20-20, right? But you know, thankfully, the Etherpad admins, you know, the people who'd set that up had set it up, so it would come back up after a reboot. So you know, after 10, 15 minutes of panicking, it was restored. So we learned a lot of lessons from that. You know, one of those are to choose your tool, dependencies wisely, especially if you're rebooting the infrastructure the tools are hosted in. You know, there's a lot of edge case pre-checks, especially with all the variety of hardware we use, the variety of dense server versions we use, the variety of OBS versions we have in the fleet. And you just need to check for everything. We learned a lot of best practices for rebooting and patching, you know, a large fleet at a time. It was kind of our first time doing this, and you know, since then we've had to do it a few more times and we have gotten a lot better. But it was clear that we did need better tracking for failures that needed triage. And that's where some of the callback stuff he was talking about came in. We actually know if a server fails during a certain point in the process, the callback plugins in Ansible will actually grab that failure, create a ticket for it, if needed, create customer communications for it, send it, direct it to the appropriate teams. And, you know, it leverages the existing workflow that they are going to use with Nagios, some in-house alerting tools we've built. People can claim the issue and say they're working on the issue. We can easily, at a glance, see what kind of issues are open still, what kind of issues have been resolved already. We have, you know, we have some new checks that, you know, that'll tell us if anything failed along the way. And I talked about this slide before I went to the slide. But, you know, the alert will create a notification, feeds into our existing tools that are actively being monitored 24-7 by an ops team. And this was a huge improvement to the workflow. I mean, obviously not going into just a shared word doc, basically, is anything's an improvement, but this was when we did the first reboots, XSA-108 was limited to only our next-gen infrastructure. The next one we had to do went back to a further, an older version of Zen, and we also had to reboot our legacy cloud as well. So the second time around, we're rebooting dramatically more hypervisors, doing more work, and it was, can't even put into words how much easier it was. The difference from the first one was everyone is running around like, what's happening? We need to do this. Who's looking at this host? Who needs this? And the second time, video chat was more like, does anybody else need help? I'm done with all my stuff, because it just streamlined everything with all of those alerts coming in. Here's kind of a screenshot with some stuff omitted here of the alerts dashboard that our ops guys see. You can see what alerts are currently active when they were created, what hosts they're for, who's it currently assigned to. If it's not assigned to somebody, they'll be in there feeding out of that queue, assigning it to themselves, triaging from there. We've also got a tool that we don't talk about much here, but is kind of part of the Galaxy alerting, kind of part of the whole shebang that we want to open source soon called Resolver. Resolver will actually catch some of these alerts that don't really need human intervention and go work on them itself and resolve, close them, and human never has to touch them. And the important thing from this slide that helped us a lot, been kind of mentioned earlier, we wasted time on the first go around looking at hosts that might not have instances. One of the things we put into the alerts dashboard here is you can see two of those have instances and are marked high priority. The other one does not, unless it was smart enough to wait the ones that didn't have customers down to a lower priority. So you could sort by give me all the alerts that a reboot patching failed, give me all the ones that are high priority, knock those out. Those are hosts that failed patching and have customers. Then you could circle back and do the low priority ones that needed attention, but didn't have anybody affected necessarily. And for the low price of $9.95 a month, your alerts can become a high priority. Yeah. I'm kidding. I'm kidding. I'm not kidding. I'll charge for that. So now we're improving this process even more by introducing some live patching features with a fleet of our size. Having to reboot the fleet has an enormous cost. Takes a toll on us, takes a toll on the ops, takes a toll on everyone involved in the process. And it's a lot of wasted time. It's a lot of wasted money. And it's a huge customer satisfaction impact. Nobody wants their services rebooted. Nobody wants to hear on a Sunday that their cloud is going to be rebooted out from under them. And we can't control when these things hit. If one of these hit, on Christmas, we don't want to have to tell everybody we have to reboot you on Christmas. So we're working and have successfully prototyped a couple of live patching tools. And there's two types of it. When we did one round of reboot, we had a version of the tool we were able to launch onto some of the hypervisors. And those hypervisors are now live patch capable. So if something were to come up, they could be live patched today without any impact on the customer. Unfortunately, we were able to roll that out to all hypervisors. And in order to make the host live patch capable, with the code we had, you actually had to deploy the code that made the hypervisor live patch capable and then reboot the host. So there's still another reboot involved before we get to the greatness. So basically, the way this function works is we would copy a new function to the hypervisor memory, find the bad function that we wanted to get rid of and kind of stick a jump to the new location at the beginning of the original function. And they would do all this in memory. So the host still needs to be patched because since this is done, it needs to be patched with the previous process, only not rebooted. Because since this is done in memory, once you reboot the machine, it's still going to be running the old kernel, running the old code. So you got to make sure when and if the host does ever get rebooted, it is no longer vulnerable. So to accommodate for the hypervisors that didn't support live patching, they have created a live patching live patch so that we can live patch the live patch code onto the hypervisor to allow live patching. Is that confusing? So basically they built a live patch that would enable the live patching on the hypervisor and they leverage hardware devices to copy that into the hypervisor memory via direct memory access. Because on the ZIN server, you're actually, even the DOM zero doesn't by default have full memory access. So we have to leverage DMA to get into these functions that it might not necessarily be able to access normally. And yeah, so that's just a little bit about how the process is improving. And every time we go through something like this, it gets a little better and hopefully soon in the very, very near future, rebooting the cloud will be a thing of the past for us. We hear of vulnerability hits. We'll be able to roll live patches out within a couple days without any customer interruption, without the need to make a big ordeal about it, without the need for people to stay up for 24 hours at a time. We actually, as he mentioned, we're all working at home. We're on a video chat. We're organizing this process. And Joel actually took a screenshot of one of the laps time the first night before he dropped off. He was online for good. Oh, I was still on for a while after that. 20 and a half hours. Yeah, hit the next line. Yeah, that's one of the engineering managers there, Tony Evans, and his powerful Kiwi beard. He's from New Zealand. It's not real Kiwi. But then that was one day. That was basically every day for a week. So, yeah, that was a good time. And that's it. Thank you for coming. Any questions? So these specific patching processes, the live patching process anyway. Oh, yeah, yeah, absolutely. That's one of the great things about Ansible is it's kind of agnostic to the OS. It's agnostic to hypervisor version. If it's Linux, it's going to work. So the alerting part is really the best thing to try and leverage out of that. Because, I mean, honestly, the overall process, simple. Install an RPM or install an access patch, reboot. Like everyone in this room has probably done that or something very similar to that a million times. The problem is just dealing with the edge cases. If everything worked like it was supposed to, I could do it myself in an hour. But it doesn't work that way. The real win was the logic we built around the alerting things to kick that out, the callbacks, and the tracking stuff. So if you have an existing workflow, just that idea of, hey, put an Ansible task that makes an Agios alert. Then it goes into all your normal stuff so your triage team can grab it. I think that's the big win that we took away from all of this. That's one of the reasons why we ran it in a serial task. So only X amount of hosts were going at the same time. The things we were concerned about was network throughput with startup stuff kind of. Mainly we were concerned about, like say, overloading a PSU or something if everything started up at the same time. That ended up being a non-issue. But it was just something we were paranoid about. So that was where the original impetus for doing it as a serial task came in. And on top of that, not to overload, the people doing the work, not to overload support, we did a subsection of hosts at a single time. We didn't do an entire data center in one run. Because if there were problems, it generates huge quantities of support volume. And so we had to kind of space that out to a lot for that kind of stuff for a variety of reasons. Actually, one thing we neglected to include in this presentation that we improved upon to the subsequent reboots after the first one is the first time you had a giant reboot window. It was like, we're going to reboot you sometime between now and tomorrow. Not quite that big, but it was multi-hours. The second go-around, each individual instance has a two-hour reboot window where it might be rebooted. We actually populate that into the Nova metadata. You can do an instance list or a show on the instance and say, it'll say reboot window between now and now. And we clear that or put complete, I forget which one, when it finishes. And that was actually integrated into the reach, our control panel on the Rackspace control panel where you could go and highlight your instance and it would say, this is going to be rebooted between now and now, or this was already successfully rebooted. And that was the kind of control and precision we got between version one and version two. XSA 108 through 10 whatever. Any other questions, anyone? So some of these compute nodes, especially in some of our first data centers, they've been running for five years. So there's a variety of reasons. They're file system corruption, they haven't had the proper updates, just general softwares. But honestly, no, in the heat of the moment there, we were not concerned with triaging why. We just wanted to get them back online immediately. I see you're coming from that, because that's very interesting, because if it's happening to us, it's happening to customers. I don't think we really, the first go-around, I don't think we really had time to do anything. I mean, we were on such a short timeline. The second time, I don't even know how many we had to do the second time. I think we kind of shook the cobwebs out between the first and second. Unfortunately, it was a fairly quick turnaround between 108 and 123, I think was the following one. It was about six months between. So that was another reason why the second one was a little smoother. Everything had no more than about a six month uptime. And there wasn't a very large volume of that problem showing up enough to make it annoying. So I guess for our time here, if there's any other questions, we'll be happy to talk out in the hall. Thanks again, everybody, for showing up in a 9 a.m. session. Appreciate it.