 How's it going, everybody? My name is Anthony Messerly. I'm a principal engineer at Rackspace. And today, I'm going to be talking about stateless type advisors at scale. So a little bit about myself. I've been with Rackspace for almost 14 years. I originally started out heading out in hardware development for Rackspace. And then I transitioned over to work on the initial cloud servers product offering. We acquired Slicehost, so I transitioned over to that full time. And then that eventually became working on heading up the engineering team that launched the next gen OpenStack public cloud. Some of my passions are just R&D and prototyping out new products and just kind of playing around with things in general. A little bit about Rackspace. Our OpenStack public cloud has been in production since August of 2012. We have six regions across the world. We have tens of thousands of hypervisors and over 10 different hardware platforms. And primarily, we utilize Citrix Zen server on our virtualization product today. So let's talk a little bit about traditional hypervisors and kind of what they're about. If you think about the components of a hypervisor in OpenStack, they typically run as bare metal. They have an operating system. You run some sort of configuration management on there. You run a Nova Compute, of course. And then when you talk about the instances, there's usually the settings and the virtual disks on there. And the mission of a hypervisor, it needs to be stable, it needs to be secure, it needs to provision and run instances reliably, and also needs to be consistent with other servers. If you're running code against servers, you want to make sure that they're predictable and look the same. So some of the problems we had with hypervisors at scale, we had multiple versions of Zen server. We launched in 2012. So over time, new versions came out, a lot of new performance optimizations and enhancements. So with that comes with a lot of different patch sets, a lot of different security fixes, a lot of kernels and Zen hypervisors that need to be kind of patched. There was a lot of Zen vulnerabilities the last few years. So the more versions you have, the more versions of that you have to patch. So the point is the more variations, the more work. And then when you talk about server hardware, you have the incorrect bio settings and firmware that come from the factory sometimes. Old servers have been running for a while. They've never gotten the updates, but new servers do. So there's a lot of inconsistency there. And then when you think about operational issues, a lot of OpenStack and hypervisor bugs can kind of leave things in an undesirable state because there's constant bugs over time. So how we solve some of these problems, we built a provisioning system that uses IPXE within the BIOS and then uses Ansible within a utility disk to kind of do like a factory style provisioning process. So every server that comes through or any server we reprovision goes through this process to kind of make sure that the firmware in the BIOS is all up to date. Then we consolidated hypervisor versions just to kind of try to reduce the number of variations. And we used Live Migrate on the ones we could to move instances off. And then we refreshed that host and added back to the pool. We also automated a lot of operational tasks on the hypervisor to try to reduce inconsistencies. If you see a hypervisor in a weird state and there's something that's not right, we try to resolve it. The tool is actually called Resolver. And if we can, then we manual intervene. But ultimately, we still have the issue where we're kind of running a traditional operating system. You have to install it to the machine. You have to update packages. So our goals were to, with this project, try to rapidly deploy hypervisors. We wanted to take advantage of every single server reboot that happened, either from a maintenance, a hardware failure, to try to bring the system up to the latest spec. These systems usually run a long time without a reboot. So it's kind of imperative to kind of take that opportunity to kind of bring up the date when you can. We wanted to have a reproducible image of our build that we could pass around to developers, engineers, quality engineering, just so that they could work on what exactly is in production. We also just want to ensure consistency on the hardware platform and the operating system. These are cattle, not pets. They have one goal in life, and that's just to run instances. So we want to kind of treat them that way. So the concept is live booted hypervisors. So if you're not familiar with LiveOS, it's a bootable image that runs in system memory. And the image you always boot is usually predictable and portable. And lots of OS distributions kind of use this today for installation or rescue. And you usually boot them from a CD network, USB key, and so on. And it lets you kind of boot an operating system, kind of look around, kind of use some functionality with that actually modifying the running state of the system. So what if we applied the same concept to run our hypervisors? So as Bill Riley would say, we'll do it live. We use a stateless live OS of our hypervisor that boots from the network to try to promote consistency across the board. Everything boots that same image, so everything's consistent. And we're using Ansible to build that operating system image from scratch. So every time a commit gets checked in, it kicks off our CI CD process that runs the build process through to generate an image and then spits it out to our deployment server. Let's just separate the operating system from the actual customer data and basic configuration on the server. So we basically take the thing that every server has and kind of separate it, put it off to the side, and then just the uniqueness of the server along with the customer data that's all kind of separated. And updating to the latest image is done by either rebooting or doing a K exec. And you can also catch up the image while it's online by just by adding packages and then updating packages to your image. So the next reboot, we'll get it. So where does the persistent data go if it's a stateless image? You can still tell the image where to look for its data. What we did was we created a systemd unit file early on in the process, which mounts the local disk, and then it creates sim links to the local disk from that. So as services fire up, they look for the location that you sim linked, and it redirects it to the persistent store. So in this case, your second partition is mounted to slash data. So the script will just generate a varlib nova sim link to data varlib nova. So then you can create a sim link for every directory you want. Usually there's not a whole lot of directories you need to sim link out. Some common ones are eti nova, your networking configurations, and just maybe some logging if you don't want to. You can either send your logging to a syslog server, or you can persist it to that local store. So how is this possible? We're leveraging the draka project. Some of you might be familiar with it. It runs in the init ramfs during boot, and its main goal is to transition to the real root file system. It's got a lot of functionality for retrieving the root fs over the network with HTTP, FTP. We recently added torrent support to it, so you can actually point to a root fs torrent and retrieve it that way, which is really useful for when you boot a lot of servers at once. And then there's many tunable options that can be set via the kernel command line just to kind of control how it boots and how the behavior works. And you can find more information at their wiki. So why use a live OS? Everything boots from a single image. You can make changes without a reboot, but you should update the image. Security updates can be rolled out to a live OS to avoid a reboot. But you can also add it to the next reboot, so the next image will get it. And you can update to a new release of the OS and roll back to the existing one if you need to. It's portable and easy to test and develop on. You know exactly what's in the build. Everything is version tracked. You can tag tickets to it. If you're from a Jira, you can mark all that. So if you had a feature or something, you know who added it, you know what it was for. And memory is cheap. It was constantly in the past. Back in the day, there were 16 and 32 gig boxes. Nowadays, everything's 128, 256, and up. So it's really cheap just to run the file system in memory. So let's talk a little bit about how the image process, the image build process works. So we put together a tool, internally a rack space, called Squashable. It's kind of a combination of SquashFS and Ansible. And it's just a bunch of Ansible playbooks that automate the build process of creating the images. And it supports multiple versions. I think right now it supports Sentos, Debian, Fedora, OpenSUSE, and Ubuntu. But you can actually add additional OS support to it, as long as they have Stratkut or another way to live boot. When we started down this project, we originally started with Debian Live. And it worked really well. The problem was, I think Ubuntu tried to use Casper. And then Fedora was using Stratkut. So there was just a lot of different distributions using different things. So we wanted to try to find a way to unify that. So we started using Stratkut. And for right now, it seems like most of the operating systems have a way to use that to create a live OS image. So the one we have internally, we kind of have been pulling bits out of it, trying to make it really generic, try to support a lot more operating systems. And that's the one we put on GitHub. And I'll get to that later. But really, the bulk of our configuration management is really done during the image build time. So any kind of packages we want, any kind of optimizations and customizations, we can do at that point. And then once the image is online and running, then you can deploy the configurations to it. And that's where it'll persist on the store. So if you reboot the image, it'll still pull the configs that you deploy to it. So all changes in the build, like I said, live within a repo, they're fully tracked, very reproducible. So you can really just kind of spit out an image of the latest build. Or you can go back to a previous image that maybe you're running production and try to reproduce that. So let's kind of just walk through the image process. So we start out with an initial bootstrap. So we use Docker to kind of make sure that we don't have to use multiple build servers for each one of the OSes. So we kind of use that to abstract the OS we're building in. And we use it to create a minimal truth. So the minimal truth initially consists of a package manager and an init system. And really, it's just enough packages to run Ansible within a Trude. Once we're done with that, we tear down the container. We copy the tar ball to the Jenkins server. And then we go on to the next step. So the next one is we prepare the Trude. So we use the Ansible Trude module to kind of catch up the OS. We throw some version tracking metadata. So it's basically some information from the build server. We throw any kind of package manager configurations to where to go pull your packages from, any kind of pinning we might need. We throw that in there. And then we kind of update all the packages at that point in that stage. And then the next stage is the common configuration. So really, this is just a bunch of different settings or packages you might have within your company where you might want to have one image that everybody can log into. It's really just all the stuff you have within your company, like your logging practices, your auditing practices, security, auth, all that stuff. You'd want everything in this common playbook. Next, you'd have the personality you'd want to apply. And that could be a KVM hypervisor, Xen hypervisor, whatever you want to do, whatever kind of makes it different from the common configuration or from other images. And there you'd have all of your performance optimizations, maybe libvert configs, and just different things in there to make that the actual image that you want to boot. So at this point, the image is done. So you'd copy the kernel and the init ramfs that were in that route out to the deployment server. You would tarball the file system that you generated and copy that to the deployment server too. And then in our case, we use MKTorrent to generate a torrent file of that tarball. And then we have our deployment server start a torrent using our torrent to create that initial seed. So then as servers come online from the deployment server, the first thing they'll do is go retrieve the kernel, do init ramfs, and then they'll go hit the rootfs torrent and retrieve that. And as the more servers come online, the more servers start seeding that torrent. So you actually add more to the swarm. And it's really just a quick retrieval of the torrent initially, and then it stops seeding it. But once your OS is booted, you can actually fire up another seed again just to continue on that seed to speed up the whole process. So let's touch really briefly on the boot process and what it looks like. So we built an image. Now what are we going to do? Well, let's try and boot it. You can use iPixie or PixieLynx to kind of boot it over the network. You can also use it. You can also boot it from the local disk using grub or extLynx. And you can actually boot it into the image from a running server using kexec. It's very easy to fall back to the last booted images, too, if your network boot has somehow failed. If you have your boot order and your BIOS set to network boot and then local boot, it can actually go to the local boot, hit grub, and then go talk to the last known image you pulled previously. This is a matter of just scripting that out and just kind of making sure it's pointing to that. There's, of course, lots of open source provisioning systems out there, so I'm not going to really go into too many of that. There's a lot more homegrown ones. I'll touch on our homegrown one in a second. But this is basically the iPixie config you'd have. You're basically setting up some draca commands that basically go on the kernel command line, retrieves the kernel, retrieves the initRD, and then you can set it to pull from HTTP or Torrent or whatever. This is some of the booting from extLynx. Like I said, what we're doing is we create a boot cache. So once it retrieves that image the first time, it caches that on the first partition of the disk. And so then we also generate the grub config. So if the server does have to reboot and local boot, it can pull that last known image and at least get the server back up and running. That kind of helps out with the case of a boot storm where you lost power, everything's trying to boot up. You at least want to get everything back up and running without having to worry about things pulling the network. And you can also roll out images ahead of time and just skip the network boot entirely if you wanted to. And then you can boot from kexect, which is really useful if you're trying to iterate on the image and try and see how it works. And the big trick to kind of make sure you kind of work on is to make sure your hardware drivers work with kexect. There's some issues with some drivers where if you kexect, it'll just hang because it doesn't release the driver properly. So it's just one thing to kind of watch out for. So our primary boot method that we use at Rackspace is called Terraform. In a nutshell, it makes a DHP request and retrieves an IPXE kernel. It identifies itself using LDP. So it listens on the network for packets. LDP is the link layer discovery protocol, if you don't know. So it listens for packets from the switch that say what's switch name and what's switch port that the server is connected to. So we use that information to go look up where that server is in the data center from our inventory management system. And we call that inventory management system Galaxy. But once it goes and finds its server number and all its attributes, it pulls that down and it throws those into an IPXE template. And from there, it looks and sees what its boot status is, what boot operating system it's supposed to run. And if it's a new deployment, we'll typically run it through the utility disk that we have. It's just a utility live OS. And it'll catch the firmware and biosettings up to date. It'll configure the storage if there's a RAID. It'll configure OBM, the auto band management. It'll capture all the inventory and it'll push that to Galaxy. And then it'll load K exec into the Hypervisor OS or at that point start the install of our older builds. So some initial scale tests. We tested with about 200 servers on x86, running Fedora 23 base live OS. The time to build the whole package and run it through the Jenkins system was about 10 minutes. So it's relatively quick. It's about 60 seconds to boot once the post completes. And then from reprovisioning servers are already in a cluster to rebooting them, wiping the disks, and redeploying the code to it and all that. It's about 15 minutes. So 15 minutes, you can wipe an entire cluster of 200 servers and be back up and running provisioning instances. So it's really quick. You could probably scale this up to a lot more servers. And because it's going to have torrent in place, it really speeds up things and lets things network pretty quick. Currently, right now, we're testing on OpenPower, code named Barilie. It's basically a Power 8 build that we've been working on with IBM on. We're testing OpenStack KVM with Fedora 23. And it's been working really well. We're still hammering out a lot of hardware issues with the unit, but we're making good progress. And if you want to find out more about Barilie, there's a lot of information on the blog of Rackspace. We recently announced an initiative with Google to work on the next Power 9 platform. So I think we'll have some good stuff coming from that. Future ideas. I'd like to see some embedded configuration management inside the image and just see how that works. Essentially, I want everything to live in the image where the image comes up and knows where to go pull playbooks from to automate that configuration, where to go pull its inventory information from. So really, you could just reboot the instance and reboot the server at any point. It would just go rebuild itself automatically. I think that would be pretty useful. Because then if you can get to that point, you can kind of start maybe getting rid of disks and getting rid of another failure point that you might have in your system. The other thing, too, is you might be able to do stateless instances. If it works for hypervisors and bare metal, it works great with just regular instances. We could potentially do boot from config drive where you network boot from your config drive that we already have, maybe using SysLinux or IsoLinux since config drive's already in ISO. So that might be a possibility. And if you had like 100 web servers, you could essentially just reboot them all and bring up the latest build without having to have a disk. I mean, you could probably find some other place to store your data. If you have a web head, you could have that persistent. But for the web heads, I mean, they're usually just throw away for the most part, and they usually are the same, so there's a lot of things you can do with that. And I actually even tested this on metal or on metal product, which is running ironic. And it worked great. I mean, I just modified the grub bootloader. Rebooted into LiveOS, and it just worked. So I mean, if you want to go check it out, you can give it a try. I got the website up last night. It's squashable.com. It's also on my GitHub. So you can kind of kick that around and give me some feedback if it sounds cool, if it sounds crazy. Let me know. I think it's a really useful tool for just a large-scale environment where you have lots and lots of the same kind of appliances, so to speak. So it works pretty well. And then I have sample boot menus. So I'm using Travis CI to kind of generate images for each one of the ones that we support. So they may or may not work, but you can use the IPXE boot menu to kind of boot into those. And then you can just boot the LiveOS and log into it and just kind of kick it around. So that is pretty much it. Make sure to check out the Rackspace Cantina. They have refreshments and drinks and stuff. Thank you very much for coming. If there's any questions, I think there's mics. Morning, Anthony. First of all, thank you for really your presentation. Thank you. I wanted to kind of take a little different angle and kind of get your thoughts on what Rackspace is seeing. So looking into the industry, we're seeing things like HP Moonshot going after complete ARM infrastructures. And also seeing things like Digital Ocean providing ARM as a new platform. Is there any thought about Rackspace looking at ARM, whether it be in a consolidated environment or ultimately even kind of a scale out? I know we're always looking at multiple technologies. I think right now, one of our main focuses is just on open power. Just because we see a lot of good stuff there. A lot of the BMCs are open. Everything is just on GitHub. So you can actually contribute toward a lot of that. A lot of stuff in the past has typically been kind of closed sourced. So it's really hard to kind of contribute to it. ARM architecture within itself can also lead to that. Definitely see what you guys are looking at, how you can have kind of that multiplier effect by being open. I guess the other half that I look for is, you know, everybody's carrying one around. So trying to play that forward, not only in the consolidated racks like the HPs or the Digital Oceans, but trying to think of where is Rackspace looking at that further. Yeah, we're definitely looking there. The funny thing is, we're on Zen primarily today. So they have Zen on ARM support, but they don't have Zen on Power. So we're actually having to look at KVM for that piece. Interesting. Yeah, it's brilliant. Thank you again. Love fun. Thanks for the question. Hi there. So what's the memory footprint on the server once you've booted the LiveOS? Yeah, I mean, so you can set up the size of the root FS. So I've been setting it up as like 2 to 3 gig or so. I think it actually takes about 4 gigs to 8 gigs, depending on what's being loaded up. But it's actually pretty small. It depends on what you're putting in there. So for just a base OS, KVM, LiveVirt, and an open stack virtual environment, I'm getting about a 500 gig compressed file. So it's not really that big. 500 gig, 500 meg. 500 meg, sorry. Yeah, it's really small. And how much are you using it as a production out of so? How much? We're not using a production yet. It's mainly just R&D. Really what it comes down to is that we run Zen server primarily today. So we haven't, we've looked at doing this on Zen server. We have some examples on that GitHub repo as well. But obviously it wouldn't be supported. And the current versions are sent us five, which don't really have a lot of good support for system D and Draca and all that. Their next version is supposed to be sent us seven, which has all that support in there. So I think we could probably do a lot more work with it. And then we'll probably be able to start looking at doing production. Thanks. No problem. Hey, do you have any guest testing after you reboot a hypervisor to see if guest instances have any issues with a newly booted, you know, oh, a new KVM came up and now Windows doesn't work or something? Yeah, so we did scale testing on this 200 node cluster. We did a test of, we spun up by 8,000 VMs, running Burn-In and all that, you know, just power cycle the cluster, started it back up again, let Libvert start everything up, Burn-In started fine again. We didn't really try to do mass upgrades of like going from multiple versions of Libvert and all that. But that's something we could probably test in CI. I'm sure we could probably just, you know, if we have a new version, make sure everything's solid on that and the same kind of virtual machines work. And we'd probably be OK. Thanks. Any other questions? What kind of demand do you see with power? I mean, it's a cool platform, but like, is there really enough demand to kind of qualify your R&D on it? Yeah, I mean, it's kind of like there's a lot of applications that can take advantage of how many threads it has and just how the performance is on it. So we kind of see it being a really good place to kind of start digging into and looking at. And really, we really like the open platform. We like OpenStack. We like just being able to contribute to our hardware as well. For a lot of years, we've had to kind of work with Dell and HP. And if we have an issue, we are kind of at their mercy. So we kind of want to take things under our reign. And we thought open power was a really good place to kind of start with. Anything else? All right, well, thank you.