 Okay, so my talk today is going to be about how I do ZFS powered magic upgrades. So using ZFS boot environments to replace the operating system of servers in the field in as quickly and as safely as possible. So my name is Alan Jude. I'm on the FreeBSD core team and I'm an open ZFS developer. And then for my day job, I work at Clara, which is a FreeBSD professional services and support company. So a quick overview of what we'll talk about today is first explaining what a boot environment is in case you haven't used them before and how they work. And then we'll talk about how we actually create system images and make golden images of using boot environments and then how we deploy those to various environments including remote machines and appliances and then how we can improve the process even further than what I've come up with so far. So starting in the past looking at how this was done on FreeBSD before the days of ZFS. So this is how things like PF Sense and FreeNAS and so on worked until they switched to ZFS over the last couple years. So we have a system called NanoBSD, which was a build system for building an embedded version of FreeBSD for appliances and images like that. So what it would do is take the hard drive and partition it into two halves. One half you would install the latest version of PF Sense or whatever in and you would run off that and that part of the system is mostly read only and there was a small partition at the end where it kept the configuration files. And then to upgrade you would write the new version into the second partition and you would set a flag in GPT saying next time we boot boot the second partition instead of the first partition. The boot loader would then erase that flag once it had done it once. So even if that image didn't boot just by power cycling again you get back to the working image. So if it's a device like a PF Sense that might not even have a video screen or something to connect to it would allow you to just do the upgrade if it doesn't work power cycle it and it would go back to the working image. If it did work you would stay running off the second image set a flag to make that permanently the default boot and then in the future when you need to upgrade again you would just overwrite the first partition and just ping pong back and forth always being able to go back one revision if something went wrong. And that worked quite well but it has some limitations including you have to partition your storage in half so if your image gets a little bit bigger then you don't maybe have enough room especially if it's like an SD card or something. So comparing that to ZFS boot environments because ZFS is taking all of your available storage and making a pool out of that and then using basically thin provision file systems on top of that where each file system only takes space from the pool as it needs it meaning that all of your free space is available to any one of the file systems or all of them. So what we can do is actually have multiple versions of the root file system so we have a version of slash with the current version of the software we can snapshot and clone that and keep that as the before image and then upgrade the system in place and then if the upgrade doesn't work we just reboot onto the older version. So same idea as nano BSD except for you're not limited to two images you can keep the last 10 or whatever however much space you have available. And the other big difference is you separate out the other file systems. You don't do this in nano BSD because you ended up with so many partitions you would run out of space but with ZFS we can make sure that your home directory lives in a different partition. So for example I updated the OS in my laptop and then when I tried to get my talk earlier today the HDMI didn't want to work. So I rebooted, selected the Thursday image from my laptop and then it worked again. But when I rolled back I didn't lose the changes to my slides that I had done on Friday because they were in my home directory which didn't get rolled back. I only rolled back the operating system and the packages not my log files in my home directory or the database on my server or whatever. And since snapshots are instantaneous and don't actually take any space until you overwrite stuff you have no reason not to keep lots of snapshots in ZFS. So it's easy to go backwards. This way you keep as many working images as you want and so no matter how long ago it turns out you introduced the bug you can always go back to before it. So we basically have multiple versions of the root file system and then you can use the menu in the bootloader to select which one you want to start from. And you can always choose to revert and go back and again you don't lose changes to the rest of the system only the operating system. I have a slide coming up that shows how we lay out the file system to separate the operating system and packages from the rest of the system so that you can achieve this undo the operating system changes but not all the other changes on your system. So because of the flexibility you get from ZFS you get to decide what lives as part of the system image and what stays persistent across upgrades or whatever. So any files that end up in the root file system slash are considered part of the operating system. So you'll see in a second how we actually created data set a file system for slash USR but we don't actually mount it so that all the files in USR bin end up in slash and will be rolled back or forward as part of the boot environment. You do have the choice of whether or not you want to include user local in the boot environment or have it be a separate file system. On my laptop I always include it as part of the operating system because it's more often that x something breaks than previous d breaks and that's what I want to be able to undo. But on some of our servers we actually keep the software separate so that we can update the OS without having to touch the software if we don't want. So we have the layout of the default file system set up if you use the ZFS installer on previous d. So we have the root of the pool which we mount to slash Z root. This is so that when you create a new data set so if you create Z root slash foo it gets mounted to slash Z root slash foo like the default in ZFS and then we have this root which is a container for all your boot environments and we don't mount it at all. And default is the boot environment created by the installer and it's your slash file system as you can see it contains the 1.6 gigs of stuff that exist on my laptop as the operating system. And my temp is empty but you see here slash USR well it actually contains 12.3 gigs of stuff it's empty because we're not actually mounting slash USR so that all the files in user bin and user S bin and user lib and so on fall through into slash. But we can create other directories like user obj if I'm compiling stuff I don't want to lose that progress if I rewind the version of the operating system. If I compile a new kernel and it crashes I want to back up to a not new kernel but I want my next build for the fix to be incremental and only take a couple minutes instead of longer. We use the and then again user home is separate so that when I revert the version of my OS I don't revert the version of my slides because then there would be fewer slides and you would be upset. And we do the same trick with slash far where again it's not mounted so that var db package and so on end up in the boot environment but we keep the audit logs the crash logs and the regular logs and my mail in separate file system so they don't get reverted. So we pick and choose what things are part of the operating system and not and that way we can control what we want to roll back and what we want to persist. So when we want to deploy our servers around the world and upgrade them constantly we wanted to basically replace the entire operating system with a newer version and not have to deal with merging config files. You know ETC update or merge master or previous the updates merging thing made me pull out my hair and I don't have very much. So our goal with the way we were designing the new set up for boot environments was just I'm going to just drop down a new version of FreeBSD and it's going to have all the new config files and I'm not going to have to merge anything I'll just overwrite but you know there are some files in ETC that actually matter like the machine needs to know what its IP address is. They're not using GHCP or anything. So how do I persist the config files I care about and not you know every file in RC.D where the version control header change and FreeBSD update wanted me to merge them all. So I thought we'll just make ETC its own file system and maybe only have to deal with that. Anybody have any idea why that won't work? Yes, you know FSTAB is in ETC and RC is in ETC and so you can't find other file systems to mount or give the thing to give the system the instructions to mount ETC if there's not an ETC directory. So yes, as it turns out lots of things in the boot process depend on ETC being there and having the files in it. So without it you have no FSTAB so you don't get swap or any non-ZFS file systems you might want to mount. No ETC RC means nothing starts. NoRC.conf means the machine doesn't have an IP address or know what what's a VLAN. I don't know. And no TTYs mean your serial console probably doesn't work either. So good luck fixing it. So that's a problem but I still really really don't want to have to merge your master. So I stole an idea from nanoBSD. Because the system images are read only they have ETC actually mounted as an MFS or tempFS and then at boot they copy the persistent files from slash CFG over top of ETC. If you've ever used something that uses nanoBSD you might have gone and changed RC.conf rebooted and wondered why it went away. It's because it actually you have to change the file over here which will get written into the memory file system that just gets blown away every time. But I didn't really want a memory file system. So while digging into it I accidentally came across a variable init underscore script in the freeBSD loader. It was originally designed for a system where you could actually boot the system in a chroot. So in the loader.conf you would set the init chroot to a directory and init script was a script that would run first to set up that chroot to make sure it had slash dev and any other things you needed in there before you boot it. So I'm not using the chroot part but I'm using or abusing this thing which causes the system to run this shell script before it tries to do anything like run RC. And so the script I'm using is on the next slide. So we set up this little script so that mounts slash CFG very early while we're still in single user mode so that when we actually try to boot it'll be able to find the first tab and all the other files. And since we have snapshots and so on we don't have to keep it as a separate file system like an N-O-B-S-D to make sure you don't corrupt it or something. So we have CFG with a couple of important config files in it and then we just replace those same files in ETC with a sim link pointing to CFG. So when we slap down the new image it already has the sim link in it and it knows to read RC.conf slash CFG instead of slash ETC. So the script is pretty simple. It runs the mount command with the parseable flag so it prints out tab delimited instead of trying to line up columns for humans. And then we just use a simple shell script to read the device, the mount point, the type and the rest of the line. If the mount point is slash we keep going otherwise we continue and read the next line. If the type is ZFS then we extract the pool so we get this variable up till the first slash and then we just mount that pool name slash CFG and now the CFG dataset will be mounted before we get to the point where normally all your file systems other than slash for ZFS are mounted by the RC or the ZFS RC service which gets run from RC.conf which obviously if you don't have a CFG where your RC.conf is it won't happen. So we mount the one file system early and the regular process takes care of the rest of the file systems just like a normal boot. So we end up with about ten files that we care about in slash CFG and any other file we ship the version we want as part of the operating system and it's never unique on any one server. So all that we really care about is the network settings. Our servers have a slash 29 of static IPs from each provider. We need certain sysctls we set between the network and so on. The SSH keys the important thing was when we upgrade the system we didn't want our SSH keys to change or to have the same SSH key on every machine. Didn't want that either. So we keep them the FS tab because we have different swap configurations depending how many disks the machine has. It might have more swap or mirrored swap or whatever. And yes the rest of ETC we can just overwrite with the stock files every time. We treat almost all of ETC as if it was the defaults directory. The nice thing about this is you never get left over merge markers in your files causing them not to parse and you don't have to ever think about that again or merge all the rc.d files because of the version, the revision control stuff changing. You don't have your configuration if you are based on stock files. The files we care about go in slash CFG and would stay as a sim link. Because we produce the image that we are replacing the OS with. So it's got the 10 files we care about change the sim links and every other file gets upgraded to the vanilla one from whatever version of FreeBSD we are upgrading to. So the way we used to do this originally was spin up a VM, install FreeBSD then like delete the SSH keys and try to clean up and so on. But that never worked out very well. We'll get to that in a minute. So now that we have created this boot environment how can we deploy it to many servers? So on the virtual machine where we've created our image of the newer version of FreeBSD that we want to use we just take a snapshot of the boot environment and use ZFS send to pipe it into a file and we'll exit it as well just to make it smaller. Then on each of the hundreds of machines we want to upgrade we just run fetch output to standard out the HTTP path to the file we created, pipe that in un-exit and then type it in ZFS receive and give it the name of the new boot environment and that file system will suddenly appear then we use the ZFS boot CFG which is the way you do that one boot thing so you say one boot, one boot only and say boot this dataset so it writes into the ZFS label the name of the dataset you selected the loader will then read that and immediately write it over with zeros and boot the dataset you selected. If it doesn't work or whatever again your power cycle when it comes back up that override is full of zeros so it ignores it and boots the original boot environment from before the upgrade so if something doesn't work you get back your working system which is power cycle but if it does work you're booted up in a newer version of FreeBSD at this point in the process we still had to deal with actually upgrading the packages because we didn't ship on our servers we keep user local separate from the boot environment but you could choose not to do that and then we also had to upgrade each jail again upgrading the packages in it and building images was painful a manual using that VM always forget something whether it was fix one of the sim links or re erase the SSH keys or it regenerated the host ID file again or something so I wanted a much better way to generate these images that I was going to push to all the servers because I needed them to be right every time not slightly wrong in a different way every time and the other problem we had is when we want to actually install the server for the first time we'd still have to do it the whole fashion way and then upgrade it and it was not something anyone other than me could really do and that wasn't useful to anybody so in our environment at Skellengine we have over 100 servers spread across 38 data centers in 11 different countries as you can see from the numbers on average there's 2 or 3 servers in each data center so we don't have the kind of setup where you can pixie boot the machines because there's only a couple in each place and so you kind of have to manually manage them and the team there for sys admins was really myself and one full time sys admin and that was it so anything labor intensive was going to be right out and we weren't using the same version of FreeBSD on everything we finally got rid of the last 11.1 machines so that's good we still have a lot of 11.2 machines that need to get up to 12 and doing FreeBSD update was just too manual and too slow fetching 40,000 patches just doesn't make sense on a server with a gigabit connection when you could just download the whole thing as one big tarpile and it would be done so again we have the process where we just set if as receive one boot, reboot machines back up as fast as it can reboot and if it works we just stamp that boot environment as the default from now on if it doesn't, one more power cycle and it's back to how it was and you figure out what was going wrong later this was very nice especially some of the servers are in remote countries like Singapore and we don't even have console access so really try hard not to break it and so just power cycle and it comes back is really useful so in my investigation of ways to make the images nicer I've actually stumbled across pudrare so pudrare if you don't know is the tool used to build the binary packages for third party programs on FreeBSD so it builds all the ports in the ports tree as packages that you can download it does this in an interesting way it uses a jail to build inside of and it does one jail for each CPU you give it because it turns out that's usually keeps the CPU busy better than trying to do one multi-threaded build at a time because during .slash configure and so on it's not actually using all of your cores so running more build single threaded at once turned out to be better and the jails mean that you're getting a fresh image with no pollution in every time and it can use ZFS to revert the jail back to pristine every time and it does nice things because you can have multiple different ports trees that you build from you can build each of those port trees for every version of FreeBSD that you care about on each architecture that you care about and you can build multiple packages for each one so you can end up with a lot of packages so while I was struggling with some of this I happened to get to sit on a picnic table outside in the nice warm air in Taiwan with Baptiste and he told me about Proudrear image that he had worked on at Gandhi to build VM images for their public cloud and it was specifically designed to build customized VM images with like the cloudware packages pre-installed and stuff or to build USB stick images so I looked at that and saw that it could do most of what I wanted build ISOs, MFS based ISOs USB sticks raw disks tar files, firmware images for things like nanoBSD or for embedded images for like Raspberry Pis or Pine 64s because Manu helped with it so my immediate reaction is alright I will add ZFS send as a file format to this so one of the outputs you can get is the stuff to recreate a ZFS pool so I did that, you get two new options one is ZFS send which will take the entire pool it just created and send that as a replication stream or which is what we would use to do a new install or you can do ZFS send plus BE and you'll get only the boot environment part and you'll ignore the rest and that's the image you would use to do an upgrade I modified the overlay handling so Proudrear image has the ability to say here's a directory full of files I want you to put over top of the image before you package it up but it followed sim links instead of copying sim links so I made the option to do it this way so that it would install all my sim links in ETC pointing to slash CFG and I added support for defining the ZFS layout instead of you being stuck with how I like to do it you write a text file with the list of the data sets you want and the properties you want and it will lay out the pool that it creates that way so you can tweak it to have the data sets you want and pick and choose which files end up as part of the boot environment that you create and it uses the same format as BSD install so if you've already written a customized script for script installs you can just copy and paste that it does data sets but you can recursively do a data set in all its children and if you do that to the top then it's the whole pool so in this mode you basically it sends the whole pool as a stream and in this one it sends only the one data set so you're basically controlling whether it uses dash capital R or not and yes for example you can control whether you want user local the packages to be part of the boot environment or not because the other thing pudware image can do is pre-install all the packages you care about as part of the boot environment so if you make user local not its own file system you can also feed pudware image the list of 100 packages and they'll already be installed and so when you do the upgrades with it by just ZFS receiving it it'll also take care of upgrading the packages even which could be really nice so then our other problem was dealing with the fresh installs so I use pudware images ISO MFS support to write my own installer where it basically prompts for the IP configuration and then just creates a new pool and ZFS receives the full image down to onto it automatically and doesn't ask any other questions so my work in progress for the patches for pudware is up on my github I'm still finishing some of the cleanup there I would like to fix some of the naming for the image formats because there's one called ZFS raw disk but it's not what you think it is you can't just boot it that way we need it to not go away because Gandhi uses it for their the images they create for VMs but it's too easy for someone to think it's the image that they want because the name isn't descriptive enough so you can use this to create your own custom images and upgrade them like that the nice thing is the images when you're trying to build a new image it uses the jails you're already building to build packages with so if you've already compiled your custom version of FreeBSD to compile your packages for it just take the files from that jail which also means that you can tell it to create a jail off of release ISO and not have to compile anything so if you just want to customize FreeBSD by taking the release, adding an overlay of your own files and a list of packages you want to pre-install you can do all that and not have to compile anything it's quite nice and I'd like to look at adding more additional image output formats to it like making our virtual machine images like we do for the install images or the official FreeBSD builds supporting for MBR and GPT and all the other combinations that you might want to build your image with because you could also use this to build hard drive images that you could spot down just DDing on to the hard drive to do your install or whatever some of the enhancements I'd like to look to like I said changing the image types so that it's more obvious what each one does and maybe just spelling those out better in the man page and adding a lot more combinations although maybe we want to actually limit the number of combinations so that it's more obvious which one you probably want to use there's another tool that I helped WarnerLosh write in the source tree under tools boot root gen and it generates a QMU image of every possible combination of images to test the boot loader and it fires each one up in QMU with a telnet console and allows you to run expect or whatever to make sure that each one's actually booting properly and it'd be nice to actually use the pudre image support to do this so that you would have evolved less compiling and so on and it'd be nice to also consider having it better able to just have a flag to automatically bundle the cloudware stuff like BSD cloud in it so that when you spin up an image that you've created like this on a cloud service it automatically configures everything for you or supports the user data stuff where you provide the command you want to run on first boot it'd be interesting to actually look at updating some of the way we build releases on FreeBSD to use this pudre image support since it's so nice and since we're already going to have to build these images for packages and stuff we could save some effort too and make it more easily automated another thing I would like to add is support for post build scripts so after it does the overlay I'd like to see each route into there run a bunch of scripts to set up more complicated stuff and then create the image from it because eventually what I'm going to want to do to the image before I ship it is going to be a bit more complicated than just copy these files over top of these other files and I'd like to look at adding more features for building appliances you know if one of the things that FreeBSD is very popular for building appliances but each appliance vendor is kind of left on their own to come up with an upgrade mechanism and if we can make something a little more solid then every appliance based on FreeBSD will do upgrading correctly instead of their own special way each time but if you're interested in the appliance stuff definitely talk to me and if you have other ideas I'm open to looking at them but I don't know how much free time I'll have so questions so the question was about how ZFS boot CFG writes the next boot environment it wants to boot into the ZFS label it does it in one of the reserved areas so it's not going to confuse Solaris or make it incompatible the way Solaris does it I think uses the they call it non-persistent.conf or whatever and the problem with that is if something goes wrong early enough in the boot process where the file system is still read-only you can't set it back to no and so it can end up booting that way more than once or permanently yeah so the problem with the FreeBSD loader is well it's not really a problem but it purposely only has read-only support for ZFS writing to ZFS is much too complicated to do in the loader and so ZFS boot CFG is actually done in the thing before the loader and it knows the specific offset in the label which is at a fixed position on the disk or in the partition that it's safe to write to and it writes the string in there and then the boot loader sees that reads it and then overwrites it with zeroes with the pre-calculated checksum of what's supposed to be there so the label checksum still always matches that way yes maybe not all four I'd have to double-check but yes it uses the ZFS label code and it puts it in one of the spaces that's safe for us to put that kind of thing in and it updates the checksum correctly so the label is valid Delphics end up doing the same thing on their Illumus images but instead of the boot environment they keep track of how many times they've tried to boot and then they reset it to zero and so what their loader does is after it sees three failed boots it purposely boots into a rescue system that phones home and say I'm having trouble booting come fix me because their appliance runs in Amazon and it has no console so it's the only way they could fix them so they just keep rebooting it until it fails enough times to boot the rescue system I've looked at wanting to extend what ZFS Boot CFG does instead of just writing a string of packed NV list or something we could actually have support for both things like say I want to boot this one next time only or keep a fail counter and if it fails I want to boot this other boot environment called rescue or whatever I think it's like 16 kilobytes so that's more than enough to store a packed NV list of like five or six features like that the minimum file system size is 100 megabytes so no you're going to want something more like nanoBSD if you only have eight or 16 megabytes of space because yeah yes I've done the same thing with the TP-Link that had 16 megabytes and it's awfully hard to squish enough of freeBSD down just to have a useful router with only 16 megabytes depending what you're doing you could actually make a new file system read only if we're using cfg for the config files we actually expect to change and we've moved the log files and so on off you could decide to make it read only yes yeah yeah I don't know there's a good way to solve that to tell people don't create just a directory you have to create a new file system if you want that file to persist or whatever yeah a quota would be a good way to catch the problem early before you write a terabyte of data into the file system forgetting that oh I need to create a file system first put a new freeBSD version upgrading the file system with new features and then rolling back to an older version that doesn't support them we don't do zpool upgrade until we're sure we're not going to want to go back usually I'm not after the newest ZFS feature anyway but yes if you're going to go back quite a few versions or something or if you're going back in time a major version then yes features could be a problem it depends on the features some of them like device removal works as long as you don't use it so as long as you can upgrade the thing but as long as you haven't tried to remove a hard drive you can still boot from it even if you're booting from a version that doesn't support device removal as long as you never actually removed a hard drive it's fine or even if you're going back into older freeBSD you've added the support for the new checksum algorithms as long as you've not created a dataset that uses one of those it's fine or even if you as long as you've destroyed the last dataset that was still using the new algorithm you can still import it on an older version that doesn't support the feature so the feature tracking in ZFS a feature becomes enabled once you can use it but it's safe as long as it's not active which means there's a file system that has used it at some point any file system is using it or a file system was and hasn't been destroyed yet FreeBSD does not have a good union file system that doesn't crash union FS or oh yeah well, 8 we're running FreeBSD 13 yeah so on FreeBSD we have support for booting from Gehli encrypted disks I wrote that in the loader and I presented it at FOSDEM a year or two ago ZFS native encryption hasn't landed yet and when it does I don't expect the bootloader to support it right away yeah, it's not really full disk encryption anyway it's the point of it is you have specific data sets so most likely it wouldn't be your boot environment because that's FreeBSD bye if you want to know more Michael Lucas and I wrote two books FreeBSD mastery ZFS and advanced ZFS they're not really that FreeBSD specific so if you're using ZFS on a Lumos or Linux or Mac OS or Windows thing the ZFS commands are the same across all of them some of the things like partitioning might be a little different for each operating system but the content of those books is good for all the operating systems and you can get it as a DRM free ebook at ZFSbook.com or you can buy it in print on Amazon or whatever what's the other big European one and every week I do the BSDnow.tv video podcast and we answer questions about all the BSDs and ZFS so if you have more questions you can write them into feedback at BSDnow.tv and we'll answer them in some future week on the podcast thank you