 Hello everybody, my name is Neil Gampa. I'm here with Joseph Basick. Say hello, Joseph and Dusty. We're here to talk to you about ButterFS and here we go. So a little bit about all of us. I'm a contributor and package maintainer. I work in various systems management projects and professionally I'm a DevOps engineer at Datto. You can reach me by Twitter and my email's on there. Joseph. I'm one of the core ButterFS developers. I've been working at ButterFS since it was released to the public. So 14 years now, a really long time. I currently am a software engineer at Facebook and yeah, that's about it. Hey everybody, my name's Dusty. I'm a big Fedora contributor. At least I like to think I am. I'm involved in Fedora Cloud and Fedora ChoreOS groups and I'm employed by Red Hat to work on Red Hat ChoreOS and OpenShifty things. You can find me on Twitter or email or IRC or many different places that Dusty Mabe has posted up his name. That's a good way to put it. All right, so for the starting point, I'm gonna hand this over to Joseph to talk about ButterFS and introduce it to the wider world. So cool. So what is ButterFS? ButterFS is new, we say new, but it's relatively old actually. It's a copy on write file system for Linux and it's kind of the idea behind it is bringing some of the more advanced modern features to Linux file system. Those features being snapshotting, send and receive, read supports, check summing, compression, encryption, all these kind of like built-in things you kind of would expect a file system to do. That was kind of the goal of ButterFS. Next slide. I did not change. No, there it goes. It's just a little laggy. Okay. There we go. So what else copy on write? So this is kind of how we maintain data integrity. And by that I mean like whenever you crash the box it comes back and you wanna be able to still have your file system in a stable, mountable situation. The way ButterFS does this is every time you modify the file system it allocates a new block and copies the old data into it and you modify it and you write out the new data. It gets a completely new set. So there's always a consistent view of the file system on disk at any time. Previous, one of the more popular versions of data consistency is journaling. So that's what EXT3 and 4 and XFS do, which is essentially you write metadata to this one section of the disk. So you write it to there and then you write it to the original place and then you say, okay, once it's written to the original place you tell the journal you don't need that block anymore. And so it's kind of, it's overwrite basically. So you write to the journal and that's the new place but every time you modify the metadata you write over the old location. One of the benefits that Copy On Write gives us is really, really cheap snapshots because we can just create new trees every time we modify things it makes it really easy to snapshot. So you get snapshots for essentially free. Basically every snapshot operation is the same no matter how big the file system is. You just copy the root, update reference counts for all the children and go. So it makes it like really cheap and easy to create and to keep track of. Because of that, it's really relatively easy to move back and forth in time. And because of this we can, because of the sharing it means that we can do things like send and receive really easily and really cheaply instead of having like walk through the entire file system and say like compare end times and that sort of thing like what our sync does. We can actually like in the metadata go down and say, okay, these blocks aren't shared copy the unshared blocks and it's a lot more efficient. That being said, this is Copy On Write is not some like magical thing that just like makes everything wonderful. It's just a new or not new but like a different way of maintaining data consistency. And so there's trade-offs for everything with file systems and we get really nice, cheap, efficient snapshots and that sort of thing but it still doesn't protect you really from bad hardware and that sort of thing. So it's not really one of those these sort of data integrity things that we have and but inside ButterFS it's not an excuse for you to just ignore good data maintenance which is to say backups, you should still be backing up your stuff. You still have the single point of failure over your disk, right? And there's only so much a file system is ever gonna be able to do to protect you from bad hardware. So how big can ButterFS get? So, it's kind of standard boring 64 bit file system XFS is like this most modern file systems or like this will support up to a max volume size of 16 exabytes. Yeah, it's a lot of data but hopefully you don't ever really have a file system this big because it results in other interesting problems. But yeah, it's like I said, it's a modern file system. Most file systems kind of do this. The only exception really is EXT4 and that's just because the way they had to iterate on EXT3 they have like 48 bit for their addressable size because that's as much space as they had in the super block at the time. I wanna say that I could be wrong about that. Yeah, they did a recently a recently I say it's been like three years now. They did a an on disk format change to switch to 64 bit super block and that was a backwards incompatible change. Right so, and even before I mean 48 bits like you're still talking a ridiculously huge volume like maybe a problem in 30 or 40 years it's not a problem now. I could say from personal experience you probably don't wanna approach even a quarter of this size on a single file system because bad things tend to happen when you get that big. Like I don't care what file system you're using bad things happen when you get that big. Yeah, it's a matter of testing and that sort of thing. Like so WhatsApp has recently started using ButterFS for their backing store for offline messages and they fill up, they raid together two relatively large or no they raid together four relatively large NVMe drives for a giant 16 terabyte file system and they fill that thing up and it's exposed a lot of interesting corner cases. So like a file system developer is kind of we test a lot of things but stress testing is kind of where we lack and so we can do these things but your mileage may vary. So this is sort of a list of the features of ButterFS that Joseph has actually kind of just touched on briefly throughout the earlier slides. No reason to really read off all of them but the 16 eggs of heights is the maximum size but the most important thing here is that this demonstrates like all the different capabilities that it has across the board for a variety of use cases including the sub volumes and snapshots which then I'll hand back to Joseph for talking a little bit more about that specifically. Right, so every file system or every file system tree inside ButterFS is its own B tree. So that's what a sub volume is. Every sub volume is its own essentially I know namespace everything. It's completely discreet B tree. And so sub volumes exist from a practical standpoint to be snapshot points, right? So for example, you would like make it your home each individual user its own sub volume so they could snapshot individual sub volumes or you could do home to snapshot entire like everybody's stuff, that sort of thing. It's kind of where you wanna have discreet snapshot points is where you wanna have sub volumes. The other use cases because they are their own B trees sometimes for like application specific workloads it's handy to have different sub volumes just to spread out tasks. Kind of the example is this WhatsApp use case where they have, they kind of shard out messages based on user IDs. So they have a different task managing a different part section of the shard and each of these shards have their own sub volumes and this cuts down on the amount of lock contention because every time you have to like modify the B tree you have to take locks down the B tree. And so splitting up your sub volume like this is a nice way to spread across the load. But how do sub volumes interact with the file like act just like a directory. The only kind of special thing is that users can't remove their own sub volumes or snapshots without a special mount option. And they also, they look a little bit funky because iNode numbers are unique to the sub volumes itself. So if you're like, if you're kind of used to like iNode numbers being unique across a file system that doesn't really happen in ButterFS. If you have a new sub volume, you create file foo that's gonna have iNode number of 257 if you create a new sub volume in that sub volume and create foo in that sub volume, it's gonna have 257. It's one of those weird unique things. And then from a user, from like an rsyncs perspective the thing that we did a long time ago is we make these sub volumes appear to be on a different device. And this was one of those things that was implemented to help rsyncs do the difference between a sub volume and would not walk into snapshots and like back up millions of the same thing. But generally speaking sub volumes kind of behave just like directories. How big are the snapshots? Sorry, I'm looking at questions in here. How big are the snapshots or the same size as the data being snapshots? No, so it's because of the copy on right if you create a snapshot, it's just an extra block. It's an extra root. So whatever your block size is, it's like 16K. And so as you go ahead, Neil. There was actually a question earlier that I realized I just missed. There's two of them. One was what is the number of disks limited? Is that tested? And also there was another one about what is seeding from other file systems mean? Okay, so there is a disk limit because the way butterfs handles multiple disks is it has a mapping tree basically and that's how it bootstraps itself. Like everything inside butterfs just uses a logical byte number and that's just zero to the length of the disk, right? But then it has this mapping that says, okay, byte number one meg is actually on this disk at this real offset. Because of that, we have to have the actual disks mappings in the super block itself and we have a limited size in our super block. So I think the limit's 128 disks. I wanna say that's what it is. We had this weird case like 2006, like within months of butterfs coming out where somebody took this weird Dell thing that had 256 disks and tried to make a butterfs file system using all of them and it wouldn't let you. I think we figured out the actual logical limit is 128 disks and that's because of this mapping thing that you have to be able to bootstrap in order for us to be able to access the rest of the file system. And then seed devices are like a special read-only device that you can use. So you create the file system and you mark it as a seed device and this is no longer readable, no longer writeable, sorry. It's only readable. No longer readable will be very bad, Joseph. Yes, that would be awful. So the use case for this is for secure enclaves. We actually use this with our, for Facebook we have these things called POPs. Or yeah, I think they're called. Anyway, so there are special machines that we ship to ISPs, places we don't trust, right? We don't control the infrastructure, they just sit in the ISP. So like I've got like four sitting down the road for me and Raleigh. So we need to, we need to make sure that nobody can mess with these things. And so the thing will boot up and it has this encrypted seed device. And then from there we add into it because you can add devices to seed devices. You add a copy on write device to the seed device that's also encrypted. So any writes that happen go to this other device but never to the actual seed device. So when you reboot, the scratch space is just deleted and the seed device is in the same thing. Another use case, we use this, have used this for as with provisioning where you have like a raw image, that's your seed device. You bring it up, you add in the root device, you delete the seed device, which then copies the information to the write device and then you can keep on going. So those are kind of like how we use seed. Cool. So I guess it's also like the fundamentals of how ButterFest Convert works then because that sounds very much like the same process for converting from one file system to ButterFest. Yeah, so Convert is relatively similar. So what we do with Convert is basically create a ButterFest file system and create extents that point at the old file system and just say, okay, those are special, don't ever remove those. And then when you do, and so when it does the Convert, when you do the remove or whatever it goes and removes the extents for the metadata and it leaves the data in place. Okay, yeah. All right. Just to kind of somehow slide back into the slide. ButterFest is developed by a wide variety of people long ago in the far past. Red Hat was also part of this, but today it is principally developed by folks at Facebook, SUSEA, Western Digital and Oracle. And I think at this point, it's sort of kind of obvious why you would want to use ButterFest but to make it clear, like it's a great file system that's developed within the mainline kernel. It takes advantage of the facilities provided within the kernel to be more efficient at doing operations on devices. It's very straightforward to support in a Linux distribution and provides a lot of advanced facilities that can be used to do all kinds of interesting user experience things with very minimal effort. And today it's used in production by, as Joseph has mentioned, Facebook. It's also used in Synology, Ficus, Netgear and Rockstore, NAS devices, as well as being used in Open SUSEA and SUSEA Linux Enterprise since 2014 as the default file system for operating system data and since 2018 for all data in Open SUSEA. So yeah. Joseph, do you want to talk about the Facebook production stuff? Yeah, so Facebook production has grown really organically. I'm relatively conservative about ButterFest's usage and I've, sorry, I don't know why that guy's on. Anyway, there's Facebook's a big company, right? And so a lot of engineers kind of run around trying and doing different things. And it kind of originally started with our build servers where we do this thing where, every patch that's applied to the code base isn't actually applied until it builds and it passes all of its tests. The way we do that is, we check out a copy of the repository, apply the patch, make sure it builds, run the tests, if it does land it and we delete the scratch space. And this used to be run on XFS on like RAM disks and it was real, real slow. With as many developers as we have, you end up with like queues of like two to three hours for every patch you tried to land, which kind of got unwieldy. So ButterFest was kind of evaluated originally to solve this problem for one of our worst repos, which was the Android repo, if I remember correctly. And so we kind of, instead of doing like these shallow git clones, the idea was like snapshot, like we only update the repo every 10 minutes. And instead of shallow cloning, we just snapshot the original, apply, build, test, remove. And so that whole process took like seconds to run compared to minutes, especially for the deletion. So the snapshot deletions appear to be instantaneous because we'll just like say, yeah, we've finished and then we do it in the background. So we're kind of cheating. But that being said, it is significantly less heavy weights than RI minus RFing a shallow clone because you have to like remove links. And that's however many files you have is however many links that you need to remove. Whereas a snapshot delete is literally just an updating reference counts for any non-shared extents, which ends up being orders magnitude smaller. So it ended up being pretty fast. And then from there, people, we started to use it and like evaluate it for like our web tier and that sort of thing because we could use compression and you know, a variety of other neat tools and feature like those snapshotting stuff was really handy. So we did that and then the container guys got ahold of it and have done all sorts of horrifying things with it. And it's kind of gotten to this point where it's our entire production environment relies on it. It's the only thing that works. Well, I keep saying this, but it's not necessarily true. It's the only thing we test C-group isolation with. Theoretically XFS will work now, but at the time we were developing all this, it didn't and the XD4 just can't because of how it's designed. I say it probably could, it just would be really hard. And so because of all of these other extenuating circumstances, it's become what we build everything on. And as it's become more ubiquitous in the fleet, people have found new and horrifying ways to abuse it. The container thing was really interesting because it first started out with like, eh, we're not sure if this is the right thing to do. So we'll just ship loopback devices with ButterFS file systems on it around everywhere. So we had like millions of machines with the XD4, but like every box had, you know, 10 to 20 containers. So they had like 10 to 20 loopback devices with ButterFS on them. So that was super awesome. Nowadays, because we have ButterFS' roots, we can send and receive images, which makes, like has cut down on our bandwidth usage a lot for sending container updates and that sort of stuff. And like I said, because our entire production environment revolves around it, the workloads that are running on ButterFS are very dynamic. You know, kind of one of the arguments early on was like ButterFS usage at Facebook is like not the same as how a user would use it, which is, you know, relatively fair. But the way we use it is way worse than any user would ever use it. And there's also the fact that we use it on all our dev VMs. So like these are developers, they're just writing code, building things, running tests, which is how a Fedora user is going to use their file system, right? And all of our dev VMs are all ButterFS file systems. Yeah, so some of the big wins, compression obviously was huge. We, again, this is another thing we're like trying to, you know, show that Facebook usage actually mirrors Fedora usage in a lot of ways. We buy probably the worst solid state drives you could possibly buy, you know, ones that you would find in consumer laptops essentially. And the compression was one of the things that really helps turn around the burn rates for these solid state drives. We were kind of burning through them pretty quickly. The snapshots I've already mentioned really dramatically improved build and test times for our build systems and a variety of other things. Send and receive is based on snapshotting, right? And so that's really helped our container story and how we ship things and ship updates. One of the things more recently, Dennis, one of the guys who worked for Facebook, he, one of the things we noticed with with these crappy solid state drives plus C group isolation is that discard performance varies widely on your solid state drives from drive to drive, from manufacturer to manufacturer. Oftentimes, solid state drives will go out, you know, stop responding for two to five seconds if they get the right discard area. So this is kind of the thing that we had to really think hard about. So async discard was a solution that we came up with, which was move discard outside of any hot path and rate limit it, because the XT4 XFS and Butterfest did this thing where it's like, okay, we have all of the free space. Now we need to go with discard all of the free space all at a time. And async discard says, okay, well, we're gonna make sure it's only of a certain size and then we only do a certain amount over a given period of time in order to not affect the overall workload. This was kind of the last part of our C group isolation work where discards could drastically affect latencies if it went badly enough and this kind of solved that for us. It's not always been awesome. Not everything is great. You know, we're still not awesome for databases, kind of. So we still use a lot of MySQL stuff and a lot of it's moved on to MyRocks, which is the RocksDB based backend for MySQL. And actually RocksDB works real well for ButterFS with its append only write behavior. RocksDB is fantastic on ButterFS. The old fashioned NODB overwrite sort of thing does not work awesome for ButterFS. Because of copyrights, any overwrite sort of behavior is gonna end up with a lot of fragmentation and ends up super, super sad. So again, this is why VIRT images, I think Neil, this isn't a later on but we recommend Node Data Cow for VIRT images. And this is actually two-fold. Node Data Cow means you can overwrite. So like you get nice big pre-allocated chunk and you just overwrite and you don't get the fragmentation. The other thing is the way ButterFS does checksumming, you can't change the IO in flight, which we can do in the kernel, but things like virtual stuff or databases, they like to use O-Direct, which the user controls the memory. And there's no way that the kernel can keep the user from modifying data in flight. So you can often end up with checksum mismatches because like Windows, for example, doesn't maintain the page state as it's being written. So we calculate the checksum, we start to write it out. Somebody changes the data before it gets written out. So now there's different data that doesn't match the checksum. So this is the sort of trade-off that you have with ButterFS. In addition to that, checksumming and generally heavier metadata usage results in higher latencies for some workloads. For you create a file in EXT4, it goes and updates one bitmap and it writes the iNode out. And it writes one entry to a little tree to say this name belongs to this iNode. For ButterFS, we have two entries for the name to map back to the original iNode. Plus we have the iNode reference to update to the iNode so we can keep track of reference. This is how we can say like, hey, what's the name of this file when you do scrub, for example, that's how we go and find out what the name of that file is, is with the references. And with all this like extra stuff, some workloads notice this. ButterFS is fantastic at finding bad hardware. Unfortunately, we had a Fedora user find this out firsthand. Poor guy had bad memory and it's corrupted his file system. And notice because get bad checksums and if you get a bad checksum in the wrong place, you're gonna have a super bad time which again kind of highlights the continued need for backups, right? ButterFS is really good at finding these problems and with EXT4 or XFS, you can go on your merry way ignorance of these issues. We actually had a pretty interesting issue early on in our ButterFS rollout where we had a RAID device that would write to the middle of the disk every time you read it at the box and this was corrupting AI training data. Hence XFS has no idea, right? So they just been using this corrupted AI training data for years and ButterFS started throwing checksum errors immediately and of course, it's 2014 and I'm like, nope, ButterFS is wrong. It's definitely, there's a bug somewhere. No, it was this RAID device just writing to the middle of the disk every time it rebooted, it was super cool. And this isn't, it's been relatively smooth sailing but there are millions of machines that myself and Omar Sandoval and Chris Mason are responsible for. So it's a little stressful. Yeah, I mean, I wonder if you can measure your Mountain Dew consumption in gallons at this point. Yeah, I've got a trash can that's like Deskite that's full of Mountain Dew bottles over there and that's from, that's two weeks worth of consumption. Whoa, that's not good, man. So I guess this is where I kind of take over talking a little bit about ButterFS and Fedora, like what we're doing here and where we're going. So a little bit about the current state. With Fedora 33, Anaconda has been configured to install the non-server variants with ButterFS. The disk images of the desktop variants have already been configured to be built with ButterFS. We're still kind of waiting for the final validation for some of those because of issues that we discovered through trying to build the ARM images. We've, Joseph, myself, DeVita, we've worked through them and like DeVita and I've made patches across the stack for fixing them. So we're just kind of waiting and seeing if everything kind of worked. So crossing fingers, but we're basically ready to fix more of those as issues like that come up. Libvert now will set no data count for VM disk images that it creates. So this will apply to known boxes. This will, yay, thanks, Kevin. Kevin just told me in the chat that they are in fact in production. So we will find out what the nightly compose, sweet. So yeah, so VMs through Libvert are now going to, VMs created through Libvert will have no data cal set automatically. This avoids the very painful double cal scenario that impacts performance and we'll make it so that we can avoid most of the painful performance scenarios that people would see by default Fedora setup since we do ship GNOME boxes on Fedora workstation. And a lot of people use Vert Manager and Libvert. We do not have compression enabled currently and this is pending some discussion with the Anaconda developers and tweaks to the image built tools. There's Chris Murphy and I had been talking about this and there's some complexities related to the nature of how we actually produce images to make it so that we produce the butterfuss image with one way saying Z standard compression seven force all to make it so that it applies the compression uniformly across the board. And then after the fact in the mount options we just want it to do Z standard one so that on an ongoing basis it's a cheap compression. This is not figured out yet. I don't know how we're gonna do it and that's part of the reason why that's not there right now. Boot is not butterfuss by default. This is also pending discussion with the bootloader team. It is technically possible to do this right now Anaconda will happily let you do it and it does work but there's some for turning it on by default I'm not comfortable with that until I figure out more of some of the other related feature enablement that I've got planned for this. And disc encryption currently will use Lux. Lux with butterfuss means only full disc encryption is possible. That means that you can't do per sub volume or because it's not a partition it's the both home and route are on one volume but you could only encrypt the whole volume. Now going forward to Fedor in the future. So Fedor 34, Fedor 35 planning. I am very hopeful that we can get C standard compression by default. That is something that I think will be extremely valuable and extremely useful in virtually every use case to have minimal C standard compression just across the board. Boot on butterfuss by default is something I do want to change for the butterfuss default. Set up in sometime in the next year. This is essentially going to be a prerequisite for supporting online or live full or partial disc encryption using butterfuss native encryption. Now, Joseph has mentioned in before to us and I don't know if you wanna do you wanna speak a little bit about like the native encryption stuff but the core thing is that it will at least require moving boot to butterfuss because we need a way to do full disc encryption properly here. But do you wanna talk about like the pending upstream work that's going on here? Yeah, so Omar Sanival is working on the per sub volume encryption stuff for us. So it'll look essentially like what FS script looks like for EXD4. It uses the same infrastructure and everything. The last thing we want to do is kind of roll our own encryption stuff that always ends in tiers and security stuff. So the way it'll work is it'll be per sub volume and you will be able to do it per file system but sub volume obviously is the bigger get, right? And the main thing that he's working on right now is there's a lot of like features inside butterFS that need to be reworked in order to support this. Namely send and receive because again for our use case we wanna be able to send and receive secure containers that like might have user data on it, right? And to do this we need to like send and receive for example with compression we'll like decompress and send the raw stuff like the actual data over the wire. And so that's not nice but wasn't really a problem but with encryption that's a problem. We wanna be able to send like the actual encrypted data as well as compressed extents and that sort of thing. So he's working on that right now to be able to send and receive the like raw encrypted data on the send and receive side. And once that's in place then it's just a matter of fixing that and fixing getting that into place and getting the repair stuff for like scrub and multi-disk stuff because we'll automatically rewrite other things in the background. Like if you have a rate set up like a mirrored setup and one disk is going bad and the other disk is fine we'll like rewrite the second the bad copy to another location on that disk to repair it. And this again has to be a little bit sensitive with encrypted data. So we're gonna there's stuff like that and that needs to be figured out and he's getting that work through right now. The idea is that end of the year we have that at least going upstream and then we'll have per supply and encryption. Cool. So yeah, and as I said earlier that will require having boot on Butterfest. The next thing that I'm hoping to have done within the next year, David and I had started strategizing about this and we've got and he made the initial work done for this but support for Butterfest for the OS build image build tool. So the initial work is already done during the cycle. It can produce a file system, a Butterfest file system but it doesn't have a mechanic for creating sub volumes and setting up those flags and stuff that we need for a lot of what I've been talking about here. And so that's something that we need to kind of go back and figure out how to implement. And the reason why I put OS build in here is that there's been some discussions about using OS build more for building images to replace some of the lit me of tools is the nicest way I could put it for building images in Fedora. I think at my last count there was like five which is like four or too many. It was very, very hard figuring out everything I needed to fix. So that is something that we'd like to make sure if everything is in fact going to move towards OS build we wanna make sure that Butterfest is the first class citizen there. And so we're gonna, we're moving towards, we're working towards that. And the last bit that I've been working towards and thinking about is a simpler setup for full system snapshotting and boot to snapshot. This is pending some coordination with the bootloader team and snapper developers. And the reason why I'm talking about this particularly as a separate point is because Red Hat and SUSE have very different philosophies on how this is going to work in their platforms. Red Hat has been moving more towards this the strategy of using configuration file snippets with the bootloader spec in the non-standardized very super extended version of bootloader spec but using configuration snippets instead of having grub do auto discovery. So the SUSE style has been if you structure the file system correctly grub can actually just figure out all of your snapshots populate the menus and set it up and you're good to go. Red Hat is going for the more concrete I guess in my opinion probably struck a strictly defined model of how to do this. And we just simply don't have any infrastructure in place to do it that way just yet. We actually do support the SUSE style way today but I didn't want to kind of go towards that for defaults in Fedora when the bootloader team and everyone else is really moving towards this bootloader spec thing. So this is more of a I have to go back to the drawing board and figure out how we want to implement this to do this properly. And I want to make this something that we can expose to the desktop level for tools and other things to use and take advantage of. And I don't really know when that's going to happen but that's certainly it's in my roadmap for this thing. And let's see before I move on to the next section. Is there any other? Yes, Matthew Miller. People need to stop making new tools for building images. We're now, I think I last counted, we're at seven. We're at seven image building tools in Fedora and they're all in use somewhere. And that really hurts a lot. So yeah. Now I think I'm going to hand this off to Dusty who will show us the coolness that is ButterFS with system snapshots. Oh yeah, Kevin, yes, there is discussion at Home D and ButterFS integration. Actually, Leonard was one of the first to suggest that we use ButterFS with Home D. So this will definitely be a part of the overall strategy as we look towards this. But like I'll let me hand this off to Dusty so he can show us cool stuff. Well, we'll see. I actually have a slide in there, Neil, if you can. Yeah, I mean, move on to it. Okay, yeah, so I am going to demo today kind of my custom setup that I've been using for years. I'm a little bit of history here. I used to work for a telecom company and one of our big features was being able to upgrade and rollback. At the time, very long time ago, we were using RPM repackaged packages. You happen to know what those ever were. Oh no. And in order to do rollbacks. But part of my job when I was there was to actually advance our state of the art and not use RPM repackaged packages anymore because that can be bad. Into something a little more reliable. So we started moving over to LVM logical volume snapshots with thin pools. And then when I came to Red Hat, I kind of was monitoring ButterFS a little bit and then also this RPM OS tree thing was just getting a start. So just everything in this space has always kind of interested me. So anything that's like, oh, let me upgrade my system and also go back to a previous point in time I've kind of dabbled in a little bit. So this is just an example of me playing around with the tools that exist and seeing what's possible. So my setup, at least what I'll show you today is a simple system with a single file system, a root file system. I probably shouldn't have said slash root should have just been slash. But it's just a root file system. In this case, I do have a luck setup on here because when the last time I tested it it just happened to be how it worked. So I'll go ahead and apologize for having to wait for Grubb to decrypt the device in order to get into it. I'll make Neil answer a question or tell a joke or give a fun fact during that 10 seconds it takes every time. So the way that thing's set up is we have ButterFS snapshot set up to be taken each time DNF does a package update. And what we'll do today is demonstrate rolling back to a previously taken snapshot. All of this is kind of documented in a series of blog posts that I do periodically that says how I set this up for this version of Fedora. The last time I did it was for Fedora 31. I usually skip releases just because I wish I had ample time to go through and do this every time. But I usually wait until one EOLs and then I do it. The caveat for my current setup is it does lump boot into the root file system. So I don't handle UEFI. I basically wanted to be able to snapshot everything. And since UEFI requires fat I just not something I really wanted to get into. And the other caveat is the state of the art might be better today. I mean, I implemented this a long time ago and it just happens. I tweak it every once in a while but I haven't spent a lot of time going back in trying to reinvestigate how things are. So taking all that into consideration, let's see if I can share my screen. Yay. Okay, so everybody can see my screen and the font is big enough. Should I do anything different? Looks okay. Yep, looks good. All right, so what I've got here is a system that's set up. I've got the serial console up here and I've got SSH down here. Basically what I have is a single disk in my system, a partition, lugs on top of that. And then I actually do have LVM. Don't ask me why, it's crazy. But the important thing to know is that it's butter FS on top of that root file system that's on the root LV. So let me go through and actually show. So this is the root logical volume is five gigabytes and if I show there's an actual butter FS file system there if I do block ID on that it should show butter FS. So that's all you really need to know at this point. So this system literally I just installed. So it's been up 37 minutes since after the beginning of this talk and it's brand new fresh. So what I'm gonna do right now is I'm going to enable quota on the file system. I don't know if this is still needed or not. I just know it used to be. And this kind of allows us to keep track of how much usage is in each snapshot. So that's just a preparatory step. And the next thing I'm gonna do is I'm gonna install snapper and a Python or a DNF plugin that basically will hook into snapper every time we do a transaction. So this basically will add the glue that allows DNF to trigger a snapshot to happen. Okay, so that's installed. And the next thing I'm gonna do is I'm going to tell snapper, oh my gosh, I just almost called it the wrong thing to create a configuration for the root file system. Or yeah, for the root snapshot. And then what we can see now is we have a dot snapshots sub volume. And Barlow Portables is in there just because system D creates it by default. So you can ignore that for now. The next thing I'm gonna do is I'm gonna set it up so that that dot snapshots actually gets mounted on boot. I typically do that just so I can go back and look at what snapshots exist and diff files if I need to. This actually became a problem for me recently because an SC Linux update caused a relabel of all the files in the file system that it could find. And so for my 80 some snapshots over the past eight months or so, it decided to go through each one of them and try to do that, which was not good. Yeah, that was not good. Anyway, so what I can do now is I can look at the default sub volume for root and it is the one with ID five. And then the next thing I'm gonna do is I'm gonna create a new snapshot and call it Big Bang. So this is just like represents the first point in time at which there's a snapshot that exists for this system. So we can see basically we have the thing that I'm currently booted into and then also the first snapshot that I created with the description Big Bang. And you can see kind of the amount of space that's exclusive to that snapshot. Okay, and the next thing I'm gonna do since I just created a snapshot, I'm gonna go up here in the serial console and run something that's gonna take just a little bit of time. So now that I've got a snapshot, let me do something to change the system. So I'm gonna update the kernel. So if I look right now, my kernel is very old because this is installed from the Fedora 31 server DVD. Yeah, so this is 5.3.7. So I'm getting a much newer kernel up there at the top. But while we wait on that kernel to get installed, let's actually go through and look at the sub volume list that was created as a result of us creating this Big Bang snapshot. So you can see that snapshots one snapshot now exists. And there's actually a snapshots two snapshot. That is because the start of that RPM transaction up there actually created a third or I don't know, ID number two snapshot. And so at the beginning of the RPM transaction, it'll create a snapshot. And also at the end of the RPM transaction, it'll create a snapshot, which is kind of cool. So we're waiting for that to install. Once that gets done installing, I will reboot the system. And we will see, first of all, we'll wait a really long time for Drup to decrypt my LUX device. And during that amount of time, I'm gonna make Neil tell us something funny or interesting. Sure. I mean, while Dusty's computer goes through a basically decryption hell, one of the things that differs from Dusty's setup in mine actually is, well, ignoring the fact that he's got LUX and LVM underneath. I actually have butterFS split out as a subvolume rather than having it just integrate in the main file system. And the main reason I have it that way, I do it because I want to be able to not have boot snapshot in the same cadence as the operating system. And this is mostly because of quirks with configuring the bootloader. You don't want the bootloader configuration to get rolled back with the rest of the system sometimes. And I like the flexibility of that not happening when I don't want it to happen. And also, in my case, I have Grub set up to auto discover the snapshots and populate the menu. Whereas, I don't know, Dusty is your setup to auto populate or no? The Grub menu? Yeah. So I think those were extra patches that I needed to pull in. Okay. So I don't have the option to select different snapshots in Grub right now. Yeah, I don't have that. Yeah. I have that by virtue of my weird setup, which is nice, but I don't think we'll fly in Fedora anytime soon. So, I won't. I've got the system back up. I've got the system back up. And we can see we've got two different kernels. And you can also see that the newer one is the one that's booted right now. So we essentially have an updated system, but we also have those snapshots that exist that were taken pre and post the package update. So what we're gonna do now is actually, I'm gonna go all the way back to the original snapshot that I told you, which was the very first one, the Big Bang. So I'm gonna roll back to one, and it tells you what it's doing. So it's creating a real only snapshot of the current system. So it created a brand new snapshot, snapshot four. And it's gonna create a read write snapshot of snapshot one, and that's what we're going to boot into as a result. So it actually doesn't affect snapshot one. It just says, oh, make a copy of snapshot one and set that as the target. So if we look at what snapshots exist right now, we can see five is there, and that's what we're actually gonna boot into next time. It's also important that snapshot five takes up virtually no space. Yeah, that's right. Yeah, and it's funny, because like as you go on, you can start to see things like the exclusive space that's used for each one start to increase. So if you start to run out of, let me type this. If you start to run out of space for whatever reason, you can go back in and like choose which snapshot is a good candidate to get rid of based on how large it is and stuff like that. Neil, any more fun facts for us? Sure. Well, no, I don't, what about you, Joseph? I don't have anything. I forgot nothing, man. So yeah, I apologize for having to wait for this to decrypt every time. I'm guessing that Grubb's lux code is like, I don't know, it doesn't take, you know, it's probably just pure software implementation or something, I don't know, takes a while. But, and I also apologize for not setting this up in a different way. I did try to do that right before this talk and it didn't work and I got scared. So I was like, no, just go back to exactly what I know works. So yeah, at least we have a demo. So, all right. So we should be back up here. Peter just answered, it has no way to use hardware acceleration or threading. Oh yeah. So let's think of threading it super slow. Yes, it definitely is. But hey, it has work support, it's kind of neat. Okay, so now we are back into the system at the point in time that I basically created that first big bang snapshot. So if I run RPM-Q kernel, I only see one kernel that exists. If I look in the boot directory, I don't see the new kernel that we installed. I don't see any of that. So this is just an example of, you made a change to your system. Maybe it didn't work for whatever reason and you can go completely back to the old, the state of the system at which you took a snapshot in the past. This isn't gonna solve a problem with regards to your data, right? So if you don't want to take your data back to that point in the past, then you'll need to have another sub-volume or something like that where your data is. So pretty much everything goes back and then everything is under the snapshots directory and we can actually look in snapshots, was it three or four? And if we look under there, you can see, oh, that is where we had actually updated and the new kernel was existed on the file system, right? So this is just a quick and dirty example of what you can do with ButterFS snapshots. It interested me because rolling back was always fun. Yeah, I mean, just a small point to add to this, like this concept doesn't have to be applied to a full system. As Joseph mentioned earlier, like Facebook is using this with containers and this is actually most of what I use this for. I mean, I do have the full system snapshot setup. One of the reasons why Dusty setup is now much more simple in his blog posts because I helped make it in the grub normally but because I used to have to patch my grub for this and that was not fun. But... So I think I had a copper or something where I had a patched grub. Yeah, that was not fun. But you can do this with containers and I actually do this quite a lot to manage different operating system environments where I need to do fairly destructive things inside of the environment and roll it back. And so that's very handy and SystemDN spawn has integrated support for this. So with that, we have just the questions and resources. So I know we're technically over time but if anyone's got questions, I think we're happy to answer a few. Is it possible to boot from USB DVD, Mount the ButterFS file system, trute it and do the rollback? Yes, that is totally possible. As long as your operating system environment actually supports mounting the ButterFS file system, you can do anything. And I've actually rescued one of my laptops which had a faulty SATA SSD which is how I found out I shouldn't buy certain brands SSDs anymore. I've actually rescued one of them by using Fedora Live Media, mounting it up and roll it back. David's asking, are we gonna have a blueprint for ButterFS-based images? I assume you're talking about with OS build. Yeah, I mean, I'm happy to work with anyone who's interested in that to like start making some examples of how to produce Fedora-based images using ButterFS. This is something that I am personally very excited about. So I'd love to, if anyone's interested, I'm happy to help them with that. And I'm sure David and I can like help with making that sort of become a thing. Jerry asks, any progress on booting ButterFS read-only failed system? Ah, Joseph, I think this is actually more in your wheelhouse. You know, it's kind of a bunch of us, right? So it's, if things go wrong right now with any file system, you get dumped at emergency prompt, which is not awesome. Not really a problem for XFS and the X-D4 because generally you can limp along and you won't notice problems, ButterFS you will. So the kind of, the thing, the idea is to change this and this is more of a system-wide change. Like this involves system D work and maybe some known work, that sort of stuff to say like, okay, I couldn't mount the file system because of this, try some of the fallback operations in order to get like a read-only environment so we can boot up and at least try to fix things. And so I'm doing work on the ButterFS side to not only just make us more resilient to really bad failures in general, but also allow us to limp along in better cases. And then there's work that needs to be done on the system D side to handle file system that's read-only and then provide the ability to mount with these different options if things go wrong. So there's a question about supporting defragmentation of the file system without undoing deduplication. My understanding is that we already do this basically for free with the way that works. So the defrag stuff is not snapshot aware because it was originally implemented like owned the box. Oh. And so it was one of those things where like I found it and it was like, oh God, that's fucking terrible. And I turned it off because I couldn't fix it at the time. And then it just has not gotten fixed since then. It's something that we can address eventually. It's not that hard to do. It's just somebody needs to sit down and do it. And my to-do list keeps getting longer and longer and preempted and preempted. So there are plans to do it. It just, it doesn't right now, it kind of sucks. Yeah, I know that feeling. So anything else from anybody or are we done here? Looks like we're done here. So thank you all for coming to our talk about ButterFS on Fedora. I hope you'll have a great time with Fedora 33 with ButterFS by default and let us know how you think about it. Oh, and forgot to mention, there will be upcoming test days and all kinds of fun stuff like that. And I'm crossing my fingers that we can get a badge for testing ButterFS and Fedora during this cycle. And so if we can get that squared away, then if you'd help us test during test days and beta and stuff like that, you could get a badge for helping us make ButterFS even more buttery. So yeah, thank you all. Thanks everybody. Yeah, thanks everybody.