 Hello, my name is Ariel Mikuras, I am from Klužnapoka. I have a master's degree in cybersecurity and I also am a teaching assistant for X86 assembly language programming. I'm a software engineer at Cisco and today I will talk about PuzzleF as the next generation container file system. So let's get started. First I will present a short introduction, then I will talk about the OCI drawbacks. So this will be the problem statement, moving on to design goals, status and a quick demo. Then the PuzzleFS data format, the results, the Linux kernel file system that I've implemented and at the end I will be happy to answer your questions. So for some context, PuzzleFS was started by Tyco Anderson in 2021. He had a first-time presentation in 2019 about AtomFS. My colleague Scott Moser also presented AtomFS earlier this year. So the idea with AtomFS was to replace the tar layers of the OCI image with SquashFS layers and also add the unverity data for integrity. This has the advantage that you can directly mount the SquashFS layers and also verify the integrity. What is DM Verity? In short, it is an integrity protection at the block device layer. And if someone wants to tamper with any of the blocks, instead of returning the tamper data, the Linux kernel will just return an error. So you also cannot write to those blocks. Why I'm mentioning AtomFS? This is because PuzzleFS aims to be its successor and it is part of the larger project machine that we are working on at Cisco. This is, of course, also open source and this is an OCI-based secure container Linux. For some OCI format basics, OCI comes from open container initiative. It is a standardization of the original Docker format. And this is the directory layout. It all starts from index.json, where you will usually find a tag. For example, if you want the latest Ubuntu image, then the latest is actually the name of the tag. So using this tag, you will get a pointer to a manifest and from the manifest you will have pointers to the image configuration, which stores stuff like environment variables. And also you will have multiple pointers to the layers. These layers actually contain the bits of the image, so the actual data on disk. These layers are usually tar layers, but they are also compressed, so it will be a tar.gz file. And if you look at the file names, these are hashes. This is because they are stored content addressed. So their SHA-256 checksum is also their name on the file system. And when you hear me talk about the data store, what I'm actually referring to is this Blobs SHA-256 directory, where all the data blobs reside. So all the problems stem from this tar file representation in the first version of OCI. There's a really detailed blog post written by Alex Azaray in 2019, where he describes all the issues with this format. And I will try to present a few of them. As I've mentioned, the layers are usually in the tar.gz format. So first of all, tar is not a well-defined format. It's not a standard. It is rather a collection of different formats, and each have their own extensions. It has no index. Archive entries simply consist of header plus content. So files are actually concatenated together, and this is basically the tar archive format. And you can imagine if you have a very large archive, it will take some time to seek and find the file that you are looking for. Another problem is that compression and encryption cannot happen underneath the tar archive. So you cannot store compressed files inside the archive. You need to apply compression on top of the archive. And if you imagine the same large file, you have to decompress everything before you can then search again for the files that you are looking for. So another disadvantage is that there is no application. Any change you make in the tar file will lead to an entirely different checksum, shot to 56 hash. So this means that if you want to download another version of the same image, you have to download the entire tar archive. You cannot download just the parts that have changed between the images. But there is one small mention here. If you have multiple images which share the same layer, then you will get some interlayer sharing. This is fragile because as I mentioned, any small change will lead to an entirely different layer. There is no true machine independent representation. This is because the directory entries and the extended attributes can be presented by the file system in different orders. And you can have the files in the tar archive in any order that you like. It's the same file system that is represented, but there is no canonical representation. You cannot say, this is from this file system, this is how I want my tar archive to look like. And this leads to this lack of reproducibility problem. You cannot reproduce the tar archives. And on top of this, the multiple extensions add to this problem. For example, there are five different extensions for specifying extended attributes. Our design goals for PuzzleFS are basically to solve the most pertinent OCI view on problems. We would like to have reduced application, reproducible image build, direct mounting support, data integrity, and also memory safety guarantees, and ideally the same implementation in user space and kernel. So how do we achieve reduced application? Well, we can use content defined chunking. This solves the boundary shift problem that I will talk about in the next slide. We use the FastCDC algorithm to chunk an entire file system into variable sized chunks. With FastCDC, you can specify a minimum, average, and maximum chunk size that you... So basically a range for the sizes of your chunks. So this is the boundary shift problem. On the top, you can see we have file A, and the file A prime is almost the same as file A, but we have inserted one byte at the beginning, this FF byte. So with the traditional FSC approach, this stands for fixed size chunking, where you just split the file into equally sized blocks. In this case, inserting a byte at the beginning of the file has an unwanted avalanche effect. So every byte shifts to the right, and then you will have no duplicates detected. On the contrary, if we look at the content defined chunking approach, doing the same operation, we will get most of the duplicates detected. The first chunk will be changed because we have inserted an FF byte, and all the other chunks will stay the same. You can see here that every time we find a nine in this file, we declare a cut point. Of course, the algorithm is more complicated than this, but I will show it later. So for real-world application, imagine we have Ubuntu image at version N, which has 80 megabytes in size, and we want to apply a small patch to libssl.so. Maybe this is a security patch. And now we get another version of Ubuntu, which is at version N plus one. It's still around 80 megabytes in size, but the delta size, the size we have to ship to the end user, it will also be 80 megabytes, because we need an entirely new Tor layer. So the solution here is to use CDC, content-defined chunking. If we chunk the entire file system into chunks and then we only do a small patch to libssl.so, only one chunk will be changed in this entire list of chunks. And then the Ubuntu image at end plus one will still be around 80 megabytes in size, but the size that we need to ship this small patch is around 80 kilobytes. This is based on the average chunk size that PuzzleFS uses. So for getting into the details of content-ified chunking, this uses the sliding window technique, and it computes the hash of the window. This is called rolling hash. And then we take the last n bits of the hash, and if these are zero, we generate a cut point. And the interesting thing about this is that the cut points only depend on the last window size bytes, which is usually 48 bytes. So we start from the window, we start with this window, we compute the hash. If the hash has the last n bits zero, then we say we found a cut point. If not, then we just move the window one position to the right and repeat the same process. So you can imagine if you insert multiple bytes in the beginning or in the middle, then you still have a pretty good chance of identifying the same cut points. Our next goal is to have reproducible image builds. We want to have a canonical representation of the file system. So for this, we define the same traversal order of the file system. So for example, you can traverse the file system in breadth-first or depth-first. We want to make sure it's the same order when we are building a puzzle-FS image. Then we make sure the directory entries and the extended attributes are sorted lexicographically. And finally, as an implementation detail, we use B3 maps because they have a defined order, unlike regular hash maps. Another goal we have is to prevent tampering. So for this, we would like direct mounting support. So one other issue with the TAR archive is that you need to extract it first before making use of it. So this leads to a problem because then you cannot make sure that the data was not changed in the meantime. You cannot put your extracted data back into a TAR archive and then compare the two and say, okay, nobody tampered with the data. So what we want is to remove this extraction step and mount the file system directly. So then we can make sure we are using the same thing that we've originally built. And also, we want the format to be simple enough so we can decode it in the kernel. Speaking of data integrity, if you remember I mentioned the FMFS at the beginning of the talk and FMFS was using the n-varity but this doesn't fit our use case. This is because even if PuzzleFS is a read-only file system, we still want to write to the data store. If we want to download a new image, a new PuzzleFS image, then we need to write the new blobs to the data store. So it doesn't fit our use case. But we have optional support for FS-Varity. FS-Varity is very similar to DM-Varity but instead of working on block devices, it works on individual files. And it must be supported by the underlying file system on which PuzzleFS resides. What we do is we compute FS-Varity digest for each file and we store it in the PuzzleFS image manifest. And then the PuzzleFS image manifest itself, we also need to compute the FS-Varity hash and pass that information in an out-of-bound way. So we do this by passing it on the command line of PuzzleFS mount. For PuzzleFS, we want memory safety guarantees and so this led to the decision of implementing it in Rust. Both the Fuse version and the Incarnal file system. Rust has many benefits. It entirely eliminates undefined behavior and entire classes of bugs. For example, dangling pointers use after free buffer overflows. It has a very strong type system and a first class support for writing unit and integration tests. So it makes it very easy for you to actually write the tests. And for all of this reason, in my personal experience, it leads to a very painless iterative development. And finally, we want to share the code both in user and kernel space. Luckily, Rust support for the kernel was merged in Linux 6.1. We don't want to write the same code twice. But of course there are some differences. First of all, the kernel only allows valuable allocations. So everything that you allocate, you must be able to deal with the fact that this allocation might fail. It's not like in user space where you don't care and then the out-of-memory killer jumps in and kills your process. And also we cannot handle the file operations in the same way. You can imagine the user space API is quite different from the kernel abstraction of files. And we also must duplicate the code for practical purposes because the kernel has its own build system. You cannot just download the crate from crates.io. Or for that matter, you cannot even use the cargo build system. You need to use a K build. But apart from these small differences, a large amount of code I was able to share between the two implementations. Moving on to the status of the project, we can build, extract and fuse mount PuzzleFS file systems. We have support for FS variety, but as I mentioned, it requires file system support from the underlying data store. And we also have an optional ZSTD compression for the data blobs. What I've done in the past months, I've also implemented a proof of concept file system driver written in Rust. So for a quick demo, you can see here at the top I have a root file system which contains two directories and two files. If we want to build a PuzzleFS image, we specify the build command and we pass the source of our root file system. Then the destination where the PuzzleFS image will be stored. And then we also need to specify a tag for this image. What we get back from this build command is the FS variety digest of the PuzzleFS image. And we will use this in our subsequent commands. The next step is optional. If we want, we can enable FS variety. We need to specify the path to the PuzzleFS image, the tag and also this hash that I was talking about. If we enable FS variety, it actually sends some eye octals to the kernel and then the files in the data store will be marked read only. So you cannot write to them. And also if tampering was detected, the kernel would just return an error code instead of giving you the tampered data. And lastly, we use the mount command where here we can also optionally specify this FS variety digest. And then we use the path to the PuzzleFS image, the tag and also a mount point. And if all goes well, you can see the output in the journal where it says mounting slash temp slash puzzle. And also the output of mount, in the output of mount you will see we have slash temp slash puzzle, which is of type fuels. If we do specify this digest on the command line of mount, then PuzzleFS will make sure before opening any of the files in the data store, it will make sure first that the FS variety is enabled for those files. And secondly, that the FS variety measurement matches what it has in its database. Now let's move on to the PuzzleFS data format. We have an index.json file and with the help of a tag, we can find the manifest for this tag. The manifest contains a list of metadata layers and these layers contain a list of inodes. And here I wanted to show the different ways in which the data blobs could be produced by the chunking algorithm. On the left, you see that inode1 and inode2, there are two files, but their contents is stored in a single large data blob. On the right hand side, for inode2, you can see that it has a list of chunks that it points to. So probably it's a larger file and it needs more chunks to represent the entire data for this file. And here you can see that these are hashes because these data blobs are content-addressed. So what we do for metadata serialization is we use CaptainProto. This is the serialization protocol. It has many advantages. Basically, the on-disk format is the same as the in-memory format, so you can just memory map the entire file and then you will have to use some accessor methods to get to the fields because it has a custom representation in memory. And as you've seen in the previous slide, we have two levels of indirection. The image manifest, this contains a list of metadata layers and also the FS-Varity data for the blobs. And each metadata layer contains, first of all, the metadata for the files and directories, but also those links that you've just seen to the data blobs. And as I have already mentioned, we store them content-addressed. So here is the CaptainProto schema for the metadata, for the manifest on the left and the metadata on the right. With bold, I have highlighted that this is the root structure for this file. So this is a structure that you will use when you want to decode the CaptainProto file. And as you see, the root file system contains a list of metadata, the FS-Varity data, and also the manifest version. And on the right we have the inode vector, which is just a list of inodes. So one thing that the CaptainProto gives us is a compact inode representation. So this entire file actually represents only two inodes. But CaptainProto splits a structure into the data part and a pointer part. So if you have nested structures, then the outer structure will be completely self-contained. And the nested structures will be stored elsewhere, and you will have a pointer from the outer structure to the nested structure. Here with red, you can see the inode number of the inode. And with white, I have shown the pointer part for both these inodes. So why is this important? Because we want to fit as many inodes in the cache. And this is because we want to have a very fast binary search when you are searching for an inode in the file system. Now let's look at the results. So what I did was I downloaded 10 versions of Jemi from Docker Hub. And I've put these versions into separate directories. So separate OCI repositories for each of these versions. Then what is also important is that all these images only have one layer in tar.gz format. So you will not get any interlayer sharing in this case. And then what I've did, I've converted these images to PuzzleFS images. And I've made some calculations, so I wanted to know how much all these versions take up if they would be stored as simply as tar files without any compression. And this is the baseline that I will use for comparing the space savings. Then what I did is I've taken all these 10 OCI repositories and I have just summed their sizes. This will give us the total size. And I did this for both the OCI and the PuzzleFS images. And then what is really important is what is the space savings if all these 10 versions would be in a single OCI repository or a single PuzzleFS repository. I call this the unified size because here is where we will get all the data sharing with PuzzleFS. And then for computing the saved space, I just took the tar ball total size and I have then subtracted total unified size from it. And in this table, we can see at the top we have the OCI format, but this is the uncompressed case which you will not see very often. Usually compression is also applied. So we have a total size of 766 megabytes. So this is if you just add the size of the tar balls for each version. Then we have an average layer size of 77 megabytes and the unified size. This will also be 766 megabytes. This is because with tar you don't get any sharing. So even if you put them in a single OCI repository, it will still be 10 different tar balls, each with their own unique hash. Then I did the same thing for PuzzleFS in the uncompressed case. We have a similar total size and average layer size, but you can see here that the unified size it is much smaller. It's only 130 megabytes and the saved space is 83%. Then for the compressed case of OCI, we have a much smaller size. It's only 282 megabytes with an average layer size of 28 megabytes. Here you see that the unified size is still the same as the total size because we just applied compression. We didn't do anything special. Here the saved space is 63%. You can see that even in the uncompressed case PuzzleFS still beats the compressed OCI with around 20% in space savings. If we apply compression with PuzzleFS, then we will get an even smaller size. The unified size is only 53 megabytes and we have 93% space savings. This is 30% more than in the OCI case. Now let's move on to the kernel file system driver. We at Cisco want to have PuzzleFS in the upstream kernel. We posted an RFC to the Linux mailing list. This driver is written in Rust. There are actually two versions of this driver. They are both based on Almeida's work. He had originally a set of file system abstractions, but recently he made a set of abstractions only for read-only file systems. Neither of these are yet upstream. I've implemented these versions on top of his abstractions. He's also working on TorFS, so this is why he needs these read-only file system abstractions. What we need to do is add third-party crates to the Linux kernel. This is for the metadata serialization and hacks to deal with the hacks strings that appear in the digests. There are some challenges writing file system drivers in Rust. First of all, we have many missing Rust abstractions because the infrastructure is immature and it is still under development. If you want to integrate third-party crates, then we require no STD support first of all. We also can only use fallible allocation APIs. We have to try new and deal with the potential error instead of new, try push instead of push, and so on. I've actually been to the Rust for Linux workshop this weekend. It happened in Astoria and we had interesting discussions there about file systems. Now for a quick demo of the file system driver. We can see here in slash proc slash file systems that we have a puzzle FS entry. If we want to mount it, we have to specify two mount options. First of all is the OCI root directory. This is the path to the puzzle FS image. Then we also have to specify the image manifest, which is this long digest. We have to do this because we don't want to read JSON files from inside the kernel, so we cannot just simply use a tag and then get from the tag to the image manifest. You can see here that I have shown where I got this digest from, so it's in this index.json file. Now once it is mounted, we can list the directory entries. You can see I have four directories and two files, so we can read the files, we can count the number of words in the files, and so on. Now finally for Captain Proto Rust kernel integration. Captain Proto is an external crate. I wanted to integrate it in the kernel, so I had to do some work to implement full NOALOX support. There was some work, but it was not finalized. What they did, they used strings in error codes, and we don't have string support in the kernel, so I had to replace this with enums. And I also had to implement some other structures, because basically it was easier to implement full NOALOX support than to try to replace the infallible allocation API with the fallible allocation API. So I introduced the NOALOX buffer segments, which is a version of buffer segments suitable for NOALOX environments. Basically the reason for this is that I wanted to avoid parsing the Captain Proto message every time a field is accessed, so I needed a way to store the reader somewhere. And finally, if you have any questions, I'm happy to answer them. You can find the project on GitHub under Project Machine PuzzleFS, and also you can contact me at amikulashet.com. Thank you.