 Let me just start by basically finding the use case where we're coming from. At Red Hat, we use image-based operating system. But they're not necessarily your typical AB physical block device images. They're more like virtual images. We use a system called OS tree. But really, it could be any kind of similar system. So we have a content.restore, which is the OS tree repository. It contains a bunch of files. All the files in the image, or all the files in all your images if you have multiple files. And then when you want to use this, you check out a branch of this by creating a directory structure that has hard-linked backs into this content.restorage. And then if you want to use this to boot your system, in its RD, you find this directory, you bind-monet that root, and then you run from that. It's also something we want to use for container images and other kind of images. Because this entire system has certain advantages. Like it's very flexible. It's easy to do an update. You only need to add the new files you need. And do a new check out. You can use any number of images in one content.restore. It's not just A and B. You can do however many you want. So this is working very well for us, except there's one thing missing. We want some kind of tamper-proofing equivalent of what DnVerity would give you in, say, an Android system, or some kind of embedded system. And there are multiple reasons for this. The most common one is for security. You want to kind of trust the boot so you can know what you are allowed to boot. But I work in the automotive side of the company. And their safety is also very important. Like if you have an electrical interference or cosmic ray or something, it flips a bit. You don't want your matrices for your self-driving car to use data that has been modified. So there is this thing called Fsvarity. And it actually matches very well this content address storage kind of thing. You can apply it to all the files there because they're fundamentally immutable. I mean, they're stored by their content. Check some, basically. The problem is that Fsvarity doesn't go far enough. It doesn't protect file metadata. It doesn't protect directory structure or anything like that. So that's why we posted this proposal for Compose FS. I mean, it's been around for a bit. But we posted it in January, I believe. And it's kind of a combination of an image-only file system or like a read-only file system like SplashFS or EUFS and Oval AFS, where it just contains the metadata for the directory structure. But for every regular file, it just has a file name and you mount it with a loader similar to how you mount Oval AFS. And when you want to read the file, it opens the backing file in the lower directory. And if you assume that all the backing file has Fsvarity enabled on it, we can have, like, we build this image at one point and we know how all the files should be. So we can just take their expected Fsvarity digest and encode it in the image. And then we enable Fsvarity on the image itself, I mean, the file system image. And then when you mount it, you specify the expected digest. And if it matches what you got, you have some kind of, like, Merkle hash, three of Merkle hashes or something that recursively verifies all that goes into this. However, during the sometimes heated discussions about this, it turns out that there's actually features in Oval AFS that sort of makes this possible. This metadata or a metacopy attribute allows you to have a two-layer system where you split the metadata and the file contents along the different layers. And there's also a redirect thing where you can have different names of the same thing in the different layers. So the theory goes that you have a two-layer or two lower directories, and the lower most one is your content at restorage. And the upper lower, or whatever, is a loopback read-only-mounted thing that has these extended attributes pointing to the lower. And then this union of these kind of creates the same effect as an Oval AFS, or as a Compose FS. There are some parts missing here. For the tamper-proofing, we need a verity attribute. It's basically similar to the other attributes. You set this, and then Oval AFS will verify. As soon as you open a file, it will verify that, actually, yes, the file has the expected attribute. And additionally, there's a mount option that says we really require every metacopy to have a specified digest. Otherwise, we fail, because we really want to trust that everything is verified. And then there was some measurements of dubious qualities where we measured recursive lists. And it turns out that some of the performance losses is basically because of the lookup amplification that happens in Oval AFS. Every time we look up something in the upper layer, we look up it up in all the layers. It's just, in the Compose FS case, we never, ever looked up the backing file unless we wanted to try to open it. But for actually no reason, Oval AFS always does it. So I mean, you posted some patches where we rectify this by having a lazy lookup. If there's a redirect, we can delay looking up that until we actually need the thing. And additionally, it's like an overlay. Actually, it's a union file system, and we don't really need the content of restores to show up in the union thing. If you control the layout of the content of restores, you can use things like whiteouts and whatnot to hide it. But I mean, I added this easy way where you can do the double colon thing that separates the kind of layers. So only the uppermost ones are actually visible in the union file system, and the lower ones are only used for resolving data. And I have in the Compose FS user space code we have, code that generates images matching these plus the regular Compose FS. So you can do MK Compose FS. If you generate an EOFS file, and there's a Compose FS mount helper, you can easily mount them as used. Like it's very simple to use, from MK Compose point of the directory, it generates like a content of restores kind of thing, and you use mount the image. So I guess the basic question here in this talk is that we wanted to resolve what kind of approach do we wanna go with for upstreaming this? And this is here where the talk gets kind of short, because I think most people agree that that leaning towards using the existing code in Overlay FS is the right thing to do, because it's just one less piece of code to maintain. And also the added stuff, the digest ex-adder and stuff, it's kind of useful on its own. Like you might wanna ensure in general that your loader wasn't modified externally, even though you're not doing this kind of tamper proofing. There are some cons. I think they're pretty minor. I don't really like loopback mounts. They're global, they don't name space well. And also the upper, like lower, the EuroFS file system performs quite similar to the Compose FS1, but the combination of the overlay and the EuroFS, it does like double the amount of inodes and entries and whatnot involved. So there's some overhead here. But when we measured it, it's slower than the pure image ones, but it's similar to the recursive check out with hard links performance. So it's not really a net loss compared to that anyway, so it's probably fine. So I guess the first question here is, does this make sense? Does anyone think we should go with the custom file system? You ask a room full of kernel developers to push out another file system, the answer is almost always gonna be no. Yeah. Especially given the fact that you can accomplish the same thing with modifications to an existing file system. I understand what you're trying to get at, and there are definitely cons. Like I don't wanna say like overlay FS is like, of course, the right answer. But it appears to me that you've made our argument for us. Yeah, and I agree. I mean, I would prefer not to maintain a kernel file system. I have enough code to maintain. So, I mean, there are people here looking at fixing this problem, but if you wanna run it, for instance, in a container, you have to have a loopbook device in there. And that's like a loopable thing that can see all the loopbook devices on the entire problem, entire system. I had patches for this, namespacing loop devices properly. I sent them out, this works fine, even including, well, Dev TempFS namespacing for this. And there are Iskasi people who are bugging me with this as well, so we will hopefully have a namespace loopback devices. Not too far in the future. Just need to generalize that work. And that should hopefully be resolved. It's really just a matter of time. Seth did the initial work for this years ago. Yeah, I saw there was like, from you, I think. Yeah, I did it. Like loopback FS? Yeah, loopFS. And the reason I did it is because I could preserve backwards compatibility with, so the devices are global and the biggest problem is that they show up in Sisyphus. They have Sisyphus entries. And I didn't wanna do a half ass namespacing where I namespace the device nodes but I didn't namespace the Sisyphus entries. But namespacing it without a file system globally, the problem then becomes that I need to start hiding the loopback device Sisyphus entries in containers while currently they appear for all loopback devices on the system. So technically this would be a regression. Also in Dev TempFS. So all loopback devices also that you create on the system also show up in Dev TempFS. So my way around this was we make a tiny file system that you can mount. Each loopFS instance that you mount is a separate world. You have a control loopback device like you have in Dev loop control, whatever it is called. You can allocate new loop device nodes. They get created only in this instance and they show up in Sisyphus only in this instance. So you get this approach where you get namespace loopback devices without backwards compatibility issues. Christoph didn't like this approach. I understand like the easy solution is usually you make it as a sort of file system thingy but I think we don't need this. We can implement a way where they don't show up in the file system. Then it would be just file descriptor based which is probably a bit less nice but it would work for these use cases. You mostly could interact with them programmatically then not through the file system. I think that should be okay. So I wonder what's the benefit of actually having this accessible as a loop effesant having multiple instances of it because I need the assumption that everybody uses usernamespaces. The loopback device doesn't give you much because you can't mount anything off this. Now my talk later is supposed to address some part of this but it's at least in your use case it's not gonna help you much because you generate these the upper layers locally. So you cannot sign them ahead of time and then there's no way to authenticate that whatever you did there is actually trustworthy. I think that's quite true for my primary use case of OSTree. We do, we will be able to sign them even though we generate them locally because they are fully reproducible from the OSTree metadata. So we locally generate 100% reproducible thing and then the OSTree commit has the signatures that would you supply. But for other approaches like container images and whatnot that you just pull it from the registry that you would have problems. Yeah, actually I was actually about to mention container images. So the one of the things is that for OCI container images at the moment it's all quite, what is a nice word to use, problematic. Basically the, yeah, so one of the things is the discussion that I've had for a long time which is that we want to fix this problem. And I don't, I didn't follow, I haven't followed what the Compose of S system thing looks like but the overlay first thing described here it would match actually pretty well what we would want for container images because we would like OSTree, we would want to have the structure and everything be transparent so that you could have this, the ContraStore actually contains the files with some caveats about deduplication and stuff like that, but we'll go on now. With all of that then you would be able to have this thing which again you would be able to generate reproducibly these hashes and everything you apply them. That sounds like something we actually are actually able to use. And finally people might actually, you can actually argue that signed container images are actually useful once we would have this. But yeah, that's, yeah. So I see, at least from my perspective it seems to me like the overlay first thing is better because it seems more modular that we can actually set this up in a way that we want to set it up is my impression. Because we wouldn't want to ship like a one giant blob which internally contains all this stuff. We would want to be able to have this be more transparent because then we could then deduplicate across images, we could deduplicate a bunch of other things which you couldn't do if it was just like, well, one thing that you're just mounting and that you can't actually look inside. Yeah, that's from that side of things. Yeah, I'll bring up that this is next year. This isn't actually something that composes as the kernel plus system solves. It's just another feature that we're interested in in this space. Like we would want to generally have the same kind of solution also for non root things like rootless pod man. I just want to run random image from Docker hub and or something. Something that we don't really trust, maybe, but still we want to be able to run it as a user. And that's more complicated because like everything needs to. I don't know if you hear me. I started working on a post office for the container use case. So currently there is an extension for OCI layers like the search zip effort which basically adds some metadata to the table so you can index single files inside the table. The idea I had with ComposerFest that we would replace that JSON file with a ComposerFest blob itself. So it would be used to as a metadata for the table and at the same time, it would be mountable by ComposerFest server. Yeah, so the thing is the strategy. I know the authors are using everything. So yeah, the strategy stuff is very good and it's very good in that it fixes many issues we have with the current format. The thing is that my view is, is that the best way of solving the container image thing which is not, this talk is not about so I'll stop derailing it. But basically, I think if we redesigned it so that we didn't use tables for everything that would actually solve basically all of our problems. And I don't think that, I don't think replacing one format that you cannot easily look inside as like a random go binary that is like just understands JSON and nothing else having replacing one binary format with one that is would be better is not the best way of solving it. Like there are better ways I think we can solve this problem. Yeah, but the current approach will work with all the clients and registers that are out there. That's my opinion on the main benefit of it. Like it will work with existing clients. I think we need to add support for mounting though. Like it wouldn't be transparent. Like you would need to, like you would need work on the clients anyway. Like if it... Well, existing clients, they would use it as a normal type of a state right now. Instead of clients to learn the, yeah. Yeah, but you can't mount the GZ, that's the issue. So like you need to unpack this thing and then you're in the same problem as you always were which is that tar archives are not reproducible at all. And tar archives, when you instruct them, you don't even know what you get. And then like when you have file systems, it's like a whole, yeah, the whole nine yards. Like it's not, yeah. Well, you start GZ, tables helps a bit, right? Oh no, it is better. I'm not saying it's not better. It's definitely is better. I just think that yeah, I think that there we, there is approach we could have which would be even better than that. And I think that we could then use the overly fast stuff in the way that suits container images and you could also use it for OST or you could also use it for whatever else. Yeah, and I'll have other people request just random things. They have cast style directors for, you know, builds. They build their stuff and it generates like a commit in some kind of content of rest storage and they want to reproduce that. So there are many other similar use cases. So here are listed some possible solutions to the non-routes running. I mean, your work on Mountain 2, like having a trusted service that lets you mount these things, doesn't really work for the arbitrary image that I got from Locker Hub, but it works for like trusted image that you have signed somehow or that you trust. I tried to write the, do this using your URFS fuse. It turns out that if you enable user Xhatters with URFS, it's not compatible with matter. I'm not sure exactly why. I can sort of see that it might be a possibility to access files from the lower that you shouldn't, like if you have access to the upper there, you can create a metacopy. For some reason it's not possible and we don't quite know. I do believe that in this particular use case where we have a read-only lower, it shouldn't be a problem. Like there's no upper, writeable, there's no writeable lowers. It should be safe here, but making it possible, it might be hard. Another approach is to use Hava. Just diffuse and reimplement the redirects and everything. That will perform poorly unless we have these fused pass for work that was discussed yesterday. So it's all these kind of things tied together, I guess. Yeah, I think for me, the generic Locker Hub images are not trusted, set it on the tin, and you should not be able to mount them from inside a container, at least if it has an on-disk format. So the mounting into namespaces work that he's going to talk about is an initial implementation, how we can do this today, but we already talked about this yesterday. Ideally, we'd extend the new mount API in a way that you can register yourself as the, let's say, mount manager for that particular namespace, in a similar way like autofest works that you can say, I want to supervise the mounts of this particular, or just a subset of mounts, of this particular mount namespace. And then the mount manager, which probably in this case would be SystemDemount, MountFSD or whatever, would get a notification if a mount is requested from inside of the container. Then the outstanding things that are missing that David pointed out yesterday would be a way to inspect the requested mount options for the FS context, for example, and then to be able to allow or deny based on that. That would be an extension which would make the work that Lennard's going to present much nicer because currently it works fine. It's just more involved and this would be a more direct, elegant way of supervising on mounts. I'm not really sure if any chain of trust can protect you from delegate the trust to who? I'm not sure I follow, sorry, it's like you have a privileged process on the host like SystemD, for example, who has an idea of which images it thinks is safe or which it has safe. And obviously trust needs to be in for the foremost cases be established through de-embarity in this case or other mechanisms where you can ensure the integrity of the, how does the data help your case? So it doesn't have the random image from Docker API use case, but it does true. Like you have a set of images that you produce yourself that are signed by Microsoft basically and you want the user to be able to run them as the user not, and it works fine for that. And it could work with Composite as that way. I'm not providing, I'm not arguing that I can provide a way to let users mount random crap. I don't think that's ever going to be the case. Is there any other disadvantage of the EOFS fuse approach apart from the technical issue? Well, there's also performance I guess, but I think. So one thing like part of the stuff I present about is also that we want to provide an API for unpolluted containers and things like that to create their own file systems that they then can mount and extension of that could actually be that you just implement your own little service that runs on the host that, no, no, no, no, no, no. You would run it on the host like on the Alta level. You would run a tiny service that simply provides services for containers and you mount the socket into this. No, it does because it could pass a, like it could sign locally either with DMVarity or it could even, I don't know, DMVarity doesn't really matter if it needs to be vital. But basically what I'm saying is that you could have a local service that can generate locally signed images for you if you give them input that can be validated as somewhat reasonable, right? So that then, once it actually gets passed to the kernel, it's known in a good state. The essential problem of why we're not letting users mount file systems is that the means to verify that the image is, the binary is safe. There's no way to verify it. We need a BPF verifier for images. No, again, what I'm saying is basically like the way how the stuff that I'm gonna talk about, I'm not a per name, I'm talking about this right now, that is that we somehow have to establish trust. The obvious way is DMVarity, right? But DMVarity, like you can generate the DMVarity data locally. Let's say we had a system level demon that runs on the host is accessible to containers. However, you talk to the demon, give it the redirection information like the list of files and the hashes. It then generates a EuroFS, automatically signs with DMVarity and returns you that finalized image. Then you've suddenly have a service that generates you trusted images and you can't fake them and they will work only locally in the local concept and otherwise they won't. Sorry, my English is not good enough but I want to clarify something. For your use case, I think for most use case, I think the views and pass through, maybe pass through implementation is good enough. I want to clarify why I introduced the EuroFS because we know that read only use cases is quite limited compared with generic file systems but currently in kernel implementation, I think that we only have SquashFS. The other FS such as roomFS and crimeFS, I'm not sure how many users to use them in the last scale but what we once introduced EuroFS because that's who we want to make use of the page cache to do in place IO so that we want to reuse page cache to for compressed data first and to do time sharing. I will give this topic for following. So that is mainly about the performance of the generic compression performance. So we want to boost up the generic performance and from my own part, I want more people to join no matter what FS, maybe roomFS or crimeFS or SquashFS or EuroFS and maybe more developer could join us to improve a more powerful file system so that we can join together. So less fragmented. That is my own opinion of this whole topic because from a technical point of view, I think this FS are much similar, particularly for your use case. I think just a few, maybe enough. It's a good opportunity to thank Alexander and invite you to talk about EuroFS. It's your turn. Thank you. I'd like to say about what you said about EuroFS was I think really introduced to improve usage and energy usage but it mainly improves for the data read case and what ComposeFS really now needs from EuroFS is metadata only. No, I'm just saying that initially you introduced EuroFS to make data access, read-only data access more efficient but now that's a metadata only use case for EuroFS but it's a great opportunity because like you said, we know that overlay FS uses XSATOS in this use case a lot and XSATOS are like the bastard child of file systems. It's not optimized. Many file systems do not optimize it. It has a very large overhead, especially for this kind of use case when you need to compact small inodes, you get XSATOS can mess up your performance but this is an opportunity for another thing to optimize. You have a compact format for small XSATOS and I know it together, that's a major win for these sorts of workloads.