 Hi, my name is Xiang. I'm from Alibaba Cloud. So who sponsored my travel, so I need to add the logo. So URFS is basically a block-based read-only file system, but its format is very simple. Why we introduce it? Because the previous FS has many limitations. For many use cases, we actually need to, for such read-only use cases that we mostly use a more compact format. But the existing file system, I think that the only Scottish FS is the only choice before we introduce URFS. Because the other file system has many limitations, both on the image size or it doesn't support compression. So, but we would like to, our internal use case is mainly aimed to the performance. We once used a Scottish FS, but eventually we failed. So that we want to build an efficient file system on the block-based. So our FS is block-aligned data. So that is for pitch-cast-friendly, because that we don't need to do another memory copy to do online data for such things. If we do a block-aligned data, so we can do a direct IO and FS stacks. And our metadata, so that's including iNodes and directory blocks, doesn't cross block boundary. So that makes somewhat different to Scottish FS. Scottish FS has a block-size concept as well. But its block size is a compression unit. It's just like a compression unit. So its uncompression size is block size. But its compress size is actually a variable. It can be a better granularity. So that is not block-aligned. So another is that our directory data is random-access-friendly. The other FS, that is too simple, so that it can just do a linear search. And so that is quite effective. And the image size needed to be large enough. So that is the UFS basic concept. Another is that we might need to provide a single file system replacement of a traditional CPIO and TAR. I saw some use cases which try to pass through the table to a guest, directly to a guest. Because their table is collected by SHA, so we don't want to modify it. So that they introduce maybe another new file system like a new internal TAR FS. But I think in such use cases, we can use UFS to replace it as well. So I think such use cases that we can just use the original TAR data and build a UFS merged metadata to point to such data. So the original TAR data won't be modified. So we also have such proof of concept. So that is a benchmark between the directly mounted UFS and TAR after fetching from remote instead of just uncompromising TAR to such as EFT4. So the whole startup performance is speed up. So I think that this might be a use case at least to pass through some container images to guests, maybe or just the wrong thing. But there is still some problem. I think that for such use cases, it still has some problems such as page cache sharing. So I think that it might need to be discussed. So in this use case, the page cache sharing might be a problem. This reminds me of the ButterFS XT4 snapshot. We should probably talk about this later, because it's probably going to be a very long discussion. I agree that the container images have these problems. These are very, very big problems. The thing is that when it comes to page cache sharing and all these other problems, my view is I think that all of them, well, not all of them. That's too grand. A lot of that can be solved is if we didn't have TAR archives in here. If we had another format where we would be able to have per-file granularity, and we share the files as the same inode on disk, and then the kernel knows that it's the same exact file, and we don't have to worry about having a file system where we're mounting. Yeah, that's my view, because I agree that if we change nothing about the image format and we try to work around problems in the image format, this is probably a better thing than what we're doing at the moment. The thing, though, is that in my view, we need to change the image format. I think that I've seen so many talks about people trying to figure out ways to optimize building images because things are based on TAR. And in my mind, this is insanity. We need to fix this format so that it's not, so people are not working around it spending hundreds of human years of life on trying to get work around TAR archives. My blood boils when I have to think about this, because it's so much waste energy and thing. My point is that I think that we could definitely figure out better ways of doing it, but we should have a discussion later, because I'm sure you have lots of really good ideas about how we can solve it. But yeah, sorry. Right, but the thing is that if you didn't have to, if it wasn't a TAR archive, and we didn't have, basically, you would be able to have container images which don't have, if you have Ubuntu Linux that has 1.6 for gigs, if it's one layer, that's one thing. If you look at the OpenJDK, for instance, basically, you have to share stuff between images, which you can't do with the layer format. So if you have the same copy of Bash in every single container image you have, but there's like 5,000 copies of container images in the system, you have to share the exact same inode for all of them. You can't do that with a TAR system. You need to rethink it, that's my view. So there are other sharing you can do. So this improves that, but I think that if we could take ideas from this and come up with a better format, we could have something that's way way better. Sorry, I didn't mean to derail it, sorry, you can keep going. Yeah, I don't want to change the table. Anyway, so, because that is the last deployed, so that is, I think that we need to do a compatible solution, at least. So my opinion, but I maintain the specs so we can change it, that's my focus. So I'm not speaking in some, I'm one of the maintenance of the specs, we can change it. I think, and if we can argue that all this loss, we take these graphs and we point to people, we say we need to fix this. I think that we have very convincing argument. But in this way that we have another problem to avoid the picture sharing, because if we want to mount two different era files, with the same, we need to aware different file system and to do picture sharing. Okay, so I'll let you keep going, sorry. Okay, so finally, that we might need to do a generic picture sharing approach. I asked really previously, but I didn't get some answer or at least some pass. So I have some hack at least to share pay the cash in the file level, but this is not enough if we want to do file duplication, file, final duplication such as chunk or some, UFS supports chunk-based duplication. So I think the type call use case are together with NIDUS. So the NIDUS is Alibaba stuff. So that's initially not related to UFS anyway. So optionally, and it's both profile, LED4 and LEDMA compression, but there's some difference about this because so UFS actually use approach such as that is a fixed size up compression. So why we need that? Because that's compared with previous approach such as SquashFS, that is not quite efficiently to compress a small block such as 4K block. Why we need that? Because that if you want to do some IO, firstly you need some compressed data to load. So you need to allocate memory for compressed data and then you whatever copy it to pitch cash or something, so the memory is amplified. So for the performance sensitive use cases, we need smaller compressed block, but the previous solutions are not quite effective. I will explain later. So we introduce to avoid this that we also introduce in place IO to avoid allocate compressed page buffers as much possible so that we can reuse the original pitch cash to contain the compressed data first and then time sharing to decompress in place. So we also maximize the in place and the cash IO to cash compressed data to save more memory because that sometimes we don't want to decompress all data. Maybe we might need to cash, it is might effective to adjust the cash compressed data. So that if we use block aligned approach that it can be more cash. So that we don't have some meaningless data to cash. So in place decompression to avoid extra copy so due to in place IO. So you can see that some percent of a sequential improvement and the decision that we also introduce the rolling hash global compressed data decompression which is used this way. Also we also support blog based FS based and file based backend. So we have in kernel because we have some use case we indeed have some use cases with block based is not quite efficient. Maybe a composite FS is one use case as well. So we found that based on the previous incremental FS discussion that we found that there exists another in kernel caching management framework which is called FS cache. So we would like to reuse it to cache our image to cache our image and do some lazy loading. Also called on demand download and maybe more use cases such as to just fetch some data from the FS cache caching management. So that is ongoing stuff. So the compression side maybe the story maybe might sound different because that for the uncompressed stuff maybe each extent is quite huge and the information is quite sparse because that we need to record a real thing but the compressed extent is quite dense because that we need to do record many informations in principle because that I observed a previous maybe observed but FS it record many stuff in its index. So, but if such index is too huge the read performance might be impact. So that such read only FS recorded real stuff. So that have some two ways such as fixed side input compression such as a squash FS which is just a start physical offset price download and then it's record a block list one by one. So that they are decompressed side it's fixed such as or one, sorry, such as 4K or more. So in this way that we have to sum it up to get each one physical offset. If we want to read a very large file you need to sum it up too. So they introduce some cash to resolve the problem. So on the other side but it's index is quite small so that I think the main purpose is to decrease their image size on the other side that on the other side that we have a fixed size up to the comparison. So in this way that because we are a block aligned file system so that our compressed data is just fitting blocks. So that we just store is a physical block and the block count and start a logical offset. So, but in this way we have, we designed a very efficient index. So in that way is that we have more higher read performance you can use the FIO or something to benchmark it. So, sorry, so our recent use case is basically three forms. The first form is the UFS4 image such as to just use uncompressed image and FS test for guest sharing or some compressor image to do speech saving as well as to both, to save its performance. Another use case is just to use UFS metal only image plus external data such as table or another binary image. This is possible. The rest use case is use UFS plus overlay FS extended attributes such as composed FS model use cases to refer to external files. So this is a recently use case about our global compressed data duplication. I think because that we record back granularity duplication with a minimized extended length of four kilobytes. So we can use a compressed data. A compressed data can be referenced for multiple times in the prefix form. So in that way I observed that that is much more friendly at least to test the files with a small compressor because there's, but I want to extend it to share library but the outcome is not quite good. Maybe it's modified a lot, not just as a simple patch. So I need to investigate more to get more ideas. You can see that this is a Wikipedia but this is two days different has two days difference. So we can just use 4K compression. We can see 5%, about 5% to use such duplication if it's just a one day difference, that's maybe more. I can suggest why this is. It's basically because most images built on shared libraries the shared library itself represents effectively all of the stuff that you can share between the binaries. Once you've done that, there's not much left in binary data that's fully compressible. You can compress it slightly but it probably isn't worth the overhead. As long as the image is built correctly with all the shared library systems. You mean the overhead, just the index overhead or something? No, I mean once you've taken out all the shareable bits and binaries using shared libraries there's not much left in all of the binaries that will compress easily to give you a great data saving. You know, they're already effectively compressing themselves. They can compress but you're showing they're not compressing very well. Oh no, that is not a shared library. That is just a test file. I mean the left side is just a test file. We use the 4K bytes compressed. The, sorry, sorry, I didn't mention the original size. The original size is almost two gigabytes. Almost two gigabytes. It's already saved something. Okay, forget I said it. Sorry, I didn't mention this line, sorry about this. So, another thing that we compared with fixed size Just to clarify, this step is de-duplication on the compressed blocks, right? Yeah. Right, so basically there's an initial de-duplication. There's compression attempt, which may be successful or not depending on the data. And then you do another de-duplication to see if any of the compressed blocks have the same hash. We just are rolling, rolling left forward. We do some, but actually I observed the difference. That is too small, small splice. That is not huge as four kilobytes. So that we could not use this way to do de-duplication to fall for shared library, maybe. And another question I have is, so you are compressing fixed size chunks? No, no, no. We compressed into a fixed time. Okay, so you basically say how much data can I compress into a 4K chunk? Yeah. Right. And then you move on to the next set of data and go, okay, I'm compressing and kind of compressing. Okay, 4K. No, no, no, no. Not in our tri-way because there's some, many, this depends on the compression algorithm because I invest for many years. There's many algorithms kind of adapted in this way. Yep. Yeah, and you quote LZ4 and LZMA, right? Yeah. I will show that I'm working on deflate. The last step is deflate. I'm working on this. But I'm just to say that it will fix into a fix, but you can leave some space with a large unit such as 8K or more, you can left some space. So you can just use a traditional way as well, but. And it wasn't quite clear to me what happens when you hit a block of uncompressible data. I assume you have a metadata bit that says this region is just not compressed because I couldn't compress it. Sorry? Let's say you try to compress a encrypted file. No, no, no, no. It's not gonna get any smaller, right? No, no, no. Share libraries can compress it. Yeah, shared libraries can compress, but you might have an encrypted file on your file system, right? And that will compress. Adjust the compare to share library. Okay. So that's not the file. Take it off. You can talk about it later. Yeah, research is more. So sorry, I will go through. Another use case is called AR data sets. That's, recently we have some AR data sets have too many files. Too many files, maybe four million files. In one directory. So that's with a compact, with compact I know that we can have some benefits compared with EIT4. So that's is a quite a use case and a quite a benchmark for ComposeFS as well. So that is a benchmark with AR data sets. So that might have some benefits to use, just use UROFS. So you can see the time can be saved. If you use EIT4, you have double time. So that's the difference. I have some ongoing development, such as definitely the decompression because I like to use Intel, the recent hardware accelerator. So that's called IAA. So we use this, compared with this standard, I will first adapt a deflate. So that's my first step. The next step is I want to enhance the ComposeFS case. So I want to do a bloom filter for extended attributes. So that's negative attributes can be speed up. So the family is FS variety stuff. I've already talked with Eric. So I will email this offline. So sorry about my English. I don't have enough time to do some discussion, but that is what I want to discuss. If we want to do some image based stuff to shape or distribution some garden image, we want to modify it. We want to do some end-to-end. So we might need some such stuff, but I'm not sure how to accomplish it. The last one is that we have some, I will see some overlay as a partial copy up by using the clone file range. But if we overlaid, sorry, the mount a UFS or whatever look back or something on XFS. So we can directly use it. So that I need to find a way if we could just use with loop back. So we could use copy file range as well because that is on XFS. Because the image is on XFS, the UFS image is on XFS, and we have over the FFS. But over the FFS cannot directly use clone file range to do partial copy up because the lower layer is UFS, not XFS. But we are in principle, we can do because that is on XFS. So that we can just do some copy up partial. There's been a field file range across different super blocks of the same file system type. Copy file range across different file system types. So you have a loop back mounted irrefist thing and you want to copy part of it through the loop back mount. It's actually on a file on an FFS directory, or an XFS file system. You could punch through it and find the original location on the XFS, then you could do copy up. Right, that's the problem is like you don't know what you're on top. Oh, is that to upper layers in a safe way? Yeah, but that might be tricky. Anyway, that might be tricky. But we might finally want some way to do partial copy up anyway. The only way here is that the only in practical the way is that. But we want to use. Maybe I'm missing something obvious, but can't you just return an error on clone file range or copy file range? Yeah, but it will copy the whole file. It will copy the whole file and waste space and time. It copies the, okay, I was thinking the range, clone file range doesn't copy whole file. I may be missing something. You can use reflink. Because as you use copy, sorry, clone copy range, you can just reflink both data and do partial copy up. It seems safe, I don't know. Yeah, that is safe. Anyway, that is safe, but you can do four. Yes, so it sounds like what you want to do is, and what you're saying is have overlay FS, try and do a clone range between the two layers. Just pass down file system, and you get the ex link, whatever it is that link. And then if it's on EROF, EROF just passes it down again and says, I should use this I node. Yeah. The use case for, if you have, sometimes you have storage that knows the context, so you can tell one server, copy from one server to the other without transferring any data. So basically, there is no support for currently for copy file range range between different file type, file system type. If we would allow it, it was to copy from EROFS to another file system. It's not trivial to see the semantics, but then suppose that EROFs deal with this, and EROFs knows what to do. So it gets a request to copy from EROFs to something else, and it knows what to do. So it can be done, but it's not trivial to define the semantics of how, for the IPI, for the VFS API. No, no, no. When we allowed a copy between different super blocks, it was directed to same file system type. Same, same, same, same. All right, Jay. Well, anyway, that's the trick. It's actually not the same file system type because I think SIFs and NFS have different sort of file system types. I think it's actually the same copy file range method pointer. Right, right, yes, right. Anyway, that's the trick, that's the trick part, but we need EROFs partial copyup. But by the other way, that might be okay, but currently, I don't know. You just need partial copyup or you need to be a ref link. Partial copyup is useful. It can be done between any two file systems. If there's a feature you're talking about that you need from EROFs, then it's a generally useful feature. It just needs to be done. There is no format for defining a partial copyup. There's no file, but you don't need any support from file system or ref link for that. Yeah, I know that it doesn't support non-ref link files. But we have some workload. But we have some workload. Such file is quite huge, quite huge. Such a gigabyte file. So we really do. Yeah, I think it's quite a common workload and overlay FS has a problem in this workload, but it can be solved. I mean, it doesn't need ref link to solve that.