 So I was actually hoping some of the XFS folks would be here, because this was actually a discussion that I was having with Derek, and we noted that we were seeing similar challenges with basically how people have been using resize, especially in the cloud. So let me first look at some of the ways that resize historically has been done, which I think most of our file systems were originally designed to deal with, which is you take a disk, you add it to a RAID array, MD RAID, and then you expand the file system to actually take advantage of that new disk. And that new disk might be a four terabyte disk, or it might be an eight terabyte disk, but it was increasing the size of the file system by a fairly large percentage, and you were starting with a fairly large file system. And then the other sort of common use case, but it was only used by a few embedded NAS projects, so we could basically pound on them until they did the right thing, which is many of them wanted to use a tiny file system, which they would install on a hard drive using DD, and then they would expand it to whatever the size of the hard drive was. And so they might start with 100 megabyte file system, and they would then blow it up until suddenly it was 10 terabytes. And that same pattern exists in the cloud, where many cloud VMs have a minimum virtual block device size, typically it's around 10 gigabytes, but it can be different sizes, but it's small. And then they'll expand that all of a sudden to a very, very large size, like tens of terabytes. And this is very, very similar to the embedded NAS case, except there are lots of cloud customers who are doing this, and many of these cloud customers are fairly naive. So they use the default make-of-s options, which typically have historically, at least in Linux land, have used different file system parameters if you're creating the file system for USB thumb drive as opposed to a 10 terabyte hard drive. You use a smaller journal, you might use a different iNode size, any number of these sort of tuning parameters, which makes sense if you're actually putting it on a USB thumb drive, but if that USB thumb drive suddenly turns into a dozen terabytes worth of storage space, the file system parameters might not actually be particularly good for that. And then the final sort of thing, and this tends to be driven by economics, is that many cloud providers charge by the size of the virtual block device, right? It might be 10 cents per gigabyte per month or something like that. And so customers are incentivized to keep their block device as small as possible. So they'll start the file system at 10 gigabytes because that's the smallest size that they can procure. And then as the file system gets full, they wait until the very last minute. It's like 97% full, 99% full, and then they'll add a gigabyte. And then they will continue to fill that file system. Maybe they add five gigabytes. But the point is they do that over and over again and that tends to result in worst case file system fragmentation because most file systems aren't designed to run well at 99% capacity, right? Or you really, really are badly stressing the garbage collection for that file system. So what are some of the solutions to some of these things? And then I'm hoping that other people will have interesting ideas because I don't know that we actually have great solutions. For all of these, we've just identified the problems. The first is it would actually be kind of useful if there was some standardized way that you could install a file system image from some sort of standard non-sparse format which you could then inflate onto the block device so that you don't have people who feel that the state of the art of installing an image onto a block device is DD. And in the XT4 space, we've been playing with using the Q-COW format. So we actually have a program in E2FS progs called E2 image where you can actually create a Q-COW file that will only contain the actual used blocks of the file system. So I could create a eight terabyte file system, install 100 megs of system files on that. And then the Q-COW image would only have the blocks that were actually allocated but would inflate to a 10 terabyte size. The XT4S community has actually been looking at XT4S dump which is sort of the XT4S version of dump restore for a similar thing. And I guess they were looking at that because that XT4S dump also supports interesting things that XT4S has in terms of file system capabilities. But it occurred when Derek and I were talking about this, it might actually be interesting if there was some sort of standardized format that we could all agree on. Q-COW is a possibility. The problem with Q-COW is it's actually not particularly well documented. And the authors of QMU have discouraged people from using it as an interchange format. And so whether it's that or something else, that might be an interesting area for file systems development communities to maybe collaborate on. And then the other area was maybe MakeFS could actually be changed so that it sometimes or always creates file systems that are suitable for being expanded to very large sizes even though they're being made on a small file system. And this could be as simple, stupid as we just simply change the defaults. It doesn't matter if you are creating it on a USB thumb drive. We always create it with a very, very large journal because it might be expanded to- That can take half of the USB thumb drive space. Yeah, so. You cannot change the default. We dropped in or something. Yeah, so we could. It's not clear that that's the optimal solution, right? The other possibility is maybe we can get some hints from the block device about whether or not a block device is either definitely not resizable, right? So we know that it's a USB thumb drive and we know that the USB thumb drive isn't going to magically become a 10 terabyte device. Then at that point, we use the defaults for a USB thumb drive. But if it's being created on an Azure disk storage device or a Google Cloud persistent disk, which can be expanded, we use different defaults. And the only question is, how do we figure out what the defaults are? The simple stupid answer is we could encode heuristics where we look at the name of the device and if it looks like Azure or Google Cloud, then we use one set of defaults. And if it doesn't, then we use a different set of defaults or maybe we can get some kind of hints from the block device layer saying, this is a DM device, you know, it's an LVM device, so of course it can be expandable or this is a real physical device that can't be expanded, at which point we're essentially pushing that problem down into the device driver, but the device drivers might actually have more of a chance of knowing that capability. So again, none of these are perfect and if somebody has some better suggestions, that would be great. And then a final thing I'll note before I throw this open for suggestions is neither of these address the problem of customers who insist on growing the file system one gigabyte at a time because they're trying to optimize cost, not realizing that that's completely trashing their performance at which point they're paying more in VM costs because it's taking forever to run their workload. I don't know of a good solution other than customer education, it's just that the customers vastly outnumber the file system developers, so that's not an ideal solution either. So yeah, I was actually kind of hoping some of the XFS folks were here, but the question is, do people have good ideas for any of these? Otherwise I'm just sort of pointing out that there's a problem. So regarding the first issue of default, so wouldn't people that create images use makeFS-f to create an unfile and not on a loop device and that's a good enough hint? So for NAS developers, they tend to be educatable because there's a relatively small number of them and they notice that it's sort of problematic and they basically come to the file system developers and we tell them use the following makeFS parameters so that the file system, so you override the defaults. No, what I'm asking is, you ask whether we can change the defaults and I said, well, there's issues because if you're really inclined for a USB drive then you don't want to create a big journal but when people create images for, maybe I'm not understanding the use case. So yes, if they're creating images, we could use a different set of defaults. That doesn't solve the cloud problem. Oh, because the cloud, you set up the disk and then. If you're using Amazon EBS or Google Persistent Disk, it looks like a SCSI device, right? Or it might look like an NVMe device but it looks like a physical device. It's just a physical device which is resizable. And so we don't necessarily have a way of doing that other than by matching on the name of the device and say, huh, that looks like a cloud device. Maybe we should use different defaults because maybe there is some magical SCSI attribute that no one is currently setting that we could use but I'm not aware of one where the SCSI device gives a hint that the device is resizable. Yeah, so like I'm looking, I just pulled up the thing. BlockID doesn't give us anything useful but the device model name, like I've got a cloud image thing and it shows up as QMU hard disk. I mean, it could be as simple as enough of just like parse model for things that, because I gotta assume that like all of the cloud vendors use the same mechanisms, right? Or at least consistently across the platforms, right? So you can say, okay, I know this is a GCE or I know this is an Azure thing. Yes, yeah, so we could do that if we did that. Maybe the right answer is it's something that we encode into BlockID so that we're only creating this heuristic once instead of each file system community using their own set of heuristics because I know what the block device is for Google because I work there. I actually don't know what the device is for Azure because I don't work at Microsoft, right? But presumably we could come together on some common heuristic which may be the best that we can do. Yeah, I think the right thing is to encode this in BlockID because I think the more we can all move towards one source of truth for attributes and then it's just a matter of every vendor puts in their thing so that we know and then the file system code can have the like, okay, I read, because ButterFS, I know EXT has to do this too, pulls in BlockID stuff anyway as far as they can make it fast. So if we just have another thing that we can check then it's good enough. So do we have statistics on how often people make cloud device file systems small and keep them small forever? I don't believe we do. I generally hear from the ones who are resizing it in very tiny amounts and then file angry tickets about why is this so terrible? So yeah, I don't have those figures. Because I'm worried if we change the defaults for cloud file systems that we end up punishing the people who make the USB disk sized cloud file systems, we should have data on what people are really doing with it before we set the default. But I love the idea of being able to say, like make a fast dash, my biggest possible file system is going to be 100 terabytes versus my biggest possible will be 256 meg. Yeah, I mean, I think part of the problem is, is that we don't necessarily have good hints about even what the maximum size of a particular block device might be, right? So, and that might be one of the other things that we might wanna encode in there, right? Because different cloud providers have different maximum sizes and different minimum sizes for their file system images or for their pseudo block devices. And obviously for LVM, we might be able to get some kind of hint about, what size file system will this ultimately be, right? Because that's the other way we could do it, which is we could provide command line options to make FS to let the user supply that hint. But the problem is most users won't know about the hint and most users will either be using the distro installer or will just simply be installing the default cloud image. And so we don't get that ability to get that hint from the user. So maybe the answer is the hard answer. Can you online resize the journal? And if not, why not? So we could, some of those parameters we can resize, but some of the parameters are actually rather fundamental to the file system format. The journal is probably the easiest one and probably, yeah, it can be for EXD4 because we allow the journal to be discontiguous. I believe XFS requires that the journal be contiguous on disk. So if you resize it, and again, if you use the pathological resize it one gigabyte at a time, a thousand times, then it might actually be difficult to resize the journal without doing online defrag of the file system. Yeah, and they have the allocation groups as well, which I think are fixed at make FS time. I think that would be much harder. There's one journal, but that's not the only data structure that you wanna specify based on the size of the disk. That was my question, yeah. Yeah, well, I think with XFS, you can have different sized allocation groups. And they're, I'm not the XFS expert, but it depends on the file system. Some file systems have these limits. Yeah, so Sandean is on chat. He's saying XFS doesn't have online log resizing, though we've talked about it. The bigger problem is the granularity slash count of the allocation groups. So you end up with, yeah, variable size allocation groups. Right now, we think it might be more straightforward to improve the efficiency of the allocator in the face of a bazillion AGs. And this is the butter FS problem, right? Like we have fixed size block groups. And if you get the terabytes or whatever, you got a lot of block groups. Yep, yeah. So I said, I would love further suggestions for potential solutions in this space. And if not, maybe I end 10 minutes early. Yeah, I like honestly, we're all in this room, we're all very aware. Creating more options for users just creates more ways for things to go horribly wrong. The more intelligent we can make our tools by default, the better. I think the best way to accomplish this is block ID gives us more information and we can then take that information and make the best choice possible. Is it gonna be right all the time? No. And then perhaps we still provide the options for power users or what I say power users. I mean, sysadmins or production engineers to set the right things for their class of things inside Azure or Google or whatever. But by default it uses block ID with as much information as we can to give us the best answer that we can. Right. And it's probably very file system specific as to for example, if XFS created a few very, very large allocation groups on a USB thumb drive, how bad would it be versus making it more efficient to support a gazillion allocation groups, right? I mean, it may be that the answer will be different on a per file system basis and all we can do is give better hints to the make FS for each file system. Right, for XT, for make XT you have this config file, the ini file that you can specify different what they called types of file systems. You can have sections there per hardware type, different set of configurations for the default, maybe. Well, I mean, so with XT4 we have a config file which makes it easier to change the defaults. And maybe it has a pair of hardware. Yeah, so we still have the problem that we need to know is this a resizable device and is this a resizable device that is going to grow by orders of magnitude or not, right? And if we know that then we can do different things. I mean, I think the main advantage of the config file is someone can actually change the defaults without recompiling make FS for XT4 but it doesn't change the problem that we need to know what the right default should be. I presume, this is probably a naive question. I presume one alternative, we can't have the alternative that educate the users, copy everything from this partition to a bigger partition, solve the defect problem, the fragmentation problem. Absolutely, it has to stay online. Yeah, the problem is they're trying to do this online. Right, so one of the typical use cases might be they're running a MySQL server and the MySQL server is set up so that they have some job running that constantly checks the size of the file system or a free space on the file system and the free space drops below a certain critical level, they add five gigs to the file system, right? And the problem is their resize script is optimizing for the cost of the disk ignoring performance considerations. I think especially because we only have to ask for companies, I would love to have data on like how people use this most often, like what's the normal size for creating it? What's the normal size that it gets expanded to every month? I think there's a lot of data that could help us make good defaults here that we don't have. Yes, that's certainly true. I suspect that for many cloud customers, it's not only just simply the sheer number but it's the value of the customer, right? So there are plenty of free tier customers that probably never use anything bigger than the 10 gigabyte disk but since they're users in the free tier, nobody cares about them. The much more interesting question is the customers that have a very high per year spend, how many of them are using small devices? And so we'd be asking the cloud customers to do a bit more work and then as they do that more work, they may or may not be as comfortable revealing information on that level. But yes, it's certainly worth asking. Any other questions? If not, I can add one. I wonder if someone had designed this system to begin with, you could provision a large disk, cloud and then say, I'm not gonna use, I'm gonna cap like a quarter is this much. You know, Dave Jenner and his proposal once for XFS to create a file system that's on a large disk but limit the disk usage. Yeah, it's not quarter but it's similar. It's like a global quarter. We can certain, you know, another solution which requires a change to how clouds charge for these virtual block devices is to charge for the amount of space actually used as opposed to the amount of space that is provisioned. There are some tricky bits about that. For example, with Google Cloud, the number of IOPS you get is actually based on the size of the disk you provision because that was deemed simpler for customers to understand. Also, customers do appreciate knowing how much they're going to actually have to pay each month. And if you base it on actual capacity used or actual IOPS used, then the amount of money that ends up being charged to the customer ends up being much more variable. And so that becomes a product management question about whether or not that's acceptable. But certainly, cloud companies could change how they actually provisioned and build for these virtual block devices that would make our lives much easier. If I could drive something. You know, then you'll have a reverse problem in the sense, if you say it's only used, even if I want to use a gig of data, they'll say I'll start with a million terabyte device and then you'll be kind of provisioning for a probably a much bigger device. So yeah, well, there's the other problem, which is the cloud company needs to be able to know how much space the customer might actually use, right? So yes, provisioning for a zettabyte of space, even if the customer only used 10 gigs is not free, right? So yes. But again, these are like product management questions, which, you know, those of us who work for cloud companies would need to have with the product managers for the virtual block devices. And yes, those conversations would be happening internally to each company. So what I see here is most of your approaches are proactive. I'm just trying to understand if there is a reactive approach to it, in the sense if he could change EXT2 resize or something of that sort, what kind of effort would it require to probably make it an expandable device in the sense I know the number of, I know just something are pre-allocated and they'll be limited, but even with respect to say smaller block size, can it be expanded later? I know it may be a lot of work, but have you explored that option? That ends up being very, very per file system, right? So for example, what we could do with EXT4 is we could just simply always create file systems that we're using a 4K block size, right? So that we do not use a 1K block size on a USB thumb drive. Now the reason why we use a 1K block size for a 128 meg USB thumb drive is it's more efficient if you have small files to be using a smaller block size. Less performant, but more space efficient. And maybe it is worthwhile to just simply not worry about that because most thumb drives are a lot bigger than they were 10 years ago. And so we could do that. We could just simply change the defaults to be 4K. We could for EXT4 make the journal grow in size as we do an online resize. That's coding work on our side. We could do that. There are probably some other things, but what might work for EXT4 might not work for XFS. And so when we were talking about this, it was like, okay, what are things that would actually be useful for all file systems as opposed to specific hacks that EXT4 or ButterFS might wanna do? And I would certainly encourage those other file systems to communities to think about, as Eric Sandin said, maybe they just simply need to make file systems with a large number of allocation groups more efficient for XFS as a partial workaround, but that's gonna be a very, very XFS specific solution. I don't know what the story would be for ButterFS or F2FS, right? I mean, we all will need to figure that out. Yeah, so F2FS, one of the guys said that they would have to like do some GC and stuff for ButterFS, it doesn't matter. Yeah. All right. I think I'm at the top of the hour, so. You are?