 I spent a lot of time on these slides, as you can tell. Read ahead. We love it. It gives us better performance. So this kind of illustrates sort of what goes on when you do read ahead. So if user space is reading one byte at a time, we do not, in fact, trouble the file system to read one byte at a time. Let's assume we're doing page cache. We read, like, 64 kilobytes at a time. And this is for pipelining, right? We're assuming that there's a certain amount of latency between asking the file system to read a page and the page actually becoming up-to-date and you have to give it back to user space. Some of you are network file system people and you think you're special. You're not. This is a problem for both disk, as well as network file systems. It's even a problem when you're on SSDs. Yes, they're really, really fast, but, you know, it's not instantaneous. It takes a little bit of time for the flash chips to give up their data. I mean, it's somewhat different problems, right? We run on a whole wide variety of storage. We run on, if you're networking, everything from 10 megabits ethernet all the way up to 400 gigabit. That's kind of a wide variety of latencies. And, of course, latency. You know, your server might be under your desk or your server might be halfway around the world. And the speed of light apparently is still finite, no matter how hard we try. And then on the block side, of course, you might be trying to do IO to some crappy USB key picked up from a vendor at a trade show or you might have a nice, shiny 5 gigabyte-a-second NVMe drive or a RAID array full of them. So, you know, we've got a lot of different stuff to contend with here. But the network people seem more interested in making sure that this stuff works well for them than the block storage people do. I find that block people seem to just say, oh, yeah, the page cache sucks, never mind. I will do all my tests using Direct IO. I mean, that's the part of the system that they can affect. Right? Block file system people tend not to venture into the page cache and say, hey, this sucks, this needs to get fixed. Do you want to use the mic for that? Or I can replace it? When you say that, usually you're looking to use fire to measure things like the link speed or the block device speed. So they're looking to exclude effects that the page cache would have. Yes. I mean, if we do a performance measurement that includes read ahead, everyone says it's not valid. But in another sense, it's entirely valid because that's what users are actually seeing. And if we have these effects where we can see performance cratering because we're not doing enough read ahead or because we're doing too much read ahead, it would be nice if we were noticing these things before the user does. You know, sometimes we're just responding to bug reports, right? And say, oh, yeah, that's what's going on. But I would like to hear more in the way of bug reports from people saying, hey, the page cache sucks, and here's why. I'm compiling a little list. I've started a wiki page on kernel and UBs for page cache problems just so that nobody gets a bit too complacent about saying, hey, the page cache is awesome because it is awesome in some ways and in other ways it's terrible. So the Android people have some problems that they've noticed and being Android people, they have worked around it in the worst way possible. They set the... I think there's a minimum or the maximum read ahead. Usually it's at 256K. And in order to get startup latency down by some fraction of a second, they wind it all the way up to megabytes, like hundreds of megabytes. And apparently that helps Google Maps start up like 10th of a second quicker or something. Okay, it has other effects, but that's the only one they're measuring so they do it. They've tried to upstream some patches relating to that and we've said no. So I would say they should be using some sort of F-advise instead. I want to read all of that by all, please. Okay, can we stop trying to pretend that user space knows what it's doing? I mean, come on, F-advise? Seriously, I'm not taking you seriously if you say F-advise again. F-advise. Okay, so I'm going to take the microphone from Dave. But the Android use cases are very specialized because they use a log-structured file system on the phone, so expanding the read ahead does pull in most of the data they will eventually want for boot. That would be true if read ahead were physical based and it's not as virtual based. And it does that on YFFS2, which is what they're still mostly using. Yeah. So I think we have a bunch of problems to solve with read ahead. And it affects all of us, right? I mean, I'm going to say the F-word. Folios play into this a bit because the use of larger folios is driven by read ahead right now and only read ahead at this point. So the larger the read ahead gets, the larger the folios we create, which is kind of fun. It was good for testing. I hope somebody says, hey Matthew, you've done this all wrong and goes in and implements a better system. But it's been really good for testing because if your file system supports large folios, they start getting used immediately. We start out by allocating order 2 folios. And so far that's AFS and XFS. Patches for AFS and 9P to do this. It gave them to Linus, but you didn't take them because it was late in the merger window, but it works. Oh, so AFS is not going to... It didn't get upstream, but I have the patch. It's not in 519, it will be in 520? Yeah. I hope the SIFs as well will be in there as well. Sweet. Looking forward to it. That's great. So one of the problems that Steve French brought to my attention is that Windows does really, really, really large read-aheads. I think four megabytes is what you said. Yeah, please. So just to give a context here, when you have normal latency over a network, the cost of sending each frame kind of outweighs the benefit of sending a smaller IO that's closer to the size you want. So I did a lot of performance checks versus various servers, and by default, I think the maximum... I don't know why it's set to this. I think some servers could set it to 16 meg, but 8 meg is the largest I saw. I saw some performance degradation in my benchmarks against various servers, trying various clients at 8 meg. It helped with some hurt with others, but I saw large improvements from 256 to 512, big improvements 512 to 1 megabyte, slight improvements to 2 meg and slight improvements to 4 meg. But in Azure, for example, we decreased it from 4 meg to 1 because other clients saw a slight degradation going to 4. I think NFS defaults to 1 meg, and I default to 4 meg, but there are servers like Azure, for example, where it's going to negotiate 1 meg, and clearly anything less than 1 meg will have performance unless you have a really fast network. And the problem I see, though, is how do you throttle sanely? Because, yes, there are credits, and there are things that flow control things and network block devices and network file systems. But this is a lot more than just simple flow control. There are cases where the page cache knows more than I do. And yes, sure, great. Send me, you know, 1 meg or 4 meg, great. There are cases where that's idiotic from your perspective, and what I don't know is how to kind of aggregate the information that you know from all data points, the page cache, the app level, and then, of course, the network server. I can tell you when the server is throttling great, but that happens like 10% of the problem. A bigger problem is we want to handle this well and so I kind of feel like some examples would help here because what I'd like to do is advise the page cache level that I prefer 1 meg or larger, give it a maximum level, but give it a lot of freedom to go to smaller I.O. Thanks, Steve. That is incredibly useful. Thank you. So we do have a mechanism for communicating between the file system and the VFS or the page cache, and that's the BDI, the block device info. Badly named because of course NFS actually has one and I believe SIFS has one as well. But that's where the VFS looks to find out kind of the performance characteristics of the underlying storage. So if that gets updated, and maybe it doesn't have exactly the right information in it right now, but it's like the way Linux is structured, that's clearly the right place to put it because the reader headcode does consult that to find out various things about the storage right now and it can be extended or maybe just used correctly. I'm not going to say using it incorrectly, maybe not, I don't know. So there's definitely... So what we do now is we mark... So when we... I swear there's a tool to let me highlight things on this. I'll just point. So when we do that first 64k read, remember the application we're talking about in this case is just reading one byte at a time, right? It's a very stupid application. It's theoretical, I'm just making it up. We set a marker, like, I don't know, 20 kilobytes into that 64 kilobyte block and we say when we get to that point, we should kick off the next read ahead. So this application goes on and after 20,000 calls to read of one byte, it gets to that... The page cache sees, oh, you're trying to read the page that's marked as being read ahead. Okay, I will now kick off the next read ahead and so it sends the next 64k in the hope that the 64k that it's reading has returned by the time the application has done another 40,000 calls and got to the end of the 64k block that's currently being read. Well, that's going to depend on the latency, right? I mean, how long does it take to get 64 kilobytes back? The MM really has no idea at that point. It's just chosen some random point within... It's not entirely random, it's making a decision, but it's not really based on very much. That read ahead that I did before was successful. It was useful, so I should do more read ahead. It doesn't have any kind of estimator to say, oh, it took this many microseconds to come back. I should schedule that earlier. But does that matter? I mean, what you're saying is that we have no idea what the latency is, so sometime it'll have returned way, way before we need it, and other times it'll be just in time. Does it benefit us if we tune it so that it's just in time? You're jumping ahead. Let me get to that. So when it reads that second 64k block, it's got to... It did it ahead of time, so the whole 64k block was actually optional, and it sets the, was that read ahead useful right at the very, very start. So as soon as you get to that second 64k chunk, it will say, ah, that was successful, and moreover, it wasn't just... It wasn't an idea I had to do anyway. It was an entirely speculative one, and it was useful, so I will grow my read ahead window, and it grows up to 128k, and it sets the read ahead marker at the beginning of that 128k, and so the same thing happens again. It does 256k, and then it stops at 256k. Now, what we could do is, and 256k is just the limit, and it's been the limit for 20 years, and I was hardly changed at all in that time, right? So what we could do is grow 256k, and from what Steve was saying, just growing that to one megabyte is probably the right thing to do, and we probably should have done it 10 years ago. And for cache files, at the moment it just caches pages, but I'm going to need to move to 256k or something blocks, which means I need to... Currently, using the expand, I'm using expand function to make this work, but that comes with another problem. Not so much with an application like this, but one way is doing random reads. If you get two random reads on two separate threads in the same 256k block, it will end up not being able to cache the block because the two read ahead will collide and stops where between, and I won't get the line blocks that I need, and so it will drop both of them. Is that actually a real problem? Is that a real problem or is it something you've thought of? Well, it is a problem that can occur because you get to read that. I know it can occur, but I'm saying it's a real application that does that that you're trying to fix, or you're just trying to avoid problems. Only if it's doing random reads with multiple threads on one file and they happen to collide. The linker, for example, is that the linker, the new linker seems to end map files and read randomly from them. But it's single threaded. True, at the moment. I guess if you're trying to link the same object file into two different at the same time, this doesn't sound like something which is actually happening, right? The code hasn't gone upstream yet, so no one's countering it yet. How about we see whether it's a problem before we fix it? I'm trying to avoid it being a problem. I'm trying to fix problems which actually exist right now. We can fix the problems that you have once they actually exist. Can we at least get it so we can tune the... Give a minimum. Yes. Alignment line in this, because it's also when we're dealing with something. I don't know whether the reader head blocks are always aligned. They're not. They're frequently misaligned, actually. Because it's possible with a file system like Ceph where the objects are, say, two megabytes in size, but they were scattered all over the place. It would be really handy if a reader head didn't cross to objects on two different machines. So a two megabyte read is fine if it comes all from one object from one machine. I think we've solved that problem with the reader head expand. We can say, oh, you're crossing a boundary. We'll round up to the next granular. Maybe this reader head happens across the ground. Will you be preferable to actually tell the reader head in advance to one of the boundaries? Even if it's just one number aligned to this place? No. And let's also remember that the way you... Let's take the simplest examples, like copy. You're doing read ahead with copy. This is really boring. Take a bad tool like R-sync. R-sync does small I-O. And it's not as parallelised as you'd like. And then CP has some cool little tricks. CP can seek holes and things like that. But it's single threaded as well. On other operating systems, the copy tools are multi threaded so you can turn off caching and do some cool things if you're trying to actually back up your system. But we don't have a lot of choices for copy tools and I know that three years ago there was a great feedback on all these tools nobody's ever heard of, MCP and others that do better jobs. But we have to deal with dumb tools. They're going to read, read, read. But I wanted to give you guys some context that are very high level. And also ask maybe the NVMe guys, their preferred I-O size maybe, if I understand correctly, for reads their latency is not quite as good as you'd like, but their bandwidth is really, really high. So I would think even NVMe local devices there's some advantage to these larger I-Os. But to give you some context, we did some detailed stats on copy. Seven milliseconds roughly per frame if the copy, sorry, if the read because over in the network a lot of these things are going to be encrypted, we're signed. Seven milliseconds is a lot if it doesn't have hardware offload in your... So some algorithms remember the TLS discussion, whether or not they're supported in the hardware. Seven milliseconds is a lot of weight on one I-O dead time, nothing on the wire. Using hardware offload, SMB, many service support that, we got down to about one. But that gives you a rough idea if you spent on a typical VM one millisecond of dead time nothing on the wire waiting on that I-O should we send two, should we send three because that one millisecond of dead time doesn't count the network latency. You guys know your network better, but you have a little bit of network latency dead time. So that dead time is avoided if we have like that expand discussion you were starting and sanely doing that expand. Unfortunately with Dave's patch that we looked at some examples and there were cases where we had 4,000 I-Os in flight which is really, really, really bad. That's too many. That will do some bad damage. Think of network buffers, think of server cues, think of client cues 4,000 is bad. So that may just be a bug in something that we're playing with, but getting the read ahead expand to the right matters because that one millisecond is a best case for a lot of remote things because you're encrypting or signing and that one millisecond you could do a lot of other things. So I just want to ask you quickly given the two bottom ones, which would you rather see? Would you rather see two 256K I-Os or would you rather see one 512K I-Os? And bear in mind these are both reader heads, the application hasn't asked for these yet, they may that whole 512K bytes of data may be in vain. No, no, I'm saying would you rather see two packets back to back to hit the server each asking for 256K or would you rather see one packet for the whole 512K? Sure. I understand, it's going to be collapsed at both ends of the spectrum. Is better? Okay. I have a different problem that's maybe a design bug than NFS and some others do. If I'm writing stuff stuff gets paralyzed, great. It uses whatever thread came in and I can Jeff Layton did great stuff launching Async Threats for that. On the read path Windows and some other servers do a great job of forking lots of socket reads so you have one socket perhaps it depends on multi-channel so if I have multiple channels multiple sockets to the same server and that's the default that many clients use when they mount they automatically set up two sockets automatically so they don't have any serialization issues on read because they're going to read from multiple sockets and the server will expose it has no problem if the client opens multiple sockets it was ironic to me that most workloads behave better even if there's only one network adapter with two sockets Anyway I thought that was kind of a curious odd thing but in general yes the 2.256 would do better than 1.512 Of course in a perfect world you have 4 or 512 and we're happy Well how about 8.256k? Okay Okay cool So you want to jack up read ahead size Actually I don't A lot of people are proposing that we jack up read ahead size. What I want to do is jack up the number of outstanding 256k read ahead So I want to do the top one I want to send out like 4, 8, 2.256k requests because that way we get so much more information back about the application usage patterns Okay so instead of going to the 512 you want to send out 5.256k in this case Okay And you want to do this partly for performance so we're more likely to have a cash hit which in this case we've already had two cash hits row so we know it's probably going to be okay and the other reason because want to allocate larger folios Is that what I heard you say in the beginning is that you want to be able to do the That's kind of a side effect That's just a side effect That's not my primary motivation It's kind of a cool thing that happens to also happen Okay I like with all of this stuff you know benchmarks You give me 5.0 jobs I put in my nightly testing We'll see how it grows and see if it goes better or worse Fantastic and I can even just line and then apply your patches and then test because it is sort of ridiculous we haven't changed it in 20 years I say screw it let's increase it and see what happens right but the thing is historically we don't keep track of this kernel over kernel but our FS is so you give me stuff and I'll run it and hooray we get the data and we need to do that in general for this it makes sense I'm going to say that yes go for it see what happens I do sort of worry about the larger like you say it's a side effect but that's a significant side effect of like suddenly we have larger pages that may or may not be used and what does reclaim look like and there's knock on effects of increasing read head size that might hurt some workloads I'm not saying you don't do that for those nebulous undefined reasons but I think we need more information I agree with every word you've said Joseph I will say once we find these workloads there are things we can do to start addressing them we can be way less aggressive about scaling up folio size and I think we should be it's only like that because I needed to do some testing right and so I'm not saying that like again I'm not saying we make these don't make changes because of nebulous things I say that when the nebulous things happen we address those we build tests for those and we just assume that what we have or you write tests to show where the improvement is and then when we find counter examples we have tests for those and we make changes there again I agree with every word you said so something to think about I was just going to answer Willie's question from standpoint of someone who works on network file systems and that is we would far prefer the larger IO set up overhead for a read it can be pretty substantial on a network file system so if we could do fewer RPCs in general it's much more cost effective for us it's a server side overhead as well that sort of thing in general we would want to prefer larger IOs when we get away with it was that Jeff Layton speaking hey Jeff so you're actually arguing with Steve French there because Steve says he'd rather see 8256k and you're saying you would rather see 1 2MB it depends on the file system yeah it's not just file system I think that for SIFS because SIFS is pretty similar it has pretty significant set up overhead for reads and writes as well so I would think too that we would want larger IOs on SIFS as well we definitely saw advantages in every single example I did with the current page cache going every single server we tried up to 1MB there was one example Azure where we didn't we saw some advantage to 2 but none to 4 but to every other server we saw advantages to 4 and then it was mixed on 8 so that's megabytes so if you were talking about a single IO in flight absolutely absolutely no question I can't imagine a network case where a smaller than 1MB does not make sense I think the reality is it's always going to depend on the file system and the storage device if we're talking about really really slow USB thumb drives even a 256k read ahead never mind a 1MB read ahead may start getting painful and if you're on a very very small memory device doing large read ahead has other knock-on effects any one magic read ahead formula that is going to be good for everyone the only question is where is the squeaky wheel so to that end I have a suggestion which is we should granted that the average user will never use this but think about allowing the read ahead algorithm to be over-rideable by an EVPF program because that will make it easier for people to experiment and then they actually present the result it would not surprise me if in the future the read ahead algorithm ends up being if you are on file system foo it should be this default maybe the read ahead algorithm should be monitoring read and throughput latency and auto adjust based on that and I submit that it will be much easier to experiment with what the auto adjustment algorithm should be in EVPF than forcing people to modify kernel code because if you make an EVPF program then we can throw it in turn or a grad student as a research project to figure out ideal read ahead algorithms because we're not going to get it right and if we make it easy to experiment then maybe we can change the read ahead algorithm and tune it every three years instead of every 20 I was going to start arguing with you about that and then I'm glad you kept talking because I think you've persuaded me that I can just make this somebody else's problem my fear is that the android people are just going to turn everything up to 11 and say hey my phone launches whatever app 0.1 second faster than our competitors so you should buy our phone and it will be like whatever so just as a data point Jan pointed out in chat that Suza has had 512k set for their read ahead size for years oh perfect great let's set it to one mag because if it's been set to 512 for years then we should clearly set it to one mag now hey Chuck I'm kind of concerned about this conversation because we're greatly oversimplifying the problem the oversimple example that you started with was a program reading one byte at a time there's no reason to increase the read ahead window in that case just keep reading 64k and you'll keep up with that program the problem with reading 512 and 1 megabyte and 2 megabytes is that suddenly you've got a huge amount of data in your page cache and that program may not get to it and it may be reclaimed and then it would have to be read again there are other problems the mix of IO that's going over a network fabric if you've got a bunch of small metadata operations like get adders and things like that those are going to be held up behind a 1 megabyte read substantially held up and that will cause a lot of outlayer latencies for other programs that are running on the system so I'm kind of concerned about jacking up the size of read ahead significantly this is speculative data that you may or may not need so I think we should be careful about exactly how this is going to be changed so maybe I'm echoing Ted's concern maybe we should be thinking about ways we can make it a lot easier to experiment EBPF is an interesting idea you know that's something that the researcher him or herself can do I don't think we need to put that in the mainstream kernel but experimentation is a number one I think we need to invest again we need to have a wide variety of workloads that are running at the same time to understand the systemic effects of increasing read ahead thank you so to make it easier to experiment I added a mount option that was created two ago for RA size because we don't use the BDI there's RA size so RA size if I I have RA size that's controllable so I can set it to whatever 512, 256, 1 meg RA size simple examples I saw the primary benefit I saw with RA size was only when I was running multiple channels so I didn't have the serialization issues so I was increasing RA size so I just added a mount option it's not set by default but your intern can try that I assume other file systems have the same thing we just use the BDI and I'm cutting us off because it's lunchtime so the remote people I'd like to point out that the schedule has something for you guys to join after lunch which is now in 50 minutes it's to give you guys a chance to share your opinions about how the virtual aspect of this has gone we'd also like to see you I don't know how that's going to work but turn on your cameras but that's happening after lunch so in 50 minutes if you want to come back for that thanks