 Thank you, Tom, for the kind introduction. And hello, everyone. If you don't already know, my name's Alan Jude, and I'm the CTO at Clara, which is a FreeBSD development and support company. So if you need help with ZFS or FreeBSD, please reach out to us, and we'll be able to help you. But today, we're going to talk about what we're willing to look at changing in ZFS to make it scale better for NVMe and other types of very fast storage that are different than hard drives that ZFS was designed for. So while we talk about ZFS is the future of file systems, and a lot of us have done that. And after having used it a bit, I can't really imagine trying to live without it. But we're also starting to realize that ZFS was built originally in 2001. And at that time, a hard drive was about 100 gigabytes, if you were lucky. And 15,000 RPM hard drives had just been invented. And flash wasn't really a thing yet. And so ZFS spends a lot of its internal code and effort on optimizing your workload to what a hard drive can do. So one of the biggest things in ZFS is we batch all these writes up, all the asynchronous writes into one big transaction. A hard drive kind of best case is going to take four milliseconds to do something that's in a different location than where you are now. But with the latest NVMe, you can have that happen in single digit microseconds. And that's just completely out of what ZFS was designed for originally. Now's a great time to notice the type of. So one of the other things that's changed is that in your traditional interfaces like SATA and SAS, you basically can only do one single operation at a time. They have queuing so you can queue up some commands and firmware can sort them and try to decide what to order to do them in or whatever. But generally, the head can only seek to one spot on the hard drive at a time. We'll talk about the dual actuator drives some other day. But ZFS spends a lot of time trying not to queue too much work to the drive. Because then if you have something high priority come in, you don't want to put at the end of a long queue where the latency before you get to that high priority read is going to be really high and you don't want that. So ZFS tries not to send too much work to the disk at once. But as we'll talk about with NVMe, you want to send as much work as you can because it's always going to be faster than you are. We've also observed that with modern hard drive firmware, sometimes the drive, when you queue a bunch of commands, will prioritize the work that's easiest as if it was one of us, a person who would always procrastinate the hard task by doing something else. And this could cause huge latency on ZFS scrub and so on. And thankfully, Alexander did a bunch of work on that. And now ZFS, especially when scrubbing, won't queue too much IO for a scrub that might cause you to get huge like four second latencies on reads while a scrub is happening. But again, all ZFS is doing is trying not to send too much work to the hard drive in overwhelming. But now we have Flash. To get more performance out of the SSDs that we have, we need to queue more commands to avoid work starvation, where the SSD does the bit of work and then it has nothing to do until you send it more work. And so Flash is spending some of its time idle while you're like, why aren't you going faster? And so we have to tune ZFS to avoid some of those hard drive optimizations if you have an all Flash pool. But you can still run in the limitations of SAS and SAT interfaces mean you can only queue so many commands and you can't really control how distributed those are across the bits of Flash in the device. And currently in ZFS, there's no facilities to be able to adjust a lot of these things like the queue depth that ZFS will, how many commands ZFS will queue to the disk on a per pool basis. It's a system-wide CCTL. And that means if you have one pool of hard drives and one pool of Flash, you kind of have to pick which one you optimize and the other one has to suffer with the less optimal settings. And so that can really cause a lot of problems. And ZFS has a bunch of other bits like an LBA bias where when it's allocating space, it'll try to allocate additional space near where it just did to avoid making the hard drive seek more. But with Flash-based storage, writing to different parts of the Flash at once might actually give you more performance. Some of that is taken care of under the hood by the Flash translation layer. But sometimes purposely writing to a bunch of different segments of the Flash will give you more performance than just trying to write all of it sequentially. And so again, the default tuning in ZFS is kind of working against you here, where it makes great sense on hard drives, but not so much on Flash. But then Flash got even better with NVMe. In this case, instead of having one command queue, we now have many, usually one per CPU at least, depending on your device and things. But instead of a single queue that maybe takes 32 commands, we have lots of queues that take lots of commands. And unlike hard drives, we actually want to try to keep those full because the NVMe Flash devices are so fast that you want, if you don't queue up enough commands, by the time you get the completion notice and do it, the drives have been sitting idle for enough cycles where it could have done a lot more work. The NVMe spec actually goes all the way up to 65,000 queues. I've never seen a device that does that many yet, but we're just saying it so that you could queue that much stuff off that many CPUs and really get a lot of stuff going. But in order to get the advantages of this, we have to be able to send enough work to the NVMe to actually saturate it. And oftentimes, the bottleneck ends up being in ZFS, not in the NVMe, with the way it's built right now. So if we just take a minute to compare a modern 12-parabyte hard drive versus a modern SATA SSD that you could, like a data-centergrade one, versus a hot-swappable NVMe from Western Digital, we can see some pretty stark numbers here. The biggest one that you normally notice on the label is how fast you can read and write from it. You're hard drives doing about 200 megabytes a second if you're doing all sequential writes with no random seats in between. And the SSD is about hitting the interface limits of SATA, doing 500-ish megabytes a second in each direction. But then you see your NVMe that's doing over three gigabytes per second. But the IOPS is where the difference really starts to come in. If you're lucky, you're getting 250 operations per second. If it takes four milliseconds to move to another random spot on the hard drive, and you have 1,000 milliseconds per second, then the best you're going to get is 250 of those operations in one second. The NVMe, reading, you can maybe get all the way up to 100,000 a second. But for writing, it's actually not as good down to only about 10,000. But with the NVMe, we're very fast approaching doing a million IOPS off of a single device, not having to rate together a bunch of them. And so when you do, you can get huge numbers, but you won't actually see that performance currently because we can't send enough work to the drive that quickly. But the latency is really where everything changes on us. We saw even in the best case, the hard drive was looking at 4,000 microseconds for each operation. But writing to a SATA SSD has played a range, but that range is between 100 and 650 microseconds, which is still significantly less than 4,000 microseconds. And it can, you know, ZFS and even the operating system spend a bunch of time trying to sort and manage and coalesce all the work in order to deal with the fact that the hard drive can only do operations at this 4,000 microsecond time scale. And so it makes sense to spend 200 milliseconds trying to optimize that work so that we can get two for the price of one. But when the device can do it like this NVMe in 10 microseconds, then spending 200 microseconds sorting and coalescing the work first is actually working against us, not for us anymore. So fundamentally, for some of this, we might have to rethink how we issue IO2 devices and how directly the file system is talking to the hard drive. If you think now you have ZFS is talking to GOM, which is then talking to CAM, which is then talking to the hard drive. And at each of those layers, we might be doing some work in order to try to optimize and get the most out of that single operation. But when the latencies start to get into the single digits, that amount of work doesn't make sense anymore. And we might be better off just doing a straighter shot straight to the hardware and sending the operation and getting our response back and having the operating system and ZFS try to get out of the way of that instead of trying to help. Because when they're trying to help, it's not always actually to our benefit. So yeah, like you were saying, when we have latencies in the milliseconds, it made sense to spend a bunch of CPU time to try to avoid doing unnecessary operations and to collapse a bunch of operations into one bigger one. But with solid state, there's no seeking penalty. So maybe that doesn't make as much sense anymore. And then as we do even more with even faster or lower latency devices, we have less time where getting the CPU involved will make sense. And we have to basically get out of our own way. So currently, ZFS creates four task queues for each disk to keep all the asynchronous tasks from being able to delay synchronous tasks and to keep those queue depths short so that we don't end up with you submitted operation and you don't get a response for 900 milliseconds, which if the device can do the operation in 10 microseconds, you definitely don't want to be waiting a really long time. Or even a hard drive is supposed to be able to do it in four to 10 milliseconds. But if we had to wait 900 because there were a whole bunch of operations in line in front of us, we don't want that. Or the problem we saw, Britain is having this morning. They're like, please don't join the queue to see Queen Elizabeth. The queue is over 24 hours. They're basically dropping operations instead of queuing them because the queue is too long. And so because ZFS's original design was about spending rest, we're trying to avoid interleaving reads and writes because that would seek the head away from where it was trying to go. But with NVMe, there is no seek. And maybe interleaving more reads and writes will actually get us more performance. Up to a point is part of the problem. Like we saw years ago with Warner Losh's work with the IO throttling that on a lot of flash devices, if you write too much too quickly, it will cause the drive to go into garbage collection or other operations that can cause your reads to start seeing spikes in latency. And so we might have to start considering things like that as well, but first we have to make sure we give the drive enough work to do that it isn't just sitting there fitting its wheels, not doing anything for some large portion of its time. So we wanna get the CPU more out of the way of a lot of these operations. As the latency keeps going down, the trade off for spending time massaging the work before we do it inverts on us and it gets more and more expensive to try to help and instead just throwing the work as fast as we can at the NVMe might be the right answer. So we actually started playing with this because we had a customer who had a system with a whole bunch of really fast NVMe devices that in aggregate could do more than 10 gigabytes a second but when they ran ZFS on it, they were not getting anywhere near that. So ZFS was not living up to the potential of the hardware that they had paid for and they were upset with that. When we looked into it, some of the workload they were applying to ZFS was not optimized yet so we helped with that but that only brought us some of the way to the solution but the bottlenecks were not where we thought they would be. So we'll walk through a bit of that now. So the customer system, they had beefy CPUs, dual Xeon Golds for a total of 112 CPU threads, 256 gigs of RAM and two big pools, each made out of 12, 15 terabyte NVMe over fabric devices. So each of those two pools should have been able to do like 10 gigs a second of reads or writes and they even were splitting the connections over multiple fabric channels to make sure that the bottleneck wasn't the link to the NVMe over fabric. So we started using a benchmarking tool they wrote and go that was basically using 77 writer threads, 22 reader threads and one delete thread on randomly sized files between one kilobyte and about 1400 kilobytes and we saw that the most, writing to both pools concurrently, the most they could do was about two and a bit gigabytes per second of writes and each write had a latency of about 25 milliseconds of that there was half a millisecond of opening and creating the file, about 18 milliseconds of syncing the file after they had written all the data like when they called fsync and then another six millisecond syncing the directory, which is something they were doing we'll talk about that in a second and then concurrently those 22 threads of readers were able to pull about 3.6 gigs a second off of the flash with a time to first byte latency of about 3.7 milliseconds. So we started with one friend Greg's methods of just asking why over and over and over again like a petulant five-year-old. So the first thing was the workflow, what were they doing and why were they syncing the directory and why did that matter? So what their workload was doing was opening a temporary file, writing all the data that just got sent to them to it then they would sync that file and close it. Then they'd rename it from the temporary file name to its final file name, but they found that if they didn't open the directory and fsync the directory after that that if the system crashed that rename wouldn't persist and then the file somebody had uploaded wasn't there and there was unhappiness. But their benchmark tool wasn't actually emulating their real world workload. It would open the file directly, write the data to it, sync the file, sync the directory, and then close. And so it wasn't actually causing a rename and so it wasn't replicating exactly the real world workload. So we looked at fixing that. So number one was why do they write the file the way they do? And like I said they found if they didn't sync the directory then the rename wouldn't always persist. And that's when I learned that rename is actually not a synchronous operation. It's asynchronous. So instead of closing the file, then renaming it, then syncing it, we suggested what they could do is they open the file with the temporary name, write all their data to it, then rename it, and then call fsync on the file before closing it. And that way the fsync on the file will, in ZFS will automatically cause the sync on the directory as a dependency and both of those will go out with a single fsync instead of needing to do two and that will reduce the overhead quite a bit. Because in ZFS when you do an fsync everything else that's synchronous write has to finish first in order. Because ZFS is guaranteeing the order of all your writes. And so it was important that when they were calling fsync on just the directory they didn't have to wait for just the directory they had to wait for every queued fsync to finish before they would get a return for the directory. And so doing that second fsync just after the first one meant they were waiting for some other file to finish writing first and that wasn't really what they needed to be doing. And so this way they don't have to call open on the directory at all, let alone fsync and close. So with the improve workflow, like I said, open the file, write the data, rename, sync, then close, and they only had one fsync that way. So then it was like, why does this make such a big difference? And like I said, ZFS is enforcing the strict ordering. So when you call fsync, every other fsync that's already been called has to finish in the right order before yours does. And any, for example, all the writes to the file we just created were asynchronous. But when you call fsync on the file, all those asynchronous writes that aren't done yet get upgraded to be synchronous and all of those have to finish too. And with 77 threads all writing at once, you end up at the back of a very long line. And so syncing out the directory, which you think, you know, we changed the directory a little bit, we're talking like a hundred K of data at most. Why is it taking so long to fsync? But it's because of the strict ordering. And the fact that the ZFS intent log, whether it's on a separate device with a slog or embedded in your pool is single threaded in order to make, to maintain that strict ordering. So you end up waiting until everybody else is done. So calling fsync half as many times led to seeing a lot better performance. But still ensuring that when that fsync is done, that file will be there after a reboot no matter what. So then what else is causing things to take so long? So it's flame graph time. That's a little small but you can see the big chunks of pink on there. So those two large purple areas on the kind of right side of the graph and those two second and third last slot there. We were spending about 70% of the CPU time waiting for the VDev Q lock. And I was like, well, that doesn't seem right. And we tried a couple of different things including partitioning up the drives and using a bunch of separate VDevs to try to reduce the convention but it didn't seem to make any sense. Then we dug a little bit further and found it was the pool wide IO stats. And then I remembered Alexander had already removed that in newer ZFS but the customer was still using 0.8 because of Ubuntu. And so once we upgraded them to newer ZFS that big performance penalty went away and suddenly things were much faster. So we still thought at this point that a large part of the bottleneck was just fsyncs take a long time, right? It thrives, take a while when you ask them to promise me that the rights I sent you are definitely gonna persist. So we wondered how much of this latency that we're seeing in rights is coming from the hardware. So we tried profiling the VDev disk IO flush command and then we even went so far as to set the no cache flush flag to one so that ZFS wouldn't actually flush. It would just assume that it worked. But even when we completely skipped over all the requests of the hardware for synchronization the performance didn't change. It's like, well that's odd. And we looked at it and it turns out the latency wasn't high there. The hardware is very fast and it wasn't the bottleneck. So we wondered how much of it is the ZFS overhead. So doing what you should never do and setting sync equals disable on this dataset so ZFS will just not do that extra work. And then suddenly the latency drops from 25 milliseconds to 14 and we're seeing five gigs a second instead of two in a bit. It's like, oh, okay. So ZFS is taking a long time to do some of this work. And pretty much it seems all of our bottlenecks right now are in the software, not the hardware, which at least we can fix because we can't change the hardware where they already bought it. So now looking at what we went and did about it. First, we improved the way the work is dispatched to the disks. Then we looked at reducing the lock contention around that so that we're not spending a bunch of our time just waiting to be able to queue the work. And then we found there were some artificial delays happening in these transactions and we were like, well, get those out of here. So what were the bottlenecks? We started by doing some off CPU flame graphs looking at when the CPU basically wasn't getting used. It was left sleeping because we were waiting for a lock or something. And we saw a lot of time being wasted getting the locks on the task queues that ZFS uses to actually dispatch the work to the disks. So we asked, are there idle task queues that could be doing this work and we're just picking the wrong ones or how does that work? And currently, task queues are split up into groups of processes that each have threads but the locks are per process. So could we use more processes instead of threads and have more work queues that would lock independently and maybe reduce the contention? So we noticed that the system seemed to be spending a lot of its time waiting for aggregation to finish before actually doing the work. This leads to the NVMe being underutilized but we're not actually saving any IOPS because if we just issued all the work independently the drive would still be able to keep up. So every bit of time we spend doing work, trying to save work, we're just causing work that could be done not to get done. So could we just not do that? And it turns out to some degree, yes. So when you have multiple task queues for reads and writes in ZFS, it decides how to size those based on a mix of some task queues that are just statically sized and some are dynamic. So in the code, there's a bunch of macros that configure these task queues. The first one is just underscore N which just says make a single progress that has this many threads. Or there's batch mode which makes a single process that has some number of threads based on a CIS CTL that controls basically some percentage of your CPU cores will use that number as how many threads to have. So we'll spend up to this many of your CPU cores doing this work in ZFS. But math again, being our hero in the performance arena added this scale mode which tries to balance a mix of processes and threads based on the number of CPUs. So it tries to always have six threads for each process and then scale the number of processes such that there's always more, fewer processes than there are threads in total. But there is a tutable to let you override the default of about six if you wanna do a different number. And then there's the underscore P where you can just manually say this many threads or this many processes and this many threads. So in ZFS, this is all defined statically in the code. There aren't tunables for this. Partly because you'd have to have something like 30 new tunables in order to do that and it would be very confusing to the user. So I don't think that's the right answer but I don't think having it static and trying to scale to a VM with one CPU or a machine with 112 threads is ever going to get exactly right either. But the queues you have is the issue queue. So that's, we're trying to go and do some work and then we have a high priority queue. So if the work you're trying to do is really important, we don't wanna wait in line in the normal issue queue. We have a priority lane like at the airport. Then there's the interrupt queue which is once the work is done, you have to come and tell me that you're finished and that's where that goes and there's a high priority queue for the interrupt as well. So each of those four queues for each type of IO in ZFS. The null type which doesn't do anything so we don't really need to talk about. We have our reads and writes, freeze, claiming space, iocals and trim and so on. But the important thing is that we see for reads we have exactly eight threads that go and issue the reads to the disk. There's no high priority thread for reads because basically almost every read is synchronous anyway. And we use the scale mode so a mix of processes and threads to receive the interrupts once those reads are done. But for writes we have this separate high priority queue as well but it has exactly five threads in one process and that's it. And that's where we found that when you have 112 threads on your CPU and in this case 24 really fast NVMe devices, five threads just wasn't enough to get all that work done. So it was a bit of a curse of just the magic number. That number made sense when it was put in because there weren't any systems with 112 CPU threads and nobody had NVMe's and so on but now that number isn't enough in order to get all the performance you could get out of NVMe's but if we just randomly increase it too much then we end up overwhelming hard drives and that doesn't work either. So how do we come up with a better system for that? So we looked at switching to the scale mode instead of just a static number so that we would make multiple processes which we'll talk about in a second and more threads based on however many CPUs you had in order to try to get more of these synchronous writes sent to the NVMe as fast as possible. So with those changes suddenly with sync turned back on and everything normal our performance was now three and a half gigs a second of writes so about 70% better than when we started and the latency was only 15 milliseconds instead of 25 and we saw opening and creating the file so about 300 microseconds the writes took about one millisecond and then the F sync at the end was about 13 milliseconds and our reads were actually faster now as well getting another 10% or so improvement there partly because of the iOS set removal and so on. So now we have more task queues but are we using them? And do we need to change the scaling of any of the other task queues? So our first suspect here was the I octal task queue because the flushes that we're sending to the disks are I octals but it turns out because the zil is single threaded we're never actually sending more than one at once so scaling that task queue didn't make sense. The read issue one again is currently statically five it might make sense in the future to have that number be bigger when you have NVMe but we haven't had a chance to dig into that yet. We are mostly worried about writes in this case. And should there be a separate high priority task queue for reads? Currently there's not because you know almost all the reads you're doing are high priority but we do have prefetch and stuff and maybe it would make sense to separate those. But yes, making these tunable by regular users seems awfully difficult since because it's kind of this two dimensional array you'd need 30 plus new sysctls and that seems a bit unwieldy for the end user as well. So the other thing we looked at is how work gets put into those task queues. Originally there was one process so there's only one lock so we looked at how it picks which task queue to put them in when there are multiple and it basically takes the current high resolution time modulus the number of task queues to just pick one but that means two things that happen very close together are likely to try to hit the same task queue and that didn't make sense. So we wrote a solution that basically try locks the task queue and if it can't get the lock immediately it tries the next task queue over until it's looped all the way around and then it goes to the normal sleep and wait for its turn to get in the task queue. With that we saw a bunch of improvement in the performance. But the big one was the commit delay. It turns out in ZFS when you ask for some stuff to be written to the slog it will say all right but I assume there's gonna be some latency there so it waits 5% of the average latency of the previous commits hoping that any other things can come in and it can aggregate those into one flush. And on hard drives or probably even SATA SSDs that's useful but with the NVMe the 5% of 10 microseconds is not a measurable amount of time and the sleep just causes a contact switch and we should just not do that. Although Mav gave me a better idea of how to handle this but we'll talk about that later. So going to sleep for a very very short amount of time is useless here and we should just skip it. But while the 5% is tunable you can't set it to less than 1% and so we wrote a patch to fix that. And with this again by trying to conserve IOPS we're actually causing our NVMe to go to sleep instead. And then the last problem we ran into is all of this tuning being system wide means that while the default is to use 70% of your CPU cores for some of these task queues if you have seven pools on your system you've now allocated 525% of all your CPU cores to doing this work and that probably doesn't make sense. So we're gonna look at possibly having one task queues that ZFS uses for all the pools or somehow auto tuning some of this down so that when you have multiple pools you don't end up using more than all of your CPUs cause this can cause you all kinds of headaches. And then end up with the pools fighting each other for CPU time and reducing your overall performance by a lot. So with those patches we got another 5% or so and we're at about 3.6 gigabytes per second of writes. And so from this graph you can see the sync time in blue went way down compared to before because we were getting more of that work done and not spending a bunch of time trying to optimize it. Then if we got rid of the reads and configured the benchmark to do just writes we actually got to about 4.5 gigabytes per second to each of the two pools but the limit is still in ZFS not the hardware. If we use just one pool instead of two we actually got to just shy of 10 gigabytes a second of writes in a latency of only seven milliseconds. So some of the tuning we looked at is there's the assist details we talked about for tuning how many, what percentage of your CPU cores are allocated for each of the different types of task queues and the other big one is the dynamic thread dynamic. So a lot of these task queues when they instead of making 112 threads or some section like that they'll only spin up more threads if you're using them. This optimization is more for machines that aren't dedicated to ZFS. So if you're using it on your desktop you don't want 70% of your CPUs always having threads that are doing maybe a little bit of work and so it will spin down a lot of those threads. Great on a laptop but on a server dedicated to storage you're better off just having them always be there instead of starting and stopping them all the time. And then our future investigation is mostly into write contention, right? We noticed as soon as we stopped doing a little bit of reads along with our writes our writes got a lot faster. Why is that and can we fix that? We also saw that if we focused on just really big writes we spent a lot of time fighting with memory instead of actually getting work done but very small writes we spent a lot of time with locking instead of getting work done. And a couple other things, as the number of files in a directory got bigger we noticed the creation time of a file kept getting bigger and we're currently working on that. And if you're interested stay tuned to the OpenZFS leadership call videos on YouTube every month and we'll report our progress there. Okay, any questions? That was weird. It's Japanese. Yeah, okay. You need to repeat 100% of the question. Yes. I'm still wondering what's aggregating other than being here? Did it make any impact on the true progress of the OpenZFS? Right, so the question was about if we could get rid of the commit aggregation delay. And yes, we changed the minimum from 1% to 0% and then skipped the sleep but Matt pointed out what might make more sense is to leave the 5% value default but if the 5% of the average latency value is smaller than some number then we know it's a fast enough device and we can just avoid the sleep altogether. And yeah, it does end up making a difference in the number of operations you can finish per second. Dave? The customer did not, ah, but yes. Yes. So the question was is the customer's application in go the same as the benchmark and if it was part of the issue? We didn't really look at that. The main thing we fixed in the benchmark was the way they were generating randomness because it was using too much CPU. But they spin up all threads at the beginning so it didn't seem to be related to the benchmark or the application. It was, as we found when we started switching things on and off between the hardware and ZFS all of the delay seemed to be ZFS's fault. Right, so Dave's question was about NUMA and CPU affinity. We didn't even get that far because the low hanging fruit is much lower than that. Currently ZFS has no awareness of NUMA other than you can tell the task queues to pin themselves to the CPU instead of being able to migrate. But there's lots more that could be done there but there are easier fixes before we get to that point. So lots more future work. Albert. Yeah, so the question was, currently the tuneables are all system-wide but instead of pool-wide would you actually want them to be per VDEV? And the answer to that is yes and that is why I went and created the VDEV properties feature where you're able to actually set different settings on individual VDEVs. I think the first ones that are going in there are about fault management currently but yes, that's part of the idea. Although we still haven't figured out if we want inheritance on VDEV properties or not. But yes, that was part of the idea again because you might have this case of a mixed pool where the hard drives we want to treat this way but the special VDEV and the slog or SSD and we need them to be different. So the question was, do you think there's a point where we want to switch from interrupt driven to more polling? Possibly, I think there's enough wins that we can get without having to re-architect things but at some point it might be worth looking at that or in general, ZFS being more aware of the fact that these disks now have multiple queues on multiple CPUs and we'd want to spread the load out and that gets back into the numus stuff. Each of those NVMe queues is on a different CPU and we want to try to maybe use the one that's in the same CPU as where we want to get the data when it's done. So the question was about the fact that there currently is no high priority read task queue and if it would make sense, I think right now the only not high priority reads we have are prefetched and so we just deal with them a little differently but yeah, it might be interesting that have certain reads that are even higher priority than most reads are. The application is waiting for the data. Let's get it to further them as quick as we can but there might be a state. Yeah, Eric pointed out that maybe you'd want certain data sets to have higher priority there. That's possible, although the solution to that might make more sense to do some kind of QoS on the data set layer instead of. Okay, so we have an hour and 15 after this. The big break so you can come speak to Alan. They just land on a couple of their day run as they expect. Hi, Alan again.