 Okay, so hello, my name is Alan Jude. I'm a free BSD and open ZFS developer and Today I'm going to explain to you like you were five how the ZFS arc the adaptive replacement cache works So like I said, I'm a free BSD core team member and a developer as well as an open ZFS developer And for my day job I work at Clara, which is a free BSD professional services and support company So today we're going to talk a bit about how The caching system in ZFS works So we'll start with a very basic overview of what ZFS is just in case you haven't encountered it before and explain Why you should start using it and then talk a bit about How caching works in general and then why the way ZFS does it is a bit different and then we'll get into more advanced topics like compressed cache So starting from the beginning What is ZFS? The biggest difference between ZFS and most file systems is that it combines The concept of the volume manager something like the Solaris volume manager or LLV or LVM or MDADM into the file system So now you Take a number of disks and create a pool of free space And then you build file systems on top of there and each file system only takes the free space It needs basically each file system is thin provisioned So all of your free space is available to any one of the file systems you create on top And so you can create hundreds of file systems and they each only need to take the space They need from the pool and then you can expand the pool later And it can also create what are called ZVOLs which are block devices Which you can then put other file systems on top of or use with something like NDB or iSCSI Another big difference in ZFS over most file systems is that every block of data that's written to the disk is also checksum And so when you read the data back We verify the checksum and make sure that the disk hasn't Accidentally scrambled the data and ensure that when we give data to an application It's always in exactly the same form. It was when it was written to the disk and Another feature that will come in to play during the talk is the transparent compression So optionally you can enable compression in ZFS and then all the data you write to disk will be compressed first In the past this wasn't that useful because of the performance impact With things like gzip but with newer algorithms like LZ4 and Z standard the compression speed is so good that It's still faster than your your disk even if it's an SSD and so there's no performance penalty Comparatively to having the compression The ZFS is also a copy-on-write file system So it never overwrites data in place It always writes the updated data to a different location on disk and then updates the pointers This way if the system crashes or the power goes out or something happens to the file system When it boots back up it finds the newest version of the file system where all the checksums match and mounts that So you never have to run fsck or anything to where the file system could be damaged One of the reasons it can do this is because it's copy-on-write If you're in the middle of updating a file when the system crashes or the power goes out You've not overwritten half the file and not yet overwritten the second half because you're always writing data to the new Place so the original version of the file still exists And then the last part is to make administration easier Each file system has a set of properties that allow you to control Various things about it like if it uses compression or not and these are inherited as you create child file systems So it allows you to very easily control all the parameters of the file system But the main advantage to being copy-on-write is that you can have instantaneous snapshots Unlike snapshots in LVM. There's no performance penalty to having the snapshots Because all it's doing is not freeing the data when it writes the new version You get very it takes no time to take a snapshot and having a snapshot doesn't impact the performance of the file system So this allows you to freeze this a file system at a point in time and be able to go back and reference it later or to actually serialize that and replicate it to another machine and This doesn't take any additional space until you actually overwrite a block So if you take a snapshot and then don't modify the file system It won't actually consume any space and when you do overwrite files only the blocks you actually overwrote We'll start consuming space in the snapshot But snapshots are read-only But there's another concept called a clone of a file system Where you can make the a writable version that still shares space like a snapshot does which effectively allows you to fork a file system so you can Take an image of a database or something before you upgrade it But fork it and then have the other version Very useful for doing development on databases and so on where you could snapshot a version of the database Test it in development and then throw it away. Well the whole time not overwriting The production copy of the database, but while sharing blocks, so you don't need twice as much space so caching Computers have multiple different tiers of storage right you have very high-speed storage like the L1 cache that's built into the CPU and then you have Slightly slower storage like memory and RAM and then now we have things like non-volatile memory Which isn't as fast as your volatile memory, but is still faster than your fastest SSD And then you have disks that can go all the way down to being really slow. You think like SD cards So back in 1946 when describing computers Von Neumann said We are therefore forced to recognize the possibility of constructing a hierarchy of memories each Having greater capacity than the preeding one, but being less quickly accessible So the point of a cache is to use that smaller but faster memory to keep the most commonly accessed data So that we spend less time going to slower storage But the cache is just a copy so we can discard that data at any time unlike on the regular hard drive where we don't want to Just be throwing away data just because you're not using it But as you go into these faster and faster tiers the amount of storage you have gets smaller and smaller And so you have to decide which data you want to keep in the cache Because that faster storage is closer to the CPU is precious So you need some kind of algorithm to decide which data you want to keep and which will be the most effective for you and so we use The main memory of the system usually as the biggest cache and this is why most operating systems have a buffer cache So almost all caches that are in use today are based on the LRU algorithm, which it means least recently used So the cache is basically a list of the commonly accessed things every time you access something It goes to the top of the list and once the list is full you delete items from the bottom to make room to put a new item at the top But that algorithms from 1965 and has stood up very well, but isn't necessarily the only way to do it So in this diagram you can see we have in the first step We have data block a and then But the rest of the cash was empty and we've used the first block So we've filled up one quarter of the cash with that and then in the second step There's a data block B and then see and then we just keep adding to the cash until it's full So once we've added D here in step three The cash is now full so when we want to add a fourth item to the cash We have to find a victim to evict from the cash So we want to find the item that was used the longest to go Because it means it's the one we're least likely to use again So in this case we overwrite a with data block e And then in the next step what happens is we actually access the data block D a second time and So we refresh it and give it a newer generation number So that when it comes time to add data block F We decide to overwrite Again the oldest item in the cash And we just keep doing this and always keeping the most recently used data in the cash Because it has the highest chance of being reused again So there's some pros and cons to this simplistic algorithm It's usually implemented as a doubly linked list which means it's very low overhead and it's not complicated to implement You know some LRU caches are implemented purely in hardware, so it had to be very simple And the advantage is the locality principle says that if a process is visiting a location in memory It's likely to revisit that same location or somewhere very near it It's more likely that it will come back to the same place than to go some any other random place The downside to something like LRU is it Somewhat ignores frequency So just because I access this data once every five minutes It doesn't mean it's more precious than this other data that I access many times a second Both those will stay at the top of the list and not be purged But it doesn't have any consideration for how frequently you access it just how often or how How recently you've accessed it? The other problem is it doesn't adapt over time So it doesn't look for patterns or anything like that And the biggest downside to something like LRU is it can be disrupted by a large scan So if you take a backup of your whole hard drive or something then your small LRU cache You're just constantly going to be filling up with files you're accessing one time only And then the cache basically becomes useless because during a backup You're going to read every file once and never read the same block twice And so it doesn't consider Recent history to try to improve the algorithm So there's a second algorithm that came around in 1971 called LFU Which is based on the frequency that you use the data rather than the recency So it's the same idea except for instead of keeping The time stamp of the last time we access the data. We just keep a counter of how many times we've accessed data So each time we access a page or a file of memory then we increase its counter and the cache is basically a sorted list and When it's time to make room in the cache we delete the item that has the lowest hit count So when the cache is full we evict the least frequently used file So unlike an LRU cache this cache is immune to the scanning problem, right? If you take a backup or read a whole database Because you've visited each of those blocks once they're not going to move blocks that have been accessed many times out of your cache But so that gives it the advantage of being immune to large scans and providing Very good performance if you're very frequently accessing the same data Because the advanced locality principle says the probability of visiting a location Increases the more times we visit it So if we visited this location a thousand times, it's more likely we'll visit the thousand one time then visit some other location But because you now have to keep the list sorted it's higher complexity to implement And again has the problem of it doesn't consider recency. So just because I accesses this file a million times Last week and I haven't touched it in a week It still has more more hits in its counter than any new data And so the cache doesn't keep up with what you're actually doing on your system very well So it can accumulate data that you used to use a lot, but aren't using anymore And then your cache hit ratio is less because it's full of data. You're not even using anymore so in 2003 Two researchers gave a paper at the use makes fast conference Called the adaptive replacement cache and The concept here is to combine the best of the LRU and LFU caches together Along with some novel tweaks of their own to make the most efficient cache possible So you take your cache of whatever size the default is the variable in the algorithm is called C And then you partition this by default you start with partitioning it in half So you're going to take however much memory you have to cache with half of it will keep track of the most recently used files because The LRU algorithm has proven to be pretty good in every OS buffer caches used in But the other half will be the LFU so that frequently access files can stay in the cache and Then in addition to those two we also keep ghost lists which I'll explain more in a second and Then Each time so we have these two caches and they're full up And then when we have to remove an item from the cache to make room for something fresher What we do is we throw away the data, but we keep the the hash key and put it on this ghost list So we keep track of items that were recently in the cache But we've thrown away the data because the cache was full So we have the four lists the LRU and LFU and each of those also has the ghost list Which doesn't contain any data, but just has a list of the hashes of things that we've Recently evicted from the cache So if we request a page and it happens to be in the LRU or LFU cache Then that's a hit we can pull the data out of RAM instead of having to go to the slow disk And the cache has done exactly what his job is to provide faster access to that data but net Once we have a hit That means that that files was already in the cache because we used it once and we've now used it a second time so it gets moved from the the LRU half of the cache To the LFU side because it's now been accessed more than once so we put it on the frequently used list instead of the Recently used list, but if a page was not On either of those two lists then we look at the ghost list Which means If it's on the ghost list it means that we used to be in cache, but our cache wasn't big enough So we had to throw it away and So that means if the cache had been just a little bit bigger The thing on the ghost list would have been in the cache and we would have had a hit and So we use that information to actually modify how we split up the cache Right by default. We're starting with half recently used and half frequently used But if we find out the block we just tried to read from the cache Used to be on the recently used list, but we had to delete it to make room for something else We decide that what we'll do is we'll make the MRU the most recently used cache slightly bigger Stealing some of the space from the frequently used cache And we keep doing this every time we read a block that would have been in the cache It's a cache had just been a little bit bigger So every time We get a miss that could have been a hit if the recently used list was bigger We steal more from the frequently used list and make the recently used list larger Just sliding the slider from 50 percent towards a hundred percent But if the hit was on the frequently used ghost list then we slide that slider back the other way And this is what makes the cache adaptive as you your usage changes on your system. It's Decides which cache is more effective You know if you have a database where you're using the same data frequently the frequently used cache will get bigger and Give you a better hit rate But if you very don't very often use the same data over again But do use data that you've recently accessed then the recently used list will grow and And this allows The arc to give you a better hit rate than either An LRU or an LFU cache would have given you by giving you the hybrid of both that best suits your current needs So each time We find a block that could have been in the cache But isn't we change the size of the cache until we find the perfect mix of the two types of cash That'll give you the best hit But ZFS doesn't stop there you know you only have so much memory in your system and sometimes is Physically not possible to install more. There's only so many slots on the motherboard for more RAM So ZFS has a level to adaptive replacement cache So this allows you to use a fast storage device like an NVMe or SSD to cache data Even when you're running out of memory So rather than waiting until it's about to kick an item out of the cache to make room for a new one It watches the bottom third of that the two lists and as an item gets towards Eventually it's going to get kicked out of the list. We copy it and write it to that SSD and then We can remove the data But keep again the hash in the Cache and just a pointer saying this data is now on this SSD at this address and It allows us to go from only having you know hundreds of gigabytes of RAM to having terabytes of level 2 arc So it still takes a small amount of RAM, but it doesn't take as much as buffering the actual data in memory To avoid the problem with large scans again where if you're reading the entire data set You don't want to write copy all the data you've only used once out to the SSD There's a rate limit on how fast you write to the SSD So it won't take everything that's about to fall out of the cache and put it on the SSD It'll only take a fraction of it to try to give us the best chance of Keeping the data that you're actually going to use again on that SSD Because you know your SSD has limited write cycles It will only be able to store so much data before it wears out and So we wanted to put a limit on that and so with this we've now made a very large cache That optimizes itself based on our workload even as that changes over time to get the best possible cache at ratio But it turned out even that wasn't always good enough So one of the companies that works on ZFS delfix had a customer who had a database that had grown to about a terabyte But they found that as soon as the database exceeded the amount of cache that they had to keep that data in RAM The performance got terrible But it wasn't physically possible to put more than 768 gigs of RAM in the server. It only had so many slots And so what they came up with was the compressed adaptive replacement cache Since ZFS is already doing transparent compression as we write data out to disk We first compress it and then store it on the disk in the compressed form ZFS has an optimization where if after compressing it we find that there's not a gain of at least 12% We actually throw the compression away and store the data on disk Uncompressed so that we won't be spending all of our time trying to uncompress data that barely compressed But this means that the data on disk is Almost always compressed especially when it's a database because you know if your Database is full of just text and numbers and so on it can compress four or eight times to one And so the way the art worked before this new invention We would read the data off the disk Decompress it and then catch that decompressed data the original form of the data And then when the application needed it we would return it With the compressed art optimization. We basically delay the decompression step So after we read the data from the disk restore the compressed version in the cache And then when an application wants the data, we take the copy from the cache and memory Decompress it and feed it to the application So this means when you access a file many times even though it's coming from the cache every time You're still decompressing it each of those times But because the LZ for compression that's used by default for this is so fast and can decompress at Two to ten gigabytes per second per core depending on the type and speed of your CPU That's going to be faster than you probably could have read off the disk anyway and so isn't actually going to introduce additional latency and Storing the data compressed can actually give you lower latency because you're reading fewer bytes off the disk So with this step we now are storing the data in the cache in the compressed form where it's somewhere between 50% and hundreds of percent smaller So now the cache can hold that much more data and we can get that much higher a cache at ratio So with this change that one terabyte database now Could fully fit in memory in only 460 gigs of of bram leaving them a couple hundred gigs to grow in the future and still having a hundred percent cache at ratio Every time they wanted to read any block in their database. It would always come from the the cache in memory There is a slight optimization to this as well There's a second level cache called the debuff cache That will keep commonly decompress blocks Basically when you decompress a block would keep the uncompressed version in a very small buffer in memory So that we don't end up decompressing the same block multiple times per second So while we can end up decompressing it many times we won't do it many times per second because that would be wasteful But then even then the algorithm wasn't perfect The other thing that the arc now considers is metadata Especially in ZFS where you need to have things like the hash of every one of the blocks and the tree of indirect blocks that make ZFS able to do copy-on-write means there's a lot of metadata about each block on disk And you need that metadata to be able to read files very quickly But at the same time you want your cache to also actually contain some data and so by default ZFS limits the size of The amount of metadata it will keep in the cache to one quarter of the cache Although you can tune that depending on your workload if you have a very large number of files and you scan them frequently You might actually want to keep more metadata and less actual data But in most cases you want to limit the amount of metadata you keep in the cache to avoid it starving Taking all of the cache space and leaving none for actual data Compared to the original arc algorithm described in the academic paper ZFS's implementation had to be a bit different For a number of reasons the first one is the size of the cache is not fixed ZFS's arc Can be shrunk if the system needs memory by default ZFS will take a large fraction of all the memory on the system up to 95 percent But depending on the machine if it's your laptop or something that your browser is probably taking a good chunk of the available memory on your system And so ZFS has the ability to shrink its cache and give memory back to the operating system for user applications So unlike the original arc algorithm described in the paper This cache is dynamically sized as well. So in addition to changing the partition value the p-value of how much is for MRU and how much is for MFU The actual size of the cache can grow and shrink over time And unlike in the original arc paper You can't evict any data that's in the arc Sometimes the data that's in the arc is currently being used right if you're actually if someone's reading the data out of the arc right now Or is locked the buffer then we can't go free that one just because it's the oldest on the list So the algorithm in ZFS had to be a bit smarter to be able to deal with the fact that some blocks will be in use And we will not be able to Evict those ones and we'll have to evict, you know the third least frequently used file because the other two are still in use And Lastly compared to the original out algorithm Where all pages in the cache were four kilobytes? The arc in ZFS is built out of disk sectors Which can be anywhere from a 512 byte like a regular hard drive or the record size in ZFS goes up to 16 megabytes So you can have very different sized objects in the cache And so the arc algorithm had to be a bit more complicated than the original one described in the paper And so when I talk about the the compressed cache in ZFS, I often asked it asked questions about comparisons to things like Compressed memory and swap cache. And so I thought I'd talked a little bit about the differences there So most of the memory compression schemes like swap cache that are out there Are generally taking the idea of okay when the system is running low on memory We're going to find some bit of data that's in memory that we're not using very much and compress it to free up space To keep other stuff in in memory the problem with that is When you're in the situation where you're running low on memory the last thing you want to do is need more memory To run the compression algorithm and to be able to copy the larger data into a smaller buffer to compress it So it's really the worst possible time to try to compress memory is when you're running out of memory And so it's a problem a lot of the memory compression schemes run into Specifically reacting when the system is already under stress is only going to make that stress worse not better Whereas what the compressed arc is doing is taking the data that was already compressed when you wrote it to disk previously and just deferring the decompression step into later and so The compressed arc is only doing compression as you're writing data to disk Rather than every time you're modifying memory so compared to swap cache compressed arc is just taking advantage of the compression you've already done and Using decompression which is much faster and cheaper than compression To optimize the amount of memory that you have available for caching because again Free memory is wasted memory if you have free memory You might as well use it for cash because if you end up needing that memory you can always kick the things You know shrink the cash to free up that memory again and use it for the user application But in the meantime The larger the caches the higher the chances that the data you're trying to be read will be in memory instead of on the slow disk and so the last thing I want to talk about a little bit is Tuning in ZFS. There's a lot of different knobs you can tweak But depending what you're using ZFS for there's a couple of highlights I'd like to point out so if you're building something that's specifically dedicated to being a file server Which is basically the defaults in ZFS Then you want to use as large an arc as possible if it's a file server It's quite unlikely to be running a bunch of other application But you're not going to have web browsers and databases running on this machine So you'll be able to use almost all of the memory for the cash But if you have very very large numbers of files Then increasing the size of the metadata cache from 25% to say 50% Will mean that you can cache The metadata of those files more and it means doing things like directory searches or trying to run our sync On large directories will be much faster because the metadata will all be resident in memory and with the compression the metadata compresses extremely well and You'll see compression ratios of three or four to one and so you can fit a lot of metadata and not that much memory and Optionally if you have a working set that's larger than the amount of memory you have For the cache then you can consider the L2R using an SSD or NVMe or NVDim device To keep commonly used files somewhere a little faster than the spinning hard drive you might be using but That's maybe not as cheap as buying terabytes of RAM But if you're doing something like block storage if you're going to do the NDB or iSCSI like the previous talk was talking about Again, you probably want to use most of the memory on that for caching Especially if you're backing VMware or something like that or Zen But one of the biggest considerations is the block size Especially when you're using the block device You have to match it up with the layout of your disks to make sure you don't use a lot of padding or have Read modify write cycles if using a very large virtual sector size on the block storage But the guest OS inside the VM expects to be using you know 512 or 4k sectors Then you end up reading a 32k block Modifying the middle of it and then having to write that whole block out again For databases, there's actually two different approaches depending on your workload The Previously the recommended way to do it was to keep the arc cache quite small or even By using the per file system parameters caching only metadata and not actually caching the data and Then using the databases buffer cache like MarioDB or Postgres would use and have the database take care of the caching Because it has more knowledge about the rows or whatever in your database and what's in use and what's not then the file system is going to Because it will Understand your usage better However, once the compressed arc was invented Now you have the advantage of if the compression On the cache is going to be great enough that Having that much larger of a cache is actually going to be a win You might instead decide to have a medium or large arc and keep the databases buffer cache quite small and rely on ZFS to do the caching because with compression they could cache two or four times as much data then the Databases buffer cache because of its higher cache at ratio and Lastly if you're going to run VMs on the machine co-resident with ZFS for your hyperconverged Type setup then you probably want a relatively small arc because it's likely you're giving most of the Ram on the hypervisor to the individual VMs And if you have ZFS trying to use as much RAM as it can and then having to give some back For the guest and then fighting back and forth. I can create more contention and it's That's good that way. So in that case it probably makes sense to have a specific upper limit set on the size of the ZFS cache and Dealing with that The other problem you can run into is double caching You know you're caching the blocks in the arc on the host for the hypervisor But then the OS inside the guest is maybe running ZFS or maybe not But using the OS buffer cache and ends up caching that same data and now you're keeping that data in RAM in two places Which obviously isn't as efficient as keeping it in only one place So now I'm going to take some questions for a bit and then we'll see what else we need to talk about Anybody have questions? How much would you recommend the ZFS file system to desktop users? Sorry, is it ZFS applicable at all to desktop users? Yes I'm giving a talk that will make a lot more sense for that at five o'clock in the BSD dev room But the reason I run ZFS on my laptop Is for a feature we call boot environments being able to snapshot the The file system before I upgrade my packages or upgrade to a newer version of the OS Means that if it goes wrong, I can just roll it back So for example, I updated the software on my laptop and then I came to go do my presentation and then it didn't work I could just reboot select an older version of the OS from the free BSD boot menu based on keeping those older versions as different ZFS file systems and just instantly revert To an older version of the OS But my home directory is a separate file system and so it didn't go back in time So my slides were still on the laptop even when I reverted the OS So, yes, I find it extremely useful on desktop student and laptops I perfectly understand how the compressed cache works when an application uses reach to read from a file, but how does The compressed cache deal with applications that try to memory map use huge ranges of files to access it directly So when you end map the file, it will deal with the page faults individually. So when you go to read the block, it will Copy it decompress it and and write that version into a different region of memory where the end map will be backed by So just as individual pages fault when you're trying to use an end map the first time you try to read them It will get the data from the art decompress it and put it there so that you can retrieve it. Yes Yes, and map is a case where you can end up with double caching in ZFS even without the compression actually because of the way The compression or the way and map interoperates with ZFS Because ZFS the data you have in the arc needs to be read only If you write new data, it goes to anonymous buffers that then get compressed and then sorted in the arc separately and so M-map and ZFS sort of not They don't use the same buffer cache and so you do get some double caching there. Yes Okay, so I think you refer to SSDs for a slot devices so for L2 arc Yes, we're out of and you said that they are varying out Which is obviously true but is there also some optimization for other memory technologies like 3dx point for example Which does not bear out that fast and you can actually write a lot of petabytes So there are two tunables for that the first one sets How many megabytes per second you might want to write to the device and then there's a second one called the boost which is Since starting up the device hasn't ever been full yet Then the boost amount of megabytes per second is added to the limit So by default if it's only gonna write say 20 megabytes per second out to the SSD But we're just freshly booted in the SSD has never been full yet Then we might add a second 20 megabytes a second to that so that we fill it up at 40 megabytes a second until it's full and Then switch into a lower speed and so For if you're using a 3d crosspoint or octane or something you might just decide to add a zero to each of those numbers or something Because at the same time you don't want to use up all the bandwidth just writing Data that maybe isn't going to be used. So having some rate limit is still useful Just as you want the device available for reading more than for writing Hi here So you talk please are you is that in a case with bake ups? You are it's not efficient because you're just filling your LRU But how do you avoid it with LRU and MFU because if you're doing backups You are just asking for more memories for your LRU and just yeah Being in a case where your LRU is way because I know is on your MFU So how do you deal with that you have some history to avoid? Too much allocation for the LRU in a specific case right the the each time we get the Ghost hit The p-value only moves by one byte so it takes a lot to actually move that counter many megabytes or gigabytes But in the case of a backup because you access each file only once You're not going to hit the ghost list very much So the the p-value won't actually change in the case of a full disk scan or a database scan And so it's not actually going to cause your MRU to get really big it will still cycle the MRU and make it less useful but at least some fraction of your caches MFU and Frequently used files are not going to suddenly drop in performance because of the scan Hi So how do you evict stuff from the ghost lists? So the ghost list is because it's just the keys. They don't take very much space And eventually there's just a size limit and they fall off just like a regular. It's basically a secondary LRU cash and so When it gets full you just delete the oldest entries. Hi, what's up with the set standard? Pardon what's up with your set standard work? What's that standard? Yes, so I there was a little note on one of the slides there about adding the new Compression type Z standard that's originally from the author that did LZ4, but it's much more modern and designed to Kind of like how GZip has levels one through nine The standard has like negative something all the way up to 29 Gives you a lot more control over the level of compression So I've been working on implementing that in ZFS and it all works The problem is if you turn the compressed arc feature off and you have an L2 arc It means that the copy of the data in memory is not compressed So when you go to write it to the L2 arc you first have to recompress it So that the checksum will match the copy that's on disk which was compressed And that works because we store the compression level and can do that the problem comes when you go to read it back if We upgrade the version of Z standard in the future for example over the course of the development Yon released a newer version of Z standard that all compression levels 10 and above are now like 8% better compression So now when you take data it was written before and then you recompress it with a newer version now When it ends up on the L2 arc it has a different checksum because it's actually 8% smaller And so when we read it back We see that the checksum is wrong and report that your your L2 arc SSD is failing when it's not So we're trying to decide what the best way to avoid that problem is whether it is to avoid the compression or Check for the checksum mismatch doing the writing phase and skip it in that case Or what or just pick some version of Z standard and always use that version never change But giving up the performance gains that are coming out on a regular basis doesn't seem like the best option either So it's kind of hung up on solving that problem That's why there hasn't been much progress in the last six months but Hoping to figure that one out and get that feature upstream It's so I originally developed that on previous D But someone has ported it to the as a pull request on ZFS on Linux And so if you don't have an L2 arc or at least if you leave Compressed arc on it works and you can you could play with it today Although it's until it's committed. I probably wouldn't want to use it in production right So recently I had a Windows image and I had this running on a ZFS volume and I had a L2 arc that was Somewhat larger than the vid windows petition, but every time I spun up the windows petition Image and span it back down. I kept getting hard disk hits. Would it be correct that I actually need to have an L2 arc That's at least twice the size of the image to Not see any hits anymore. I think likely what the problem you're running into there was the fill rate limit because you were only Allowing by default I think 10 or 20 megabytes a second to go out to the L2 arc But the data in your arc was falling out at a faster rate than that So if you just increase the fill rate Then you should be able to fill up the the L2 arc you have and get closer to the hundred percent cash Interesting. Thank you very much. Yeah, it's basically taking a fraction of the data That's falling out of the cash and writing it to the L2 arc And if that fraction is only 10% or something then you're it would take a lot of cycles before you had all that data on the L2 arc separately there's another feature coming in ZFS on Linux and eventually to free BSD called Allocation classes that allow you to have an SSD that you specifically define as being for metadata or for Blocks that are like 4k or something so that you could force specific data to always be on the SSD And then the rest of your data be on the spinning disc Oh, hello So I read that PostgreSQL wanted to implement RSE Yeah, 15 years ago, and then they decided against it because of the patents situation Can you say a bit about the patents situation? I think IBM has a patent on this IC algorithm. I Cannot comment on issues of patents Not qualified and really really don't want to There's a question up there Yeah Over on your over here Yeah, I was just wondering about the Compressed arc and Obviously that pulls the data compressed off of this or if you've got a ZFS volume that you've enabled compression on and a load of existing data on there that's uncompressed at the moment Does that then get loaded even though you've got compression on your volume compressed arc? That still gets loaded into the arc uncompressed because it's uncompressed on this right with compressed arc the copy in the Arc is always exactly the same as the copy on disc so that the checksum can be verified each time without having to store a Second checksum, right? So if I wanted to take advantage of the compressed arc I'd have to send you basically write the data to make sure it gets written compressed Yes, or you could use the ZFS replication to send to a seconds evolve Yeah, and have a written compressed all right. Okay. Thank you So pro tip turn compression on before you write any data Hey, three more minutes Question Suppose and NVMe is your main story. So I don't have any hard drive How would I configure ZFS in this case and I will start like NVMe has quite different performance Characteristic to hard disk so you would rather store things in a different format than in hard disk Does any of that the biggest difference with NVMe is that you can actually be Executing multiple reads at once or multiple writes So most NVMe's can run somewhere from 16 to 64 commands in parallel Whereas the SATA interface only does one at a time you can queue a bunch of them up But it's only gonna actually run one at a time So you don't need to do anything special in ZFS But there can be a big performance gain from increasing the queue sizes. So instead of by default for spinning discs you don't want to send more than a single digit number of work items to the disk because They form a queue and if you have a really important thing that comes up it goes at the end of the queue So you keep the queue short so that when you have an important Job come up it gets to the front of you quickly But if the thing can be executing 64 different commands at once having a queue of four doesn't Take advantage of that So you just need to up the queue max to allow it to queue say 128 commands to the device knowing that because it has 64 parallel execution, you're never gonna have a queue that's longer than two So just some tuning for NVMe, but you won't you don't need a fundamentally different file system or anything Thank you for an excellent talk and if you want to know more you can check out the books and I do a weekly podcast as well