 Tommy here for Morning Systems and I have a new ZFS shirt. So I want to talk about ZFS. Seems like the appropriate shirt to wear for this topic. Now, I did a video called ZFS as a cow, but obviously there's more to the story. The copy and write semantics are really important to understand. But what's the next question might be, what about those specialized V devs such as log or cash? And don't those also play a big role in how that data gets to or egresses from that particular pool? And you are correct. But there's a little bit more to the story. It's actually a very complicated topic, one that, well, you'll find lots of discussion on many different forums about this topic. So I wanted to put some numbers together, show you how this works, and not just talk from an academic standpoint, which I'm going to have everything time index because there's going to have to be a little bit of talking going on to explain how these things work and try to be as concise as possible. But then we'll get to the demo part where we demonstrate how you functionally add these or remove them on the fly inside of TrueNAS. And what happens when you do this when you have a VM and you're doing some reads and writes back and forth to it and showing some of the numbers and how things adjust on there. Ultimately, it's functional to teach you how it can be done inside of TrueNAS and, you know, academics so we can understand what's going on behind the scenes. And I also will leave links to all the articles that I talk about in a forum post that I have or just dives into a lot of different topics of ZFS. I am a big ZFS fan, obviously, for those of you that don't get the joke on the shirt. I have been called a ZFS cult member because I'm preaching about the wonderfulness of ZFS and it is a wonderful file system. So that's kind of the joke about the search. Yes, there's a link to those in case you're curious or care. If not, you can skip right over that. Nonetheless, before we get into the details of this video, if you'd like to learn more about me and my company, however, LawrenceSystems.com, if you'd like to hire us for a project such as storage consulting, there's a hires button right at the top. If you'd like to support this channel, otherwise, there's affiliate links down below to get your deals and discounts on products and services we talk about on this channel. Let's start by talking about how write caching works in ZFS. There are two types of writes, asynchronous and synchronous. An asynchronous write provides immediate confirmation when it receives a write request even though the write is still pending and has not yet been committed to disk. This frees up the application waiting for the confirmation and can provide a performance boost for applications with non-critical writes. That's because it's lying to the application. Essentially, it said the write happened even though it didn't. Now, a synchronous write guarantees integrity by not providing confirmation of the write until the data has actually been written to the disk. This type of write is used by consistency critical applications and protocols such as database, virtual machines, and NFS. We'll do a VM and NFS demos later in this video. All pending writes are stored in the ZFS write cache. This is located in RAM, which is volatile, meaning its contents are lost when the system reboots or loses power. To maintain data integrity for synchronous writes, each pool has its own ZFS intent log or ZIL residing on a small area of the storage disk. Pending synchronous writes are stored in RAM and also logged in the ZIL simultaneously, only for synchronous writes. ZFS is a transactional file system. It writes to storage using transactional groups, which are created in a set every five seconds. These groups are atomic, meaning the entire contents of a transaction group must be committed to disk before it is considered a complete write. That is the copy and write mechanism that I discussed in my video where I said ZFS is a cow, linked down below. So what happens if a failure such as a power loss or kernel panic occurs? Everything in RAM, including all pending transactions and asynchronous writes requests are gone. If there was an interruption in the transaction group performing a write, that transaction group is incomplete and the data on disk is now out of date by five seconds, which can be a pretty big deal on a busy server. But keep in mind that pending synchronous writes are still in the ZIL. On a system that's on startup, the ZFS will read the ZIL and replay any of those pending transactions in order to commit those writes to disk. It's actually a pretty solid system here. Pending synchronous writes still get written without any losses, no more than five seconds though of asynchronous writes would have been lost. Five seconds as I said is a big deal, but just so we're clear on it's only asynchronous sites lost because it replaced those transactions. But this comes at the cost of performance. So let's talk about performance and where that ZIL data is stored. For applications using synchronous writes while having ZIL reside on the storage just can result in poor performance because the ZIL writes and reads must compete with all the other disk activity. This can especially be a problem on a system with a lot of small random writes. The solution to this issue if your workload requires synchronous writes is to move that ZIL to a dedicated log vdev. And there's a few things to keep in mind if you want to do this. The log device requires at least one dedicated physical device that is only used for the ZIL. You should mirror these devices because safety and redundancy in case one of them were to fail. The devices should be really fast, but they do not have to be very big. The ZIL only needs to have enough capacity to contain the synchronous writes that have not been flushed from the RAM to the disk. This process occurs every five seconds. So how much data is that? And let's use a 10 gigabit connection to your TrueNAS server as an example. With a max 10 gigabit connection, the maximum possible throughput ignoring overheads and assuming one direction in a perfect situation would be 1.25 gigabytes per second. And because that flush happens every five seconds, the most you would see written to your log device is five times 1.25, totaling 6.25 gigs. That's it. You're hearing that correct. Only 6.25 gig max would be the absolute most you would write to that log device. So if you bought a one terabyte one, you're probably not going to use all of that log device. This is, as I said, why so many of them are so much smaller. Now let's talk about ZFS read cache, of which there are two. The first one is ARC, which stands for adaptive replacement cache. And it is a complex caching algorithm that tracks both blocks and cache blocks recently evicted from cache to figure out what to cache. ARC is all in RAM, which leads people to thinking ZFS is memory hungry, but it's not. ZFS just does not let any unused memory go to waste. It also should be noted that ARC is a native part of how ZFS works. And the more memory it has to work with, the better it performs at caching. Now the L2 ARC or cache VDEV is a storage device you attach to your pool that is a much simpler system compared to the ARC, allowing for efficient write operations at the expense of hit ratios. Just kind of a simple caching system. Of note, L2 ARC will rarely have a cache hit ratio as high as the ARC. This is as expected, but not a bug. L2 ARC is just a simpler, cheaper way of caching more data than the ARC can fit. The most demanded blocks are always going to be available in ARC, and the L2 ARC is just a catch-all for some of the marginal blocks, hence the lower hit ratios that are on there. Now the real question comes. So when should you use an L2 ARC for read cache? For most users, the answer to this question is simple. You shouldn't. That's it. The L2 ARC needs system RAM to index it, which means that the L2 ARC comes slightly at the expense of memory for ARC, since ARC is an order of magnitude or so faster than the L2 ARC and uses a much better cache regular them. You need a rather large repetitive set of data to be requested that exceeds the ARC for the L2 ARC to become worth having. So if your goal is to have a fast cache for frequently accessed blocks, buy more memory for that system. It's the best investment you have. This is why high performance CFS systems always have so much memory in them. It is also worth noting that if you have a right intensive workload or don't frequently request the same data, it's also not very useful. So the goal is always to have as much memory as possible. That's where you're going to get the best read performance on there. It's really that simple. There's not anything more complicated than that. But if you do have a workload that exceeds your RAM or you have maxed out the capacity of your RAM, then maybe it's time to consider an ARC device. Now of note, you do not have to put these in in pairs in the system because they're only caching data that already exists on these storage VDevs. Therefore, if they were lost, if a failure would occur, it's inconvenient, but not catastrophic because if that data is not there, it just pulls it back from the drives like normal. Now for the fun part, playing with this demo lab that we have set up running TrueNAS 12.0 U8. Essentially, this works the same in TrueNAS Core as it does in scale. Our demo system here happens to be a core system. Now we have 15.9 gigs of RAM, but this is really not an overall performance system. I want to bring it up because we're not going to be running a whole lot of benchmarks. We're just going to be talking about functional things and how to get them done in here. ZFS cache, because I have a VM running on this attached over NFS, ZFS has decided to cache things with the ARC as we talked about. It's not that it's RAM hungry, it'll use as much RAM as well is available in this system. 15.9 or 16 gigs of RAM being available means the cache is currently able to use 11.7. And yes, in case you're wondering, based on services running on this, you will end up dynamically, if I add more services or install something like a jail on here, it just automatically resizes and scales it. It's only using what it perceives as free memory, not used by other services running. So that's dynamic and nothing you really have to worry about in case you're wondering. Now the system itself is set up with NFS attached to XCPNG. I have a demo system right here. And I have it up live and running and it's attached, as I said, with NFS and we have sync disabled. So let me cover that setting real quick here. We're going to go to storage, we're going to go to pools, click on little buttons there, we're going to edit the options and we can see sync is disabled. And we're going to run a simple test where SSH into these. The top one is the Ubuntu one and down below and probably I should show the command because people will wonder we're just running Z pool IO stat dash V the name of the pool, which is the lab pool and one means refresh every one second gives us a really simple it is actually just keeps rereading and telling us what the stats are. We're going to actually make it a little bigger so you can kind of watch this populate. And we're going to run just some FiO, but specifically I have this set up to run FiO with a write test, we're just going to write about a gig of data here and see what happens. So here we see it writing. And you can see down here the data as it's refreshing. All right, we're writing quite a bit of data here to the drive 1.4. You can see by the way these are also just RAID Z1 and we have three SATA SSDs in here, not a particularly fast system, but good enough. So utilization Q merchant out we're writing at about 93.1 megs a second. So right, there's our rights. And if we go back over to the system here and switch over to here and we're going to look at how it perceives this. There's our disk rights that we just did. And we can see the throughput as it goes through there. So not bad, not great, but not bad either for what it's writing. Now on the fly, you can change this, we're going to sync always. But by the way, we do not have currently installed a slog device, but let's go ahead and run that same test again after we do this. So we're going from 93.1 megs, we're just going to rerun the same command. And let's see how fast we can write to it with a sync always turned on. Looks like we're writing at about 13 megs instead of 70 megs. This is that forced commit. Now, when you have the sync disabled is essentially lying to the XTP and G system, it says, oh, no, we have that right absolutely committed. So because we're saying no, don't tell XTP and G or whatever's writing to this particular NFS mount, until we actually sync it, don't give the system call to send more data because we're not ready. We're not syncing yet. This just results in this just results in extremely low performance. And even with a lot of drives, this is where you run into the need for a ZIL or to turn the sync off. So currently with this 21 megs, 17 megs, it's going to take a little while for this to finish. And it's because it's writing so slow. Let's go back over to here in true NAS. And now we're going to talk about adding and we have in our just pull here. So if you look at the disc, there's those three sadas that we're using for the drives called the lab pool. And here is this MVME drive we put in, we have been up a small one laying around, and it's snapped into the system. So we're going back over to the pool. And by the way, we can do this while it's reading, writing, we don't have to shut anything down. We want to go here, we want to add a V dev, got to choose the type here of log, find our MVME drive, go here, and that's it. You're going to get a warning, a straight blog V dev may result in data loss if it fails combined with power. But as you tell you don't have just one. So we're going to have to say the word force. They really want you to know that if you do not commit these in pairs, there is that potential that you'll lose this and lose the RAM. If there's a catastrophic failure, you could have a problem. So they're reminding you of that that these should be set up in pairs. We're going to add the V dev and a disk or a race in the pools extend. Yeah, we're just letting know that this block device that we're adding is going to do that. Actually, let's check real quick. It did finish writing and you can see it's written rather slow, but let's see if it's finished. And we'll double check and show the screen because we're going to do this and hit add V devs, or jump over here. And just like that, look, we've added it. And now it shows up right here as the log device. So here it is. It's got 13 gigs, but back to what I said earlier, but it doesn't have to be that big, which is fine, makes this a little bit more ideal. Now there's small rights and transactions that the system is going to do with this VM. So there's going to be a little bit of data that you're going to see here popping it out, but let's go ahead and run that right test again and see if we can beat our 17 megs. So run FIO and let's see if these little rights make a difference. Hey, look, we got some data going over here. And we're not quite up to the performance and this comes down to the systems who are hitting in the 70s, but we're up to closer 60. So you can see the immediate effect of adding this definitely improved our rate performance, not to the word it was, but still reasonably good. So yes, absolutely. We were improved and it's now committed and flushed out some of that transactional data. So it sits with about 362 bags left, even though the file we're testing is one gig. As I said, it's not exactly a one to one ratio for the size of the file you create versus what ends up in the intention lock. And as it kind of expires out, commits those, and we're back to don't really have anything left. All right, go back over here and let's remove it and you can add these or you can remove them. We go to status from clicking on the pool, click on this, and we can just drop it. Confirm you want to drop this absolutely. And it's gone. Well, almost. All right, now it is got a little head of myself there. That's it. Now we've removed it. And if we go back over here, you can see it doesn't show up anymore. So switch back here. Go to pools. Same process again that added as a cash drive. Now, before we add it as a cash drive, let's go ahead and do this. We're going to them run FIO. I don't care about a right test anymore. So we're going to block out this. This is a random read test. FIO is just a really simple utility. So I'm using it's not in depth. And this is not accurate or benchmarking. This is to give you some general ideas of what happens. So if we run this again, and we look down here, you're going to see much heavier writes, and we'll see where we get with this. So it's doing the rights going to take a few seconds to complete. But we're doing this at about 63,000 IOPS, and we should get a speed test at the end. It should tell us about how fast this is with the random rights, or we can even go and look at this because this is what's going to be in there. I'm sorry, random reads, because we're doing this is a random read test. If we go over here, you'll actually see, which is kind of interesting, even though we're writing at these slower speeds, you can see it's able to get a substantially higher throw put. Now, the reason for this is really simple. And we'll go ahead and switch back and forth to it. I even put a larger file in here, but it's also part of the way the adaptive recache works. So if we go back in here, then run FIO, if you can read through where it's wrapping right here, we actually sizes to be a 10 gig file. But even though it's a bunch of random reads from a 10 gig file, if you notice at the beginning here, I had mentioned there is 16 gigs of RAM and 11.7 gigs of it is all for this cash. This is why you don't see a ton of reads and writes on the disc activity. You're seeing basically a massive amount of this reading going on in terms of pulling it right from the arc. This is ideal. This is not necessarily making it easy for me to benchmark it because it seems to be an unrealistic number. You're like, Tom, you're not measuring a dry speed. I'm like, well, do we have to? Don't we want to see the efficiency of things that are going on in a VM and isn't running a VM? Frequently asking for a lot of the same data. So the more memory you stuff in here, the better it is because that's where that, as they referred to as hot memory is going to say, yes, those are all the objects that I put in there. This is that hot demand data that you're saying that should be in here. So you keep requesting it. Let me keep giving it to you. Let's go over here to the reporting. And if we look at what's going on, we switch over to ZFS. This is how you end up with these arc hit ratios that are really, really good. So here we go. Hit versus miss 512,000 hits versus 5,000 misses. And this are, you know, quite good numbers and what you're hoping for. Now let's go ahead on the fly and maybe you're noticing like, Hey, what are these over here where it says L2? That's where I was doing a demo before I hit record. So let's go back over to the storage pools. And now that I have record on and I'm doing it in for this demo, we're going to add a cache feed of same process at it over here. And these, it does not need to be in pairs or redundant. The reason why is really simple. It's just data that's read from your existing storage pool pulled into the cache. So no problem. It's not a big deal if it's lost. It's just annoying. And it will just repopulate that as in each time the system, if it loses it, it can go, Oh, I know where that was. It was just pulled to put in there. So now we've added this. And now we see that the cache one is here. Now that we know that this random read test, let's see what happens now. Now currently, a little bit of memory is going in there. And it's just the frequent writes. And remember, this is a very basic, the L2 arc is a more basic cache. And it's going, All right, I'll put some of this data in here, just from the running VM. But if we run this, it's going to populate it a lot faster. So we're going to start seeing the rights kind of populate in, but we're probably not going to see a massive change at all in performance. As a matter of fact, the IOPS is roughly the same. If we scroll back in here, it's not substantially different here. And the reason why is because it's just pulling still from the cache. Now we have to fully exceed the cache for to start the arc cache in order for the L2 arc to start being effective. But there's still at least some data being swapped out and populated in here. And that data will just stay there is part of the data. And this can be beneficial, as I said, in certain workloads, such as if you may need a cache, if you have lots of large files that many people are requesting that large file, if it's larger than the arc can end up in likely to be hit on as a successful cache on the L2 arc. But overall, the reality is the performance differences are massive between memory ram and a block device storage. So if you can, if it's better for your budget, you can, well, better for your performance, not necessarily for your budget, as you put more memory in a system. And this will allow it to have a better caching. But the caching, it still works. The adaptive one still puts some data here, but it can take a little while before it exceeds the memory. So we have to exhaust arc in there. I don't really have a demo, I guess I could try one more test where we just make this file big. So let's go ahead and do this and see where it goes. If we said, we're going to write a file, not 10 gigs, but let's say 25 gigs, let's go ahead and run this again with a 25 gig test, which should, depending on if it's a different amount of data, it's supposed to be read random file it creates here, hopefully it will exceed the arc cache and actually start pulling a little bit more. And we'll see some read operations scroll this up a little bit, but you'll see some more read operations on this cache. Because as you can see from these tests right now, it's just not hitting the read operations on it. So this larger file did result in a different change. We have 169 seconds here. So a little bit slower than our, if we scroll up the 295 here, if we look at this under XEP and G, we can see this test that we did here, then here is the writing out of that large file. And back over to here at the end, it kind of peeked out again, probably when it was pulling it out of arc. So the first part was very random. And this is probably more realistic. But once we got to the part where it was repetitive and arc was going, Nope, I have this. And it was able to send all the data. We see it ramping back up because it was asking for repetitive data. Now, as I said in the beginning, I'll leave links to, well, many of the other topics I've talked about around ZFS and planning and storage pools and ZFS being a cow. And well, the whole list of articles I have over in my forums, because there's still a lot more to understand. But I hope this video gave you some understanding better than you had before about how some of this works. It is a complicated topic. It is one of the reasons it's misunderstood just due to the complexities of it and understanding performance of file systems is a little tricky. But it's still what makes ZFS such an awesome system. And I just love the fact that we can do this on the fly, not rebooting or anything. Just go and, Hey, let's add parts of these block devices, remove them and understand a little bit better how they work. I really encourage playing with it in your own lab. This is how I gained a better understanding it. But before I started doing this commercially, it's one of those things of why I like spending time talking about it. Because well, I think we need more people in the industry that understand it. And it's just a fascinating topic. And I imagine a lot of you, if you've made it to the end of this video, it was a very interesting topic to you as well. As always, if you want to have a more in-depth discussion about this, hand over to my forums and see you in the next video. Thanks. And thank you for making it all the way to the end of this video. If you've enjoyed the content, please give us a thumbs up. If you would like to see more content from this channel, hit the subscribe button and the bell icon. If you'd like to hire a short project, head over to laurancesystems.com and click the hires button right at the top. To help this channel out in other ways, there's a join button here for YouTube and a Patreon page where your support is greatly appreciated. For deals, discounts and offers, check out our affiliate links in the description of all of our videos, including a link to our shirt store where we have a wide variety of shirts that we sell and designs come out well randomly. So check back frequently. And finally, our forums. Forums.LauranceSystems.com is where you can have a more in-depth discussion about this video and other tech topics covered on this channel. Thanks again for watching and look forward to hearing from you.