 Our next speaker is Mark Fasche with Dadoop on ButterFS. Hello. My name is Mark Fasche. I work at Susie. I work on file systems at Susie. I am the maintainer of the OSMS-2 cluster file system. I work on ButterFS quite a bit. And yeah, I do general file system stuff at Susie for Susie Labs. So the talk today is on deduplication. And, you know, in order to talk about deduplication, we should define it. And quite literally, we're trying not just removing duplicates of data. So nothing super complicated there. This happens across files, across the file system, which differs from, say, compression when you think of, like, making a zip file or something like that, or even file-based compression within the file system, right? So that's, you know, like a lot of people might ask, like, well, how does this differ from compression, right? And that's basically it. So conceptually, it's just we're leaving the data alone. Not really, because we're blowing away half of it, right? And pointing everything back to the one copy. But conceptually, like, the stream is intact. We're not compressing it. Decompressing it or anything. There's two forms of dedupe that we looked at. And generally, there's two forms of dedupe. So there's inline dedupe, and that happens in the right path of the file system or the block device. So mostly in this talk, I'll talk about the file system. But there are block devices that dedupe on that level, right? In any rate, when you're doing it inline, right, that means that the right path has to calculate some checksums, you know, and potentially maintain a table of duplicated hashes, right? So that's going to impact your right performance. The, I'd say the trade-off you get is that you're dedupe right away. So in theory, you know, if you start with an empty device and you're deduping as you're writing to it, you would never write the same thing twice, right? So that's your trade-off, right? If you want to trade some right performance to always just get the maximum deduplication, you would do that. Out-of-end dedupe is basically happening later. So we let the data get written to the disk, right? And at a later point, the admin or whoever is responsible for the data decides this is what I want deduped, right? So because it's a deferred process, you have no impact on the performance of your rights. The downside, of course, is that, temporarily, you're using more space than you will later, right? Because you'll dedupe it later, basically. Does that make sense to everybody? Straightforward. Cool. And just raise your hand if you have a question. I guess that would be the easiest way to do it. So we had customers asking about deduplication at Suzy and butterfests is a natural fit for something like deduplication. The reason that it is a natural fit is because it has to refcount its extents, right? So already butterfests understands the concept of one extent being shared amongst multiple files, right? And usually you're creating maybe a snapshot or something or cloning an extent, and this is essentially just the reverse of that process, right? Where instead of cloning it and taking an extent and putting it maybe in another subvolume or pointing to it from another subvolume, right? We're just, well, we're taking the pointers. Essentially, I guess what I'm trying to get at is at the end of the day, it looks the same on disk, basically, right? It's just kind of another use of the extent pointers that they have. Does that make sense? I know that was a little... Okay. The other good reason is because we supported at Susie. So obviously, that's where I'm going to look first. And I've been asked to tell you guys that we have four engineers that will work on butterfests at Susie. I'm one of them. And it's actually a pretty good number. I would say we have very good coverage of bugs and features and whatnot. So that kind of describes a little more of my background, I guess. I started on butterfests just with that sort of bug fixing, adding features that we needed, right? And then it turned into like, okay, now that we've got things stable and we can ship it and give it to customers to use, what are those sorts of next steps we want to do, right? And that's the advantage that they have at my organization, right, is we've got a lot of people working on it. And we can do that sort of thing, basically. All right. So before I get into the nuts and bolts of how we dedupe, I wanted to describe how extents are laid out on butterfests. So the actual data is laid out pretty much as you would expect in Extent on any extent file system. So there's no header or anything, it's just on disk, right, at some offset with some length, right? BetterFS maintains what's called an extantry and this is global for the file system, right? And this is keeping sort of those, that's keeping track of all of those extents on disk, right? And that includes things like ref counts on the extents, right? If you have an inoted butterfests and it wants to point to some data, right, it uses, it has its own item, right, an extent data item, right? And that points directly to the extent, okay? Now, the reason we do this is because we can keep the extent tracking separate from, say, snapshotting, right? So when you create a snapshot, right, all you're doing is you're creating new extent data items, right, that are pointing to the actual extent on disk, right? Set, follow, does everyone follow? Yeah, I had limited time. I think a graph or a picture might have been a little better. So that gets to basically the process of cloning extent, right? So if you were to ask Butterfests, look, I have this extent in one file and I want this extent to be cloned into the other file, right? Let's ignore what happens, well, what happens to the data that it gets cloned over is that it gets thrown away, right? But you're fine with that, right? So essentially that's what Butterfests is going to do, is it's just going to rewrite the extent data item and it's going to point it to the extent that you asked for, right? Yeah, sure, absolutely. Yeah. Yeah, yeah, right, right. It's a good question. So you're doing that because you don't want to keep all the ref counts and all of that sort of global metadata, you know, spread out amongst multiple snapshots. So you keep that in the extent tree, right? Does that answer the question? Yeah. Yeah, yeah, yeah. You have different information. So, like, when you have another example of why you do it separately, is when you describe the extent in the extent tree, you're just saying, well, this extent lies at this offset on the disk and for this length, right? When you describe it for a file, well, you have additional information like, well, where's the offset within the file that this extent is located at, right? Yeah, so that stuff, you know, as you can imagine, right, it would get really messy to have that all in one place. So yeah. Yeah, make better sense. Okay, anyone else? Yeah. All right. Yeah, so now with the background on ButterRFS, I'll talk just how I implemented DDoP. We chose to do out of band. Most of the people we talked to were not interested in sacrificing their right performance. There are actually patches for doing it in band on the list today, actually, so that's pretty cool. But yeah, your choice with the in band is going to be using a lot of memory or maintaining a table on disk, basically, right? And you're going to be computing checks while you're doing the writes. So we chose to forgo all of that, right? One, because a lot of customers were not too happy about the idea of losing right performance. The other one is it's just a lot more simple, too. Hacking the right path can get complicated. It's very performance-sensitive. And it just seemed easier and better, basically, to do it VNI Octol and just do it later. Right. Yeah, yeah. So that's the idea. And then it's a good idea, right? And that's kind of where I'm going with the out of band DDoP, right? The idea is that it's giving you the flexibility, right? The admin can know, okay, hey, look, we're not busy between midnight and 4 a.m., right? So now is when we're going to run our DDoP or something like that. Or maybe you have some better intelligence, right? That's monitoring the system. And no, this is when I want to run my DDoP process. The reason that we keep it out of kernel at that point is because there's no longer any reason to have it in the kernel, basically, if that makes sense. So putting it in the kernel at that point is just adding a lot of complexity and potential bugs and stuff. But yeah, that's a very good idea. And that is ultimately where I'm going with my project is for it to be run as a demon when you want your DDoP. Okay, so yeah, so that goes over why we chose to do out of band. The other things that I realized when I was doing the DDoP, excuse me, is that no matter what, you can always have collisions. No matter what checks something you do, no matter how strong it is, there's always collisions. It could be compromised. People, one thing I have learned, people are very unforgiving if you corrupt their data. Like the one thing I have learned in file systems, you do not do that. So because of that, don't trust the checksums, right? Yeah, go on, please. Yeah, I mean, I would assume there's still some mathematical probability, right? Yeah. But then you go, okay, well, you use, like, what was I using? I was using SHA-256, which is way overkill. I'll get to it later, but I want it being way overkill for my purposes. But yeah, yeah, I would say, my response to that would be that I could tell that to a customer, and then they'd say, so there's still a chance it might corrupt, right? Yeah. So you just, you don't do it. Yeah, yeah, yeah. So, yeah, go on, please. So, yeah, so, okay, okay, I see what you're getting at. Yeah, yeah, yeah. I would agree with that. Yeah, yeah, yeah, definitely. Yeah, yeah. I agree, I agree, yeah. It's just, it's, we can't say, you know, yeah, no matter what chance someone gets upset, basically, if that makes sense. No, but I mean, I generally agree with you. It does mean, though, it does free you up to do some neat things, because you don't need really strong checksums anymore. So that's kind of nice. By the way, checksums on the file system level are not there for detecting duplicates of data, though, right? You probably understand that, actually. They're there for detecting errors, right? And the reason I bring that up, not because of anything specifically you said, but a lot of people come to me and they ask, hey, why aren't you using the Butterfest checksums for this? Right? And that's one of the big reasons as well. They're there to detect errors. They're not there to tell you, yeah. Yeah, yeah. But it answers your question, though, more or less. Yeah, but I agree, but it's, yeah, I have to take care of it. I can't do otherwise, basically. So ultimately, though, what that means is that if, when we get into the kernel, we do compare, byte by byte. We make sure, okay, if you're asking to deduplicate this data, right, that, you know, the page is absolutely the same when either end, before we do that. So, yeah. That's the promise you can give, basically, right? You say, no, we're checking. This is, you know, so. Yeah, exactly, yeah, exactly, yeah. Yeah, absolutely. No, no, you're absolutely right. You're absolutely right, yeah. It's true. So this is the IOC play I came up with. And it's pretty straightforward, actually. You just, you give a target file with an offset and a link, and then you do the same thing. You just, you have a nice fat array with a bunch of FDs in it, and you send up a request, and you say, hey, I'd like all of these files deduped, right, and at these offsets and whatnot. As I, as I already described, the IOC tool will go in there, and it does a byte by byte compare. So there's no hashing that happens there. It's just a memconf, basically, right? So what it's allowing is, that allows you to do the hashing and user space however you want, right? You want to store it in a file, you store it in a file, you want to keep it in memory, keep it in memory, use this, use that, you know. Just discussing which hash to use. I've had maybe four or five patches of different hashes, because everyone has their favorite hash or their favorite checks on algorithm and stuff. So let's see. Yeah, it's, like I said, it's pretty straightforward. It does everything under lock. It returns whether you've deduped or not, so that the user space understands, you know, okay, has that compare failed, right? Was there any other reason why you couldn't dedup, right? There might be like permissions issue or whatnot. It's entirely possible for a file to change in between when you call into the kernel and you have to dedup it, and when the actual need to run and whatnot. So we got to handle all of those things, basically. Internally, we just use most of the clone code. Clone isn't a ButterFest I octal, and I basically touched on it before, but it's the way ButterFest exports the user space the ability to move, or not to move to copy extents from one file to another, right? So all we're doing basically is saying, you know, we're just doing the compare, we're just cloning one extent over the other and letting it all fall out from the clone code, basically. And clone code already handles that in those to orphan the extents that gets overwritten and everything, so it makes our life easy and shares bug fixes, which is nice, so. Oh, any questions on that? I think that is a good sign. So, yeah, please, please. The question here is... They can be, yeah, yeah. Oh, for ButterFest? No, it's 4k. Yeah. That would be the minimum extents size, yeah. Well, excuse me. No, so that goes ahead, actually, as a slide. I do it on 128k. I'd be happy to explain why, if you'd like to know. If you want to know, I take it. So when we're doing all the IO and all the check summing, my do-per-move program, the user space software, has the option, right? You can do up from 4k up to one meg, basically, right? The IO, the check summing, everything just slows down a lot if you're chunking it up into 4k chunks. Additionally, you really start to fragment, you know, I would say your risk of fragmentation gets higher, right? So it's kind of a balancing act of how quickly do we want our IO to happen, right? And how much would we like to avoid fragmenting the file system? So... And I actually didn't have, like, a great answer right off the bat, right? So I'm talking to my boss about this. I'm talking to other engineers. And really, I just experimented a lot. Did a lot of runs at different block sizes and looked around, actually, and looked at what other software did. So 128k seemed to be pretty common for, like, a default DDo block size, if that makes sense, yeah. So from user space, you know, the block size that you compare at is basically up to, you know, up to the user defaults at 128. Internally, the file system absolutely can have extents, you know, 4k to, I want to say, 256 megs would be the biggest extent. I want to say that. Yeah, does that answer your question? Oh, okay. So if you want to check this out, it's on GitHub. This is the user space part, right? The kernel part is in the kernel. And if you guys want later, I can show you the kernel code. Yeah. That's absolutely correct. No, that means that for your duplicated blocks, yes, the IO happens twice, absolutely. Yeah, I won't lie, I'm not going to... Actually, we magically do it only, yeah. What's that? Sorry, yeah. No, no, no, the kernel uses the page cache. Yeah, absolutely, yeah. So it's hopefully... Right, the worst case is... Yeah, absolutely, absolutely, yeah. The worst case would be that you do the IO twice. Average case just depends on your set. If your set is way bigger than your memory, probably it's going to happen twice. If it's smaller than your memory, it's going to react. Okay, cool, cool, yeah. Let's see. By standard file-based interface, I just want to get across it. It's like your standard Unix program. I try to make it as close to CAT or CP, anything like that. Please go ahead. Right now, it do remove, scans everything, builds extents from the scan blocks, and then submits them. But if you don't mind, I'm going to write that down, because that's actually a pretty good idea. I'm submitting it right away. Now it's because we don't know ahead of time, right, how many... Because I want to cram as many D-dups in one IOctl as I can. So I don't know ahead of time how many duplicates I would find, right? But there's a possibility of doing that in the future. I have a feature where... Again, we're getting ahead, but it's okay, as long as everyone's okay with that. I have a feature where DuperMove can write them to disk in a hash file. That'd be nice later for when we scan the hash file I have a feature coming up where I will use the hash file and I will store transaction IDs in the hash file so then I can just query ButterFS based on a subvolume and say, hey, what is the last transaction ID on this subvolume? And ButterFS will be able to tell me then which files have changed. Yeah, yeah, so that's the... I'd say that's probably the last feature, the last big feature I had in my head to make, to complete the project, I guess. So yeah, okay. Any more questions? Straight forward, okay, cool. Good. Most of you deduced this already, actually, but yeah, so the DuperMove basically works on a three-step process, right? So you have a file scan stage and we know over why we use 128k default. You can optionally use the temporary file. I recommend this. So one of the lessons I learned also is if you don't do this, you will eat an enormous amount of memory. So I'd say about maybe 18 months ago I had to work something in basically because we were finally hitting data sets. I was finally hitting data sets. We're like, oh, wow, okay. This is more hashes than I have memory for, basically, right? So now you can use a temporary file and that expands the amount of the space you can scan by a ton. We then take all of the duplicated blocks and I make extents out of them and this goes back to wanting to reduce the number of Ioptals I'm calling into, right? So I have an enormous table of extents that are duplicated, right? I'm sorry, I should say blocks that are duplicated, right? And basically what I do is find all the dupes that create, you know, a multi-block chain, essentially, right? And I submit those together. So two reasons for that. One is in the interest of, you know, calling into the kernels a few times. Secondly, I'm also just trying not to fragment, basically, right? So I don't want to have, like, you know, a meg and then be chunking it out into 128k dupes, right? I want to coalesce all those 128k blocks that I have and turn it into a meg and then dedupe that. So that would be the intermediate step and then the last step is essentially an enormous loop of call into the kernel, you know, ask for a dup, get back my status and whatnot. Yeah, and then the hash is actually wound up being really useful for testing. So I actually have instructions on the Wiki for people if they like to just see how much, you know, how quick a step, right, one of the stages is or, you know, how their hardware handles it, they can isolate the three-step process, right? And write everything into a hash file first. And then dedupe from that hash file, right? And then my great feature is eventually we'll tile that together and then you can run it with the hash file that you had from your last run. It'll just dynamically update it. So hopefully by next year you know, and I do this talk, I'll have that. I would say a lot, but I've had some people use it for things other than dedupe because it can just hash a lot of files very quickly, which has been pretty neat. So just an interesting point. Any questions about sort of the high-level view? Yeah, please. Oh, no, no, no, it's paralyzed. Yeah, yeah, yeah. I thread the heck out of everything I can, yeah. The step, in fact, if you want to know, basically the first and last steps, the file scan and the dedupe steps are heavily threaded. So as many CPUs as you have, or as you tell me I can use, I'll make that many threads. The extensor, I've not been able to thread it yet. It's actually very CPU-intensive, too. Yeah, because it's literally not allocating memory, it's not going to disk. At that point it's just walking this immense data structure and putting extents together. So it's actually led into some discussions of maybe allowing people to optionally skip that step. If they're fine with just setting a bunch of raw blocks to dedupe later. Yeah, but yeah, the first two steps are heavily threaded. That's where I got an enormous amount of performance. That, taking a better hash helped a lot, too. A bunch of things, so. Any more questions about this step? So, yeah, most people that they want to know is like, okay, well, how fast does it work? How much dedupe do I get? So this is my test. The test I do is basically I copy my home directory to a test machine. I copy slash home to a test machine and I run dupe or move against it. So right now it's about 750 gigabytes takes about 45 minutes, 44 in change, basically. Yeah. So that's about like, what an hour a terabyte or something like that. That's fully scanning it, categorizing, deduping it. Last year at the talk, that took two hours. And then before that, it didn't finish it. So this is, again, part of the talk is I just want to explain the evolution of this project. And that's definitely one of the evolutions. The very first thing anyone ever asked was how quick is it and how much can it dedupe, basically. So, yeah, and this is, it also gives you a decent idea of what you might see on a home directory. So 70 gigs, 750 gigabytes, what about 10% right. And my home directory, obviously I'm not going to share the logs of it, but it's mostly just mostly, you know, photos, music, your usual stuff. Yeah, right. I'm not going to share the hash file of my home directory, but yeah. I'm desperately looking for something, I want something that has like the exact same pattern, but is not my data, you know what I'm saying, that I could put up somewhere. But yeah, so that's yeah, I mean like, I was actually surprised 70 gigs, it's not bad for what I thought was pretty like not deduplicatable, you know. So, yeah, yeah. I don't have source code, and I have a separate partition for my source code, but source code doesn't deduplicate that great either, I found out. Yeah, yeah. Yeah, that's my big feature for the next revision. So version 11 should have that, yeah, yeah. Yeah, that's where I'm at now. I've got it to where everything else is fast enough now that that's something that can be on the horizon if that makes sense, yeah. And I'll use that using transaction IDs from Butterfest, basically. If anyone has used the Butterfest plan new command, it's essentially the same thing. You can give it a sub-volume and it'll tell you like what has changed. It's pretty sweet actually, it's really cool. Yeah, please. Yeah, right, so that's I have someone I know on the Butterfest IRC channel that is doing something along those lines. I could do that, yeah. Right now my idea is to have it something you run, right, and then maybe it'd be run by another daemon, right, you know, so then you would run it and keep the log somewhere. But it's not a bad idea, yeah, yeah. I noticed at least one other person is doing that with their own custom daemon. They have a really specific use case. I haven't actually gotten the code from them. I think they're just keeping it themselves, which is fine. But yeah, it's definitely it's a good idea. I'm not sure if I'll get to that or not. Cool. All right. So use cases, I say medium-sized because I don't want to over-promise, like, you know, so you have a good idea, right, you do terabyte in an hour, so maybe pointing it at a petabyte, maybe not yet, right. Yeah. I say medium-sized because I don't want to over-promise, yeah. Yeah, sure. Oh, just once, just once, yeah. It'll make no difference or sometimes you'll discover a little bit more and there was a good reason for that and I don't quite remember. It actually has, I think it has to do with the extend search though, yeah, but that's generally what you would expect, yeah, yeah. You'll get it out and then that's basically it. Yeah, so virtual machine images was actually one of the first use cases that was presented with and I say sometimes people just need to be aware that this is fragmenting your virtual machine image, right. So for a lot of people, if it's mostly read-only or something like that, that's great. They love it then, right, but if you have like a really busy disk, you probably don't want to do this. Make sense? Cool. I don't like angry users, so... Yeah, go ahead. Yeah, please, please. I personally haven't seen an enormous impact yet. I'm waiting for it, but yeah, I haven't seen anything myself personally. I think the SSDs, I mean taking out the seeks really like... Yeah. Oh, okay, so are you asking why... Are you asking... Yeah. Yeah, right. No, actually for both of... I'm talking about both of those cases, basically. Yeah, yeah, it will. Absolutely, yeah. Now, blow up, I don't want to blow up, like that's a strong term, but like you will be rewriting the extent pointers that did the extent data the sets that I talked about, right. So presumably you could be splitting one up so that might introduce a metadata overhead, right. No, but it's... No, I mean not in like, like if we're talking about one file no, right, but it's something to be aware of if you're running it on a lot of files, right. Or, yeah, does that make sense, yeah. Yeah. No, no, I don't have like an actual... No, no, that's true, yeah, I don't have an actual number. Yeah, it's more of a, just understand that this is what happens, right. So if, again, like if it happens once that's, I agree with you, it's not really big deal. The leafs are leaves are about 16k in butterfests, so there's plenty of room in a leaf for extra extent pointers and whatnot. But yeah, it's just, it's one of those things where if you do it to one file, 10 files, a thousand, you know, right, things can, could blow up basically, yeah. Makes sense. Any more questions? I'm a kernel hacker so we just copy like the RB tree code from the kernel and then I store it in the RB tree. Yeah, basically so in, in doopremove we are storing, yeah, I have an RB tree basically, and I'm keeping the hashes in that and then I'm using a SQL like database on the back end to store them temporarily basically. Oh, okay, a red-black tree. Does that binary search tree basically? Yeah. Yeah, yeah, yeah. So I don't need, so there's there's two modes that doopremove can run in. If you don't give it the back end, a file back end, it's just going to use all the memory it can use basically. In that case, you load up an enormous RB tree, right? I haven't had a problem so far, yeah. I guess that's the thing. Right now it's right now it wouldn't be used if you're giving basically the file back end, right? We still use, we will still use a tree but it stores a lot less, like an enormous amount less. We just keep basically the ones we know that are already deduped, or the ones, I'm sorry, the ones that we know we found duplicates for will keep in a tree just to make a subsequent search faster. Okay. Yeah. Does that make sense? Yeah, yeah. Yeah. Right, yeah, yeah. Yeah, so what doopremove is, it's not passing up, hey, this is what I know is on disk as an extent because you're right. It doesn't have that. Actually, it can discover that information via FIMAP but it's not something that comes naturally, right? No, you're absolutely right. We're passing up the logical extents, not the physical extents. So, yeah, yeah. So doopremove is working on the logical extents, right? And passing those into the kernel and then the kernel is resolving that, figuring out where it is on disk. Yeah, absolutely. Oh yeah, yeah, absolutely. Yeah, yeah, yeah, absolutely. Yeah, yeah. This is all handled in the clone code. So this would be the same thing if you clone one portion of a file into another. You're essentially doing the same exact amount of work, yeah. So, for example, as you pointed out, if there's already an extent where you're cloning into, you're going to blow it away and you might wind up splitting it into two extents if it's larger than the area that you're cloning into, right? So, you know, I mean, you can wind up from one extent to three extents essentially for one portion of a file because you'll have the two extents, right, that were from the old extent and then you're nearly de-duplicated third extent, yeah. That gets into the overhead we were talking about. Yeah, absolutely. It's basically the trade-off you make, right? Again, like on a per-i-node basis, it's not really a big deal, you know, but like, you know, if you have like a really busy virtual machine, maybe, I don't know, yeah. It's, again, it's something to be aware of, to understand. Oh, right, right, okay, okay. Yeah. Yeah. Well, now with the SQL light back end, it's pretty much oh, oh, before that. Oh, yeah, before that, oh, okay. You could basically you could do the math essentially and say well, I don't remember the exact overhead but maybe it was like a, you know, a couple hundred bytes maybe, right, for the node, the arbitrary node, right and then you could, you know I have a terabyte that I'm scanning so divide that by 128k and then multiply that by the overhead of like the arbitrary node, right. So, for example, now, do you want like a more... No, no, I mean, I understand what you're thinking about. Yeah. What's that? Sorry. Yeah, yeah, it would be something like that. Yeah, yeah, exactly, yeah, yeah, yeah. Yeah, exactly, yeah, yeah, yeah. And that's if you run it without using the temporary back end basically. Turns out the run, I had not find... Now this is running on this system, right, with the nice SSD and whatnot. I haven't found a really big performance difference from using the SQLite back end which is pretty nice, so I was happy with that. I'd say a few minutes, honestly, if I'm going from memory. Anyway, I have a lot of these numbers on the Wiki as well so every time I do a release, this is kind of the test that I run just to make sure, okay, nothing broke, right. Bring in a patch, go, you know, set this off, right and then every time I do a release, I'll re-run this and put up the numbers so I track, you know, well, how have we improved or regressed or whatnot. Cool. Any more questions about this page or anything in particular? All right, so... And I think that's getting near the end of the slides. This is essentially what bugs I found really. Nothing's perfect, right. You start, you know, and you find bugs in anything. So the first thing, kernel locking is complicated. When we're doing this, we're locking two inodes down and that doesn't happen very often in the kernel. It almost never happens. The closest you get is if you're doing like a rename and then you can lock multiple inodes. In that case, you still don't lock down the data for the inodes. So this was an exercise in nesting an enormous number of locks. I would say about four of them and that includes like the inode mutex. So it's probably not, you know, super surprising that we had one or two issues. At some point, clone was locking in the opposite order so that caused problems, right, if you did a clone and a dedu. The ones that I'd say ever actually showed up, clone I just found by reading, read page was doing the opposite or I should say extend same, the IOCTA was doing the opposite order of read page and that would show up as R-sync hanging on some people's systems. So I get a bug report and someone would tell me why is R-sync hanging on the system I'm deduping and then I'd go and I, oh look we're hung in read page. Huh, I wonder why. Yeah, there we go. So to say that was probably the biggest bug that I fixed that you might have encountered if you used it before, basically. Another interesting one was we drove R-sync crazy, clone changes M time and C time on the target I nodes and R-sync does not like that because the file is changed, right. Yeah, so that is the next one, you know, it gets out there and then people are giving me feedback and one of the bits of feedback was this is great, it's deduping, why are my backups so slow now? Right, it's like, okay, yeah. So maybe we shouldn't do that. So you know, I updated this one of just being a kernel patch and we just skipped that update basically in kernel, really easy, but important. The other big one was we weren't deduping the tail of files. If the tail of a file was not aligned, I was aligning down the request and it turned out to just be incredibly wasteful because we would never remove those extents or you would have like this tiny tail that wasn't deduplicated, right, on all these files and that was just basically a bug fix patch type thing. Pretty straightforward. And all of this is if you have latest kernels, this is all fixed in latest kernels and whatnot. So, all right. Oh, is there any questions about this? So, okay. And then, so down to dedupe, I've touched on most of these really. The one I haven't touched on is when don't dedupe your backups. So, when you do that you're just, you're putting everything in one place to fail, right? So you have a single point of failure, right? Don't dedupe your backups. Understand the disks fail. Discs might fail and just one part of a disk might fail, right? So another reason why you don't, you know, right, you want to be careful what you dedupe, right? If it's something critical, it might be better that it's duplicated on your disk. Yeah, please. Oh, that's non-SSD hard drive. Oh, okay. Maybe not a nice term, I guess. Um, so on those, you know, you could, you could get more seeking, right? Because you're fragmenting the files intentionally to deduplicate them. So, let's see. Yeah. Well, okay. So I would say my objection, my only objection to deduping the backups is your, well, you deduplicate. So you're taking potentially critical data and, you know, making it a single point of failure. But you're right, if you have like a rate or something. Um, so it's maybe a guideline, Yeah. Yeah. Then you have more backups too, right? Yeah. Yeah. Yeah. Also, if your backups are, if you're replicating your backups, then you should, you can dedupe them, right? Because it's, you know, right? So yeah, it's definitely case by case, right? I just mean, I don't want like the poor person is just, like me, I just back up to USB storage. So, I may not want to dedupe that because I don't trust it to be super reliable, right? Oh, fair enough. Okay, okay. R-sync, right? Yeah, that's right. R-snapshot. Yeah, in fact, I use the R-snapshot. So I'm kind of deduping my backups already in that sense. Yeah. Yeah, yeah. Well, yeah, it's on the file level. Yeah. Yeah. So it's not necessarily due, but yeah, it's funny to say that. Yeah. Any other questions on that part so far? Cool. Oh, yeah, please, please. Yeah, there's a, so I don't want to, there's a defrag process in Part RFS. I have not had enough time to look into it because it's one of those things that I want I need to look into because it could be something I could tell the users, right? So hey, run your dedupe, right? And then afterwards, run this defrag command, right? You know, to put everything nice again or whatnot, right? But I'm not clear yet, because I haven't had the chance to look at it as to how exactly that defrag works. So I know that they exist fine together because I haven't both turned on and nothing's blown up and, you know, there's no reason why they shut it. But yeah, yeah. I don't know enough about it yet. But that's my hope, is that I could fix that. I don't have ideas for it too anyway. I've talked about it afterwards, but is that generally answer? Yeah, cool, cool, cool. So the the last thing that I found, and this was interesting was bookend extents in butterflies. They don't know what a bookended extent is. Yeah, okay. So I didn't know either, and I was like, oh wait, it's what you call this, bookending? Okay, I get it, yeah. So Butterhouse has a copy on my files. So you've got a large extent and someone writes into the middle of that extent. When they cal that extent, let me make sure I get this right, they might not copy the entire extent. Because if the only part that's changed is in the middle, so what they'll do then is they'll write the newly changed data in its own extent. And the old extent becomes what's called a bookended extent. And only portions of that extent are used now. Because those extent data items that I referred to later are rewritten so that as you read the file logically, you go okay, jump to this extent, write, read this, jump to the new extent, then come back. Well, those extents cannot be split in ButterFS. So you lose that space for the entire time that that extent is referenced. Now, this doesn't happen that often, it turns out. One of the first things I learned doing file systems is people almost never rewrite their file. 99.9% of the time a file is written once and that's it. But definitely, yeah, definitely something I found. So that'll be something I'll be looking at fixing and it should be an exciting project. Exciting. Yeah, yeah, yeah, absolutely. So it's kind of, again, it doesn't turn out to be a problem because most of the time the extents are really nicely laid out and the duplicates are very nicely laid out too most of the time. But it's definitely an issue that and it's an issue for more than just de-duplication too, right? Because it's just a space issue. You can have potentially gigabytes of hundreds of megabytes of data in an extent that's no longer referenced but is pinned to the file system. Again, it doesn't happen too often, but something to understand. Any questions about this stuff? Cool. How am I doing on time? I think... Oh wow, okay, cool. Alright, plan features. I've talked about all of this already. The incremental de-dupe based on the that's like for me is really sort of completed feature-wise. It'll make me the happiest. We originally didn't support de-duping within a file and I don't remember exactly why that wasn't possible. It's possible now. So the next version will have support for doing that, right? That just means that a file doesn't de-dupe itself right now if that makes sense, right? But it's possible it's just the code needs fixing. OCFS2 also does copy-on-write files. So OCFS2 at some point will get a patch for this. Oh, and OCFS2? No. Oh, yeah, sure. Sure. Yeah, so it's turned off by default right now and that's just because of other reasons basically, but yeah, so we'll find-map the file. So this happens instead of de-dupremove because the kernel just understands the holes as they are. This is part of the clone code, right? It won't find the extent. Yeah, yeah, absolutely, yeah, yeah, absolutely, yeah. So we find-map the file before we scan it, right? Yeah, it's turned off because it has nothing to do with holes. It has a lot more to do with what Butterfax Marks is shared. So Butterfax, if it has a ref count of greater than one in Marks, an extent shared. And I also try to write, I say, well, if an extent is shared, hey, it's probably de-duped already. I don't need to do anything to this, right? It turns out, though, it's shared if it only has a ref count greater than one, which means that it might not be shared. You don't know who it's shared with, basically. Right, so it just says it's shared, but you don't know, oh, okay. It might be shared file, this portion of file A might be shared with file B, but I didn't also begin in file C and, you know, right, yeah, yeah. So those are rolled together, unfortunately, in the code, so I just have to unroll them, basically. And then I'll use five maps to just always skip the holes. Another thing, sparse files are, luckily, I found out not super common, either. You have to support them. Oh, yeah, that's true. Absolutely. Yeah, that's true. That's true, yeah. I mean, if you have a database, that's a different story, right? But in the world of, like, you know, oh, I just un-tar some stuff, or I just edit a file or whatever most of the time. But, yeah, yeah. Good question, though. Thank you. Anything else? Any more questions? Cool. Yeah. All right. These are, that's my GitHub page for do-per-move. There's a wiki there. And union packages for most distros of the open build service site. So, and, yeah, any questions in general, then? Jeff, people who want to get into this, they want to do kernel work more, but they're really frustrated, or which, do you love doing kernel work? That's the biggest thing. It's like, what tool is one of the best tools you've tried? Right, right, right, right, right. Yeah. That's a good question. So, I'd say getting into kernel work, patches, obviously, right? I'll put patches on the mailing list, right? The Linux kernel mailing list has an enormous amount of traffic. I don't actually know that that's the best place to go to be perfectly honest, I mean, I wouldn't say ignore it, right? You know, definitely check it out. But, yeah, it has an enormous amount of traffic. I would say to find a project in the kernel that you're interested in. It doesn't have to be popular or cool. It's just something that's interesting to you and learn that really well. The nice thing about the kernel is, say you're interested in file systems, right? That's something I can speak from, right? Well, it's really well designed, right? There's a VFS inside the kernel, right? And, you know, when I first got into file systems, I was just filling in call backs, basically, right? But because it's all there and because it's fairly well designed as a whole, right? It's easy for you once you started that small space to sort of start to branch out, right? So then I learned more about the VFS, just doing file system stuff, right? Then, you know, oh, look, OCFS2 initially didn't support cluster aware M-map, right? So I got to learn about the memory manager implementing that. So, yeah, I'd say does that make sense? I mean, does that help? Yeah, yeah. Yeah. Yeah, yeah. Pick either something that you like that's in the kernel, a device driver, a file system, you know, and just do small patches. Yeah. Yeah. Oh, okay, okay. Yeah, that's definitely out of my wheelhouse. Is that the right term? Anyway, yeah. But, yeah, it sounds like you have a place to start, right? Yeah, so that would be something to check out. Yeah. Yeah, kernel UBs is good. Linux would be the IRC network. There's a pound kernel. But it's IRC, just going on there. Yeah, I mean, yeah. Oh, yeah, sure. Yeah, yeah, yeah. Let me see ButterFest IRC. LinuxNet is what it's called, right? It's in my IRC client and thing that I forgot. Yeah. But, yeah, definitely there's a kernel, there's a kernel channel there, there's a ButterFest channel, there's an OXFest 2 channel. Yeah, I would like to check that out. Yeah, absolutely. Let me think. Yeah. Yeah, a lot of people send small patches and get started that way. Like really, you know, one-liner stuff like that. PrintK. PrintK. Oh, PrintK. Yeah. It depends on debugging. If I'm debugging the kernel, PrintK obviously, like if I'm just, if something I can run, right? PrintK is the easiest, right? Honestly, like CisRQ, like if I have a customer machine and say they have a hang, right? I mean, you got to start with that. So you look at stacks. So you get a CisRQT, right, to get the stack of all the running processes, I'm sorry, all the processes on the system. You go look at that and then you work backwards from there, right? So like, for example, the R-sync bug, right, that I was referring to. Well, the way I figured that one out was I got a stack trace, right, from the kernel. I said, well, look at that. This guy is sitting in the acquire lock subroutine. And this guy is, you know, already has a lock because he's sitting in another routine. I know, you know, he's sitting in a different acquire lock subroutine, basically, right? And then your head goes off. I go, oh, okay. Let me go look at Butterfest ReadPage. Let me go look at ExtentFame. And lo and behold, we're doing it in the opposite order. So, yeah, I'd say things like that. GDB exists. I haven't used it recently. GDB for the kernel, I should say. Yeah. I make full use of GDB in user space, but this is why not. Yeah, I'd say a... No. Virtual machines are very useful because they're fast. Because I have them on SSD. So I do have a lot of virtual machines. I've got maybe four. And I'll test in those for kernel problems, absolutely. Yeah, they're great for... because it's very easy to just iterate. Right? They reboot very quickly. You don't take your own machine down. You don't take a test machine down. You don't have to wait for the physical hardware to boot. And... presumably, you would be able to pause it, I guess, huh? That'd be kind of neat. But I haven't gotten around to... I haven't had to do that, really. Yeah. It's not so useful, yeah. Right, right, right. But, yeah, those are... that's... you'd be surprised print K is useful. Can be useful. Any other questions? I had some general butter up fast questions. I've just heard a few things today about butter up fast. In the Facebook of the... you know, pushing the curled code saying, hey, butter up fast. It's looking pretty good. It's almost there for us. We've pushed up a lot of... we've pushed up a lot of bug fixes and a lot of them are more butter up fast. It's almost there for us. Also something similar in the Gloucester fast, right? There was some positives there in the server set. Then I noted the... it was a little butter up fast. That was more about someone out in the field who heard this from me from other podcasts. It's a data loss. Ouch, you know. So maybe it's more... I mean, where are the good use cases? Right, right, right. Or the level of sophistication... Yeah, yeah. So butter up fast... Yeah, so butter up fast moves very quickly. And it's constantly getting... Right, yeah. So let me get to that. So, yeah, you're absolutely right. It moves very quickly. It's actually quieted down quite a bit, but it was under very heavy development for a while. I think some of that... I mean, you're always just going to hear bad things, right? People are going to have bad experiences, right? That's why you don't corrupt the data, right? Because they don't forget it, right? It comes up everything. Yeah, exactly. The other thing I'd say is at Suzy, we kind of curate the features. Right, so now, in fact, with SP, with Service Pack 1, we actually have a lot more enabled, right? But for example, there were some problems with compression when we released SLEE 12. So we didn't turn compression on. We curated the features. We looked through, we ran it ourselves, what breaks a lot, what do we not understand, and that's how we got around a lot of those sorts of sore points, right? Where people might enable a feature, right, and then find out, you know what, while this portion of ButterFS is nice and stable, it's had a lot of work on it, this is like brand new and it's just trashing stuff, right? So I would say that that would be the biggest thing for our customers, and it's mostly what I can speak from, right? Yeah, yeah, absolutely. So like you're saying for Suza, it's very solid because you don't let it stray. That doesn't mean you could, I mean, presumably you could look and see anyway. It's not hidden, we don't hide it. All our codes up there and stuff. So you could easily find out, you know, what did they turn off and stuff. Alrighty. I'll take any more questions afterwards, I guess, because I really shouldn't keep anyone here any longer. So thank you all very much for your time. I appreciate it. Thank you.