 about this thing called the G-Union. It's a new geom utility in the FreeBSD kernel that I actually wrote because I had a itch that needed to be scratched, as you'll see. And for those of you that were in my class, the first couple of slides are going to look very familiar. For the rest of you, when you hear geom, you don't even know where the heck that is in the kernel. I figured I should really just start out with the big picture that's the whole kernel on one slide. And so the system calls come in at the top, and the hardware is down at the bottom. And as one Intel hardware engineer said to me, all that other stuff is the friction that makes our hardware look slow. But we like to think of it as FreeBSD. At any rate, you come in at the top and you go through file systems or devices or whatever. And geom is actually pretty far down in the kernel. You've gotten through your file systems, and at this point now you are trying to actually get to the physical media. And the geom layer is often referred to as the disk management layer. So geom deals with things like braid or striping or anything having things to do with sort of getting an IO request ready to hand down to an actual device driver that's gonna take some action on it. Or potentially, if you're striping, it may have to split the request and send it to two different places or whatever. Okay, so there's actually a whole lot of other little things that get plugged into the geom layer. And this geom is just another one of them. So you can think of it as sort of another disk layering piece. So just to sort of get the gist of how the geom layer actually works here, one of the things that it does is things like disk partitioning. And so when you talk about some particular slice of a disk, this is the place where we actually figure out where that is on the physical media and send it down to the appropriate place. And what actually happens is that this layering or the labels get set up as the system is identifying the disks and configuring them. So what happens is at the beginning, the code is going down and it's looking to see what hardware is there. And it comes across a disk, and this being the first one that it's found, it decides, all right, this must be disk DA0. And then what happens is that the geom layer announces that a new disk has showed up. And part of that is that the identifying the disk DA0 caused an entry to be put into slash dev, which is that dev slash DA0. And now the geom layer is notified that it's there and the geom layer hands it off to various different things that have indicated they're interested in disks. And they taste the disk, which really means they read the first few sectors of it and they look at it to see if there's anything in there that they can, they identify. And sure enough, the GPT taster finds a GPT label there and this particular GPT label has two slices. So it creates DA0 slice one and DA0 slice two and makes those corresponding entries in the slash dev directory. And now that looks like two more disks that just arrived and some more tasting code goes in. And this time at the beginning of slice one, the tasting code finds a BSD label. And so the BSD label then produces, in this case, dev DA0 slice one A through DA0 slice one H and they all now show up. And of course they look like disks and so the tasting layer will go again but it doesn't find at this point anything else. And so this whole process now stops and we have all of that bit of the disk. Now other things that might be there, for example we have two disks that are being mirrored then there can be a label that says, well I am part one of a mirror or I am part two of this mirror and so when both of them have been found then that will cause the mirror dev entry to be created. All right, so GM layer then can do things like disk partitioning, it can do the aggregation of disks as I just described for mirroring or rating or whatever. Also, since all the IO is coming through this layer it's an ideal place to do collection of IO statistics. And so we can collect those and those are the statistics that ultimately get propagated up, for example when your process exits if you ask for the statistics you can see how much IO it did. It's also an ideal place for doing IO optimization so things like disk sorting which of course is important for spinning rust not so much for solid state disks. But this again is the layer in the system where that kind of thing happens. This is an example of how mirroring would work. In this instance we've found two disks DA0 and DA1 and they have GPT labels on them so we get the slices and then it turns out that at the beginning of DA0 slice one and DA1 slice one is a label that says this is a mirror and so those two are put together as a mirror and then again that looks like a disk so tasting happens there and in this particular instance it finds a BSD label and so finally up at the top there we have DEV mirror DA0 S1B and so on. All right so again the mirroring of this underlying partition is just happens at this layer in the system and there's all kinds of ways you can configure your mirror and that's all in that label that comes at the beginning so do you favor one over the other? Maybe one is flash and the other is spinning rust and so you wanna try and do reads off of the flash or whatever. At any rate that level of stuff again is all being done here at the geom layer. Now how does geom actually operate? Well when geom was first created we were still back in the sort of free BSD 4, 5, 6 era and those of you that were around in that era know that the 4 to 5 was the great move from uniprocessor to multiprocessor and that meant a whole lot of locking had to happen. It was a very painful transition because you can't really just do it in bits and pieces you sort of just have to bite the bullet and do it and the problem was of course all of this underlying device drivers were not set up to do multi-threading. They didn't have the locking added and to have to put all that locking in all at once was just a mind boggling task and so to avoid that the way geom worked was it single threaded the requests that were coming in to the devices. So it would come down and it would go through the geom layer and then it would queue it to be run by what's called the G down thread. So there was a single thread that would take these requests off and then run it down through the rest of the stack and through the device driver and then when the IO completed a G up would come back up through the code and so in this way nothing basically below the geom layer needed to be multi-threaded it would just the G up and G down would serialize everything that was going through. Two things that came from this one is that modules couldn't go to sleep because if they went to sleep they would go to sleep with the G down thread and everything would come to a screeching halt. So you really don't wanna let that happen. Similarly coming upwards of course and you don't want them to compute excessively. So you don't wanna have one of your modules do something like encryption because everybody else is gonna wait while your block gets encrypted before anyone else is gonna get to do more IO. So again you couldn't put things in there that needed to do a lot of CPU time. Now the idea of course was this is gonna be it's gonna slow down the performance of the system dramatically and so we had to have a way of being able to incrementally start putting the locking in and so what happens is that as a module gets SMP locked it can set a flag and says all right I can deal with multi-processing and that means that we now do what's called direct dispatch. So the thread coming in doesn't have to queue up on the up or down it just goes straight on through down into the lower layers and the device driver and so on and so it's all SMP and it's been locked so that's going to work. So if you look at for example like version five of FreeBSD almost nothing is direct dispatch and if you look at the system today I don't think there's anything that isn't direct dispatch but that's sort of the history and if you've been looking at it people say well what is this thread that's G up and what's this other thread that's G down? Well they're mostly sitting there being bored because they almost never get used now. Okay two other key things that GOM is doing for us is that if a provider goes away then an error gets propagated up the stack so if your disk goes away someone it's something that's on a USB stick and you pull it out then it's not there anymore and so now the GOM will propagate an error back up and any IO that was in progress will be finished with EIO and that EIO error will be propagated up and then the higher layers have to deal with that and anything that was queued if there was a stack for G up or G down anything that's on those queues will also all be errored out and sent away. The other thing is when a provider changes this is called spoiling then that gets propagated up the stack and spoiling is something such as the disk label identifies the GPT label and so the fact that there's now these two new disk entries is gonna get propagated up and it's that propagation going up where things like the entry will get made in slash dev and other operations may occur based on that. Well now that you're all well set with GOM let me now talk about a couple other things that I need to tools really that you need to know about before I can finally get to G union. On the next one of these is the memory disk and a memory disk is kinda what it sounds like. It's something that looks like a disk and acts like a disk and quacks like a disk but actually isn't a disk it's really just memory and you create and operate on memory disks with the mdconfig command and there's essentially three different kinds of memory disks that you can allocate. I'm gonna describe on the next slide the details of these three but there's one that's essentially just malicked kernel memory it's just solid memory in the kernel. There's virtual kernel memory so it's virtual memory i.e. it's backed by swap and so the active stuff will be in memory and the stuff that's not active is gonna get pushed to the swap area and then finally you can have virtual, it's virtual memory but it's actually backed by a file. The one that's backed by swap just disappears when the disk goes away but the one that's backed by a file the file has whatever the contents of that thing was and so you can recreate it but remapping that file. Okay, as far as geom is, consumers are concerned it just looks like a disk. So what are the details on this memory? Well malick, in this case, the changes are all held in the kernel memory and the size is limited to the amount of kernel memory that's available in a single kernel malick. So it's, I don't remember exactly what that value is but it's some relatively small number of megabytes. You're not gonna be generating any gigabyte sized malick based disks. Honestly I'm not quite sure why you want that maybe, if you're doing some kind of a benchmark and you wanna know that it's in memory, that would guarantee that it is in memory but other than that it's not a terribly useful one because usually you want disks that are the size of other disks. Okay, so swap mode is the changes that are all held in the buffer cache and pages get pushed out to the swap area when the system gets under memory pressure otherwise they stay in memory. So for the most part when you first touch a page, a piece of sector or multiple sectors on one of these disks pages get allocated to it and they pretty much just stick around unless we start getting short of memory and if we start getting short of memory then the ones you haven't been using get pushed out to the swap area. All right, and then the V-node mode instead of just having an anonymous object in the kernel that's holding it, it's a regular file. I typically, if I wanna save the changes over time then I create a large empty file. So you just say, let's say I need a one gigabyte virtual disk. So I just say truncate one gigabyte name of foo, whatever the name of the disk is gonna be and now I get a giant file that has no pages in it. And then I tell mdconfig all right, map that file and now anytime you write stuff of course that part of the file will get allocated so it will just be a sparsely populated file. Of course the more of it you use the more space it ends up consuming but the difference between the swap based one is that when pages are reclaimed then they get written out and actually the contents are actually in the file system and when you break down that disk, when you mdconfig throw it with the destroy option it will push any of the dirty pages out to that file. So later you can come back and reattach to that file and it comes back looking just the way it will look before. Okay, so for swap and V-node, the space used by the memory disk is based on the amount of data that's written to it. Finally, we can get to G-union. Okay, the G-union module is used to track changes to a read-only disk on a writeable disk. So it looks a little bit like the union file system except that it's being implemented at the disk level instead of at the file system level. And what we're gonna do here is that we are going to take one disk, which is gonna be treated as read-only. We're gonna have another disk on top of it which is gonna be read-write and now what will happen is that when a write request comes in, they get intercepted and they get stored on the top disk. So we don't modify the lower disk. When a read request comes in, we first check to see if that block is in the upper layer and if it is, then you get the thing that's in the top layer and if it's not there, then it just falls through and reads whatever at the lower layer. So it's pretty much exactly what you'd expect except that it's being done at a block level rather than at a file level, which is the way that the union file system does things. Okay, so the picture we have here is we have two disks. In this case, I'm using a memory disk for the upper layer and a real disk for the lower layer. So we have DA0 there and it's got a couple of partitions, slice one and two and now I take the memory-based disk, I slap a label on it so that it'll have the same size partitions and now the union is gonna be the union with the real disk on the bottom and treated as read only and the writable disk is the memory disk and that's the one that's gonna be on top and so when the union gets created, we'll end up with slash dev slash md0 slice one dash da0 slice one dot union and that will be the name of the disk that looks just like this one over here but is being mirrored or not mirrored but is being tracked by the memory-based disk. All right, so what are the actual operations that we have with the G union? It's actually pretty straightforward. We have create, which sets up the union provider on the top of the two given devices and assuming that it succeed, the new provider appears with the name as you saw in the previous slide and then we have the inverse of that, which is destroy, which disassembles the two pieces. So you go along and you create this thing and you do a bunch of stuff with it, reading and writing, whatever and then you're done and now you say, all right, did that do what I wanted it to do and the reason that I actually created it is because periodically people send me these disk images that causes FSCK to choke. It tries to fix it and it either fixes it wrong or FSCK blows up or whatever and they inevitably send me some disk image that's four terabytes and so here I have this four terabyte image of the disk and now I run FSCK on it and FSCK croaks and it's like, okay, I put some debugging in FSCK or I scratch my head a bit, now I wanna try something else and I have to make another copy of the four terabyte disk which takes 25 minutes to do and then I run it again and it's like, oh, after you've waited 25 minutes a few times, you get really bored with that and so I created this so that I could just put this on top and I run FSCK on the union disk and when it dies, I just say revert and revert says discard all the changes made in the top layer reverting to the original state of the lower device. This takes less than the time it takes for carriage return because all it has to do is zero out an array of bits, one bit per block and then nothing has been written to the top so you just have a G union revert sitting there so you run type FSCK, it does bad things, you say bang G union and it throws all those changes away and you do the next one, you do the next one, et cetera. If you say, well, that's great, but who besides McEusik ever deals with FSCK and you'd have a point there, it seems like no one else seems to wanna take it on, but there are other times when it could be useful to you, for example, you have some disk and something bad has happened to it and so one of the things you can say to FSCK is FSCK minus Y, it just answer yes to everything and then it just goes roaring through and does that and one of the states you can end up with is it can just decide that your root directory is bad and so it just ends up creating a brand new empty root for you and the blusters are not going to do that the blust in found with all the files in it and that might not be the optimal solution that you were hoping for, so you might actually then wanna go back and say, well, there's a couple things where I said yes to FSCK and I didn't really mean to say yes. Well, if you use the G union, you can just revert that and say okay, okay, minus Y was a little bit over the top and then go back and keep working on it until it's exactly the way you want it and then you say, okay, finally it did it. Then you want the commit command and the commit command says write all the changes that are in the top layer to the bottom layer so make this union disk, make the lower disk look like the upper, like the union and so then when the commit is done, voila, it's, you've got the new disk image. Okay, so you see that not just me that's gonna use this. Okay, so the upper disk has to be at least the size of the disk that it covers. That should be kind of obvious. The other points though are that the union metadata exists only for the period of time that the union is instantiated. So it's important to commit the updates before you destroy the union because otherwise they're just gone. Now, if the top disk is using 4K sectors then that bitmap that disk keeps track of what's been written is only about half a percent larger than the disk that it covers. So it's possible, although not currently implemented, to save the union metadata between instantiations of the union device. So out to what have passed the end of the disk you could just rate that bitmap and of course if it's a MD disk and it's mapped by a file then you could just reload that file and then the union would be able to pull the bitmap back and it would know what the changes look like. Okay, but since I didn't need that and nobody else has asked for it I didn't bother putting that in yet. All right, so let's go through and just take a look at sort of an example of this. Sort of what are the commands? So the first one is to create one of these providers and so you say in this case G union create minus V just as give me verbose output like what you're doing and then MD zero is the top and DA zero P one is the bottom and that will then create the union disk and then you can just mount that union disk on slash M and T assuming it has a file system on it. If it doesn't have a file system you can new app pass it and then you can mount it. Okay, then you proceed to make your changes in the mounted in slash mount and if they're successful and you wanna keep them then you just unmount slash mount and then you can do a G union commit of that union and that makes all the changes get pushed to the lower layer. Yes, sorry? Still didn't hear you, sorry? Is there a reason you have to unmount it? Oh, is there a reason you have to unmount it? Yes, because when it's mounted you may have state in the cache and particularly dirty blocks and you wanna make sure that all the blocks have been written to the disk before we do the commit. Okay, so and in fact if you try and commit when it's still mounted it will complain it will refuse to do it. Okay, if on the other hand you're not happy with what ended up on slash mount then you can unmount it and revert it and that will just put everything back to the way it was. And then when you're done you can just eliminate it, you just unmount it, G union destroy and all the uncommitted changes will be discarded when that destroy command is done. All right, so what is this useful for? Well, I've already alluded to the reason that I created it in the first place. It's when you're dealing with large disks with corrupted file systems and you're not quite sure about the repairs, you create the G union disk, if you run fsck-y or whatever and if that fails then you just revert the changes and try again and if it's successful then you do the commit. In this case it's not mounted so you don't have to do the unmounts. Another thing though that I actually found kind of a poor man's ZFS if you will, you can place the upper disk over the one holding the file system that you wanna upgrade and then mount that and then you just run the upgrade on that top thing and if you're happy with the upgrade, it didn't, you know, nothing bad happen then you can commit it and if you're not so happy with it then you can revert it and that brings you back to where you were. Now of course if you've got ZFS then you just make a snapshot and then a clone and then you update that and you know push it across. So as the practical matter, if you're running ZFS then you don't need this but if you're in a system where you're running on UFS for example in an embedded environment this is actually a safer way of going about trying to do upgrades. I mean, of course the script always works but just in case it doesn't you're not sitting there hosed and even on an embedded system you're likely to have enough virtual memory capability that you'll be able to actually create the MD disk and use that on the upper layer, okay? Well, it's one other GOM module that I sort of like to talk about just since I've got everybody's attention here and I have like five minutes left and that's the GOM NOOP module. It's a GOM module that does absolutely nothing or at least that's what it used to be. It was originally written to provide the boilerplate needed to create a GOM module. So when I wanted to create G union, I didn't have to just start with a blank sheet of paper I just made a copy of the NOOP module and then I just started adding the functionality for the G union except when I first looked at it, it's like, wait a minute this isn't a NOOP, it does all kinds of stuff and it turns out that these features that have been added to it are again useful particularly if you are into benchmarking things like disk performance. So you can do something where you export just a subset of the underlying provider so it's the poor man's way of putting a disk label on it. But other things are for example when I was doing the work to deal with disks failing and unmounting the file system rather than panicking the system you can add the GOM layer so you just have your regular disk you add the GOM layer and then that's what gets mounted and you say to the GOM layer, just die. And so it's as if the disk just keeled over and any IO in progress fail and any attempts to do anything fail and so it's the dying disk without actually having to plug in unplugged disks which they tend not to like that very much. Another one though that's more for benchmarking is to apply a possibly variable delay in reading and writing through the layer so you can say well randomly make a read take longer randomly make a write take longer and you can specify is it like one in a hundred or every single one or every other one or whatever. And again you'd be surprised because of course what this is gonna do is cause IOs to be done out of order and there's a whole lot of things that kinda expect things to be done in order and when they're not done in order they kinda don't work very well. It'll be at least from a performance perspective. So again you can really do some sort of interesting tests with this variable delay and then there's another one where you can specify a probability of just flat out failing for reading and or writing. And again this is one where you're essentially simulating where the disk is like kinda working but not completely working and again there's a whole lot of programs that are just they don't check for EIO. So you know bad things start to happen. Okay so it's great for testing error recovery code, delay handling code, correctness of out of order IO. And by the way if you actually wanna create another module we should really have a true NOOP beyond module but it still is pretty close to what you need for using as a template. All right and that is what I have to say so I will entertain any questions that people have. Yeah so the question is how difficult would it be to write out, you're talking about the bitmap. Well if you're using MD and you simply have it associated with a file then that file it'll just be a sparse file. And so the amount of, it'll look like it's huge but the amount of disk space it'll use will be proportional to the amount of writing that you did. If you actually have other media then of course the other media will just have that data on it. Right so the question is do I have a question of writing out the bitmap state. It's on my to-do list it's like a day or two project to do it so it'll probably happen at some point. I'm more or less waiting for somebody to come along and say you know it'd be really nice if that happened. Yeah well it's a little too easy for that. Okay so the question is is it possible to write onto the lower layer. The G union won't do that unless you do the commit command and when you do the commit command it will attempt to write the top changes to the bottom one. Not as long as the, can someone else write to the lower layer? No they can't because anytime a disk is open for writing it has an exclusive access bit set. So I mean just in general any disk that's open for writing nobody else can open it for writing. Like if you've ever tried to run FSCK on a disk that's mounted FSCK starts out by saying no write because it tried to open it for writing and it wasn't able to do it. It'll then still run through but it won't try and make any changes because it can't. Is it possible to make the union mount? Yeah I mean you downgrade it to read only? Yes you could and if you do that then you can do the commit without unmounting it. Because downgrading it to read only is flushes. Yes when you have a memory disk that's backed by a file when you tell the memory disk to go away before it goes away it makes sure that all of the pages that are in memory and modified will have been pushed to the backing store. There should probably be an MD config sync option which is just push it without destroying it. But at the moment the only way that it gets flushed is when you destroy it. Yes actually yeah that's a good point. If you just do an FSCK on the file that'll get it out. Because I mean the pages are associated with the file like they would be with any other file so an FSCK will do it. Good point. Yeah the question is does the top one have to be a memory based disk? No it can be a real disk. If you just happen to have an extra four terabyte disk lying around that you can use you're free to use it. Thank you very much.