 So, while we wait for a minute before we start, just a quick introduction about me. I work with Red Hat as a SSME, which means senior maintenance engineer or subject matter expert depending on whom you ask. My talk today is on thin pools, and I'll be talking about LVM, thin LVM, and whatever you would like to talk about if we have time after that. It should be about 30 minutes, maybe 35 with questions. I think we should start because we have something else at three, right? Right, so welcome and thanks for joining. This is about thinly provisioned LVM. Jared already spoke about it. You probably compressed 15 minutes of my talk into two minutes of your talk, but that's all good. That's good. So, thin provisioning, that's what we're going to talk about. And before that, we'll talk about LVM for two minutes. So is there anybody here who doesn't know what LVM is? Okay, cool. So like the slide says, LVM is the logical volume manager. This lets you resize your partitions on the fly. Every operating system has their own. Solaris has the SVM, the Solaris volume manager. There's also VXVM, some of the well-known ones. Linux uses LVM, and you want to see what the latest LVM is all about, just go to the first link on the slide, and you'll also get the sources there, and as well as some documents explaining how stuff is done in the latest upstream LVM. So basically, it lets you resize your partitions on the fly, which means that if you have partitions which are getting full, you can expand them or reduce them if you think that they're created too large in the first place, and you don't need that much storage. I'm going to be talking mostly about thin LVM, so the thin capabilities have been in Rails since 6.4, and like Jared already explained, thin basically means overallocating storage, so promising users more storage than there actually is, and my talk will be about how this stuff is done and what can happen if it goes bad. Also it's important to know that LVM, basically everything that you use in LVM is just a wrapper over DM setup, and all this work is actually implemented in the device mapper, so the thin stuff will be the thin target, for example. Just like, for example, there'll be different targets for, say, multi-path or Lux has their own device mapper targets. Similarly, LVM has their device mapper targets, so all the work is in the kernel. You basically interact with that using reporting commands, which we refer to as the LVM 2 set of commands that you'll use to create and report on and look at and delete logical volumes and volume groups and stuff like that. If you have any questions, interrupt me, don't wait till the end. This is a look at how thickly provisioned storage looks like before LVM got the thin stuff, so if some of you have not used LVM or have not created logical volumes, typically you have your physical drives and you create partitions on them, or you may not, it's up to you. And then what you start with in LVM is you create physical volumes and then use those physical volumes to create volume groups and carve out logical volumes and then mount your file system on those logical volumes. It's these logical volumes that can be resized on the fly, so that's what the biggest USP is, physical volumes. So at every level, there are separate commands for LVM admins to use. So generally the ones that operate on physical volumes would start with a PV, the ones that operate on volume groups would start with a VG and so on and so forth. And this is how storage allocation looked like before there was thin provisioned storage in the picture. So notice that these logical volumes that you would create in such situations would actually be storage that would be allocated once the logical volume was created with a particular size. And then if the user would say that I needed 50 TB and ended up using just five TB, so you had 45 TB of wasted storage space which you couldn't use for anything else. So that's the use case for thin provisioning and that's exactly what it does. It promises much more storage than it exists. So if you notice in this case you have physical volumes just like before, created with exactly the same commands. Nothing changes at the PV level. You then use them to create a volume group, again exactly the same. Nothing changes there either. And then you have a thin pool and that's the new stuff for thin provisioning. So you have thin pools which you can think of as volume groups for your thin logical volume. So there is another layer between the volume group and the logical volume which you call the thin pool. So you would create a thin pool and notice that it comprises of a data LV and the M stands for the metadata LV, okay? So the data LV and the metadata LV are hidden logical volumes. You will see them when you run the LVS-A command, they'll appear in square brackets. I'll show you a screen output. I have a slide on that. And making up the thin pool are these data and metadata volumes. And out of the thin pool, then you can carve logical volumes just like you carve volumes out of your volume groups in thickly provisioned storage. You would carve out logical volumes from your thin pool. There would be thin logical volumes and you can generally give any size you want. So your thin pool, of course, you cannot give any size you want. So the size of the thin pool would be as much space as is taken from the volume group, the thin pool cannot be arbitrarily sized, of course. That is actually storage, actually how much storage you have. And the thin logical volumes can be virtually sized and you can promise a user much more than what actually exists in the thin pool. You typically have something that's 80% larger than what you have. That's generally what sysadmins do in thin provisioning. This is a slide showing exactly how to do this stuff. So like I said, the PV stuff, you would PV create stuff. So if you have a dev SDE disk, and if you run PV create on that, you generally what PV create does is basically just stamps the disk with the LVM label so that LVM knows that it can use this device as a physical volume and then you can VG create and create your volume group using that PV and then the thin pool stuff comes at that layer that you see there where you do an LV create and you give a size. So in this case, I'm giving a 10 gigabyte size and flag it with the thin pool, give it a name and the VG in which the thin pool is to be created. The command on top is exactly the same, but I think that's the right way of doing things. I think the pick is a bit outdated. So I pasted the stuff I found in man LVM thin, which is something you could take a look at if you want to learn more about thin provisioning. So what we are doing is just creating a logical volume. Remember the thin pool is a logical volume. You're creating a logical volume with the type thin pool, giving it a size with the dash L, giving it a name with the dash N, and then telling it what volume group the thin pool exists in. After that, you start creating your logical volumes and notice the virtual size with the dash V and you can give any size there, right? And obviously you don't have that much storage, but whatever is used is what is allocated, all right? Multiple logical volumes are going to share the same thin pool, which brings us to the most interesting point about thin pools that once you create it with a certain size, you cannot shrink it. Because data may not be written linearly in the thin pool, because many logical volumes are sharing the same thin pool. And LVM may spread the rights across randomly just for performance reasons. You cannot assume that it's all linearly written. So once it's created with a certain size, you cannot shrink it. The thin volumes, of course, can be shrunk and extended as you wish. You can do what you want with them, right? These kind of volumes, like thin volumes, even for us at support, it's interesting because we can reproduce cases where customers say he has huge data, like hundreds of terabytes of storage, and we really don't have that much storage, but we need to run use cases and tests with that kind of stuff so we can create thin volumes with that kind of size and use them in our tests. So right off you know one test case for that. Obviously the largest test case is that administrators can promise users much more data based on the assumption that people generally ask for much more than what they really need. So administrators can provision much more than what they actually have. And hope that everybody doesn't use everything they have at the same time, right? It's exactly like Gmail promising everybody five, whatever, five GB. Is that what they give now? Five GB, right? So if everybody used up that five GB all at the same time, it's not going to work. Right, so we have pros and that's not a typo. It's a con. There's exactly one con with this whole thing. So the pros are okay. So this is like virtual memory for processes. No storage wastage, obviously. And you can forecast and plan ahead so admins can buy storage as is required. And we're talking about forecasting for years, okay? So this is like we're talking about huge storage where a user says that, okay, I need 50 terabytes over the next three years. And you can provision 50 terabytes today and then keep adding it as and when he uses that, right? This is per user. Like I said, you can reproduce issues without actually having that huge storage. Snapshots, so thin snapshots are very efficient. And in fact, at Red Hat, we ask people to use thin snapshots, the older thick snapshots which were, in fact, in LVM1 snapshots who are copy on right, thick and also read only, which means you could not mount the snapshot and write to it. LVM2 does have thick snapshots where you can mount and write to the snapshot as well. But the copy on right snapshots, which are non-thin, have an IO degradation in the sense that every right requires a copy on right. Like a right to the origin is going to require data first to be copied to the snapshot and then the origin to be overwritten. And that's going to basically reduce your performance by maybe 70%. And if you take multiple snapshots, it's going to drastically drop. Also, I don't think thick snapshots would let you take snapshots of snapshots of snapshots and so on. So thin snapshots are efficient. You can take snapshots of snapshots as many times as you want. And it's a lot faster than thick because there is no copy on right. It works with redirect on right, so metadata pointers are shifted around even for rights to origin. So thin snapshots are basically really fast. And thin snapshots, in fact, you can think of as just an additional, thin logical volume in the volume group. That's all it is, just an additional logical volume. So what that means is technically because a thin snapshot has no size associated with it, you can fill up the thin pool technically if the snapshot fills up, right? Thick snapshots never fill up as such. Well, they do because you tell them how large the thick snapshot is. And in fact, if the thick snapshot fills up to the point of the size that you gave, it's discarded, which is also a disaster. That's a serious sysadmin error. Thin snapshot never gets full, but the thin pool can. And then if the VG does not have space and the thin pool doesn't have auto extension on, that's a bigger disaster. Because file system corruption is going to happen if your thin pool gets full. Yeah, so that's the pros. I probably jumped a bit ahead of myself, but there is one con. And the con is, do not let the pool or metadata get full. And I've repeated that. And again, because that's basically the only thing we have to watch out for. Everything else is okay. But if your thin pool gets full, or if your metadata gets full, it is serious trouble. And we often have customers saying that, can we recover our data? And we have to say no. Yeah, you have a question? In fact, two questions. First question is a host-level volume management, right? If the backend storage is also again having thin pool provision, is it mandatory to have this thin pool configuration? And, you know, processing the volume groups and everything? Yes, so you can have backend storage which is thinly provisioned. So, for example, the EMC storage itself is thinly provisioned at the storage level. On that, if you create a thick volume, not a thin volume, you still get the advantage of thin provisioning at the storage level. But you can create thin volumes on that to take advantage of many features that LVM gives you for thin, like thin snapshots, for example. So, when you take a snapshot of a thick logical volume, there's copy on right. So, the snapshot performance is very bad. Just a LVM snapshot. We are talking about just LVM snapshot, not the storage snapshots. Because there is thick provisioning at the LVM layer, the snapshot will be copy on right. So, the origin will have to be copied to the snapshot before being overwritten. Thin snapshots don't do that. Thin snapshots will change the metadata pointers because thin snapshots basically have, just like thin logical volumes, they have metadata pointing to the blocks allocated in the thin pool. So, when a snapshot is taken, the metadata of the snapshot points to the origin blocks. So, they both share the same blocks. And when there are writes to the origin, the pointers of the origin are moved for the new allocation. So, there is no copy on right. There is redirect on right. So, there is only one right. There is no copy first and then overwrite. So, that's why snapshots are, in fact, snapshots are the biggest USB. That's why Gluster, in fact, uses thin pools. Because Gluster, I'm sure you heard of Gluster, right? So, Gluster ends up using thin provisioning. In fact, without the sys admin, even being aware of it. Most of the production systems are using shared EMC block storage and RHL hosts are running. So, that's why I asked this question. And another question is like, if I want to have a RAID 1 or RAID 5 or RAID 6, how to perform using LVM? Is the question, how's the performance? How to perform, supposedly. How to perform? Yeah. So, first of all, you can have a software RAID on which you have created a thin LV. Yeah. So, if you have a MDADM device, which is a software RAID, and then you can create a thin, you can add that to a volume group and then create a thin logical volume from that volume group. So, that is straightforward. It's just going to happen at two levels where thin will be there and then below that there is RAID and then MDADM is going to do the rest. It doesn't even know that it's coming from a thin device. Okay. And for the device filling up, do we need to have threshold limits? Sorry? Do we need to have threshold limits? As you said, if the pool becomes full or metadata gets full. Yes. Pool getting full or metadata getting full is serious sys admin error. If the pool or metadata starts to get full, see usually LVM calculates the metadata and pool data sizes such that the figures are very close to each other. So, by the time the metadata gets full, the data is also going to get full. Right? The data is where the data is actually stored. Metadata is the pointers. I'll block allocations. So, if you run out of metadata, that means you may have data, but you can never address it because you can never create allocations for that data. Right? It's just like running out of inodes. Okay? Even though you may have file system space left, if you can never create a new inode, you can never allocate that. So, we have auto extension turned off by default in LVM.conf. Basically, because we can never guess how much storage to borrow. That depends on the sys admin. We cannot steal space from the sys admin. But we have to have LVM warn people that you have auto extension turned off. Please turn it on. And we have several issues with Gluster because Gluster does not do that. So, we have a lot of issues where Gluster has created thin pools and the sys admin is unaware of it. He has not turned auto extension on. So, he gets thin pool full issues and volume groups may not even have any space even if auto extension was turned on and corruptions and their bricks are bricked. You know? So, to answer your question, there is also a rate type in LVM now. So, you can create rate logical volumes as well. But I don't think that you can create rate logical volumes and thin at the same time. So, thin is a different type. Rate is a different type. So, if you want to use thin with rate, then you have to have software or hardware rate below the LVM layer. Okay. Got it. Thank you. All right. One more question. Do we have any monitoring solution for, you know, checking the pool? Yes, there is monitoring. We have DM event D. It is a monitoring for thin pools. And when you say monitoring, all it can do is log to the logs. So, it will keep logging to the logs. 85% full, 90% full, 95% full. And in fact, some versions of LVM have different behavior. Some of them will unmount the file system when it gets to 80 or 85%. Or in fact, 95% when it gets to 95%, it will unmount. It will say that it's better to have an un-writeable file system than a corrupt file system. But some versions of LVM will not do that. And they'll just keep warning the sysadmin. And if auto extension is turned off, and if nobody's monitoring the logs, there's nothing else you can do. Can you go back one slide? You want to go back a slide? So, here the VG create commands remain same for thin as well as synthetic, right? Yeah, the volume group doesn't change. You create a thin pool like you create a thin LV. It's same. So, the VG stuff never changes. Okay, but the volume group when you use in the LV create alone, you are using iPhone T, thin pool. Yeah, so that's not the volume group we are creating. We are creating the thin pool. That's a thin LV. Pool is also a thin logical volume. Okay, so you've created the physical volume in the first command over here. You've created the volume group over here, and that remains same whether it's thick or thin. And then after that, you have created a logical volume. If you did not give the thin pool type and the T flag, it would be a regular thick volume. In this case, you're creating a pool. So, certain restrictions apply in the sense you cannot shrink it, and you can use it as a pool for further thin logical volumes. The thin logical volumes are what you will make your file system on, not the thin pool. Okay, thanks. Any other questions? Okay, so no more questions, right? Right, so we were here. Everything I said just now is probably in this slide. Think of thin pools as a VG for thin LVs. File systems are created on the thin LV, not the thin pool. Thin pools cannot be shrunk. Thin LVs are virtual-sized. LVs-A will show you some useful details about the thin LVs that I'll show you a screenshot of what LVs-A looks like for thin pools. Auto extension parameters are there in LVM.con. Sadly, they are not on by default because you cannot steal space from the sysadmin without sysadmin being explicitly aware of it. So, they have to turn that on by default themselves. Or we need to have some top-level scripts for third-party applications using thin pools which would warn customers in advance at the time of setup that please turn auto extension on and tell me the figure for at what rate you wanted to kick in and at what percentage you wanted to kick in and how much I should extend by. That sort of stuff you cannot guess, you have to ask. Also, there is a chunk size associated with the pool. The chunk size is the IO granularity. So, it's the size of the blocks managed by the thin pool. Generally, a higher chunk size would mean better performance but not for snapshots because snapshots are incremental changes. So then, if you have a snapshot, you would be better off having a smaller chunk size. Sometimes, customers try to change the chunk size. In the sense, you can't change the chunk size after you create a pool but customers sometimes try to pass in the chunk size to LVCreat while creating the pool which is something LVM does allow. If you don't pass in the chunk size, it will generally calculate the chunk size based on the minimum or optimum IO size and also create a chunk size in such a way that the metadata and data LVs will more or less grow in the same proportion. Whenever you allocate IO, so when block allocations are made, when data is written on the file system on the thin LV, data will be used up as well as metadata because blocks need to be allocated for the data to be written to. Metadata and data generally grow at the same rate if a good value of chunk size is chosen. If customers pass in a bad value of chunk size, sometimes metadata can get over before data and vice versa and that can be a total disaster like I will show you in the end. I have a slide showing a road accident, an LVM road accident to satisfy your morbid curiosity because you'll see how much trouble the guys actually got into. That's all about chunk size. Like I said, calculated by default can cause havoc if unsuitable value passed manually to LVCreat. Right, so these are just the steps where you create a pool like I already showed you in that slide. When you create a LVCreat, when you do an LVCreat with the type thin pool, that's what LVM shows you. And then when you create a thin logical volume with overprovisioning, for instance, in this case I'm creating a 1 GB volume in a 100 MB pool. And you see the warning from LVM and it says you have also not turned on protection against thin pools running out of space. That is something that came after the sky case for those in red hat. We also have LVM warning that please set activation auto extent threshold below 100 so that you have auto extension of thin pools. And then it says, okay, you have created your mythin LV, that's your thin logical volume. And if you do LVS-A, you see the ones in the square brackets, those are the hidden volumes, that's the data volume, that's the metadata volume. So you can see a 4 MB metadata volume created for a 100 MB pool. And if you say LVS-O plus chunk size, you'll see the chunk size being used. Typically for small ones it'll be 64 K, that's the minimum size. It can go up to 1 GB I think, not sure. Also I just want to mention that the largest supportable metadata size is 16 GB. So 16 GB, you cannot extend anymore, 16 GB, as of now is the largest size of the metadata LV. And honestly 16 GB is enough to address huge storage. So you really don't need anything bigger than that. For now, I'm sure that's a bit like Bill Gates saying you don't need more than 4 K memory but anyway. We have a spare LV also created. So this spare LV is used for repair. It's generally the same size as the metadata. Sometimes customers extend the data but forget to extend the metadata and that is when metadata can get full a lot before than the data. Sometimes customers also extend the data and the metadata, in which case if you try to repair it, you may have a larger metadata being repaired into a smaller spare which can also cause serious issues. We have issues with LVM right now because when we extend the metadata they do not automatically extend the spare which they should be doing. So many issues but still in general I found thin provisioning is is a very good solution to everything you need to combat storage wastage. We've already spoken about thin snaps so I guess I'll send out the slides and you can go through this. I think I've already more or less covered everything. This is a slide on how Gluster uses thin pools so as you can see I think Gluster uses thin pools just for the snapshots because they keep doing snapshots and they want thin snapshots they can't afford thick snapshots so that's why they use thin pools at the risk of users not having auto-extend on and running out of metadata or space. Snapshots are taken at the LVM layer over here and then on LVM we have XFS with your direct trees which are used as bricks and then exported as volumes by Gluster. You asked about RAID right? It may or may not exist it doesn't matter in this case it does. So we have 5 minutes and I want to talk about discards and honestly I have to confess I don't understand discards very well but from whatever I have read when storage has file system on it and when you delete some data from file systems if it is regular rotational discs like the 80s and 90s there was no problem because those blocks would be decommissioned by the file system and then file system would just send new writes to the same blocks and the storage wouldn't care but SSDs do care because they want blocks to be zeroed by default before they can make new writes and secondly thin pools do care because they have overallocated stuff so for thin LVs and for SSDs you cannot simply remove the file system data and expect the thin pool to magically shrink or to have available space again in the thin pool. You have to send the trim commands which are the discard commands. Usually it's achieved automatically by mounting the file system in discard mode but if that's not done you have to run FS trim in a cron job to periodically reclaim space that the file system has deleted most people don't know this so if they create thin logical volumes and create file systems on there and do rm-rf from the file system you expect to see the thin pool again free up that space but it will not alright and there's a lot of configuration stuff you can do in LVM.con for discard so take a look at that. So what not to do if you're using thin pools do not pass in a manual chunk size to LV create do not forget to turn auto extension on do not let the VG run out of space and do not do any of these things that have written down and do not not monitor your logs ok we have official LVM engineering guys saying thin provisioning requires responsible admins if you're not willing to take care of your thin pools don't use them lots of kittens may die and this is yeah I have two minutes and two minutes to look at this gruesome scenario if any of you are sys admins you're going to get a morbid pleasure out of this guy is in so much trouble I obviously removed the sensitive data so host names and client names have been removed but this guy opened a case with customer support in red hat saying dear red hat he didn't actually say dear red hat he said we are in big trouble something like that and he said we are facing issues on thin pool metadata space LVM shows 48.7% data written but 100% metadata written on this one particular node we have six other nodes in the same situation with metadata around 99.8% how do we avoid to block all other nodes and each one is 250 terabytes so that's 1500 terabytes of data that they cannot afford to lose and in fact they can't even stop IO because they're using this for some online streaming thing and IO was basically coming in faster than they could even delete it it was that fast and there was nothing we could do and the short story is that two weeks later everything exploded and they lost 100 and 40 or 1500 terabytes of actual data file system went corrupt everybody cried and then they kicked red hat out because they blamed it on us and that was disaster but this guy is in so much trouble right at the start we looked at this and we said there is nothing we can do man two weeks it took for us to actually give up but two weeks of sleepless nights and we still couldn't save it the metadata went full on all their other nodes and the root cause was that they had passed in a wrong chunk size to LV create and once that's done you basically can see that only half their data is usable because metadata has become full before 50% data has been addressed and once metadata gets full actually file system corruption is going to happen because when data is written on the file system file system promises the application that okay your data is written but by the time it gets to the thin pool thin pool starts rejecting that IO because metadata is full so it cannot even allocate new blocks for the data so the file system has to go so thin pool has to send back a message to the file system saying you know what we had told you that the data was written but actually it's not and now I'm sure it's corrupt so please if you are journaled go back to your previous snapshot of a file system and use that sometimes it's okay but sometimes it gets corrupt and in this case 1500 terabytes gone okay no backup and no time to copy it out if you try to think about copying out 750 terabytes just calculate how much time it takes to calculate how much time it takes to copy 750 terabytes of data or 900 terabytes as in this 1500 terabytes by half right so 750 terabytes of data it takes more than four months at even even in your fastest office with your best broadband or whatever it is fiber channel you just think of how much time it takes to copy 750 terabytes out it's a huge figure much more than we had time metadata went full and disaster alright thank you guys if there's any more questions you can ask me okay there's one question I think we have 10 minutes before the next talk starts right so I don't think we have a break plan there's a question so somebody wants to yeah so it's a good talk you showed how you complicated the LVM stack with so much of metadata and how you can so easily corrupt the file system so what's the journaling story there if you crash and burn while your metadata is being updated what is the chance that we'll even recover and come back you're not running out of space but you just crash at that point right so let's talk about XFS right you said journaling right yeah so do you mean that what happens to XFS when the tin pool goes full no no what happens to your tin pool metadata update when you crash right at that time suppose I have a copy on right and you are changing the pointer and at that time your system crash when the metadata gets corrupt you cannot even mount the file system that that's the point right so file system do journaling so that in a time of crash yeah but you cannot activate the logical volume so you need to thin repair the metadata okay but repair may not work because it's not intelligent enough to be able to recreate the metadata each time the corruption can be complex okay so you have a concept of metadata repair on the next mount sorry you have a concept of a metadata repair and try to recover the data yes we have a command called thin repair so if you run LV Convert dash dash repair it tries to repair metadata which is corrupt and once it successfully repairs metadata you can activate the logical volume and mount the file system but it's not foolproof it cannot repair metadata which has got corrupt in a complex way where you have to actually look at the binary tree and recreate the nodes and the pointers and that is something engineering can do manually but not always okay but we have a command called thin repair which we can use for simple corruptions yeah okay thank you thanks I have a follow on question though so how do you manage atomicity of your block update like basically your IOS 4k but you said minimum chunk size 64k or suppose we pick 256k as your chunk size there is one 256k you have to write but the IOS are still in 4k blocks what if it fails in between so you are saying that the file system has say larger IOS size minimum IOS size minimum IOS size but because of thin provisioning you have to write 256k block because you are going to copy over the file system will ask you to write 4k blocks but if the chunk size like you are saying the chunk size is 2k chunk size is larger than 4k take a 64k example if the chunk size is larger than 4k even if 4k is asked to be written that chunk will be provisioned there will be internal fragmentation which will be reused later so sometimes we see that half the data has been written but the data has reached 100% because of this internal fragmentation that smaller writes are causing larger chunks to be provisioned ok so in that case LVM is not stupid that it's going to say that you have run out of space so it will start reusing them once the internal fragmentation is it's reached 100% and now it starts looking for blocks that it can reuse for partial writes ok thank you did I answer your question not right but again it's let's have a talk later ok alright thanks guys lead some data from file systems if it is regular rotational disks like the 80s and 90s there was no problem because those blocks would be decommissioned by the file system and then file system would just send new writes to the same blocks and the storage wouldn't care but SSDs do care because they want blocks to be zeroed by default before they can make new writes and secondly thin pools do care because they have over allocated stuff so for thin LVs and for SSDs you cannot simply remove the file system data and expect the thin pool to magically shrink ok or to have available space again in the thin pool you have to send the trim commands which are the discard commands usually it's achieved automatically by mounting the file system in discard mode but if that's not done you have to run fs trim in a cron job to periodically reclaim the space that the file system has deleted most people don't know this so if they create thin logical volumes and create file systems on there and do rm-rf from the file system you expect to see the thin pool again free up that space but it will not alright and there's a lot of configuration stuff you can do in LVM.con for discard so take a look at that so what not to do if you're using thin pools do not pass in a manual chunk size to LV create do not forget to turn auto extension on do not let the vg run out of space and do not do any of these things that have written down and do not not monitor your logs ok we have official LVM engineering guys saying thin provisioning requires responsible admins if you're not willing to take care of your thin pools don't use them lots of kittens may die and this is yeah I have two minutes and two minutes to look at this gruesome gruesome scenario if any of you are cis admins you're going to get a morbid pleasure out of this guy is in so much trouble I obviously removed the sensitive data so host names and client names have been removed but this guy opened a case with customer support in red hat saying dear red hat he didn't actually say dear red hat he said we are in big trouble something like that and he said we are facing issues on thin pool metadata space LVM shows 48.7% data written but 100% metadata written on this one particular node we have six other nodes in the same situation with metadata around 99.8% how do we avoid to block all other nodes and each one is 250 terabytes so that's 1500 terabytes of data that they cannot afford to lose and in fact they can't even stop IO because they're using this for some online streaming thing and IO was basically coming in faster than they could even delete it it was that fast and there was nothing we could do and the short story is that two weeks later everything exploded and they lost 400 or 1500 terabytes of actual data file system went corrupt everybody cried and then they kicked red hat out because they blamed it on us and that was disaster but this guy is in so much trouble right at the start we looked at this and we said there is nothing we can do man two weeks it took for us to actually give up but two weeks of sleepless nights and we couldn't save it the metadata went full on all their other nodes and the root cause was that they had passed in a wrong chunk size to LV create and once that's done you basically can see that only half their data is usable because metadata has become full before 50% data has been addressed and once metadata gets full actually file system corruption is going to happen because when data is written on the file system file system promises the application that okay your data is written but by the time it gets to the thin pool it starts rejecting that IO because metadata is full so it cannot even allocate new blocks for the data so the file system has to go so thin pool has to send back a message to the file system saying you know what we had told you that the data was written but actually it's not and now I'm sure it's corrupt so please if you are journaled go back to your previous snapshot of a file system and use that sometimes it's okay but sometimes it gets corrupt and in this case 1500 terabytes gone okay no backup and no time to copy it out if you try to think about copying out 750 terabytes just calculate how much time it takes to calculate how much time it takes to copy 750 terabytes of data or 900 terabytes as in this 1500 terabytes by half right so 750 terabytes of data it takes more than four months at even in your fastest office with your best broadband or whatever it is fiber channel you just think of how much time it takes to copy 750 terabytes out it's a huge figure much more than we had time metadata went full and disaster alright thank you guys if there's any more questions you can ask me okay there's one question I think we have 10 minutes before the next talk starts right so I don't think we have a break plan there's a question so somebody wants to yeah so it's a good talk you showed how you complicated the LVM stack with so much of metadata and how you can so easily corrupt the file system so what's the journaling story there if you crash and burn while your metadata is being updated what is the chance that we'll even recover and come back you're not running out of space but you just crashed at that point right XFS let's talk about XFS right you said journaling right yeah so do you mean that what happens to XFS when the thin pool goes full no no no what happens here tin pool metadata update when you crash right at that time suppose I have a copy on right and you are changing the pointer and at that time your system crash when the metadata gets corrupt you cannot even mount the file system that's the point right so file system do journaling so that in a time of a crash yeah but you cannot activate the logical volume so you need to thin repair the metadata okay but repair may not work because it's not intelligent enough to be able to recreate the metadata each time the corruption can be complex okay so you have a concept of metadata repair on the next mount sorry you have a concept of a metadata repair and try to recover the data yes we have a command called thin repair so if you run lvconvert dash dash repair it tries to repair metadata which is corrupt and once it successfully repairs metadata you can activate the logical volume and mount the file system but it's not foolproof it cannot repair metadata which has got corrupt in a complex way where you have to actually look at the binary tree and recreate the nodes and the pointers and that is something engineering can do manually but not always okay but we have a command called thin repair which we can use for simple corruptions yeah okay thank you thanks I have a follow on question though so how do you manage atomicity of your block update like basically your IOS 4k but you said minimum chunk size 64k or suppose we pick 256k as your chunk size there is one 256k you had to write but the IOS are still in 4k blocks what if it fails in between so you're saying that the file system has say larger IOS size minimum IOS size the file system asked you to write only a 4k block okay but because of thin provisioning you have to write 256k block because you can copy over the file system will ask you to write 4k blocks but if the chunk size like you're saying the chunk size is 2k chunk size is larger than 4k take a 64k example if the chunk size is larger than 4k even if 4k is asked to be written that chunk will be provisioned there will be internal fragmentation which will be reused later so sometimes we see that half the data has been written but the data has reached 100% because of this internal fragmentation that smaller writes are causing larger chunks to be provisioned okay so in that case LVM is not stupid that it's going to say that you have run out of space so it will start reusing them once the internal fragmentation is it's reached 100% and now it starts looking for blocks that it can reuse for partial writes okay thank you did I answer your question not tried but again yeah maybe let's have a talk later yeah okay alright thanks guys