 Thanks for making it for I think the last talk of the evening I'm going to be talking about caching in particular caching using LVM I'm sure a lot of you know what LVM is in fact a lot of you probably know more about caching than I do because I see a lot of familiar faces in the audience and a lot of people whom I have learned from but for those of you who don't know what LVM is it's basically a block-level volume manager the Linux volume manager and it allows you to aggregate physical devices and create logical devices cache logical devices are one type of such device and we're going to be in how to set these up and what caching is in general and what benefits it can get your Ivo applications. My name is Nikhil and I work in backline support for storage I have been with Red Hat for about five years now and LVM is one of the most frequent areas of my work when it comes to support so everybody knows what we are talking about when we talk about caching it it it exists at every layer of the stack but the the fact is that the IOSub system has become the performance bottleneck because CPUs have increased at speeds which are much faster than then the back plane of IOSub systems for most organizations. The latency hit you get when you exit from the RAM subsystem and do your IOTO the hard disk drives which are still pretty much the backbone of storage infrastructure for most organizations is in the order of several magnitudes in fact if if you want to think about it in terms of layman figures the RAM is is as fast compared to a traditional hard disk drive as a supersonic fighter jet is compared to a snail that that is actually how slow IOS when it exit when it exits the RAM subsystem and starts being written out to these slow rotational magnetic disks. So there is a latency gap that exists and solid state drives are pretty much filled in the gap the problem is that when it comes to storage you you can either get so you you you consider storage in terms of price its size and its speed and unless you have infinite budget and nobody does you could get two of these three things so the idea is to efficiently use each layer and take as much as you want of your fast expensive storage take as much as you need of your slow large capacity storage and put together a hybrid solution that can be used for accelerating your IOTO the dream situation is of course that you pay for a slower device but you get the performance of a faster device and the way to do this is to cache as intelligently as possible. A lot of hardware manufacturers if not all have have begun offering hard disks with caches coupled into them so you don't you don't only see this in standalone devices you also see them at the storage area level where your controller cache could be such devices the idea is of course the same the flash would store most often leave frequently accessed files and while it does help the the the basic problem is that you can't do much neither can you be coupled nor increase the size of the cache and so what we're going to be talking about today is doing the same kind of thing but at the software level using LVM so in LVM a logical volume is composed of one or more storage devices so you aggregate physical devices and make them into logical volumes like I said at the start cache logical volume is a combination of a slow which I would call a origin device and a fast which I would refer to the cache device LV create which is the command you use to create these logical volumes allows you to specify the actual disk or the actual device you want to use for this logical volume and typically you would use the fast disk to create your cache logical volumes and your slow disk we would assume is actually in existence and in fact with a file system on it and currently serving I the idea is that the fast device would boost the speed but it's not going to aggregate your storage capacity so the total storage capacity you have is the storage capacity of your slower disk what the cache would do is actually typically store frequently accessed blocks of the slower disk and service IO requests if those blocks are cached IO requests would be serviced from the fast device without having to go to the slow device so when we talk about caching all data stored on the fast devices eventually makes their way to the slow device depending on the behavior you set up it could make a difference in terms of how long that takes depending on the way you tune your hybrid volume but eventually you would never be in a situation where where the loss of your fast device would research in in loss of data at least that is the end goal that's because it's keeping a copy it's not keeping that eventually whatever the fast device stores it will flush it to the slower device for safety so that you would never be in a situation of data loss if the fast device went away I repeat that I repeated that because I saw I saw you a little puzzled about that A couple of years ago I have read a block entry of one of our fellow red actors I don't remember his name but there was a statement that if the SSD has gone away then you lose some data you would lose the data you would lose the data if it goes away in the progress of making the right back but once right back finishes and it has flushed everything then if you lose the fast device you lose only the cash layer you would not lose your data it depends at what stage you lose the fast device if it if you lose it in the progress of right back then yes there would be data loss but the goal is to get the data to the slower device eventually depending on how you tune it there's two ways to cash I own using LVM depending on your workload and what what what has happened is is a distinction between the read caching and read write caching so sorry read write caching and write caching so then the the owner we are doing things was the DM cash device mapper target DM cash has been around for a long time it made the mainline kernel in 3.9 I think more than six years ago very often used by customers for their caching solutions and the cash would get populated by both reads and writes which means that if you would access certain blocks frequently enough on the slower disk they would be promoted to the cash the cash would do these kind of promotions based on a hit count how many times a certain block may be accessed on the slower device DM write caches the newer target it it it's not it's it it was it was mainline less than a year ago I believe and has recently been integrated into LVM upstream it has not yet found its way into rail but probably in the next rail version you would see the DM write cache target and basically integrated with LVM the other thing I wanted to mention is caches can be added to files to logical volumes which have active file systems on them so you you may you may not really get a performance hit during the addition of the cache both the DM targets that exist which are integrated with LVM let you do that there are other caching solutions where it's not trivial to do that which would wipe the origin device when you set up the cache so it's difficult to set up caching on an active file system without having to unmount it and actually wipe it clean which is one of the main advantages why you would like to use the LVM integrated caching methods so let's talk about the end cache for a bit like I said the end cache does cache promotions and cached emotions so frequently access blocks on the origin would be moved to the cache initially when you set it up it would be a cold cache so the first time you set up DM cache you would see performance which is worse then if you had no caching at all because the cache is cold nothing has been promoted nothing has been moved to the cache after you use it for a for a certain amount of time the cache would become warm and certain blocks would be promoted to the cache and then you would start to see the performance boost that you expect when you're using caching fast devices in DM cache are represented by a cache pool a cache pool is a type of logical volume which would have a metadata and a data logical volume if you know anything about thin provisioning and if you've seen thin pools it's the same kind of concept what you would do is create a cache pool which is actually just a linear logical volume and then attach it to your slow existing logical volumes which would create a hybrid cache logical volume there are ways to to create this quickly and let LVM auto create the metadata and the data logical volumes and in fact even auto create the cache pool itself and I'm going to show you the quick way and the slow way to do it in the slides the cache can also be detached online so when you detach the cache using the split cache flag to LV convert what would happen is all the pending dirty blocks and a block could be dirty marked in marked as dirty if it had to be changed on the cache but not in the origin and those dirty blocks would need to be flushed to disk before you deactivate the cache so that could be done when you when you do a split cache on the cache you would have a flush hopefully not too long because the end cache typically does end up writing data to the origin device quite fast it doesn't have data lying around only in the cache for too long there's three modes that you can set up there's the write back mode which means that your IO will be signal complete to the application when it reaches the fast device only so you need not wait for the IO to be written back to the slower device before the application thinks of it as complete there's the right through mode which is what it defaults to in well at least what right through says is IO will not be marked as complete until it reaches both the devices so the right through mode is actually beneficial only for read caching it's not really beneficial for write caching but it is safety first because by default we don't want customers to be in a situation where like you said there could be any any incident of data loss at all there is also the pass-through mode not very not used very often it is generally used if the cache is not coherent and you want to still activate such a device without actually using the cache so everything would bypass the cache in such a mode and here's an example of how to do this with a cache code it is basically just these two commands that you have to remember assume that you already have a slower logical volume with a mounted fight system on it which has in progress IO that's cool and you create a fast logical volume if you notice the first tell me create creates a one gig linear logical volume which you would call your cache pool or name it seapool and you specify which device you want to use for creating this fast logical volume and you would then create a cache pool automatically as well as tie it to the origin and create a hybrid cache tell me all in this one step this LV convert step that I've shown and at the end of this process you would see this output which would basically leave your origin name but the type would be changed to a hybrid cached volume which you could see in the attributes over here it's no longer a linear logical volume it's a cache logical volume now it's a hybrid logical volume and you will have five hidden logical volumes created auto created which LVM would not show you if you just run the LVS command but this is the LVS all output and this is what it would internally use for it's appropriately named so the cache pool which is auto created which becomes hidden the data and the metadata a spare for repair if needed and the original which is renamed to underscore see average volume if all this seems confusing just remember that at the end of that step you have the same name logical volume as your origin logical volume I always in progress as before but we'll start being cached and this is the long way to do it I want to I wanted to have this as reference so if anybody looked at these steps or if anybody saw the slides they would get an idea of how to achieve maximum flexibility notice you have two LV converts here because you are manually creating the cache pool because you have created your data and metadata separately so this is useful if you want to have for example different rate levels on your data and metadata for example you would want to probably mirror your metadata device would probably want to to stripe or you would want to have a rate 5 setup on your data device and you may also want to use different devices for the data and the metadata suppose you have some super fast storage which you want to use for the metadata and you have some fast reasonably fast storage which you would use for the data so that kind of flexibility is available to you and you can go ahead and look at these steps slides will be sent out I do I don't want to spend too much time going through the entire stack but it's pretty much what we saw in the last slide but with additional flexibility of doing each step and having separate data and metadata logical volumes we also have cache walls now cache walls can be thought of cache pools without a user accessible data or metadata you would never see a metadata or data logical volumes even if you try running LVS cache a like you did in the earlier slide and the idea is that the metadata and data combined into one internal logical volume used by the cache wall the way to create it is pretty much the same if you notice it's just that I I I gave a cache wall instead of a cache pool to the come to the command so let's talk about dm write cache this is the the new caching target merged into 4.18 by Mikulas who's right here so you all know who to ask questions to integrated by David recently into an LBM version which would go into the next major release of well which would be packaged well dm write cache is only right back no right through and that makes sense because we are trying to accelerate only writes here it's useful for applications like database which need low commit latency is for transactional consistency where they cannot afford to have data lying around in the page cache at all so they want to get it as fast as possible to persistent storage optimized for PMM devices in terms of how quickly or how often it will flush but also very useful for NVME based SSD devices only uses cache falls you cannot use cache pools and there is there is no there is no flush without eviction so every time data is flush to the slower device it is evicted from the cache all writes will be cash only writes will be cash reads will be serviced from the cache but if you read from the slower device no matter how often you read the same block there will never be a promotion from the slower device to the faster device on the basis of that read the idea is that reads may generally be available in the ramp page cache so you could get it from there and if you really think about it reads and writes are quite different when it comes to caching right if you send a right request the operating system will typically cache it unless you use direct IO it will be cached in page cache and you will be told the application will be told sure consider it it's written done right when you do a read request the OS actually the OS has to actually deliver it can't say consider it read it has to actually give you the contents and so having two different targets is is is an attempt to actually isolate one for a particular kind of workload and the other for another kind of workload and and that's basically the idea behind having these two caching targets now available for applications to use as of today detaching the cache this could require deactivation of the logical volume in dm write cache not not like dm cache which lets you detach online too but there is work being done which will in the future allow you to detach as well online the same kind of output for dm write cache as you can see you are using a type write cache instead of a type cache and a cache wall you cannot use cache pool like I mentioned and it's pretty much the same kind of output your your origin device would retain its name but it would become a hybrid write cached logical usage of the cache cannot be seen at the LVM layer right now but give us a few months and that will come into into LVM so you can see the the the usage of the cache in terms of how much of the cache you have free today it can be seen using the dm setup status the field which I have marked would give you an idea of how much cache you have how much is free and how many jobs are being used to flush this to the origin if at all a write back is a progress these are tunables for dm write cache these can be set when you set up the cache using LVM these can also be changed dynamically using the dm setup command today again there will be a patch going into LVM in the future which would let you make these tuning changes dynamically after setting up the cache the high watermark is basically the the level of usage of the cache after which a write back would automatically start low watermark would be the the percentage till when write back would continue so after 45% by default no more write back will happen the write back jobs is the is the last is the last figure that I showed you in the dm setup status command and these are basically the number of threads that will be spawned for for basically flushing data there are also other tunables for how how quickly to flush so after a certain number of blocks are committed a write back will begin after a certain amount of time is passed a write back will begin and these are optimized more for persistent memory as you can see so that's basically the idea I'm trying to get across here I think we have five minutes for questions I don't I just have a references slide but if you have any questions you can you know there's a lot of experts in the room so just so the default caching mode for dm cache is right through do you have the same issues that this gentleman was talking about in that case with the data getting back to the backing store no you would not because I would be signal complete only after it reaches the backing store so it would result in a performance hit for rights but it would be a safe way to cash you would get a performance boost for reads only in that case how well does it scale? I see in your examples you use like 5 gigabytes or something like that I use 3 terabytes or a 10 terabyte drive and I didn't really see that much usage I didn't see the speed of the expected did you use dm write cache? yes so I use dm cache with write back you use dm cache with write back to the number of say chunks in the caches around 1 million like you did with the most huge multi-termal caches when Adin tries to keep the number of chunks in this category it means that the chunk has to be 3 to 8 and then you pay a big price on the promoting that you will hold 2 megabytes but even if you will just double or get killed by from that position so at this moment I would not actually recommend using the cache because it's a question if you make a benchmarking or if you use the cache for something else at this moment we don't have super optimal trees for handling multi-termal chunks yes so if you see the lbm.conf file you will see a limit of a million chunks and if you use a cache size which would need more than that then instead of increasing the number of chunks the chunk size would increase and you would have an equal increase in the migration threshold so you would be promoting and demoting in larger chunks which could be affecting your performance the suggested ratio is actually more in terms of your working set so the size of your data if you have for example a 5GB cache and if you are doing streaming writes to video files which are constantly going to be larger than the cache eventually you will not get much of a performance boost from that cache if you have a working set which is doing reads or writes at sufficient intervals for the cache to be able to write back and then wait for the next burst so that is when you would get the optimal performance so the ratio, how big you want the cache to be should probably depend on what is the working set of your data sorry, can you repeat? dmcache and you would only read or only for reads dmcache so you would like to only cache reads I don't know if you can do that we don't have a read only cache write through a moment but he means only promoting things that are read and never mounting for writes and the answer is no before you have told us about cache walls and cache pools I am not fully understanding the difference between those things cache pools give you access to the internal metadata and data logical volumes if you want to separate them on to separate devices or read either of them at a different level cache walls use them internally and the sysadmin or the creator of these logical volumes has no access to the internal metadata and data logic which one is that thing to use? well, if you want greater flexibility then you would use the cache pools if you want ease of use and one command which does it all you would use the cache walls I think we are out of time so thank you all for making it