 All right welcome back from lunch everybody I'm glad you found your way back here and haven't quite hit food coma yet So hopefully we'll keep that going with some interesting presentations for you Our next speaker here on the sef day track as a part of open source days Well will eventually be three speakers, but Jason's gonna kick it off and the two Intel gentlemen will be joining us after another talk that they will be sprinting here from so they're back to back But in the meantime Jason from Red Hat is gonna talk about some of the the block storage stuff And give us a rundown on that world. So Jason take it away. All right. Thank you And it's actually surprising on the last day of the conference and the afternoon that there's still actually so many people Interested to hear about to sef and specifically a guest RBD What this talk is going to focus about it's gonna be mostly about 99.9 percent about RBD And we're really focusing about some of the work that Intel is doing right now and helping to develop and actually contribute to upstream To to reduce some of the latency issues that some people talk about with some of their workloads on top of sef RBD So just to start off and sorry just ask back up just building you already mentioned I'm currently the project technical lead of the RBD portion of sef So if you have any RBD related questions, I'd be more than happy to answer them But again moving on When we talk about sef when especially if people that work on the project I always like to start off with these slides about here's the the three towers of of sef components built on top of the The the rados library underneath and as I mentioned, we're gonna be talking just focusing in on the the RBD the block portion and For those that don't know what RBD is I'll give the 30 second intro So what RBD is it's a block device abstraction layer that's built on top of rados In the way it works is you have Hundreds thousands tens of thousands of small objects within the rados cluster and that's a strike. So and then your image Be it, you know terabytes petabytes, whatever you want That's broken up and striped over these very small by default four megabyte objects in the rados cluster Some of the highlights that we have with RBD. It's it's broad integration. I'm specifically talking about open stack It's broadly integrated into cinder nova and glance I've heard some rumblings about even pointed into ironic to offer Device driver RBD device block driver on bare metal It's thinly provisioned. So anything you boot up you're not instantly allocated that space out of your cluster If you want to boot up, you know a hundred gigabyte image You don't take up a hundred gigabytes in your cluster It takes up practically zero bytes in your cluster just a little metadata and it's only when you actually start filling it up Do you actually start using that data? Snapshot functionality if you're here earlier Greg did a very comprehensive detailed talk about the internal structures of how Snapshots work within stuff, but within RBD. We abstract all that away and you can just snapshot deletes rollback Snapshots of your images and even on top of that from your snapshots. You can then create clones Which are basically ways of I created an image I can create that image and call it my golden image Snapshot it and now I can clone off new images from that snapshot of the parent image and those clones are copy-on-write thinly provisioned clones and Since this is open stack I always like to point out. I mean Obviously, I think cinder and open stack they they widely use RBD and we love that and as you as the use case users of Of RBD inside of open stack if you have any problems questions or concerns or any improvement advice I mean, obviously we would love to hear from it because we want to make We want to keep this trend going of this is the latest from April 2017 of the user survey But you know you can go back to previous years and it's very similar results of stuff RBD dominating the usage and deployment on cinder So before I start talking about some of the problems we're trying to solve with With within RBD and latencies just wanted to go over a high level for those that don't quite understand how this RBD architecture works So we have two different ways to access RBD. We have the the KRBD driver That's just a kernel driver that provides a block device To your operating system. We also have lib RBD and that's in terms of open stack use cases That's pretty much always what you're interacting with the user space clients being either things like Management control plane things like a sender Nova glance or also just the actual consumer of the data in this case like KBM KMU that directly integrates with lib RBD to provide a backing device So in the case of KMU your operating system you know your VM has exposed a block device within it That is whatever size you've configured that image to be so but from the point of view of RBD lib RBD it gets requests from KMU saying please Read or write from this sector this many sectors and we translate that just internally just a quick We just use our striping logic to say well given that it's that that offset within the virtualized image space we know how to map that to a one of those tens of thousands of small four megabyte backing objects in the rados cluster and we can push that That read or write or whatever request to the appropriate OSD for that operation to take place And then diving down in about again how that control that IO flow works Within for IO requests. So in this case a client, let's say KMU It creates an AI over question asynchronous IO request via the lib RBD API and that request as I hinted at is Just a sector number of sectors of data that you want to read That comes off the KMU IO dispatch thread that actually gets then enqueued into The lib RBD IO dispatch thread. So once that IO gets enqueued its KMU gets control of its thread again to continue Doing additional work and eventually the lib RBD IO dispatch thread will take up from there pop the request off the queue It does that translation from that striping to top translate the image offset length to the backing objects Of that RBD image and the offsets and lengths within those backing objects and Then for each of those backing objects that you need to do work on it Creates in parallel IO requests via the lib rados API to to request that change to the image either read write Truncate discard for example Liberados it's the abstraction layer that actually handles talking to the OSD So liberados takes those requests against very specific objects in the backing cluster sends those off to the correct That you know the primary PG whatever OSD that is it just uses the quick crush map look up to locate the right place to go and Then finally once all the OSDs have done their work for all those operations you'll get liberados the OSDs Complete asynchronously back to liberados liberados Asynchronously completes back to lib RBD and lib RBD can bubble that completion backup to complete the AO request from Kenya But just visually because words sometimes don't express it just to visually show how that that works Just in terms of IO flow you got you're in this case a right the right request comes down asynchronously Cures it gets popped from the dispatch thread that can get translated into one or more Object operations, which then gets sent off in parallel to liberados liberados works Sends out in parallel to the cluster the cluster works out in parallel Sends back completions in parallel the completions bubble up to our internal lib RBD AO completion, which would then fire the completion back into Kim you The striping that I showed here I showed how an example of you know one IO operation Translated into three different object operations When you're doing normal, I guess small small IO what you're really going to get into lib RBD So you're going to get you know 4k reads, you know read one 4k sector, you know at this specific offset And what happens is? Realistically that's going to properly align within one backing object So it's not going to translate into anything more than just one operation that gets sent off at a time But if I had an operation that came through Kim you which was you know, right 12 megabytes, you know That 12 megabytes is going to span three objects. So I'll get three different Object operations going off in parallel to the backing OSDs to complete so one of the things with one of the design Features of of self is that you can only only the primary PG can modify The object that you want to that you want to manipulate so What happens is if you have a lot of IO that keeps hammering the same Object you're going to have a lot of operations that keep hammering the same Object within the same PG and all that work basically has to happen, you know sequentially on the OSD side so one of the ways that we had previously mitigated such Such workloads which especially when you consider like a sequential 4k sequential right workload what you can have is you can have a And what we have is a you know an in-memory right back cache and be right back to be right through But I'm talking about in this case the right back case And so what happens is especially in the sequential case if you have a bunch of 4k IOs all in a row that While those are in the cache and have been flushed back to the back to the OSDs yet Eventually when that right back process comes to evict those dirty blocks It can say oh, I have a lot of a lot of dirty extents for the same backing object Well, I can bundle those up in one request and send those to the OSD, you know as one operation as opposed to many small operations, so that's a way to actually reduce the workload on the OSDs and Eventually just get better performance So but that that in-memory cache it has some some constraints and caveats first and foremost it really only helps you for sequential workloads in terms of consolidating down those and merging those IO requests If you truly randomly go on all over all over your your backing image There's nothing you can do. It's those are basically going to be one for one Request back to the backing OSDs number two. It's constrained by physical memory. So you have RBD, you know being used in QMU that you're trying to stack multiple hypervisors on a on a multiple VMs on a on a host and You don't want to have to dedicate and you can't dedicate gigabytes of memory for for an in-memory cache to Help relieve some of this some of this pressure So right now it's by default unless you override at your configuration It's defaults to at a maximum 24 megabytes of right back cache, but it tries to keep it around 16 megabytes of in-use memory and finally because all the Rights are we don't once the rights go into the cache We don't track the initial order that they come in so any flesh requests that come in from your application That actually has to flush all the dirty objects out of the cache So it's not just the objects that have been Touched between a right barrier So just visually again what we end up with it's just a slightly modified IO flow Slightly modified IO flow with the in-memory cache and What happens is basically a stain all the way at the top, but instead of Breaking that up into per-object requests what we have is a per-object cache those Request populate the per-object cache and eventually Not in that if you're doing right back that completes your your whole IO operation back to your clients Eventually you have a right back thread worker that will pull dirty dirty extents out of the cache Tries to consolidate them up together as a single per-object Operation to send to liberators and it can send multiple operations, you know in-flight concurrently Do have to admit there is a I believe this is a bug My next slide does not match what just shows up here Let's try this. All right So here's where we are today and actually this is this is a graph that's on on sef.com are available from sef.com And this is from this is a benchmark that was performed by Intel and this is actually against an older releases against the hammer release of of Seth RBD specifically So this is 4k random read workloads and what you have here is The constraint on the bottom is basically constraining the client workload to only utilize certain number of threads on the client Core, but what this shows you is on average On on on this half of the graph on the left half of the graph You can see as you increase the cores our throughput goes up not changing number of clients So what that tells me, you know as a developer? RBD developers typically that tells us we have work to do in terms of how much we're hitting and utilizing the CPU like we want to We we have performance gains that we can still get out of libRBD To reduce the CPU impact and get us, you know back up to where you start seeing it like 40 cores where it's the CPU is no longer the bottleneck but you can see on the on the Alternate access the latency access. I mean on average it does pretty well even under these high ops high IOPS workloads, you know two millisecond average response times but the Really what we're trying to gain here and what we're eventually gonna start talking about Where we want to go forward in a libRBD is how to tackle the tail latencies of RBD so here you have an example of as IOPS increase Specifically, I think this is just increasing the Q depth The Q depth is increasing to increase the the IOPS count and what you can see is Yeah, again the the average latencies stay pretty negligibly low, but the Q depth of the tail latencies really start to climb and already you're here, you know, you're probably already at 20 50 milliseconds of tail latency And obviously is as the cluster is getting to the point where it's maxed out you're getting to four and a half seconds of of tail latency, which is pretty unacceptable. So the goal is how can we Reduce these latencies for for certain workloads and now that my my partner in crime has joined me He'll take over from here and start talking about what he's been doing or what they've been doing at Intel to help address this scenario Yeah, sorry guys. I I actually had a had an overlapping presentation that I just just ran out of so Just just finished talking about SPDK and I sell for those who you know wanted to hear it But it just happened so anyhow, so just to add add to what Jason said, you know The one thing I wanted to point out was So we we actually wanted to see the tail latency behavior with With the fastest Kind of back end that's available today, right? So this is this cluster is a five node cluster with Four, you know state of state of the art NVMEs. So you have an all NVMEs self-backend, right? And with the fastest we we see the the tail latency being You know what it is and and that's kind of serves as the as is one of the motivation, right for us to to to do some optimization so So so that's where client-side caching Comes in as as you know as an attractive option Basically, you know VMI O's which are so basically 10 lend themselves very well to you know locality based Optimization so expert IO locality to cash VMI O's on the node and reduce IO part dependence on the network are you know Are basically the two two key Reasons why why we are we're doing this right which which helps with the with the long tail latency problem because you I mean for the right use cases for the for the right performance You know you're basically Kind of either either maintaining a log Journal of some sort locally i.e. caching caching all the rights locally which and then you can and and also serving your reads out of local cash which which basically leaves on the network network for your You know network bandwidth available for for your rights and that that basically helps both You know both the RBD as well as the rate of Gateway workload so you could You could do a have a block cache or also a you know the same cash in the future can be shared by You know rados gateway for object caching so so I don't know if you covered the the DRAM base for No, I didn't cover the tail of the new cash design Okay, so I can't touch down though Gotcha, so this is kind of a recap of where we are today with the with the in memory cache I mean some of this Jason already talked to you know the DRAM base caching is You know kind of has has limits in terms of when you when you reboot the node you don't have There's no warm-up. I mean you need to wait for the cash to warm up to get any benefit and the cash ability is really You know limited by how much DRAM you have available And as the system ages it becomes less and less Available so There's no ordered write-back support. So you cannot do fmrl With what's F today, right? You you have the object cacher implementation today is is pretty complex Because it's actually shared by LibRBD and Ceph FS. So So basically that there are many bought single littered bottlenecks that that we are we're trying to you know get around By maybe we're basically rethinking that this cash as you know as basically a new cash and that's that's where This crash consistent write-back caching extension comes in. It's a it's a blueprint. We have in the works right now, which is extending the DRAM base caching to to use solid state media where you can have a larger SSD footprint Which is also persistent. So so you can get get the nice warm up kind of behavior across reboots You you have ordered write-back cache Which and again, this is actually verbatim from the blueprint. That's why so you know already But basically the the point is that you have an LBA base read cache and a an ordered write-back cache which With a which is basically a journal that you maintain on the on the compute node on that on a on an SSD and it it also has a an external, you know Caching plug-in interface. So if you if you wanted to have pluggable caching policy support Like for instance for let's say the Intel has a cache acceleration software or some other external projects can be used For as a as an intelligent caching plug-in So So we have been working on this blueprint for the last couple of months now Actually, Jason had a great starting point for us last year That that we started up started from so we so before I jump in and talk about what we have done so far Right. This is this is actually an a proof point for for this cache gets us right on and So we did an early, you know POC At Intel that the small team that we have We we actually were able to get get about 7x better random right right performance with the right back caching With a with a much better latency and again a lot of these results are actually with solid-state backed safe safe, so they're not your you know standard spinning media NVMe kind of combinations, so So in that sense that these are kind of the worst case results, right? So with with spinning back ends you're probably going to see a lot lot better throughput and latency and We also compared it to the current caching I mean that which which is being reworked of course But we we also had had a you know, I just want to add that as a reference point Guinness to say that on the cluster side we we have we still have work to do in addition to the client side and The cache will actually enable us to get better performance to bring more workloads to to save an rpd So I mean, I think we just class we mean I think speed through some of this especially for time, but The way that the design is shaping up right now is Be kind of actually like two separate caches a read-only cache That could be used for immutable data immutable metadata. So that's that's your RGB case our GW cases and your RBD snapshot Situations like a parent image. That's your golden image that you're pulling from and then also the crash consistent right back case Which was which would be implemented basically as just a append only Journal that just journals all the changes that you're doing so that your IO can hit the journal and instantly complete back to Kimu and then you have a background right back process. That's a thread that's pushing data as fast as it can Back to the cluster but avoiding all those spiky potential for tail latencies Oh, yeah, that's so yeah, that's that's the important point about the read cache is that how it has to be designed Is it actually has to be shared? Potentially between multiple clients you could have multiple RGWs on the same host So it doesn't make sense to cash the same object and number of times. It'd be great if you could reuse that cache data and have a consistent hot Object policy that's shared between all the RGW instances that are running and then similar with the RBD Parent image that's that's you know acting as the golden image for multiple RBD images They should be able to just write that image once as opposed to having to each have their own copy each instance of Kimu have its own cached copy of Of the golden image Okay, so talking about maybe we actually have a pull request up for if you already which We are refreshing as we go, but what we have up there is a it's it's a file-based cache It's actually a file-based caching framework Jason originally started it has today It has a you know read as in right through and right right back mode support For the right back case we have a you know, we basically have a time-based flush, so we maintain All the blocks that have returned out You know are basically appended Appended to log And you know you act act the client and we basically have a time-based RPO Which is recovery point objective, which is basically your tolerance on how much data you can afford to Let's say lose in case you the node or the cache cache device fails So so basically that that's the time-based flush that we have implemented we and for the eviction policy We have a very simple aliru-based policy for for now And we are working on you know implementing something like 2q type policies And again, you know that the two other bullets basically tell you you know Kind of the pretty much the details of the class names that that are part of the PR So we have a file image cache which implements You know the ordered right back journal we have and then also the read-only Object store which is a which is the read-only part of the cache as well as the policy So that's the in encapsulating class and and today we You know if you officially pull this down and try it you you know You're not gonna see performance today, but because we are working on it but this actually has configurable options for Right, but you know right back or read-only cache. You can actually size it size the cache statically The one thing to note is because this is sparse file-based implementation or rather The the reason the sparse file-based implementation was chosen so that you can easily do the You know one is to one are pretty volume size every volume to to cache ratios right and Then right now, I mean we are working on making all of these you know more things configurable with the with the cache So are there any questions so far on the on the design part at all? If you do have questions, please stand up and use the microphone. We are recording these for posterity All right, so so from the in-memory cache, you know the Okay, so a couple of great questions on the Blueprint that you shared for the caching stuff. So first you showed it for the RBD do other gateways like the POSIX one CFFS would be able to reap the benefits of that and the second one in your diagram You showed SSD being used on the client side for the caching Can some flavor of nv ram like a p-mem on the client side on be also used or is SSD absolutely mandatory? The answer to the second question is you can you know in the future the the goal is to actually be be able to use it You know not any any kind of non-linear memory. It doesn't doesn't have to be at SSD only but Anything anything to put a file system on right? So Yes, so you can you could actually have a Yeah, like a persistent memory Where I mean that there's certain flavors that'll let you carve a block device out of it Which will be plug-in. There's any kind of memory. That's gonna let you carve out a block device. That's gonna work Yeah, okay, and what about the first one can CFFS also benefit from it? Or is it RBD specific on in theory there? I don't think there's anything that would stop the CFFS call like user space client from Benefiting from that, but we're talking about like the kernel client the kernel client then this is a user space and Then yeah, no in that case, okay, thank you, okay, so so this So I actually missed the first part how Jason, you know where Jason may have explained the flow, but I'm assuming keep Basically explained the flow for the current in-memory cache So the difference from from the memory cache as you can see is it's basically all the eye all your aio rights are basically being being journaled as You know basically appended to a log and then you know like you basically act act the client to you and then there is a write-back thread which Asynchronously flushes those those blocks out out to the cluster and So again this kind of summarizes, you know how how The I of law is being implemented today, right? So you basically have a a a write-back thread that is that this flushing out those general entries on a On on basically when when there's a time past a max interval, which is configurable Which is basically how many how long of a data loss can you sustain without losing consistency? So the so the key here is that the ordered write-back basically lets you Let's you boot or migrate a VM to to another node you know with Guaranteeing consistency that that's the basic requirement, but but your data could be Little stale right so so that's why the that's where the the last bullet basically where it says backlog time past maximum interval That's that's configurable interval that you can sustain Or you are comfortable losing a few future data data bytes And then the sec in the second bullet to add to what the write-back thread can do You basically if you if you have some partial rights You know multiple rights to the same block you could you could basically consolidate those You know and before writing before writing the the blocks out to the to the OSD and and the other mode right which Which is read cache Which is basically we you know we we are actually working on a shared cache policy Sorry, basically a demon that implements a shared cache policy which is you know, which which basically serves as a Mechanism to you know to communicate between between multiple clients because that the cache allocation We you know we we wanted to be more cooperative in the sense that You know if you actually have a a cacher as a service or an Central entity that is doing all the cat block caching for you that could become a you know Basically a hotspot or single point of failure. This this is a demon that kind of serves as only You know takes care of the control path and kind of gets out of the way sort of a thing Let's see Anything else you want to add to that? Yeah, and like like the last bullet says here, right? There's we actually what we've been working on this design to be to have no No hotspot or bottlenecks. So that's why the all the IO Is basically directly to a client specific cache file You know only the control messages are basically being passed with between let's say an RBD client Which is a virtual machine and and the demon which which owns the you know sort of the cache policy and We do we do and into in you know, basically extend extend the same scheme to to share Should share the you know space for hot Hot object caching with our rados gateway, right? So that's that'll be an extension to this work. Yeah, so So what do we what do we have today as part of the PR for the request actually, you know works functionally we still have several performance improvements and Functionality to add so right call this thing is going to be a big big performance improvement The shared cache is not part of the PR yet. It's only a read read only cache per VM today Which you are working on and then the dynamic cache sizing is a is a big will be a big feature up with a with some sort of a Cooperative allocation scheme that is weighted. So so you could assign, you know way to VM and depending on that so so basically one one scheme that That we had thought about was you could have of you know fixed Slots as allocation units. So so so you basically you you deal cache I should say cash slots that are let's say 64 Megan size and Then depending on how important a VM is let's say how what is the weight of the VM? You you basically are saying it gets more slots and then the VMs cooperatively Manage the cache space in the sense that you know if a VM is not using it. It basically kind of Talks to the other VMs and figures out that it needs to give up some slots to the to the general pool or If there are more available it can reclaim some right so it's a it's a more peer-to-peer cooperative kind of kind of an allocation scheme So that that'll that's the we're calling that a metastore or the cache allocator Which is actually heavily work in progress right now. So and I believe That is it yeah, so yeah, we actually wanted to kind of Walk you guys over through the design real quick and provide a status update that you know this is actually work, you know, like actively in progress now and Yeah, you guys should see see more more peers going going up Any questions? Yes for existing RPD right cash. Is it's crash consistent? Is it's respects flushed? Yes So the in-memory RBD cash the the right back cash it is crash consistent So if you send a flush it flushes everything that's in memory Performing IO until it's correct. Okay. Thank you That's one of the big bottlenecks of it like with this though. You could potentially if you're writing it Persistently now to a journal you can selectively ignore those flush requests You just journal the flush requests and now on disc you have your boundaries to say anything between that flush request and that flush Request that I just read a chunk off disc I'm free to reorder free to you know coalesce and send off to the OSDs and still that thing Is controlled by air by RBD cash and self-configuration on the host. Yes, it will be yes Hi, I think you might have answered that already, but I just want to make sure This this new cash mechanism is that a user-level client thing or can I use kRBD? So that yeah, this would be this would be purely user space kRBD. You could always layer something like Dm cash on top of it. Is there a plan of migrating that feature to kRBD so it can use a CIF Fits for example to do that not not at this moment now Yeah current development and like easy space development is you're probably well aware. Yeah And it does have a level in cracking or luminous or what kind of So it won't hit luminous, but hopefully I'm forgetting the name of it now. It's mimic. Yes. Okay. Let's hopefully Yeah, I just wanted to understand a little better what what fsync means for the right back cash So is it flushing to the local client side? Journal or is it going to flush to the OSDs? And if it doesn't flush to the OSDs, is there a way that an application can make sure that data actually Kind of gets out there so that what do you think for you like a super flush? Yeah, so it would be to be a configurable policy But I think you're going to see your big if you're even like if you actually have like a database workload That's like right flush right flush right flush You know you're not going to see much gain out of something like this unless you can selectively ignore those flush barriers to Yeah, wait, and you're not abandoning those flushes and you're keeping everything ordered and consistent If you're familiar with RBD mirroring its features the same way how we keep everything consistent It might be delayed, but everything is respectful of the of the boundaries of the flush So no matter where you finish replaying. It's still consistent. It's might be missing data, but it's not inconsistent So again, I think maybe you answered this already But I just want to make sure I understand if maybe you could go back to the LibRBD IO flow And could you maybe like walk through for that for when things are off the cliff and a request takes four seconds What's going on? Yeah, so It's not necessarily anything in this flow here. That's that's happening It's really what happens is a request goes to the OSD's and those OSD's might be so hammered or the network is so backlogged or something like that You just it just takes longer for those operations to Take place So that's one of the nice things about doing it locally is you take all that out of the equation you you can basically Smooth out and get rid of all those spiky tail latencies that you just get on some random IO Maybe there's just no good answer to this, but it feels strange to me that with all SSD Ceph and an average or quest time of 20 milliseconds still you've got this four-second thing like what's going on with the ordering or Is there no easy answer? Yeah, I unfortunately not have an easy answer I was a the full view into the stack of what request was hung where and what process it could have just been that You know and again, this is totally presumptuous But you know if you had everything has to go through the primary PG But if that primary PG was getting hit so high, you know, it just had a way larger backlog than you know The 99% of the other cases that it just took longer than whatever. I mean, yeah, it's four and a half seconds I mean obviously when you're in those workloads, you were hitting the the extent of what the cluster was able to handle Yeah, plus there are several bottlenecks down the KV store. I mean this this was actually These were a research from blue store by the way So there are actually some other optimization points like down there. So, you know down under the OST The rocks TV layer and so on so I mean there are many latency adders all along the way and though the actually the the chart was Basically at at various Q depths So the the four second latency was actually at 64 So so at that that kind of Q depth your close effective clustered Q depth was probably Times 60 or so and so so that that basically puts a lot of pressure and that yeah Quite a few latency adders on the way Thanks a lot for the presentation presentation is really great stuff that will hopefully address many of the issues that we have with With RBD where we have users that suffer from two large latencies The question I have is like because it respects Right barriers and flushes there would be no issue with multi-attached volumes either right that should all work You're saying you had multiple RBD volumes and you're well Well, if I have a volume that's attached to multiple VMs on multiple hosts. Oh, there should be no issue because You wish that the cash will respect right barriers. Yeah, if you had like a clustered vial system on top of all Yeah, of course And and you wanted to actually get the best performance out of like the local cash You would have to basically not you the the flushes the bears are coming local application get recorded locally To the journal but not necessary. That's otherwise. You're stalling all your IO out near Really no better than you just taking whatever hit that the OSD's and the network are providing so in a workload like that I would say it this is not a good fit Okay, I'm a shared RBD So with everything presented today, how does this align or does it not align with what red heads doing with blue store? as was mentioned in San Francisco and probably last week as well So, I mean, they're they're they're separate problems that are solving so blue store is you know A great reimagining of how to how to store data on on the OSD's to take all the the double But the double journey double journaling and double write penalties off and basically allow sep to control its own destiny You know directly controlling the device as opposed to then having a layered file system on top Sure, but I mean this is all in the same general goal of trying to make sep go faster. Yes So we're at what point do all these efforts, you know kind of come together Eeking out more performance is a never-ending goal of any like system, right? So it's like we'll take our performance wins wherever we can get them as long as there's a use case that actually makes sense for Right. I mean everything happens at a cost to do something else So I mean the question is, you know, we're everyone's obviously trying to squeeze sep for as much as it's capable of doing I understand that What's what's the general end goal? What's the customer drivers behind this? Who's pushing the hardest for you know turning this into a performance platform? I Can't speak for what specific customers are asking for You know, you're just talking about use cases, right? I mean a ephemeral on sep is big use case Again, you know date databases you can't you know not not not just yet, right? Because you you have hard consistency Equalizer, so I would say ephemeral is the biggest one that we have heard right where They'll benefit from this. Thank you One question if you have external cash external from chemo protosleep Radis and there is some bad consequences for example Semi crash and host, you know when Ios stalls or system become unresponsible was still alive On the normal system with in-memory cash you terminate chemo You kind of get rid of off on written datas, but if you have Persistent cash on the host and separate process and you restart guests on the new host How you protect guests? Which start doing some work again from outdated rights coming from the cash Which is a great process well that the cash is on a different Just assume we have host with enabled persistent cash. There is some kind of Crash VM terminated Just goes right back up on the same house. Oh, because it Doesn't start with a new cat. It doesn't start with the new cash It like it then it would have to basically start a pending operations. I started with empty cash on other hosts So you have an evacuation Kind of yes, and then and the original host is not dead completely VM is dead But the caching system is not that how you protect the right back the right back was from the process that no longer lives Yes, yes, yes, no, it's not mean like the chemo process is dead on that house So it's nothing right nothing right back if it's a great process the ground flush process a separate process Kimu is dead this process is not no it's the flush though It's just another thread inside live Rbd. So it's dead. Yeah, it's not a process The only other thing would be a process would just be that demon that does policy so it doesn't do any IO All right, we're gonna have to end it there. We're actually five minutes over so thank you everybody for coming and thank you guys for a good presentation