 Thanks, everyone. Thanks for joining. My name is Jeffrey. I work for the OpenEBS project and as it was already mentioned I was a little bit Aferical at this stage because you know I was working on something and finally work So I you know submitted this title to at a call for papers at FOSTA and we'll see if we can actually achieve that Hopefully I can show to you a little bit later today But to my defense storage is not deterministic, right? Sometimes it has Audities there so a little bit of a recap so open EBS last year. I was here actually I forgot to turn on the mic So I hope this year is better touch briefly a little bit on how the sand and as Infrastructure came to be and mostly to set the context around this concept that we came up with in its container attached storage and and what it actually is and what the purpose is and Today we'll talk a little bit more about the progress we've made and our maiden voyage with rust which was a very exciting journey and We'll go over of the concepts that we've implemented why we've implemented them and why I I personally believe that they make sense As I mentioned a quick demo and if you hear something today that excites you and you would actually Would like to join on the project. It's open source obviously, but we're also hiring so a Little bit about open EBS. It's an open source project started roughly two years ago It's sponsored by my employee Maya data. So I'm actually paid to work on this And the idea is is that we want to provide a a cloud native abstraction For cloud native workloads in particular for persistent volumes, right? So how do you do these persistent volumes in the system that's orchestrated in the cloud in Kubernetes in particular? so as mentioned it's built on Kubernetes and One thing that Kubernetes has proven over time. I suppose is that it provides Through the abstractions It allows the developer to actually focus on deploying the application and not so much worry about the underlying infrastructure So we inspired to do basically what Kubernetes does for apps. We try to do for storage in in a meaningful way So to give a little bit A vision about how we see this all happening So imagine that you have several cloud vendors could be on-prem could be different type of vendors You all run through Kubernetes and what we want to build and what we want to provide is this Declarative data plane that allows you to abstract away the differences between them not just from a configuration aspect But also from a data plane aspect, right? And we'll go over how we do that because obviously that's non-trivial So based on that we we we capture a lot of information and the idea is then based on the information we provide various technologies that give you some insight visibility advice in terms of How you should move your data and things like that policies and what have you so the motivation that This project has to begin with is okay So why are we doing that and one of the things that we we saw is that applications have changed and somebody forgot to tell Storage and what I mean with that is if you look at cloud native applications these days they They have the scalability batteries and availability batteries included, right? So if you look at these new type of applications, they are built with with failure in mind, right? Across nodes across DCs across regions and even cloud providers Similarly for performance they all kind of have this native scalability through ha proxy envoy or what have you so that leaves some room let's say for Rethinking how you can do certain things in storage because not everything is tied to this You know storage system and then if the storage system goes out everybody has a problem The other aspect is that the people a very important aspect the DevOps persona and they need to deliver fast and frequently I just had a Hilarious laugh last night because there was a very interesting video on on on Twitter about fast and frequently But that's besides the point that the point here is is that these applications are born into the cloud So they adopt these cloud native patterns and now they want to move it back on-prem because you know They need to actually take it in production and that's when the problems start to come right because those abstractions that are true in The cloud are not necessarily true in their own environment Another important aspect is hardware hardware trends enforce a change in the way that we do things will go over those As well, but you can see these trends, you know propagate in all areas, right? So in the languages that we use we have the concurrency primitives build in and what I mean with that in particular Example is is go. Let's say where you just go whatever and you don't need to create p-threads or whatever You just it's locked into the language. There are even Languages that are even higher than that cloud native languages like metaparticle ballerina where you just you know provide Basically describe the communication between these individual components and there's that I'm so the hardware And then obviously Kubernetes as a universal control plane to deploy these things But the interesting thing is very recently and this is a fuel a lot of the work that I've been doing the last couple of months Is that the if you're not going to the cloud the cloud will come for you and what I mean with that is that you can actually It's not publicly available yet But things like GKE on-prem where you get the GKE look and feel On-prem and the same goes for Amazon outposts. So people can actually buy point solutions from Google and Amazon To run these type of applications as they were As they did in the cloud. And the other thing is is that Kubernetes is starting to move towards also managing VMs, right? We can argue if that's a good thing But the the other side is that we we see different type of VMs There was a talk yesterday Rust VMM Which is a perfect example where we basically stripped down the hypervisor to the bare minimum just to get the isolation features Such that we can run containers in production safely because security is still not a very good solve problem in this space. So PV's and PVC's in a nutshell There was a talk before this around CSI and all that so I'll just you'll just go over this but I basically You know sum it up as a PV is a mountable thing that you register to your cluster Then you have a PVC which which you claim this mountable thing in your cluster and you need to reference it in your application and then There are some some some other things around that in terms of dynamic provisioning So you do not have to register all these things manually because that would be a tedious task And that's a good thing But it also has some implications because your storage is like the mother of all snowflakes. It's very special It's very purposely built and they have all kinds of limitations in various dimensions that you need to consider when you're actually running Kubernetes on top of your storage system But it is a lot better and then you have the the CSI interface container storage native interface that basically Hides the the the vendor specific implementation of actually mounting the volumes and propagating the mount points into your containers So how does this look in YAML? How does this look in YAML? So this is something that you probably should not do because what I'm doing here is I'm mounting the local mount Point of a of a node into the container spec And as I mentioned you really shouldn't do this unless you really really know what you're doing because the problem Is is when the workload moves away your data is gone, right? And that seems like like a no-brainer. Well, just don't delete the data But the point is that your data needs to need to move needs to move with your application, right? so the canonical way of doing that then is through PV's and PVC's and and Unfortunately, it's a little bit more YAML. Um, and so you basically first create this persistent volume And while I don't go over the details here and what all those fields mean that they're they're described on the documentation Website of Kubernetes itself and then you create the claim, right? So now I have a claim on this volume and the final piece then is that I need to actually refer to that claim within my application So that's a that's a whole lot of YAML for a simple mount command But you know it is what it is. It's the cost of abstraction I suppose but it does make sense if you run this at scale and Just for a single container you're like well man, that's way too much but at scale it actually makes sense So I think that the picture says more than a thousand words So so you know imagine you have a two-note system You have your PVC that is somewhere on a storage system It's not important right now and then you have your pod and then you need to mount this thing And this is what CSI helps for you It does it automatically in the sense that it first connects it to the right host where your workload is and it knows this because it Works together with Kubernetes to figure that out and then this this final piece Let's say that the the blinking arrow is the actual mount propagation because you know it's containers namespaces So we actually need to do some additional work And then the trick is obviously when when the pot moves then we need to do the same dance again And there is an actual certain order in this right you cannot start the app before the PV is there So did it you know it sounds trivial, but you know there's a little bit more to it than you know Then meets the eye in general So the other thing is is that I don't know if you saw it blink it should have blinked But the PVC it becomes a very important thing here right because as I mentioned It doesn't work without it for sure, but you also need to manage that data. How do you do that? Right? so that's some of the things that we have looked into and So you know is the problem solved with this PV PVC abstraction and I can just plug in any storage system in there and You know be happy, but There are some more things to it than that or at least we believe and that is how does a developer compose volume specific to its needs So what I mean with that is that if you for example run MongoDB and you're using sharding You don't need all these fancy features on your storage And in fact they might actually get in the way if your storage system has a fixed Endpoint where it replicates to it basically nails you down to either that storage system or that storage system and Thus in the network and all these type of things so you want some flexibility in certain cases, maybe not all The other thing is how do you unify the differences between the storage subsystems that you might have so if you have DC1 you have a Vendor a DC2 you have vendor B Okay, now I want to have a PV and I want to replicate this stuff How do I actually do that and these vendors go out of their way to make that? You know as incompatible as possible right so that's that's the thing that you need to consider as well So what can we do to actually make that a little bit better? And the other thing is is when you when you when you go into Amazon or GKE or whatever and whatever you like it or not What is kind of easy? You don't have to worry about anything right? You just create a volume It's there you use it you write it and something crashes and sure enough there is your volume again So this this this expectation in terms of how people use it how they operate with it That is something that we would like to bring to you know on-prem situations or cloud independent at least so The gist of it is make your storage as agile as the key applications that they serve now there is a Snake in the grass for lack of a better word And that is if you as you create all these pvs on your storage systems What you end up with depending on your storage system is this thing called data gravity and Whenever whatever your storage system is this happens by definition right and you can look up the term on the internet There are a lot of articles talking about this But everything evolves around your storage system and the more data you store the more application become dependent on that data So it's like you know this never-ending loop So how can you break out of that data gravity aspect and why would you actually want to do that? And one of the reasons is that all these storage systems they have limits They have limits in terms of latency throughput the IO blender type like thing and if it collapses the whole universe is gone Or at least that solar system so The other thing is the dimensions as I mentioned already those volumes typically have a limit in the terms of numbers How many you can create and so a typical solution is you know, let's replicate the Sun to another dimension and you know You could argue okay, so that works and it works way because we've been doing that for many many years But you know it might actually make the problem worse So and when you talk to Picard about that he's not really happy because he believes that you know Fighting the Borg and the Borg is a reference to Kubernetes, which is the original paper which you I would suggest you read So therefore there's some Star Trek here. Anyway, so the open EBS approach we we we thought about this We all have a storage background. So we're not like new to this But we didn't feel like okay, let's build another Distributed system that's even better than all the other distributed systems out there, right? So we really try to look into what is the problem that we're trying to solve and one of the things is that Instead of having these dynamic distributed algorithms Dynamically figure out or calculate where the data should go. We actually allow data placement to be defined in animal and Again, we can do this because Kubernetes is here, which has topology awareness and all these type of things The other thing is we want to have it composable just like you compose your containers We would like be able for you to compose your volume and in turn aren't certain features are off that you don't need and As you decompose those monolithic apps certain things become possible Because we're dealing with volumes that are 20 gig or 200 gig in size and not with petabytes of volumes that I need to replicate So the the inertia, you know decreases a lot on runs in containers For containers and therefore in user space and the primary reason for this is is that if you look at these different cloud vendors they all have their own operating systems, right and You cannot just load a kernel module in them because a it's not available and b if you would do that They would actually not support you when you have an issue. So Doing this in user space seemed to be the most logical approach here Which is which requires a certain amount of of of tricks to make it perform Which will go over as well. So instead of having a big Sun. We have a lot of stars and you know in Picard makes that you know thinks that's a lot better. So we try to do this So to give you a little bit of okay, so whatever man, what does it mean, right? Okay, so Data availability so so let's assume you're you're a developer and you're tasked with Building an application and you have two data centers and your storage administrator says, you know You go figure it out because you're a smart engineer how to get the data there, right? That could be one of the Things that you would need to do so Like I mentioned the replication between the vendors is not efficiently Or maybe it's not even possible at all so you could you know spin up your container and configure our sink and then say where it Needs to go and you know, but that's all very static. It's not very dynamic. So that probably isn't very good thing So what what we allow you to do is you can abstract away the differences between them And we can insert ourselves within the data plane and do that for you And you just specify what you want by declarative intent through YAML and I'm so How does this look so you have your PV then you have your container attached storage system? And then you have the box one the box to the box three and this could be you know any box This is actually a reference to the comedy show Silicon Valley. I don't know if you've seen it It's quite entertaining on but we basically you know as you put the IO in there We just you know fan it out through the three different subsystems And what is important to note here is that the container the cast container does the logging in and you know The actual replication so it talks all these protocols to do the work for you So you don't have to do all that work for yourself You open APO open EBS operator, which is not shown here Obviously place a crucial role in making this all possible It actually consumes the specific YAML components that you impose upon it when you create the PV and the PVC So another example So let's imagine that you have similar situation But now you have different protocols and these protocols can be the underlying protocols as well as the protocols at top So what I mean with that is is that you're running on host a and it has a NV me Cart with RDMA and now it moves to host B, but it has no NV me So how do I switch dynamically between NV me and ice-cazi? Let's say right At the at the other end and I'll explain this a little bit better later at the other end You have the same problem. So you might connect to a local device That's physically locally in the machine or an ice-cazi device or a network block device, right? So abstracting away these differences is is key here or at least we think it's key So to give an example again So asymmetrical back ends that's where that comes from because all these different protocols have different envelopes in the way that they perform so you have this This problem then let's say where you want to get rid of this NBD device because it's slower than ice-cazi and therefore it You know it dictates performance depending on the replication model. So async I think speaks for itself semi-sync is basically where you say, okay I want the two ice-cazi luns to acknowledge and then a lot of the the third replica You can do that a little bit later between a certain amount of rules and sink I think that also speaks for itself. So imagine that you want to attach this new ice-cazi lun And then you basically want to rebuild or hydrate they call it hydrate in the cloud native world probably sounds a little bit cooler than mirroring and then once the once the Hydration or mirroring is complete. It basically becomes a full replica of this whole set And then you can just phase out the the existing NBD volume and there you go You're nice cozy and this is not just this is a little bit of contrived example, obviously But there are actual data migration paths that you need to do in data centers because your storage system have storage systems have a Particularly lifespan attached to them, right there. They only operate for let's say three years So you need to do this anyway, right? And it's always been a big problem in general to do so a little bit about this Rebuilding so these as I mentioned those volumes are relatively small, right? So although small you don't want to rebuild stuff that doesn't require rebuilding so unused blocks You don't want to rebuild and and the seftalk actually mentioned this as well And that is to keep a bitmap, right? So these these are not new things man. This has been People have been doing this for years, right? So I'm not going to pretend that I've invented the light here I'm just reapplying what has been there and the cool thing is is that I actually have to do it on a far smaller Size right and so bitmaps and keep track of dirty segments the segments Represent a region on the disc and then you only copy the dirty segments off the disc to the other one There is some some things that you need to do because while you're rebuilding another dirty segment gets created But again, these things are not rocket science. We've been doing that for for many years on so small So if you look at if you look up Google Bonnwick space maps You'll you'll see what I mean was small because the space maps are a Way that Z of s which you've probably heard of it can do petabytes because if you have a bitmap for petabytes the bitmap Is huge right so but in our case the bitmap is small, right? So that's that's very key Tim provisioning snapshot clones. Obviously you want to have that. I won't be talking about that now maybe next year But as I mentioned, this is nothing new it's not all that Interesting so Composable stories. Okay. What does that mean right sounds cool? But what can I do with it? So you have these again in Kubernetes they they they they don't really talk about input output They talk about ingress egress. Okay, so whatever we'll use that and then you have your So you have your input and then you have your output which Points to a device whatever it is. We'll get to what type of devices later But then you can add these translation layers in between Based on what you need like for example compression encryption and mirroring in this particular case, right? And you know there is important here. So don't encrypt and then compress but yeah so in order to build this stuff as I mentioned we we we We we needed a language that actually allows us to accelerate a little bit. So we we opted for rust Which was a interesting journey The the thing there the unsafe actually disables optimization for the compiler. It doesn't have certain on Constructor attributes that you would be used to in C. So but it's a little bit of a sidestep here But it was quite of interesting. So let's let's look a little bit into mirror device very high level I actually had to remove a lot of code and you can't actually see the comments, which is unfortunate on but Basically, this is a structure of a mirror device based on the YAML and we actually Enumerate over the YAML let's say and then we open all those sub devices for you And then every IO goes through the child's which we which you can see Again, I've removed a lot of stuff here So At the very end here we match it the IO type if it's a read will do the read if it's a write will do the Right, but the most important thing is here. This is for OBD that stands for online block devices We basically iterate over all of them and the policy that we apply here That's what we determine before we actually go to the IO if there is an error then we basically said that it was an error on the On the IO and then in the completion handler again based on the policy will figure out if the IO failure is critical or not Or what do you just say? Okay? I had three of the replicas replied my policy says two is enough So I'll just continue So declarative what does that mean in in in YAML? kind of looks like this so you basically Define what type of volumes you want where they are if you need to Because we can actually do the temp provisioning on bigger volumes, so you only set this up once that's no problem either I'm the properties that you want so you can actually compose your storage the way that you want it So a little bit more about these protocols, right? So I have my PV and then I have my container Normally you talk to a storage system. Okay, so how do I connect right on and this was Really key because one of the things that we don't want to do is is is impose a cost that is unacceptable in terms of performance, right? Because you know at the end of the day It's all nice, but if it doesn't perform, you know, it doesn't really work so What we we do here is we we basically have a set of protocols that we support on The vert IO family is very important again for these new type-like hypervisors that you know are to strip down once They typically use vert IO to do IO very efficiently Because those kernels know how to work with virtualized IO And then we have basically the the input path and then there is the output path Which can be well, I won't say everything but a lot, right? So cluster is one of them We can just write to that or another ice-caze Elon or a local block device which we access through a IO and And so forth so and then when you construct a volume with this yaml It basically just fans out and you know could do a transformation or not, right? Okay, so let's talk a little bit about this performance aspect of it because performance Particularly doing IO and user space. There are some challenges to that And what can we do to mitigate that and so one of the things is when you look at CPUs these days? They have these MMUs in the IO MMUs and these are very important in terms of the Misses that you can get and the impact they have on performance So a solution to that is to use huge pages and they're not actually all that huge. They're only two megs But I don't know if you can see it But the 4k pages we have 22 million misses and in the lower end we have zero misses, right? So and this is really beneficiary for for performance The other thing is is that once we start up the allocation is static, right? We we we don't need more capacity or memory as we plow through performance because we just pass the performance on Unless you do compression, but we basically allocate these things up front And so lord of the rings one of the things that as I mentioned the hardware has changed And so what can we learn from these hardware changes and how do we make sure that we? Use them optimally in the software stack So it's kind of kind of interesting to see with the with the lines that I tried to put on top of that Is that the the number of logical cores has been increasing and the frequency and Hertz basically stagnated back in? 2000 has you know at a dead stop So more cores and that means that if you want to utilize these cores you need to change your software approach to that And this is exactly also what the NVMe spec does is that they use all these type of rings and each ring refers to a Particular device and each core has its own ring so each core can talk to any NVMe device without you know Any locks in between and whatsoever? And there is some other interesting things so this is Jens. He's the you guys probably know him as the Linux kernel block maintainer He's the author of the block and Q layer. So a very smart guy and he's actually been implementing This ring approach in the AIO subsystem and you know, I've highlighted the word ring And this is just one of his bullets actually And so you can really see this pattern of submission cues and completion cues and rings come back in you know in the whole IO stack and The reason that he's working on this in particular is he wants to mitigate the specter and meltdown patches. So with this code When you do IO, you don't even go to kernel space at all, right? So you write it in a ring buffer and the kernel does the submission for you and you just pull for completion, right? So we use the same thing in container attached storage Based we're leveraging a code from from spdk there So there is this this puller which basically so we have an ice-cazy socket, right? So we're listening for ice-cazy and we're constantly reading on the socket with the read call and basically We're doing this non-blocking. So most of we do the read and you know say he would block and we just continue to the next thing And just keep spinning like a madman in order to see if there's work if there is work, however We get this whatever the job is let's say read or write or an inquiry command whatever There is a message placed into this ring, which then gets read by the reactor And this is typically a function argument one argument two But this is abstracted away in terms of IO channel So you don't actually have to send the message but you work with IO channel so to make it a little bit easier and then The there's also a management interface obviously because again pulls the same way And then there is actually the devices that you need to pull as well in terms of okay Hey, there's an interrupt or I'm ready or whatever So efficiency so people ask hey, but you're pulling constantly. What does that do for performance? So this is the same machine at the top level here. You can actually see that the The CPU idle is the right column So the more idle the better right and you can see that the load is actually divided between kernel space and user space As you would suspect with running this thing The traditional way and the thing here below is where we run completely in user space, right? And the thing that you don't see normally is that what the kernel in the background does when you actually do the IO So you would think hey, this is not efficient, but it turns out in the numbers speak for them self although very small But nonetheless, right? We have a very low Or we actually are more efficient right, but that's just not the goal to show here The goal here is is that pulling is not a problem and in fact kernels actually in the background dynamically start pulling anyway Right, you just don't know about it so This gentleman avi. He was the author of KVM. He's working on a database They use similar constructs from DPDK in their software and he sums it up pretty nicely, you know if there's no load Yeah, we're probably a little bit More inefficient, but if there's high load we are far more efficient. So the same holds true here as well So NVMe TCP has been ratified somewhere in November. So I was trying to Make that work didn't go very well as you could see but I figured it out eventually And the point here is that protocols matter. So I have this micro 1.9 terabyte NVMe SSD who on spec can do 840 k IOPS. So in this is just a huge amount of IOPS, right? I mean come on. It is just insane. But anyway So we we run the software using UIO So we basically map the PCI registers into user space and we write directly into user space into these registers And we get nine close to a nine hundred sixty thousand IOPS a second It's very likely this performance will degrade over time because that's the nature of SSDs But you know it will probably steady state around this 840 spec Then I did the same thing export that device through a network block device protocol And it could only do hundred thousand IOPS and and the problem here is that the syscall rate dominates performance here And that's due to the fact that how NBD is is is is designed right each IO is a syscall I SCSI does a better job. You can log in multiple times and and then You know you get close to five hundred thousand IOPS, but that's where it basically starts to become a problem as well So the solution to that obviously then is I SCSI over TCP or I SCSI not do I SCSI but use NVMe over TCP Rd may goes a lot faster But through Rd may you actually need special hardware need special switches and that's you know probably a big tax on your infrastructure So how can you leverage NVMe and your existing infra without throwing away all your hardware and NVMe TCP is a solution for that It's it's not, you know rock solid stable. So this is only available the RC kernels 5.0 but I did a very You know, I wouldn't say Scientific tests, but nonetheless I did a test on my laptop and just by the same device same set up same laptop Same VM or whatever just switching protocol just gave me a 30% increase and the reason for that is not because NVMe TCP is so much smarter in this sense But the reason for that is that the things that it does not have to do right? So less is more and you get that through performance because this whole block layer request the SCSI layer The HBA and all these type of things that's completely gone in the NVMe space and therefore It allows you to do some better performance. So I want to show you guys real quick how that I Hope at least I can show it real quick Yeah, okay, so Okay, this I have to okay. Anyway, so I have this tool here that If I can type So I have this tool here and I had these these these egress things, right? So I'm skipping all of that and I'm immediately writing into the the underlying layer And the reason that I do that is that I want to measure the incurred performance when you for example add encryption or when you add Encryption with compression. What does that do for performance because you need to figure out, you know, how these things behave so to give you a little bit of an example on The what's the fellow name? So, yeah, so the configuration file here Shoot, let me see if I can Okay, that's unfortunate. You can't see that, right? Okay, so Wait, maybe I should have known better not to do demos. Okay, so to the This is a Pandora device. It's a null device. So basically what it does is it grabs the out throws it on the floor Okay, what what good does that do? Well, actually the kernel has a thing similar called null block, right? And it's just it's common thing to do, right? How do you measure overhead in the block stack by having a block device that's infinitely fast, right? How do you do that? Don't do the IO. So it's a little bit of cheating and so let me actually move my Terminal back to my laptop so I can actually see what I'm doing for a sec and And Actually I'm going to I'll show you what I've done because I want to set it up real quick so that you can See that it composes several devices Out of multiple typing under pressure is never good, okay There we are so what I've done is I Yeah, so I hope it won't mind that that things hell there So I'm going to create three of these devices and then I'm going to you know at this layer of mirroring where I grabbed these three Devices and then do I out to it, right? Not all that interesting, but the thing here is gay. So how fast is it right because if this is slow? I'll never go go fast. Where's my mouse? Oh, this is I'm never going to do this again Okay, so I'm going to start the thing I have to give it the configuration file and I'm going to give it a Q depth of One because that's what devices don't like because it shows how slow they are And I need to specify a block size so we'll do this A time as well because otherwise it will just run forever And I think that's oh, yeah, what we want to do some random rights whoa and So you see some of the huge pages constructs the the mirror device based on this thing and That's a little bit over of three point five million. So three devices were mirroring and we have We're doing three and a half million. So we actually Broke that barrier and just to show you that there is obviously one last thing and then I'll be done Never mind Just Questions. Yes. Yes, sir No, I have not that's oh, sorry So you're yeah, so I did not turn on multi-con support for nbd That is very true that is very true And my point was not to find the absolute numbers of ice cuz he versus nbd what I wanted to show was that there is a Difference between the protocol that you select what the performance envelope is Yeah Any other questions? Okay. Well, thank you very much and Next time. Thank you