 All right, we're starting to level off on those attendees who are joining on time. Thank you for joining on time. We're going to go ahead and get started. I'd like to thank everyone for joining today's CNCF webinar. The topic de jour is persistent cloud native volumes at NVMe speed. I'm Lee Calcote, I'm your host today. I'll be helping moderate the Q&A. I am a CNCF ambassador and founder of layer5.io. If you are into service meshes, thinking about adopting a service mesh, go to layer5.io. You'll find some resources for you. We are fortunate today in that we're joined by Philip Reisner, CEO of LinBit. He's here to tell us about how to run persistent volumes very quickly. So with that, as we get going, please note that while you can't ask questions over audio, we highly encourage those to be placed into the Q&A. So please don't be shy. Philip said something about tossing out a challenge to see if people could ask him a question he couldn't answer. And I may or may not be putting words in Philip's mouth, but with that, Philip, hi, how are you? Come on in. Hi, Lee. Thanks for the introduction. Okay, then let's get started here, right? Okay, so maybe you don't know LinBit, that's the company behind it. So I start with a few words about LinBit and then we jump right into the technical stuff. So we found it in 2001, so we are around since quite some time now. And at the moment, we are about 30, where most of us are located in Vienna, Austria. So you might meet LinBit's with a German accent. And about 10 or 11, it's changing so quickly, of us are located in Portland, Oregon and on the west coast of the United States. And we have a very strong partner in Japan who helps us to serve our customers in Japan. It helps us obviously with the language barrier we would have there without our perfect partner there. And we care a lot about open-source software. We also have a few proprietary bits, but let's jump into it. So I will start with talking about some building blocks we have in the Linux kernel and then our building blocks we can use to build a perfect storage solution. So let's start with the LVM, the Linux volume manager or logic volume manager. So I hope most of you are familiar with that. So we just take some physical volumes. We make them physical volumes by writing a label to it. That can be full disks, full SSDs. It can be partitions on disks or SSDs. That goes into a volume group. That's a concept of the LVM. And out of that, we can create logic volumes. And what's the advantage over using partitions? Well, these logic volumes can span multiple disks. We can have many of them. You know, partitions were limited to 15. And on top of that, we can also create snapshots. So this has been around since ever. So it has certain limitations. Like it becomes really, really inefficient if you create many snapshots of a single origin. So a few years later, that LVM got a new feature that's called LVM thin or thin pools. So with that, one of those fully allocated LVs becomes a thin pool. And out of that, we can create thinly allocated logic volumes. That is just, that feels like a regular logic volume. The only difference is the storage in it is only allocated in the moment you write to it. So that brings two things. One is you can do over provisioning. You can create thin LVs that provide you more storage than you actually added with your physical volumes. And the other thing is that the snapshots in the thin LV world, they are very efficient. So that brings us into the position where we can take a snapshot from a block device and the second one and the third one and a snapshot every five minutes and keep these snapshots for two days. And then we start to throw them away. So that's a very powerful thing. What other building blocks are there in the Linux IO stack? So there is this rate, software rate, right? All the rate levels you dream of. So they're Striping, Mirroring, Rate 5, Rate 6, Rate 10. It even has two front ends to the same back end code in the kernel. What else is there? There is two implementations to use. SSD is caching as caches for rotating media. Two implementations of that available at your fingertips. Then there is the duplication in the Linux kernel. That's a little less widespread than the other features. It became available since the release of well 7.5 or CentOS 7.5. It's not yet in the Linux upstream kernel, but to the best of my knowledge, Reddit people aren't working on that. It came from an acquisition. So Reddit got that from Paramabit. And it is a full-blown industry top notch inline data de-duplication implementation. So where would you use that? You would use that if you have LVs holding, let's say, images of virtual machines and you expect that on all of these images, the same operating system is installed, then de-duplication helps a lot. Then we have many targets and initiators on Linux. So for iSCSI and all the related protocols. With recent Linux releases, all the NVMe were fabrics, target and initiator in software form on the upstream kernel. And also the newest cousin that's NVMe over TCP. And if you look at another distribution, if you look at Ubuntu, on Ubuntu there is also CFS. And CFS brings another implementation of a few of these components. So like CFS has its own built-in LVM. It's called the CFS people then speak of seawalls. It has its own implementation of thin provisioning. It has its own RAID level, RAID set, and it has its own mechanism that is like a caching mechanism. So we have plenty of these building blocks. And then we from Linbit added another building block, and that's Diabdi. And Diabdi is a block replication technology. So if you're not familiar with it, this illustration can give you an idea. So it feels like it's a RAID one between a local block device and an initiator from the initiated goes to the target. And here we store the second copy. But it's not implemented using this component components that was just an illustration to give you an idea. So in Diabdi speak, we call the site where your application is running. We call that the primary. And from here we have our application to a secondary. And you can very easily switch these roles. It is by stopping the application, unmounting the file system, this one demotes the secondary. And the moment later you can hear, mount the file system, start the application, and it promotes the primary. In the moment, the mount commander opens the block device. And then the replication direction is reverted. Diabdi can replicate multiple volumes in a consistent way. So you would do that if you have volumes of very different characteristics. Let's say a very fast RAID one of NVMe drives. And let's say a very slow RAID six of hard disks. And when you try to use these two volume types for a single application, let's say an ORT database, where you put the huge tablespaces on this low device and the lock spaces on the fast device, then it is able to replicate these two volumes together. So that at any point in time, whenever the replication link fails or your primary fails, the two volumes on the target side are at the logical same point in time. So that allows the application on top, the ORT database, to recover and to continue to offer its service. Yeah, then Diabdi is not a two-note thing. So it can mirror to multiple visualization cases to switch that at runtime. And not every node in a Diabdi cluster needs to have a local replica of the dataset. So you can think of that like it's like an ISCAS initiator. So the ISCAS initiator gives you access to your data, right? You can mount the file system on the on the block device. The ISCAS initiator gives you. But at this Diabdi, it's just like that. It gives you access to your data, but it can be connected to two secondaries having a replica concurrently. So that means if one of those guys fail, the application running here is completely shielded from the failure. So just imagine there was a read request going down the stack and this guy sent the read request over here and before this guy sent the answer. This node fails, then this primary will reissue the read request to the other node, get the data, and it is handed back to the upper layers. So you're completely shielded from node failures, device failures, network failures, whatever. And if a failed node comes back, it's automatically reintegrated. So there runs a re-sync process between the two. And when it's up to date, this can use it again for read requests. And write requests are simply sent to both nodes holding a disk concurrently. And over the nearly 20 years we have been developing this, it got many bells and whistles. Some of the features are less relevant in the modern days, others are more relevant in the modern days. I think I'm too time constrained to go now into every detail. Post your questions if you're interested in specific features. What are we doing recently in the DLD world? We added support for, or we added optimizations in case you have PMEM persistent memory. And persistent memory or NVDMS are available since quite some time. That is like a hybrid between memory because it can be as fast as RAM, as memory. And a hybrid between a storage device because it keeps its data or even, well, when the power is lost of the server. So we have been doing some optimizations so that we can get more IOPS if our metadata is located on PMEM or NVDMS. And recently Intel is promoting their PMEMs. And then we improved locking. And so these days, if your backend devices are fast enough, you can see up to 200, 300,000 IOPS in a single block device that is replicated by DLD. And I'm talking of write only workload because reads is easy. Yeah, so what's coming down the road here in 2020, we will see implementation of erasure coding and more clever things to our long distance replication story where we can keep multiple copies in a far away data center and send it only once over the long distance link. Okay, so far I was speaking only about these building blocks. And these building blocks are great because we can combine them on the data plane. So what I mean with that is you can use the rate five code below LVM. You can stack encryption on top of it. You can stack, let's say, deduplication below it. But the problem with that so far was that all these tools or many of these tools bring their own management tool. So it's compatible on the data plane, but it's incompatible on the control plane. And with lintstore, what we created with lintstore is it's a distributed application that's designed to control all these building blocks for you, build the storage stacks you need for your requirements on a bunch of nodes, and at the top it offers you a REST API where you can request the block storage devices that you just need for your current requirement. And in this context, of course, our connector to Kubernetes is the most relevant one, but it also has connectors to OpenStack and other virtualization systems. Nearly all of what we do is OpenSource, and everything what I spoke so far about is OpenSource most under the GPL license, some parts under the Apache license. So let me explain by an example how this lintstore-dbd combination could work. And let's look at a hyperconverged example. So in the context of CNCF, we would have here nodes that are Kubernetes nodes. And what's here labeled as a VM is a container, right? And we want to have persistent volumes for our containers. And as the name persistent implies, it should be resilient against node failures. So let's say we decided for a replication policy of two. So then we have here the orange container, and its persistent volume is located on the same machine here, and on some other machine. We have a second container, the black one. Its persistent volume has one replica on the same machine and a second replica somewhere else. Obviously, this hyperconverged architecture gives us a few advantages. Like we can fulfill read requests locally. So if we get a read from this container, we can read it off from this replica. We don't need to send anything over the network. We don't add network latency to fulfill this read request. So this is really made for scenarios where you have high performance storage devices like SSDs, NVMe SSDs, or even persistent memory in your physical servers. Now let's say a container is life-migrated or it's offline-migrated, whatever. It's moved to another Kubernetes node, right? Then the storage might be big, might be, you know, terabytes. So the storage is still where it was. So now it's no longer in this optimal state. So now we also have to ship read requests over the network. But it takes only a single lintstore command or a time-triggered policy, and lintstore will allocate a third replica on the node where the workload is. And the dbd component in the data path starts to do a full sync error. It copies over all the data. And when that is finished, it finds out by looking at these policies, oh, I only need two replicas of this dataset. So it will remove the other dataset. But I want to stress that this storage follows workload migration is not automatically, because I believe it shouldn't be automatically because it's only you who can decide when is the best time to put this stress on your network. It shouldn't happen immediately after the workload moved. Yeah, then let's look at the architecture of lintstore for a second. So lintstore contains two main parts. One is the lintstore satellite. You could also call it node agent. So that's a stateless component that's installed all the machines that take part in that or to express it in Kubernetes language. You just run the stateless lintstore satellite container on all the nodes taking part in this. The other part is the lintstore controller that's stateful. Well, I will come to more details in a second. And it reaches out to all its satellites and it distributes the configuration parts to the satellites and only the necessary parts to the satellites. And it does the central decisions. Let's put it that way. And in order to have it HA, you usually have one or two standby's of that. And it will be switched over if one of those guys fails. Until now, this is not the case. Until now, this lintstore controller had an embedded SQL database, but we're just in the progress of all the offering at CD back end. And with that, it gets more integrated into the Kubernetes world. And yeah, on top of that, there is also a client, a REST API. The client uses the REST API. And our CSI drive, of course, also uses the REST API to interact with the system. And maybe what I should add at this point, the controller is only necessary for, let's say, control operations, like creating new volumes, taking snapshots, resizing volumes. As soon as a volume is established, the controller is no longer participating in its operation. So that means we can stop the controller, we can upgrade the controller. And all of your workload that depends on this persistent volumes simply continues to run. Because the controller just establishes the data path and then the data path is independent of it. So what type of problems can lintstore help you solving? So without reading the slide now, just follow the concept. So let's say you have many nodes and let's say it's blades being in chassis and these blade chassis are located in racks. Then we realize, oh, a blade chassis is like a failure domain. If a blade chassis loses its power supply, all the blades in it fails at the same time. So maybe one part of our placement policy is to say, okay, we want multiple replicas, but they shouldn't be in the same chassis. Then let's say we have multiple racks and we have powerful switches, we have powerful top of racks switches, but the bandwidth between the racks might become a bottleneck. So then another part of our policy might be the two replicas should be in different chassis, but should be in the same rack. And requirements like that, you can express that in lintstore, in its policies or if I used lintstore word in its resource groups. Also, lintstore can help you to select or to solve issues with selecting the night. The right correct path is over the network or use redundant path is for the replication and accessing the data. There's a lot to it. And then connectors in our context only relevant, the CSI driver. So the CSI driver is targeted at the recent Kubernetes releases, so I think 1.15. And we are currently working on getting it also to cover the older Kubernetes releases like that one that is in OpenShift 4 and that one that is in OpenShift 3.5 and all these details. I think I shouldn't touch the other connectors. And yeah, so the early feedback we got from the users is that when you compare this lintstore DVD solution to other solutions we are competing now with is that it's pretty hard to install. This comes from our background. And so we started this effort to create an operator for lintstore. And the ultimate goal is to have deployment to enable you to deploy a full lintstore solution by using a single YAML file. And yeah, we are doing that with a partner company that's DaoCloud and follow our work either on the website Perios.io or on the GitHub project web page. When you go there, keep in mind this is really work in progress. If you know how to do it, it already works. But we know that we still lack proper documentation and we are working on properly documenting this part as we speak here. Yeah, then you might ask who is using all that. And I have a bunch of customers but that's medium-sized companies located in Europe or Japan. You wouldn't notice the names. So I have one name that really stands out and that's Intel. And as you might know, Intel is currently in the progress of introducing their PMEM product into the market. It's called Intel Obtain DCPM. And the guys at Intel understand that bringing a complete new storage type into the market is hard for the market because not everybody understands how to use that. So they decided to build an integrated product that ships by the rack and sells by the rack. It has, I think, around 20 compute nodes and two or four storage nodes. And the compute nodes are equipped with this PMEM stuff. The storage nodes are equipped with QLC SSDs. They use Red Hat Enterprise Linux as operating system and OpenShift as Kubernetes distribution. And they use LinStore to orchestrate all this storage. The local PMEMs and the SSDs in the storage nodes. So yeah, then, right, time for a short summary. So LinStore uses existing storage building blocks. We are leveraging on the, let's say, shoulders of giants of many man years of development that is in all the parts we're using, be it LVM, be it CFS, being the encryption layer, being it the duplication, caching layers, DOBD, NVMevo fabrics. So we're leveraging on all of that. We're not reinventing the wheel because all these technologies are great and should be reused. And LinStore itself is a reusable component, not only tailor-made for the Kubernetes world, it can also be used in OpenStack world and other environments. And that is what makes a standout and it's all open source. Okay, that was it for my regular slide deck. And now let's hope that we can go through some interesting questions. Very good. Oh, Philip, that's a great presentation. That was fantastic. We've got some kudos coming through the chat already. So as Philip said, this is your opportunity to pop in your questions. It's also your opportunity to maybe razz Philip like I do. I got to say it's much fun. So bring your questions. We do have one or two lined up. So Philip, one of the questions that's come through from Mr. Agarwal is if you could, the ask is to also walk through an AWS example with a multi-availability zone, Kubernetes implementation. I think this probably goes back a few slides to your other style of deployment. Yeah, okay. No small ask. Yeah. So I don't have a perfect slide for that now in my deck. But from the point of view of the technologies we are developing, you're using a bunch of storage nodes in one availability zone, a bunch of storage nodes in another availability zone. Then probably the storage you're using there is ephemeral storage. So that gives you the requirement to mirror within the availability zone synchronous, let's say two ways synchronous. So you assume Amazon will only kill one of your instances and not two at the same time. And then since the network connection to the other availability zone might not be perfect, so maybe you do the long distance replication to the second availability zone in the asynchronous way. And now, and this is all the parts of the policy we would give into LinStore. So in other words, we would have one LinStore cluster spanning all these nodes. And your policy says my storage class, blah, blah, my storage class resilient storage has two synchronous copy and availability and one asynchronous copy in another availability zone. Yeah, I think that's at least the best of what I can do in giving a verbal description. And if I had known the question before, I would have drawn nice illustration, but sorry for that. Yeah, no, at least you didn't use your fingers. I tend to... Very good. That's a good question. Very good. So if you have questions for Philip, please bring those through. So we've got another question just popped in. The question is, why would I use this solution instead of any other storage project that exists for Kubernetes? Yeah, again, I don't have a slide prepared for that, but there are a few competitors that are open source. So when I read the blogs on the internet, it seems that our open source competitors create higher overhead in their software implementations than we do. And I read in the blogs on the internet that there is a closed source competitor that has comparable performance as we do. But that's proprietary. So I would say you should use our stuff because it's open source and it's the fastest among the open source players. Nice, you had me at open source. Very good. And thank you very much, Greg, for that question. All right, Philip, we've got another one lined up here for you from Mr. Gonzalez. The question is, can we use LinStore in GKE, for example? What kind of volumes does LinStore need to work on different cloud providers? Yes, so using LinStore in GKE is definitely possible. So with this project perios I mentioned, we are now containerizing everything. And so the most interesting part, of course, is the kernel driver, so the DPD kernel driver. And so now for these cloud deployments, we have now a kernel module loader that runs once on this cloud instances finds out, okay, what kernel is here running, recompiles or compiles the kernel driver for that and then loads that into the kernel there. And then what kind of volumes you need there? You just use the, let's say, ephemeral volume. I don't have the Google name for that at the top of my head, but the ephemeral volume type Google offers you because for the resiliency of your persistent volumes, you do that with the LinStore DPD system. So the short answer is yes, you can use it on GKE. Nice, nice, very good. Very good. All right, we're hitting the bottom of our open questions. I think Philip is knocking them out as fast as you bring them up. So I'll make another call for questions. We'll give people a few minutes. Okay, Lee, if we run out of questions, I have a few appendix slides and we can go into them. That'll give people a minute or two to generate some questions. So yeah, we've got some time. Hello audience. Okay, so come up with some questions, otherwise you have to see my appendix slides. Yeah, no, none just yet. Yeah, if you want to, it looks like people want, they want more, they want more slides. Okay, here we go. So in the standard slide deck I had hyperconverged, the hyperconverged example. So you can also use it disaggregated, but why do we have the hyperconverged version in the standard stack? The reason is the beauty of it, right? You have one type of node. It does both functions. It does compute and it does storage. And I need to add that what we do, this Linster-DBD stack, is really very well suited for hyperconvergence because its memory consumption is constant and it's deterministic. So you know upfront, let's say, I have a node, I have a hyperconverged node that has let's say 100 or 200 terabytes of storage. So I can upfront calculate how much memory dbd will use at maximum on this storage node. So then you know how much memory is left for my workloads, for my containers. So that makes it very well suitable for being used hyperconverged. But if you're still not convinced, you can also use it disaggregated. Disaggregated means you have one type of node where you run your workload and you have another type of nodes which you only use for storage, right? And I mean, going here through the example slides real quickly, it's pretty obvious. We can mask the failure of a Kubernetes node, which is Kubernetes will just restart the container somewhere else. They can access the same virtual volumes. If a storage node fails, well, no problem. We stored it redundant, right? So we mask the failure of the storage nodes. So that was easy. Maybe, well, let's jump over these. Oh, we have the other ones. Okay. Then I'm lacking the slides. Another topic that is interesting for some users is NVMe over fabrics. So NVMe over fabrics is a transport protocol with interesting properties. So NVMe is, well, NVMe started as being a command set to access storage. In a way, it is replacing SCSI, right? And the old SCSI standard had a physical part for the, you know, old SCSI cable. Then NVMe standard comes with a physical part that was the PCI express bus. Now, with NVMe over fabrics, we are actually shipping the commands over a network transport. And so in certain use cases or for certain requirements, it also makes sense to use NVMe over fabrics instead of TABD. And I'm just pointing that out that this is possible. It's interesting for certain use cases and requested by some of our bigger clients. And just jumping topics here quickly. Here, the last appendix slides show you something very, very not related to Kubernetes. That's TABD on Windows. So we started the effort to port our DVD kernel driver. Remember, that was written for the Linux kernel to the Windows kernel. And again, it's driven by some users that want to have that. And it's at a, you know, there is a public better available. So if we really run some Windows workloads, you can download that from our homepage and try it out. And the interesting thing is it's completely wire protocol compatible with the Linux version. So what the users are actually looking into is they want to run their Windows workloads and have the storage being served from their Linux storage nodes. And that works absolutely fine. And it's not yet production ready. But we are also welcoming all the users that provide feedback on that. Okay, so now I'm really through every slide I have here. I'm very good. This has been a great presentation, Philip. This is a lot of engagement through the Q&A. And I think we dried them up earlier. Actually, it last call for questions. This is your last chance to shoot one out, Philip, right? Okay. I'll keep my eyes peeled. But I do want to say, you know, as we go to wrap up, Philip, thank you so much for spending the time with us today and sharing openly. I will note that this webinar recording and Philip Slides will be online later today. And we did have another question come through. And so, Dustin, time, the question is, you know, what about OpenShift? That is the question. Yeah, perfect question. Yeah. So all this works with OpenShift. And I think I mentioned it earlier. With OpenShift, OpenShift is a distribution of Kubernetes and a few bits around it. I'm not a huge expert on it. But the challenge for us is that OpenShift ships specific Kubernetes version. And that is not always the latest and greatest version of Kubernetes. So the challenge here for us is that our CSI driver, that our operator is then compatible with exactly what is in those OpenShift releases. And I mean, we are not all living in the open source heaven. We also have the requirement that our customers pay the bills so that we can pay the payroll of our developers. So that means we support OpenShift all the way. OpenShift, a new one, the 4.0, I think, and the super new one, 4.1. And also the 3.5.1. But here, we are still bringing that documentation into shape. But I got the news from the developers. It's here. It's there. We just need to write down how you can use it. So, summing all that up, OpenShift, yes. Oh, very good. Very good. Okay. Well, all right. I think we took out all the questions. And Philip, we even made it through your extended slides. So thanks for hanging out with us as long as you have. Again, just a quick reminder, the recording and the slides will be up on the CNCF webinars page later today. And with that, we look forward to seeing all of you at a future CNCF webinar. And have a great day. See you all later. Thank you, Lee. Goodbye, everyone. Bye.