 So yeah, welcome everyone and today we will talk about how to improve long home based on SBDK. So I'm Daviko from SUSE. I'm Keith Lucas from Oracle Labs. Yep, so we just get stuck. Yeah, probably someone in this room probably did not use long home before, so we have one part. We will talk about what is a long home and what long home current works and what the challenge long has right now. And next, we will talk about based on the challenges we are facing right now, how we leverage SBDK in the future, and how it really works and what benefit you will be taxed from SBDK. And lastly, we will talk about a benchmark we have right now to share with everyone and also some areas we want to improve in the future. Okay, so long home is a CNC project incubating project right now and he's focused on the persistent voting stuff. So it's highly available and software defined persistent bot storage based on Kubernetes and also run for Kubernetes. So all the things we just leverage on the Kubernetes stuff. So he's quite lightweight, reliable and easy to use. What I say that because we're based on the user experience and we don't have any external dependency actually. So take example, we don't have any external database or stuff. We just leverage the Kubernetes resource like API resource. So this is what we have. So deploy long home is quite simple. And also we support all the persistent voting and with the different type of voting modes, SS modes, RewriteMany, Rewrite only as a multi-modes model bar device or fire system. Of course, if you use a long home, you will question about RewriteMany when it will get to general available. So you can check it tomorrow. We have another session, a maintenance track told about long home roadmap. And also it's a storage agnostic. So that means he will be easy to deploy based on the whole fire system. Any fire system support the specified because long home use the specified for his same provision capability. So here we mentioned that ESD4 and XFS is actually verified by long home team. So you can use that directly. And also not just in cluster. We have in cluster snapshot but also support external backup and restore. So different type of backup target we support right now is NFS and S3 compatible API backup target. And as I say is the Kubernetes. We use the Kubernetes resource. So Kubernetes design patterns is what we adopt are basically control patterns and customer resource design. Sorry. The definition is what we use right now. And also his open source. And how long home works right now is simply said that we have five parts I want to brief introduce a little bit. And it's about the volume and volume life cycle and data placement deployment and control plan. And while in UK is composed of three parts volume phone and volume itself. We call it engine. If you use the long home terminology and also the data placement is for the volume replicats and about life volume life cycle. We are 100% rely on CSI protocol. So he will be driven by the Kubernetes building customer resource PVC here. So any following operation just for the CSI protocol. And data placement is all the data located in the long home disk. And long home disk is actually left on top of host file system as I mentioned. So you can create different long home disk per note. And even you can create a mount point for different partition but regardless like a long home disk. And deployment for volume you can see the right hand side the diagram. The volume is composed with the engine and replicats. And each one is like a segregate component. So you will be independent from each other. One volume have some problem will not impact others. This is what we decide right now. So you will make your volume is much safe will not impact others. And control plan long have a control plan and data plan and control plan based on Kubernetes very straightforward. Okay. If you dive into a little bit about long home engine and replicats database because today we will talk about database. And the current model we use a volume file and based on ISCSI protocol. So he will let the long home volume by one by the open ISCSI is that we use as a client to make it happen on the whole side. And also volume engine is like the volume controller. So he will be have engine process along with the TGT target server to make sure the data can pass through the TGT server then down to the engine. And engine have a two major parts. One is the TCP based data server. And the other one is the old volume controller. He's a GRPC server to accept the different commands from the upstream. And the last part is replicats for the data volume data. So the engine will bypass the commands or even data IO to a downstream replicats to save to the local replica or even remote replicats. And all the operation will just happen at the replica level. So like a snapshot, rebuilding, coils, merge, prune, purge, etc. So this is a diagram to talk about what I say about the database. So you can simply say you have a workload, the user application and you start to do have some IO stuff and go into your volume, long home volume on top of that have a fire system. And then it will go into the iSCARCYBAR device is posed by the long home. And it will go to the iSCARCYD. It's a hosted library to make sure you can write to the long home volume. And it will be running on the TGT and use the live long home library communicate with each engine. Engine is what I say is a volume controller. Then make sure the IO it will downstream to down pass to the long home replicats. But this is currently what long home works. How are the challenges right now long home has? The first one is too many components. I mean too many components along the database. You see the user need to prepare setup iSCARCYD and long home will create a TGT engine process for each volume. So you will be a little complicated from the database point of view. And the second one because too many components say we will coast the different coast, extra coast of auto communication, especially for the database you need to IO through the different parts, like here TGT and engine. And that's one I want to emphasize is IO models limitation because right now we based on the sparse file and we set the IO and with the blocking read and write so far. So you will have some challenges for the performance perspective. And also if you want to leverage like a synchronized IO stuff, you will be a little challenging right now for the current architecture. And also for sure different language or different integration, especially if you want to integrate a system stuff, have a different effort for sure. So this is a primary challenge so far for long home has right now. So we start thinking about how to move forward to create a next generation data plane for long home. So what's the SBDK keys? We'll talk about that. SBDK is the software performance development kit. It's used in a lot of high performance cloud applications and one of the things that it's kind of based off of is the data plane development kit, which is a tool that is used by other cloud service providers to utilize network device drivers in user space to provide better performance than using the kernel. And this kind of includes DPDK in its source code and uses some of the methodology behind that to kind of provide better performance for any kind of application. And one of the things that it also has is a generic block device application layer with several different implementations and a relatively easy way of implementing it. And we kind of got to a dead end with the current implementation of Longhorn where we have multiple Go routines each using blocking IO and it kind of didn't provide any means to easily support asynchronous methodology to improve performance, but SBDK already has some of these generic block devices that use frameworks like LibAIO or IOU ring to provide better performance and allow us to basically plug and play and try those out and test those more quickly than rewriting Longhorn Engine the way it is. And as we were discussing earlier, we were mentioning that Longhorn currently uses iSCSI by using TGT, which is an open source software product that we modified to interface with the Longhorn Engine. SBDK supports that as well as a newer type kind of as in we no longer just use SATA drives, we use NVMe drives, so NVMe overfabrics is like an improved network protocol for network available volumes. And it has a feature called logical volumes which allows us to store data in a way that's equivalent to how Longhorn currently works with its sparse file methodology of storing data. So we can have a feature equivalence between the two and also basically it's designed for asynchronous programming which for the most part seems to perform better especially for intense IOU performances. I mean Longhorn currently provides relatively good performance but we want to make it better. And I think one thing that we do at each level of that data path is we allocate memory, then de-allocate it in TGT, we kind of mallock it, then free it. And in the go portion of that we allocate it, then you let the garbage collector at some random time free that memory. So this kind of has a model to kind of consolidate our memory usage and make it more efficient. So I was kind of talking about the logical volumes and how it's equivalent to the sparse provisioning that Longhorn uses. So Longhorn uses sparse files and we have a kind of a hierarchy of sparse files to find the data that we want in a particular volume and we guess we have a separate directory for that. And this logical volume feature in SBDK kind of performs the same thing but it doesn't use the file system functionality for the sparse files. It kind of just stores it in a huge... be kind of the equivalent of a file system and store all of its data within a huge disk and manage it as if it was the file system but it's able to do it more efficiently because it doesn't have to support all the features of the file system and basically SBDK is the only user of it so it doesn't have to deal with any of the contention that a normal file system would have to deal with. So I took the picture of how Longhorn sparse provisioning works from our Longhorn website and I took a picture of SBDK's logical volumes from SBDK's website and I think both of these show the methodology that both systems use for kind of determining like how to read or write a block. So when we have like a hierarchy of snapshots or data and all that data can be sparse so in both cases if I like wanted to for example read block 2 I have to look up to see if that data is available in that particular in the top level so if I looked up to I'd have to use an IOctol or IOctl to see what's available at that particular block and see if that is sparse or not and if it's a sparse or it's an empty location in the file I go to the next snapshot and so forth and in this diagram 2 I have to go through three layers before I actually find the data in the oldest snapshot and similarly this SBDK has the same functionality except it implements that ability to query the sparse data like internally so we don't have to use the kernel to get that sparse information so we can just traverse that hierarchy then find our data without some of the stuff and basically all these snapshots in SBDK are within the block device that we don't have a file system on whereas with Longhorn each of these snapshots is a separate file within the file system that we're traversing so one of the things that SBDK kind of makes it easy to use is to program asynchronously and for the most part this kind of allows you to achieve better performance and with this methodology basically IO functions don't block if you perform a read or a write it doesn't block but it's like you register an event then we have a main loop that checks to see the status of that and continually checks to see if it's complete then we are notified via a callback if it's complete and this is basically the model that's used by LibAIO or IO U-ring which is the more modern asynchronous ways of performing IO that are more performant or something that we kind of want to emulate and try out and SBDK allows you to kind of easily do this so we're going to redesign the data plane part of Longhorn to work with SBDK and for the first point we're going to switch to using NVMe over fabrics for Longhorn's block device instead of iSCSI Linux has mature support for NVMe over fabrics the Linux kernel has kind of a better interface for NVMe over fabrics than iSCSI for example iSCSI has a daemon process iSCSI D that's running and kind of handles some of the TCP connection portions of the protocol whereas the kernel directly supports it and it has like a system FS or SysFS support and so there's like no process running the kernel directly communicates the NVMe over fabrics protocol then we're going to implement re-implement the Longhorn engine as a custom block device within SBDK and this will be equivalent to the Longhorn engine that we kind of showed earlier and so each volume within this Longhorn volume will have a set of replicas and will have them be either local and have it within the SBDK process or remote and that would be on other nodes in the Kubernetes cluster and so it's kind of equivalent to that except to communicate with the remote nodes we'll use NVMe over fabrics again and right now we kind of use our own custom protocol for communicating with our replicas this is, we'll be using a protocol that we know is relatively efficient and probably better than the one we designed ourselves so each write operation is a bit equivalent to how Longhorn works currently as we distribute all of our write operations to all the replicas and verify that they're complete before returning write complete to the application then we also since we have multiple replicas because we only need to use one for a read operation and we currently will be doing that in a round robin we'll just go from one to the next one for each read operation and kind of distribute them throughout all the nodes and we'll also support snapshots and rebuilding new volumes when added so snapshots is that kind of functionality that's built into Longhorn to have an atomic point where we store the state of the volume and rebuilding is what happens when you add a new replica when the volume is already up so you need to copy that data that exists to the new node that you're adding and one of the departures that we're doing with Longhorn with SBDK is that we're going to have only one SBDK process for Kubernetes node each SBDK process will handle multiple block devices prior or with the current Longhorn there's one Longhorn engine per process and like TGTD is the thing that handles all the volumes so we're going to be generally reducing the number of processes and SBDK is handling more of the scheduling than using blocking IO to allow the... never mind okay so I have a diagram of kind of what I just explained and this is an example of volume with three replicas using SBDK and I guess we have the kind of the same thing as we had before where we have the user application interfacing with the file system driver and the kernel and that kernel file system driver is using the block device which in this case will be an NVMeal of our fabrics block device and the kernel will be directly communicating with our SBDK process which is called SBDK TGT and we have our special Longhorn block device or BDEV and this one has one the local replica and two remote replicas and within each of those replicas we have a logical volume store for our particular volume so now we're going to discuss some of the preliminary performance results that we got from this new implementation of Longhorn first off we... I guess this is the environment that we used it was basically a bare metal server from a cloud service provider and I used one of the SSDs and those SSDs were SATA so... but okay so our test methodology is that we use the Kbench utility which is a program that Longhorn developed to test the performance of various volumes and we test IO operations per second bandwidth and latency by using the FIO command which is kind of a utility that was used to benchmark the Linux Kernel's block device implementation and it was also developed by the author of the block device implementation in the kernel, Jens Axpo and also he's the author of the newest IO methodology in Linux IO U-ring and we had two tests that we did first test is we tested a raw disk the existing Longhorn and Longhorn with SBK on a single node I think the main motivation was to see the overall impact of using either implementation of Longhorn versus just using the disk in the second example we did a three node scenario with both the existing Longhorn and SBDK so here are our results and so I think the one thing that was very interesting is it seems as if the Longhorn with SBDK performs very similarly to just using the raw disk in terms of both the bandwidth and the I think it's most equivalent in bandwidth and the IO operations per second it's relatively close but I guess it's a little bit down on that and the latency is very similar I think it seems to be it's about 20 to 25 microseconds of latency and over just using the disk we're achieving a relatively good performance and it's in all cases it seems to be better than the existing Longhorn the latency is much improved the bandwidth for writes is improved to compare to the current Longhorn and also the bandwidth is kind of equivalent and the IO ops per second are very much improved so I guess that's kind of a summary of what I just said the bandwidth improves in all categories it's very similar to just using the disk and the overall latency is about 25 to 30 microseconds versus 100 microseconds with Longhorn so this is the three no that performance comparison there was no disk equivalent because it's not really possible so in this case it's also very much improved compared to the current Longhorn we have much better IO operations per second especially compared to write and the bandwidth is also improved if you think about it the existing Longhorn is already doing over a gigabyte of data per second but we're doing even better with SPDK the latency is also improved and in general it's performing better and you can kind of see compared to the single node scenario our three node scenario like performs better than like it's kind of an equivalent to RAID 1 and in this network scenario it performs better with read operations than just a single disk on a single node and I think like we haven't even scratched the surface of all the things that you could do with SPDK I think one of the things that SPDK is noted for is doing user space NVME support so if one of the nodes in our Kubernetes cluster actually use NVME drives we might be able to access them directly instead of using the Linux kernels driver and this is like one of the SPDKs to claim the frame it's kind of equivalent to how DPDK uses network drivers in user space we didn't initially do this because we wanted to have a wide scope of scenarios that we could support so we could support any sort of environment not necessarily only NVME environment but I think we could as everyone moves to using actual NVME drives instead of other formats we can potentially go and investigate doing that we can also try to use IOU ring and other new technologies when I tried to use IOU ring in SPDK at one point in time it didn't work but since Longhorn isn't the only person working on this we could also test out as changes come upstream to the SPDK project and another thing that is also very dependent on the cluster environment is using a feature called RDMA which is basically DMA which is direct memory access which is kind of a way to transport data without using a CPU and there's a possibility of extending what we did to have NVME fabrics use RDMA using special drivers in the Linux kernel to more efficiently transport data over the network instead of using the TCP approach which is what we're currently using and RDMA could improve performance even more so I guess what's next for Longhorn so we kind of developed Longhorn SPDK application outside of the rest of Longhorn which is the one that enables it to work with Kubernetes so we need to do a lot of changes into how it's deployed the main thing that handles how things work within Longhorn how it deploys all the pods or controls how it's deployed on a Kubernetes network on a Longhorn engine the Longhorn manager and we kind of need to change how it deploys Longhorn on a Kubernetes network because we're moving from one Longhorn engine process per node multiple Longhorn engine processes per node to one SPDA process per node that handles all the volumes we were going to we have this component called the instant manager that manages all these processes and we need to kind of refactor that and make it basically remove it and just have one SPDK process and I think as David was mentioning we have a lot of CRDs and CRDs kind of reflect the current model we kind of need to update that to reflect the new process model going forward and we're going to basically remove this iSCSI component and TGT and transition to NVMe over fabrics and another thing that we need to do is the Longhorn manager communicates to each of these engine processes over GRPC and SPDK GRPC so we need to develop a new mechanism for communicating to these Longhorn engine processes in order to move forward to have it integrated with the whole of Longhorn and we need to kind of implement some of the things that Longhorn has currently like backup and restore kind of achieve full feature parity with the existing Longhorn and before we can move forward we're currently working on this and we're working on a technology preview that will kind of show Longhorn more integrated with SPDK and it's a kind of significant change and but we're working on it and is there any questions? What Longhorn release is that going to be featured in? So right now we just show the volume pots so what we are working right now is the control plane feature parity integration so I would say the better timing is next year Longhorn have a quite periodical for release so each year we have a two feature release so I would say the coming release is 1.4 0 is the end of this year but this release will not have SPDK preview stuff but it will happen in 1.5 we have a preview version but maybe not including all the feature parity but the end goal is next year so this is about the SPDK equation Thank you David and Kit So the new volumes created using SPDK sorry the old volumes would need to be converted to SPDK the current volumes They're not compatible right now but I think we have the ability to convert them between the old format and the new format and if I think we have an initial effort to that but we might have we might develop a means to actually automatically do that when they deploy the new version So I would say not non-destructive but we are trying to because upstream SPDK have a lot of application integration already take example like SPDK DD so we try to think about how to integrate to make sure the SPAS file can be migrated to the SPDK bar file so it will be better to do the migration but I would not say it's still need to investigate Okay, thank you Thank you everyone