 Welcome everyone, and I think it's the first time we have the in-person meeting after such a long pandemic. And well, in fact, the most difficult things I found about this time is just fighting the room. So as you can see, we are partly in a hurry, so sorry about that, we started five minutes late. All right, so today our topic is Longhorn, and I will be joined by my colleague, Joshua, and to talk with you on the Longhorn project, give you update and see what is upcoming. So since you are here, I think most of you probably know like Longhorn is a CNCF project, and last October Longhorn has been promoted to the incubation state, and we're working towards the graduated state as well. So Longhorn's mission is in fact very simple, and we want to provide your Kubernetes clusters with the reliable storage solution, and Longhorn itself is a highly available distributed storage software built on top of the Kubernetes and built for the Kubernetes. So if you want to find more information, always go to longhorn.io, which contains our knowledge base, our documentation, and all the information you can get, and how do you touch. So how is Longhorn different? There are a few design principles on the Longhorn when we start working on this project. We have the aim for this storage project to of course be reliable, which is the most important thing about any storage project. And Longhorn is designed to be crash consistent, and we also have designed multiple layers of protection against the data loss. Like Longhorn itself has a building a snapshot. You can always create in-class snapshot, but in addition to that, Longhorn also provides the backhand mechanism, like a building backhand mechanism for you to back up your volume to S3, like external of the cluster, and also that backup is very efficient, it's incremental, and it's the second layer of protection you'll get for your volume. And the third layer is in fact Longhorn's design. It's very simple. Even in the worst case scenario, you lost all your nodes. As long as you have one of the replica hard drives still available, you can very easily extract data from that. So reliability is the first principle we focus on, and the second one is usability. As you know, many softwares in the Kubernetes areas, like half the operation pattern and half the home chart installation, but many of them still require you to go through a lot of configuration to make it usable. And Longhorn in fact is one of the first storage solution out there to enable the one click installation. And you can just download like install Longhorn from your home chart or a Rancher app category and one click and you will have a highly available distributed storage available to your Kubernetes cluster. And also Longhorn has provided a Longhorn UI which provides the polished user interface to any user and including all kinds of advanced function the Longhorn provides. Like disaster recovery, snapshot backup, the disk tags and all the other functions. And the third thing is like unlike many of the storage project out there, we designed Longhorn to be very maintainable and the concept is easy to understand as you can see later and even easy to recover in the worst case scenario because you understand how it works. And the Longhorn also provides your ability to upgrade without interrupting the workload which means you don't need to schedule like as big of downtime if you have something happens. All right, so let me talk a little bit about how Longhorn works. So this is what we call the Longhorn engine which is also a part like the Longhorn data plan. And as you can see now we have three nodes and two of them with the SSD marks in the black which means they are data disk. And all three of them of course also have the root disk which is the SSD mark at the gray. And each, everyone often have a CPU and a memory available. So if you heard about the words of hyperconverged in fact this is the more of the hyperconverged deployment of Longhorn. And you can see there are some process like containers called instance manager, engine instance manager or replicas manager running on each nodes. So once like a pod request a warning on the Kubernetes side Longhorn going to create two replicas on a different node because we ensure the high availability and then create an engine to connect to those two replicas and using the protocol to provide the data of those two replicas to the warning inside the pods. So this is the very simple design and but it's very efficient and it's very reliable because when you see if you have scaled to more pods with more volumes Longhorn going to create new engines and replicas and more for you to satisfy your storage need. And there are two advantages of this design. The first thing is as you can see the data path of each volume is in fact not intertwined together. So for example, if you take out the engine or take out the replica the worst case scenario is only going to impact that the volume of using that engine. And if you take out the replica of course another replica will take over. So it's not even no impact. And the second thing is as you can see the engine always collocated with the like with the consumer with the workload. So if the node one is down and the Kubernetes we're going to reschedule the pod for example to node three even there's no local replica exists within this scenario Longhorn still can find the blue replica which is for the blue volume still exists. So we're going to create another engine which is stateless by itself connect to the replica on node two and the restore your operation for your volume. So it's very flexible and the reliable design here. And it's like a satisfied a lot of scenarios like if you have for example EKS running on different AZ and your EBS volume going to tie to the one AZ and Longhorn in fact able to cross AZ scheduling and ensure your data availability in case one AZ goes down which EBS itself can now do. Okay, so that was the like overview of the data plan of Longhorn and this is a quick rundown of the manager part which we'll call like a management plan. So Longhorn manager in fact is building on top of Kubernetes cluster and whenever Kubernetes cluster have a request to create new volume and they'll go through the CSI interface to talk with the Longhorn CSI plugin which in turn talk to the Longhorn manager with the Longhorn API and the Longhorn manager going to write the metadata into the Kubernetes API server using the tradition like not a traditional but not commonly used Kubernetes controller pattern to operate on those objects and start creating engine and the replicas to satisfy the needs. And all the engine replicas you see here are represented by the Kubernetes API objects, CR objects. And there's another way to interact with the Longhorn manager and explore all the features is of course through the Longhorn UI which is also going to talk with the Longhorn manager with the Longhorn API. So this is the very simple approach and now we're also exploring some other ways. The currently in the next release we're going to have like more integration with Kubernetes to enable you to like manually change the object and the spec and also reflect that change into the Longhorn system. So currently this change is gated by the Longhorn API but they were not going to be updated in the future. All right. So quick update on the now the open source community side of Longhorn. We're currently a CNCF incubation project and now GitHub stars 3.8 thousand and the worldwide node count is currently at 53,000 plus. And in fact, you can take a look at the metrics the Longhorn.io to see the real-time update and we're connecting this data is in fact like just a get call which we got to send a notification about Longhorn user the new vision is like new version is available. So, but on a side of fact as we know like how many nodes is like sending that call because we have the server running as it's just a get call there. So also we have like 2000 users the Slack channel on the CNCF and the legacy Rancher user Slack channel. And if you have any questions or requests that coming to Longhorn feel free to talk with us on the Slack channel and also fail issues on GitHub. Okay, so upcoming 1.3 release we're already have the 1.3 RC one out. I think a couple of days ago and by the end of Q2 we're aiming to have 1.3 like official GA release. In this release we're going to include CSI snapshot support for the Longhorn snapshot. Previously we have the Longhorn snapshot sorry CSI snapshot for the Longhorn backup which means like whenever you create a CSI snapshot Longhorn will create a backup for you in the backup store which is as you might know is off the cluster. So now the CSI snapshot support adding to the Longhorn in-class snapshot has been added. And also we started enable you to able to modify the customer resources using a code control. So you have the basically the CLI like the Longhorn CLI become the code control. And we are adding a storage network which when you use that in combination with the motors plugin you can enable the storage traffic like used by Longhorn to go through a different separate network which you can have more bandwidth setting there. And we have added other features like a backup image and the volume download and the secure communication among the data control plan and the data plan. So this is the most the immediate upcoming release and in fact what really excited me is about what's in the future. So on the release after or like two weeks after we are looking to add in the train support to reduce the volume size. One question we in fact get a lot is like a file system if you delete file and the file system like the Longhorn warning didn't shrink. The reason is Longhorn is by design is a block storage. So you need to understand more file system protocol for that to happen. So we are aiming to add that in the release like in the next release after one or three. And also the Longhorn system backup restore itself is going to enable you to like for example if you have upgrade scenario you are going to have a failsafe over like sorry failsafe in case like something went wrong. And the warning group feature will be there for your state forward code. For example you have a application running as a state forward set. And all of these warnings can be grouped together for you to take a snapshot or backups. And we are also aiming to add highly available NFS rear remaining support and highly available S3 support as well. The last of the list we are also like in fact are working on a very exciting new thing is a new high performance engine based on SBDK. So yeah I will leave more to my colleague Joshua to demo what's the current status Longhorn is and maybe give you a pick on what's in the future. I will just quickly switch. Do you need press a button here? Test test. I'm not sure, okay. Is this transmitting? I can hear. Oh yeah, okay. Thank you. So since we gave previously gave lots of intro demos I figured I do a little bit of a different demo today in which I talk about some cases we've seen like when you run Longhorn in production. All right, let's Linux, let's go. Wish we could have seen like when you run Longhorn in production, how do I get auto scaling to work? What kind of setup do I need? Like how can I get more control like middle? Yeah, that's short. I think that works. Are we seeing something? Wish we had a solution that you have. You have full HD, right? Which resupply? Oh yeah, full HD, right? Oh, fine. Then it should work at full HD, but okay, I'm gonna keep it now. Should I switch or keep it now? Yeah, yeah, this is fine. All right, I'll keep it in this. It's a little smaller, but for the sake of time, let's do this. Okay, so like I said, we previously done lots of Longhorn demos for like intros. So I figured I talk about some setups you wanna have for your production clusters or in the enterprise space. Let's start by number one. How do I do auto scaling? What would my cluster setup look like? I'm using Yanshet for demonstration as a visualization tool here, but all of this can be done with operators and Yumwills and Helm directly. So let's have a look at how I set up my LHD mode cluster and, I don't know, space for my mouse. And we can see that I have like three node pools. Generally, you want three node pools on your non bare metal clusters. Anytime you want to do auto scaling, you want to have your control planes separated from your worker nodes. We've seen that one. Generally people end up putting them together control plane worker nodes and then some things we saw on is like, they scaled up the cluster and then they have simply control plane nodes effectively, which is a problem in itself. So you want to have three node pools, one for the control plane, one I call the stable set, which is basically, the stable set is your minimum resource availability set. So anytime you guarantee a certain capacity, you can scale up the stable set, but you cannot really scale it down because you need to be able to guarantee that for the lifetime of the cluster, therefore it's your minimum availability. So that's the prem cluster here in this case. And lastly, your transient set, which is just the regular worker pool. The worker pool itself can be permanently auto scaled. You would want to have it set up so that it can scale up down quickly depending on demand based on your metrics. In my case, all the nodes are the same. Use the same template, node template, but what you would want to have is for the minimum resource set, you have bigger nodes with more CPU, more storage so that you have less change in that environment. And for the transient set, you can have lots of small nodes because it's pretty cheap to recycle them. All right, so that's that. Now the other question is what kind of options do I have on setting up? How do you use Longhorn to define where my data actually gets stored? And this is, when you're in an enterprise environment, you don't necessarily have different nodes. Some have GPUs, let's say you have machine learning workflows. Some have NVMe, some have like spinning disk and you don't want the slowest web server kind of thing using storage from your NVMe disk, for example, on the node that has the GPU. That would be worst case scenario. So to solve this, we provide a whole bunch of parameters in our storage classes that allow an administrator to define. Let's have a look at the, it's too small. The resolution is kind of bad. Better? I can do it with a wrench or two, it's too small. Okay, so we have the concept of disk and node selector. I'm gonna zoom in now. There. So we have the concept of disk and node selector which allows us to tag our nodes as well as a disk that you have configured in Longhorn. I'm gonna give an example and go to the Longhorn UI to visualize that very simply. What you as an administrator want to do is set up a collection of storage classes that define the parameters of the workout types that you want to offer in your cluster to your developers or end users. In this case, I just call the editor disk selector which is fast. So any volume that gets provisioned with the Longhorn NVMe of storage class ends up getting scheduled, they're getting the replica scheduled on a disk that fits that selector criteria. So this limits makes sure that if you have certain workloads database like good spot that needs high IO, you can ensure that the replicas end up on the appropriate disk. Yeah, let me, let me just show what that looks like. Sorry, the resolution, can you see that? It's kinda small now, my apologies for the technical difficulties. You can see I pre-tagged the nodes here. I ended up tagging my permanent nodes which is my stable set nodes as well as an example of for GPU this node ends up with a GPU and for certain workarounds, if you don't have affinity zone set up or zones and region set up, you can also tag individual nodes to ensure for right now for the current version till we have volume group support to ensure that your volumes get scheduled. Let's say if you have an application replica that in itself has that replication and you want to ensure that the application replicas replicas get scheduled so that you at least always have two, two application replicas available so that you can lose one of the application replicas. That doesn't make too much sense but once we have volume group support we know that this one is part of a group of a set of volumes and we can apply scheduling rules to the replicas as a set. But till then as a workaround you can tag individual nodes and ensure that you have like three storage classes for each of the application replicas and they would then ensure that the replicas get distributed evenly. Okay, the second thing is, the most, the second thing is you can add many disks. So on your stable set it's perfectly fine too when you after evaluating your resource consumption you guys say, yeah, I'm not actually using too much CPU or memory but I'm getting low on storage. You can attach additional disks via the cloud provider block storage and expose them as part of the long set which allows you then to tag these appropriately well. In this case this is tagged as an NVMe disk so any replica that was provisioned with the previous storage class ends up on one of the NVMe disk. While the default disk is just a regular disk so no tags, so no specific scheduling rules. All right, that's that and last but not least I wanna talk a little bit about the upcoming SPDK. It's still in the works and here's some quick, this resolution is too small. Sorry, I need to make it small like this. Yeah. Okay, so this is like a quick developer benchmark so take away a grain of salt it's not production ready yet but in general you see that IOPS we're getting are pretty close to the local disk. This is including a file system, right? So this is a host mounted file system versus an SPDK provided long-hunt volume then mounted to a local path on the node. We have a little bit more latency so we go from 132 microseconds to 160 microseconds for the IO but again this is only for the local replica. Once you add like remote replicas the latency will go up a little bit but you also do to the architecture change that with the SPDK implementation we're going from one CPU, basically CPUs per process so previously each one was an independent process. We're now going more into the, you have one process for the whole node which allows it to scale with the amount of CPU as well as the amount of disk linearly which is for large, large permanent storage sets like I demonstrated here is really beneficial. While for a hyper-converged state it's also beneficial but less so because you have less resources per node. Yeah, let me probably elaborate on that. So what you're seeing right now on the left side is in fact local paths which basically you're bell-mental hard drive and I think it's backed by NVME and on the right side you see that. No, these are still SSDs, just regular SSDs. That's SSD, okay. On the right side you see the long-hunt SPDK that label that is like our currently in developer SPDK engine which nowhere near finished just very, very big caveats there but it's the one to show off like what is the potential we can get. So as you can see that basically we have no difference within IO like IO per second and also the bandwidth is like basically no difference on the bandwidth size as well and slightly like you see the percentage difference like on the latent side it's big but looking deeper you'll see that we basically only add about 20 microseconds latency on top of the native disk which in fact is pretty fast and this is the new engine we are currently in development and based on the SPDK which is a storage open source a storage framework developed by the Intel. Yeah, so this is and also this is strictly local right now. We do have our SPDK implementation right now do support replicas and start adding support for the snapshot but still this is in a very early stage and we're still talking like working on this but for the people looking for the performance this is something I can show you. Okay. Yeah, definitely we have here the community is concerned and we got two performance and we're working very hard. Shout out to my colleague Keith Lucas who is driving the SPDK initiative. And as you can see from this quick depth benchmark yeah in comparison of doing pretty well this is just for reference this is on a single CPU provided to SPDK on a single disk. Now you can imagine how this will roughly scale once you add more CPU with this more disk usage. Okay, I think we have five minutes right? Yeah, I think we're just on time that's great. All right, any questions? Do I? I don't. What happens when you add a node? Does the existing replicas rebalance on the new devices or nodes? Okay. Oh you, yeah. Yeah, I can take it. Yeah, so currently if they're like a soft and definitive rule enabled which is in fact disabled by default and the replica was in the same node we're going to auto balance to another node but currently like based on the capacity it's not being automatically rebalanced and we're like that's in fact one request we've heard from the community and we're taking that into consideration for the roadmap. Yeah, we did add some auto balancing features. They're not complete yet. We have some least effort best offered but yeah, not 100% change it. Jenny, what I like to have is like the different affinity rules so you can set up zones as well for your Kubernetes nodes which we will also respect as part of the affinity rules. We have disk affinity as well as node affinity. So yes, sorry, I couldn't. No, no, it's not the right. It's like a wait thing by using affinity rules? No, now it's basically just like any affinity across, you can cross node or cross region which I mentioned like how we can support EKS and I think we have cross disk too but I don't think that's going to be where a company use. Yeah. No wait at this moment. And one more thing regarding the replication. It's a network consumes the network. Do you recommend a dedicated interface for the replication protocol? I don't know exactly how does it work. Oh, so yes, in the upcoming 1.3, I think I got it, right? In the upcoming 1.3 release, we're adding the dedicated storage network support so you can utilize motors to have your storage traffic flow independent of the application flow, right? Yeah, yeah, that's a question. I think the data path will be the same for the replication. Yeah, so definitely you have to split then. So let's say you have a much higher bandwidth for your storage than your applications or much higher for your applications than your storage. That would provide that. And one more question regarding upgrades. I understand that the service is not affected like it's still available but I guess the IOPS are still are affected. Yes, the IOPS are not going to be disrupted when you do the upgrade and we call the live upgrade. So we used to require some break, like detach the warning when you do the upgrade but nowadays we're trying basically things the 1.0 release till now, in fact 0.8 release till now, like every version can be live upgrade and this is something we keep working at like to keep it in the future, right? So we want to make sure there's no breaking upgrade. Yeah, because from my experience we have some application of IOPS demanding and when we are upgrading those applications fail because we are losing performance during upgrades. That's if you have a network like very saturated like for the previous like 1.1 version you might have this situation. I think after 1.2 certain version we have a lower bandwidth limit and that's you're able to help. Yes, we also changed the timeout handling mechanism on the IOPS site. So let's say you previously had the slower disk which was one request we had previously. We improved that a little bit. It's still not officially supported since if you want to really good performance and slow at this you need to make dedicated optimizations for that but with the changes in the newer versions you should be able to run it on a slower disk or yeah, I run one of my Devcast so I was on a set of Raspberry Pi with SD card so for testing that sufficient of cost production I wouldn't run it there. Okay, we've got an online question from Luke B. How is ARM support coming along? ARM is supported and in fact ARM support is contributed by a community member and we accept that and like accept that PR and polish it and now the ARM 64 is officially supported on the lower. Yeah, so ARM 64 is supported. ARM 32 bit is not supported so ensure that if you're using Raspberry Pi or if you're installing 64 bit comfortable operating system I think Raspberry Pi doesn't have it yet or might just gotten the 64 bit version out. I'm not sure but Ubuntu definitely works so. Yeah, a question about the read write support. So I see in the roadmap the HA is going to be supported and it was at 1.4 so I see it's experimental right now. Is it going to be similar to that and you're just making an HA or what's going on there? So right now the way we implement every X support is by having us, we call him share manager but it's basically a provisioner that ensures that every old block device volume gets attached to a certain node and then we export NFS chess via NFS Ganesha on the application level out from that. The problem is for highly availability systems you now need a mechanism that is either active passive active AFKIS. You can't do active AFKIS with an AVL block device but we can do something like active passive. Right now you can use it. It is marked as experimental, it's usable. The only downside is that if the node where the NFS server is on goes down we end up terminating your workloads. That's depending on settings right, depending on coverage. We end up terminating your workloads which generally is a service interruption but due to the, if you have them in an employment or a stateful set, the replacement workloads will be spun up and the replacement workers will attach to the new AVX share manager. So the interruption of service is very small if you end up having an issue but therefore we classify it as not highly available yet. Once we have an active passive set up it would transition time would be much smaller and there would not be any downtime for the actual application service. There will be an iOS stall during the handover but that's handled on the kernel side so we should be fine there. Yeah, yeah. Yeah, so if you can live with short interruption of services in the case where you have node failure or volume failure then you can use it but it's yeah. Officially, we don't have the high availability yet. That's why we market it as experimental. All right, thank you guys very much. That's all the time we have for this session. Big round of applause for the long run guys. Thank you.