 Hello everyone. I'm happy to be here and today we will talk about Long Hong. I'm David Koh from SUSE. Test? Oh, I'm Josh Amoudi. I'm from SUSE too. I'm a software engineer at SUSE and I am one of the Long Hong maintainers. Yep, so let's get started. So firstly I will talk, we will talk about a few parts but of course we will talk about Long Hong. What's the Long Hong? And general feature list and how the community status right now and story and roadmap for sure. And of course recently we have some release update. So information I will share with you guys. And then next we will talk about Long Hong inside to more about how it works like control plan, data plan, and even some major functionalities, snapshot backup stuff. And of course people will concern about or have a curious how would care about the disaster recovery. So this part will be covered as well. And of course non-disrupted migration is also part of it. And lastly we talk about what's next, mid-term and long-term plan for Long Hong. Okay, Long Hong basically is a highly available software defined storage for persistent modeling. So he's a provider bought device but on top of it you can provide a file system as well. So have a different scenario depending on your use case. And it is lightweight, reliable. What we said is because all the things we don't really rely on the external stuff like databases or other other service, we just rely on the Kubernetes resource. So and also the installation way is very easy, no other supported dependency. And the major thing is the persistent modeling. So different type of modeling mode and SS mode we do provide by Long Hong like a device and the file system for SS mode we have a rerun and rerun many. Maybe some of you use a Long Hong, we're curious about the rerun many when we will get it to the general available. I would say it will be soon because at by end of this year we have a one folder release including we'll make the rewrite many GA. I will talk about it later. And next storage agnostic. So because we want to make the Long Hong easy to use so we don't let the user to worry about how to measure Long Hong this stuff. So any kind of host file system support or tent sparse file can be used by Long Hong. But right now very by our team is the ESD4 and XFS. And this one let me in the user you can just rely on what the file system you have and use the Long Hong directly. And of course not just in cluster we also support out of cost stuff. So it's the billing function and you can back out your volume to ESD4, XFS and S3 compared to our server. And also I will talk about more new stuff in one folder about system backup and we'll talk it later. And of course it's Kubernetes design implement. We just focus on the CRD and control pattern. Nothing else. So it's very straightforward. No other dependency. And right now it's owned by C&C open source project and also become the incubating project of this year. The feature is I don't want to go over all the items but I want to highlight some major items here. So the first one a volume because Long Hong can provide simple vision. So let you leverage your storage usage efficiently. And also we have in cluster stuff station out of cost backup restore and of course some clone expansion based on the CSI protocol. So all the things you work on your volume based on CSI you will work well with the Long Hong. And security stuff we provide the volume encryption in the current versions addressed in transit based on the current long architecture. And if you talk about the replicas, the volume data we have different scheduling strategy but simply say we provide pros, no, and software anti-affinity for no level and also auto replica rebalancing. This one is actually for if your cost have a no download up we have some marketing to make it a replica can rebalance. And storage attack is also the way to let you manage your no and this can use you well for your warlord and volume usage. And about disaster recovery, cross discovery, this is one thing we will talk about later and live migration. And of course for the backup restore external we have for the user feedback we have a policy base so you can define your recurring job for in cluster snapshot or external backup. And one thing I just mentioned rewrite many we will make it GA by end of this year. One for Zell Reads together with ARM64. ARM64 we already support experimental feature however we are not very confident before because we did not run the full regression on it. But right now already integrate into our testing pipeline. So the testing coverage is the same as AMD64. And UI yeah UI is a simple way you can operate but right now we support CRD admission webhook this kind of stuff so the interface will be more than UI. And talk about the community and momentum for Longhong. And Longhong is a long history but along the Kubernetes growth we also grow together and right now the stuff for sure is keep growing and enterprise usage especially we have a kiosk this time and here some feedback from the users and some users have a different scenario usage and also we have some telecom users as well. And we have a matrix we know right now there are more than 17,000 nodes running Longhong already keep growing and if you compare with the three years ago we don't have the sensitive part to sensitive it's right now already grows 23 times. Yeah this is a matrix we just collect like the middle of October so it's keep growing if you if your base is a goal like 92 percent. Yeah let's talk about the resident release. We don't have a release right now but we just have a release 1, 3 in June and this one we did not highlight a lot in the previous Qcon. So this one this page can give you understand what are the functionality major items in 1, 3 and we have we introduced admission and the conversion webhook for the Longhong resource v1 beta 2 so that means you don't need to worry about if you use v1 beta 1 we will automatically do the conversion so this is different from some other solution you need to prepare some work if you migrate to the next version of a resource and storage network some people from the community as well they want to match the the data plan the traffic can be separate from the control plan so storage network right now with their leverage based on the models you can define your your network so met our data plan come components order traffic to the second interface so this one and also some people from the user they use Longhong on the cloud a special public cloud and they are want to see is anything we can support like cluster auto scaling and also some no pool support and previous we have some trouble because different public cloud have a different strategy for their no pool management and also cloud cluster auto scaling have a different mechanism we don't respect it but right now in 1, 3 we have this kind of seen support and I I think you can give a try because right now we still muddy the experimental feature because we want to know more feedback from the users and yeah Snapshot Snapshot will be will be involving the space usage so we want to do more about the Snapshot purge to make sure the space usage much efficiently and right hand side we have a CSI Snapshot a CSI Snapshot is a feature already but only focus on the external backup volume backup but right now for Longhong we have in cluster Snapshot as well but we want to have a user the same user experience so user can use a CSI Snapshot to in cluster Snapshot or external cluster backup and backing image yeah also volume is not just raw data you can based on is in backing image to create your volume so you can use the backing backing image as you will backup strategy as well because you can explore your volume as a backing image download and reuse it later so it is a it's a quite efficient for some user spatial virtualization stuff security communication we have introduced the TRS encryption is the option in so you can enable it to make sure the traffic between the control print data print communication about a spatial for the control print commands so you can go through that and upgrade yep again the feedback also from the community that are concerned about the upgrade spend a lot of time so in 1.3 we have some field trainings on the client to make sure the upgrade very fast so if you still encounter some upgrade if you are a user you encounter some upgrade slow still using 1.3 let us know we will see how to improve if still have some issues yep so this is for 1.3 in 1.4 is upcoming release by end of these years and we focus on the data itself and Longhorn is very respect the feedback from the community so all the things actually you see here is a major vote from the community as well the first one left hand side the trim trim volumes because Longhorn is a broad device so he cannot really respect the file system the train to let the use the space free but right now you will we will do that in 1.4 1.4 0 and rewrite many volume GA previously we did not have like a meta service function for the rewrite many our turn is a share manager so you will met if a share manager don't client can come back the data consistent will have some problem but right now we introduce a meta service in our share manager so we will make sure are you when our share manager come back come back the client connection will be keep for keep for the data consistency and backup and restore yeah right now Longhorn have a volume backup restore but it's for volume and in 1.4 0 we will have a Longhorn system backup restore what that means you can have a different strategy for your cluster backup and you don't need really rely on the variable we have a documentation you can use a variable to do some stuff for the Longhorn system backup but this for current so right now in the future 1.4 we don't want to use it to like use a different tool to manage by themselves want just one stop use a service to let a user can can do the Longhorn system backup and restore and bureau protection together with the share chest some right now we we only rely on the revision counter design mechanic in in in our volume it's not good enough because they have a percent possibility to make the data inconsistency so in 1.4 0 we will have a checksum for each nature to keep calculating but not just calculating we have a function you can set in you can enable the detecting so you will make sure the out of date data will be removed and the clone we are reviewed from the healthy one so this is one and together with a bureau protection yeah I'm 64 I just talk about it and kubernetes one 12 25 and people ask about the past security stuff PSP and we are working on that if you are for our github but right now we are ready the PR the code is also merged so you'll be sooner or later but we have for some concern about the backport to the current feature is like a 1.3 or 1.2 so we will want because every reason we will make sure Longhorn will be together what kind of minima version of kubernetes you can support so some different factors considered together one 12 25 support will be happen will happen in 1.4 0 and support bundle and we found some troubles when use user feedback some issues we user can download support bundle but right now the information is not good enough a special longhorn also rely on the host so the new support bundle integration we will make sure this part are much better so when user report issue will be to identify identified by our engineers and online volume expansion the same is a very many people asked and that's why it's local volume local volume we have a data locality are set setting for your volume but it's the best effort but right now many will know it's actually running like a distributed application like a database so local volume the requirement is getting getting hot so we want to make this one to extend the data locality to have a strict enforced mode for that to get a better performance so this one for zero so please expect for the upcoming release by end of this year yeah so this is what I say and recently we have patries for one two six because one of a rarely happen data corruption probably happen so we fix it in the one three two and upcoming one two six so if you are using one two you can upgrade to the upcoming patries okay so next we will talk about longhorn I will brief introduce this slide and later judge what we'll talk about detail the other functionalities data plan how it works so basically for longhorn I summarized the five parts the one is volume and second is life cycle then data placement deployment and control plan so volume actually have a three parts the one is a volume fine right now we rely on ice gauze protocol but we have a different we have a new plan if you join the yesterday we have longhorn with spdk so we have a plan for the next generation data data plan so you can watch the recording if you did not join that and the second one is a volume itself is volume engine I say it's a volume controller and next is the data volume replicots so these three parts and how about longhorn volume life cycle is triggered by the pbc and related customer resource another customer resource is actually building resource and we have a csi side cost building together with a longhorn distribution so you will watch those resources to have a corresponding operation as well of course you we because longhorn is a six to seven years project we have a different interface like a rest for api stuff but we encourage a user can start to use a kubernetes way to have a consistent user experience so data placement uh because longhorn provider uh storage agnostic so it's long this is actually like uh five five on top of host of fire season so it depends on your strategy you can define a different mount point for different long disc and have a different tag uh for your replica scheduling strategy for that and deployment uh I I call it a segregated microservice because each volume is independent from others so if your volume have some problem take example like an engine or replicots you will it will not impact others because everyone is like a semi-gallant group control plane pod control pod just rely on kubernetes control controller and the longhorn have its own customer resource and right now we have a web webhook for the other mission webhook and conversion so the goal for the future if we have a next version of resource we will do the same okay Joshua Joshua we'll talk about the airplane cool um no I don't need this okay yeah so uh right now our front is based on ice kazi by using tgt in the application mode um two components for our data plane is like the volume engine which you can take off as like a volume controller uh while a replica is just the volume data it's like a dump data store um the communication happens over tcp ip um there is some improvements possible there as well uh and I think that was talk on the spk site um which goes to more in depth that um I'm gonna keep going to the next slide so this is a one very interesting aspect of distributed storage is like you want fault tolerance right um in traditional setup you would have the pod a which is using a volume uh it would traditionally be just writing to the direct host disk uh that creates a problem that that pod is forever coupled to that nodes because it relies on the storage of that um so you need to level need a level of indirection in between that uh and we provide that by the engine and replica block in between this allows the replication of the data across nodes in our case we do this synchronously so our replicas are equal at all points in time this also allows us to move the workload arbitrarily so if the community schedule decides hey this node is resource exhausted for example hey I need to move this workload because it's low priority then that work can move to an arbitrary order node in the cluster um it helps us in a case where you would lose the node in this case node one would go down of course the sense of work of this at that point in time on what node it would be down it would be rescheduled by the schedule uh we have some improvements there like if you ever used a stateful set before you will notice that the pods don't move by default so we help clean it up and help the process along um so the community schedule would now schedule this node on on the node tree the pod a on node tree in this case uh node tree is uh purely a compute node right so in long run you don't necessarily need um every node to be a storage node um you can have a dedicated storage pool which is like the traditional centralized storage architecture while um if you have storage on each node that's more like the hyperconverged storage architecture if you want the benefit cost overview of that one join us at the booth for an in-depth discussion later that get the kiosk um so you've seen that the node failed uh the data is saved because we have replication kubernetes was able to reschedule the pod to an arbitrary order node in this case a compute node and the engine continues processing and living on happily after um you can see that there is many different volumes on on each node we have what we call an instance manager for replicas in an instance manager for nodes so that's like one pod uh for each on per node which manages the processes inside of it and there's uh changes in that architecture coming as well with the spdk site um so yes and you can also have different amount of disks in long on like each node can have different nodes generally it's very nice to have like a node template setup but if you know like your workload um has a certain storage to compute resource um like relationship then hyperconverged storage is the is the way to go because like your compute scales up your storage scales up equally um a good example use case would be like if you're doing video transcoding uh you roughly know how many compute units or memory user units you have per storage units so that's a really good use case of hyperconverged storage quick overview of how does long on actually work this is uh how do we integrate the with kubernetes what the control plan looks like um kubernetes has a csi specification it's an interface um general purpose interface for storage integration um we provide the long on csi plugin which implements the csi spec and that one communicates with our long on monitor which is a demon set that runs on each of the nodes um which provides a back end api as well as our communities controllers um so if a user for example tries to create we would go to the csi call for create volume publish volume all of these would go against the back api back end which then our controllers would create cd resources on so if you like want to use this in an api based fashion we had a we had a kubectl for example just cd directly that would be possible to the long on manager would then react on these calls and create the appropriate resources in this case we have an engine resource replica resource and schedule them across the nodes so that's that let me see where i am 27 now i'm just going to go over some things that where long horn is useful or where it's useful to have a software based storage back end right if you use for example an amazon ebs volume directly then that volume is tied to a certain zone that can be okay for some use cases and so other use cases you might want to have a distributed across different zones so you need like a not a level of indirection by having a software defined storage stack on the snapshot side we are gonna we are introducing a 1.4 a bunch of stuff when we want to make snapshots more performant as well as more reliable so we do like the snapshot check something which allows us to do even faster delta for rebuild so if you lose a volume we will have to rebuild that rapid volumes replica on a different note by doing a snapshot in pro snapshot we can skip and pre calculate the data right now we already do like verification of the actual snapshot data but that process is tiresome because we need to go to the whole snapshot chain so if you have a bunch of snapshots it will take much longer but if we can already identify that this snapshot is healthy over time we can skip the the rebuild reuse some of the pre-existing data without additional verification how does this how does the storage work i think i'm cutting out how does storage work what is the representation of the volume well it starts out with the single spot file so the moment you start you just we provide a fast system on top of that and anytime you write anything on to that file system there's a bunch of blocks going into a sparse file on each of the nodes we are the engine to replica communication and when you take a point in time snapshot we will create another sparse file as a layer on top of that one and that land as long as there's no data in it we just read through it we have a performant index to do that so that we don't have to scan the whole snapshot structure every time but anytime you then write to it like the prior data is still existing but it no unless until or it's no longer needed for the life reads but it might be read it for like backups restores restore to a certain point in time so we build a whole snapshot graph actually per volume you you can have different starting points different ending points and on the right you see what the volume life content actually looks like so that's that great how does backups actually work so we have some improvements on the backup that we didn't mention I think it came with 1.2 1.2 is like a backup groups we have a volume backup policy recurring job policies so previously if you use number four we had the ability to recurring jobs but that was per volume setting it's much nicer if you can define policies for your backups so let's say you have your hourly daily monthly policies and then apply that to all volumes like a default policy or you have a specific sensitive volume group so you have a policy for them you can apply them in a more hourly every hour for 20 hours we have a retain count etc um so that's very cool if you want a demo of that one we had the kiosk data backups rely on our snapshot mechanism and we do some delta there as well trying to be very efficient we do like a csc verification because well the worst thing that can happen to someone is that you have a backup but that isn't actually useful for restores so definitely something if you're if you're doing backup testing or backup setups make sure that you once a month test your like backup so actually restore ability okay um rebuilding yeah let's let's talk about okay how does actually volume rebuilding work um this is built in depth so i'm gonna go uh slowly we have like synchronous replicas uh proto one so by default it's three uh this one is gonna just show one what do we do we have to create a very shortly pause the engine so that we can hook in a new replica uh you it doesn't the user doesn't notice it uh we take a snapshot on all the healthy replicas at point in time since the iu is paused they all have the exact snapshot content at that point in time um the new snapshot checks some mechanism will at that point also do an async verification the background so this is basically increases reliability because we know for sure that this backup snapshot is healthy there was no bit rot none none of that happened in the past so we're good to go um then we create a new right only replica which is basically gonna get once we unpause the engine it's gonna get all the life iu since a snapshot is like a stack and a snapshot graph we can already start populating the life hat we can already start populating the life hat um while in the in the background we will pick one of our healthy replicas at random and do like an asynchronous transfer of their snapshot uh chain files right so we do verification validation that all the csc happens and that one process is um quite expensive even if we don't have to transfer the data right now we have to look at the whole snapshot and do a per data block in the snapshot csc checks some validation so with the new improvement we can see look just look at the snapshot file for you okay last time we checked this file uh was uh three days ago it had a healthy csc csc so we can skip it this is already present on a new one all right so then yes once the replica is synced up uh we can flip it over from right only mode into read write mode and it becomes like a regular replica and it's part of the read set as well okay um well we have a life migration feature we have actually two technically two life migration the first life migration feature is for harvester for virtual machine life migration was implemented for that one particularly um what's the challenges if you're using a file system the problem is you can't have the x mode unless you have a distributed file system so what we do can do though is like have two parallel running engines and then do like a handover for the engines um we do that we use this approach for life engine upgrades for example so anytime you upgrade to a newer long-run version we make improvements in the engine so that's the volume controller part or the replica part we can create uh mirror processes from the original set of processes and do a life handover yep we see this here so then we switch from that over to our new mirror processes and the old one can go away um okay we have okay I'm just gonna do this one right quick so we we talked about like the improvement on the long-run system backup right so previously if you have the disaster recovery setup that was also per volume needed to be set up per volume um now with the new improvement it will make it easier to set up like a mirror disaster recovery cluster that in an basically active passive configuration so you can like flip it over and all your volumes that they are ready to go this relies on like an external backup store that is shared between the cluster this is nice for on-prem if you already have like a old uh bvnfs server or for in-cloud if you like on amazon s3 or azure s3 um and both clusters can share the same backup store anytime the volume on the active cluster does a backup it will be populated to backup store and the volumes on the passive cluster will pull that regularly and apply the newest backup so you have like a continuous set of data with the clearly defined period of possible loss right because like the delta between the last backup that you took versus the live data okay I think that's it we want to do some q and a um I want to talk really quick uh ignore the numbers on here but we already did some performance improvements for 1.2 um and we're doing minor performance improvement on top of the 1.2 improvements for 1.3 uh we did that by uh improving our memory allocation and switching also the memory allocator uh on the c side on the tgt side so that's it um there's a what's the session yesterday for the spk side if you're interested in performance improvement that's like the future of the long on zeta plane engine um this year is just for the existing data engine so that the people that are using it actively right now get some benefit out of it as well thank you very much and there's uh any questions okay four minutes right so we might many right now is um set up as so on the file system that we support we might many for file system usage more um and how it's implemented uh initially it was done via an external provisioner uh we now have an integrated that nicely so that uh from the user's perspective there's no setup required all you have to do is create a pvc with access mode uh read write many and we will internally create like an nfs server because you need a disparate file system uh map the volumes take the manager lifecycle export etc of that server and then provide that anytime the the client workload basically does a csi call uh for mounting all of that for uh attaching uh that will be wired via our controller to the uh share manager we call it a share manager pod which is basically going on the first server and export that volume to that client for use oh i can see i can just run into customers with concern to my looking at this technology is what happens when you deal with large data sets where because of the the migration costs and such the time and effort just simply restitch an existing volume underneath kubernetes so i have a storage array it's already got the data there it's you know 25 disk or whatever it is right just stitch it back up under the kubernetes and let it be managed on the control plane is there a methodology or a concept to do that so when you say stitching up uh do you mean like you already have like 25 long run volumes for example or like block volumes i scuzzy currently exported and being consumed by a virtual machine i i resubstantiate the virtual machine as an image deploy that but this is outside of long run or are we talking about long run ones we're talking about no we're talking about a traditional storage volume stitch it back under the control of long horn more like a disk uh this uh placement how you manage them right but long there's other vendors yeah long did not yeah yeah but long home is actually have no any strategy for that but we have some uh like a tag disk tag you can manage what kind of uh disk you want to do but beyond that like uh reg level like you say data center part we don't have this kind of design but you can fit you can fit back more we can see how we can do i think i know where you're going though like okay i assume you have this old appliance let's say your rate server stuff with 25 disk yeah um how this would work on long horn is like um we would pick up the new disk see that it's there right you would need to might need to do some disconfiguration initially to see where they're at and stuff but if you had used that array before for long horn and you just took it off for like a day or something to do like maintenance on we would be doing like the rebuilding of the existing ones we would pick the pick the see that oh there's like replicas of this volumes already we can reuse them it depends on like some setup of the replica replenishment rate interval etc so you know like if you you know you don't want to take this off and increase the replica replenishment rate interval so we will give you time for this to come back online and then we can reuse the existing data based on the snapshots right and this is where the improvement of the snapshot comes in because let's say you have a terabyte virtual long horn volume there you don't want to scan every single file in the snapshot chain instead you will yeah i verified this two weeks ago this guy is good to go we don't need to we can immediately skip it we already have the the mirror copy here it's fine we only need to then do the verification for the last the failed snapshot basically the snapshot and put it there was taken in point in time when the device went offline and the life hat needs to be synchronized of course and my second question is speaking to customers where as we're stitching virtual machines under these covert frameworks um when things happen and the volume is transferred to the virtual machine is my migration and whatever it is things go bump in the night viscine and how we help a customer is there a way to are there any hooks that we could talk through that how does it detect when the volume comes over and it's it's four parts of the volume it's got one that's got to go through a scan cycle something like that right so um i can only speak about our implementation for harvester right because that that's the use case i have you can do a trick um on the harvester side you can do a trick uh that basically you can rely on the csi calls because you know when the new process goes online and you know when the old process goes offline that will can be used as a validation to see if the migration actually worked correctly or not or if you have to roll back so that's how i implemented that for the harvester live migration work is if you're interested in that come see us at the kiosk and i might give you some more in depth stuff i'm sorry thank you very much oh yeah last one um just wondering how you would recommend looking at the differences between using something like longhorn and the use cases versus something like efs or efs if you're i'm sorry nfs if you're using google cloud right or just with the cloud providers and right so um i can only speak from my perspective on this right so if you're using like efs which is basically if i'm not mistaken it's nfs underneath the hood they just renamed it um you are bound to their file system then that has benefits and advantages because you're operating on a file system they might do some smart optimization based on the file contents etc that you can't do when you're operating on a block level but the problem is like if you ever need to move from that one or you need more control of it then that's a problem i don't know all the full features set if they have like ios scheduling if they have different storage classes etc all of that stuff i don't know of the top of that there's the cost factor um depending on like your your storage needs right um so that that would be a benefit like uh basically the the conceptually difference is like longhorn is a storage software as a storage stack right so you get the software that manages your storage creates the storage deals with all of that while with the efs the control of your whole storage is inside of the cloud provider which has benefits like yeah as well as but also downsides so if you have a specific use case come see us at the kiosk and great thank you very much and you time's up