 Okay, good morning everyone. Thanks for joining. Today the main agenda item is our presentation from Sheng, who is doing natural apps, and it's going to talk to us about the Longhorn project which he is submitting as a sandbox project. We have the Sheng has filled in the information on our questionnaire. Alex, it's really hard to hear you. Oh, sorry, can you hear me better now? Yeah, much better. Sorry. So Sheng has put together the information for the Longhorn project and filled out the questionnaire as well as put together a presentation which we have linked in the agenda document. So they're available for background reading too. And unless there are any other immediate questions, I think Sheng, please go ahead and start. Sure. All right, can you see my screen? Yeah, we can see the screen. Yeah. Okay. Hi, so thanks for coming. Six storage. And my name is Sheng Yang. I'm working for Rental Labs. And for the last few years I've been working on this open source distributed blog story software for rental apps called Longhorn. So today I'll be glad to present you Longhorn and tell you more about it. So, sorry, I probably got a little bit cold last night, so my sound probably got a little bit hot. And also, if you have any questions, you can just free to interrupt me during my presentation. All right. So we started Longhorn in late 2014 at Rental Labs. I think it's about September. And the motivation we started is we won't have open source distributed blog story software for containers. But what this makes is different is we want it to be simpler. We want it to be simpler in the way that it should be simpler than the SAF, which we know that is the basically the most popular open source story software out there. We didn't really, we are not really a SAF expert, but we have seen many users using SAF and the findings were difficult to operate. It requires certain knowledge to really operate SAF correctly. And that's why we started Longhorn. So Longhorn itself has been adapted by Open EBS as one of their storage backend back in March 2017. And I think that is one, one of the proof that Longhorn is really the at least targeting as enterprise grade storage software. And this technology has been adapted by other companies and they use it for their own product. And also that's demonstrated our embrace for the open source models. So Longhorn, all of the old Longhorn's code are licensed in Apache 2.0. And if you want to know more about the licensing and external library dependency, you can check the document or our PR to the CNCFTOC. So why would be the Longhorn? As I mentioned before, we believe that the distributed story software doesn't have to be really that complex. The reason is the, if we have, if you would consider that the modern high speed, high capacity, as the existence of SSD, and we can think that probably the one thing we get away with is not doing striping across the different disks. And because the high capacity is mostly have the, already have enough space for the user to use. And also the striping is, in fact, the most complex part of the storage, like in the staff. So if we get away without doing it and still provides value to the users, we found that should be a what should be the build of story software should be much simpler. And also we use proven Linux storage features like sparse file, and we're planning to do QoS with VLC groups in the future. So that's made us not unnecessary to rebuild and rebuild our full stack from ground. We utilize the mature and existing technology to do a lot of features rather than just build, rather than just write them by ourselves. And in long horse model, each volume is just a set of independent microservices. And now it's orchestrated by the Kubernetes long horse manager, manager, manager is totally run on top of Kubernetes. And it follows Kubernetes controller model to write a bunch of controllers and control and orchestrated the flows, operate the flow of creating, deleting and operating long horn volumes. So currently long horn is the most, most of code is writing in goal. The currently functional code, which is exclude the testing part is about the 30,000 line of the goal code. And that's including the data plan, which is the local engine and the manager plan, which is local manager. So the data plan, we can. Okay, I will talk about more about the architecture of the data plan and manager plan later. So here is the overview of the was the current long horn community looks like we have submit the sandbox PR to the CNCF TOC and the current independent state, and the long horn has about 600 GitHub stars. And we have made about 23 releases thing for for the things we change everything into Monday reroll everything on the Kubernetes. And currently we have 200 plus members in the long horn storage channel on the rancher slack. So one thing I want to emphasize that is our 600 plus GitHub stars is purely organic. We don't infectious for the last few years since long horn is still a product in alpha stage. We don't spend much of marketing effort on that or So basically, we have some announcements once every month and or whatever two months from our rancher official Twitter and announce that the new releases or we have new demonstrate a demo coming and the new masterclass coming something like that. But other than that, we don't really have spent we don't really have spent much marketing effort on marketing the long horn. But so many of this users for the long horn is really just coming from they they try to find a story solutions. And they they compelled all other solutions out there and open source ones. And they found is very easy to get into long horn. So that's, that's how we basically really just organic grow that so this community right now, right now. So after we got we're going to announce a long horn beta and we're targeting J by the end of this year. And after that we'll spend much more effort on the marketing and because we want to make sure that our product is ready and user friendly and the user. We're going And the user going to is going to really like it and it really can depend on us to trust their data to us. So that's that's when we will try to launch our full marketing campaign. But for now that's just we are We are you heard of the project in the ranchers like K3S K3 OS and the oldest on the real and we don't really have much much much African than that. But once we reach beta and the GA we will do more and I will expect this number to be grow substantially. Can I can I interrupt briefly the question. Yeah, sure. I was just curious. Is it correct that you've been working on this for five years now roughly. Ah, yes. So I was just curious. I mean five years is a long time to still be an alpha. What, what was the reason for it taking so long. Okay, so the first thing is we wrote the whole thing. So the first implementation is basically spread off at the early 2016, because that's why we do is still seeing the first imitation is way too complex is is working in C, C plus plus. So we basically just get rid of it and they just start from scratch and write we wrote everything in gold. And that is 2016 and in the 2017 we officially announced the project and but you know, in fact, when we build the long horn for in the 2016. That's where targeting is really at the rancher. Sorry, not rancher but Docker. So on 2017 we announced when the first version we announced long horn for we see the starting of communities but it's really really ramp up very fast. So at the time that 2017 committee hasn't become I think it's on the way to become the universal application platform but haven't they haven't reached there yet. So at the time we are building on premises of Docker and we still require some external storage to do external storage to store the state. Right. And about in the 2018 we basically did a rewrite at the management plan again because we see that we can utilize Kubernetes to for many many long horns the capability so we basically just rewrite to the management plan and the solely focus on the Kubernetes and that's is how you see you're going to see was the architecture right now is basically is solely based on the Kubernetes. So since we fully rewrite to fully rewrite again and the targeting Kubernetes this has been about one year and in fact to this 20 something releases is all happened this year happened to be doing this one year one year and a half period. So after that I think we progress is pretty decent but the things is we really want to make sure that the users can trust their data to us because storage is really really important that you cannot. The worst thing is not that your one warning is offline or what that's that's is really bad but the worst thing is you lost data somehow. So we really want to user. There are many. We have many great user feedbacks but also we want to make sure that many user tried it and they are not this not the case we well off that no one lost your data. I know that that in fact the skin have this one case that one user accidentally delete one replica which he think is forwarded but that what happened to contains the last piece of really the use of a data for his warning that is only this is the this is one known data loss at the time and we immediately patch it up we basically just say that even this replica is forwarded and we don't allow user to delete it if it contains just the last piece of the knowing data. So that's that's the some things we put efforts on the usability and we put effort on makes things a stable so that's that's why it's taking really long time and also of course we have a few rewrite and in fact not a story so we also change the front end and that's also that's another long story if you want to talk about it. Yeah, but that answers my question. Thanks very much. I want to derail the presentation. Okay. All right. So, yeah, that's continuing. So, so currently longhorn offers the enterprise grade distributed block storage software as so when and also longhorn offers a warning snapshot and buildings building warning backup and restore support. The difference between the snapshot and the backup here is a snapshot is the snapshot you made in the in cluster for the in cluster volumes. So whatever you made snapshot is staying in the cluster. But when you do a backup, we allow you to backup your volume to the third party, like S3 or S3 compatible object store like menial or NFS. So in this way, user even user lost his whole whole cluster and they still have access to their data. So in longhorn, this is one. One we one part differentiate us from other many other solutions. So we do the backup and the restore by our self and we did that in the incremental way. Because we want that we want we think backup copies were important for the safety of the users data. So we want we want to provide first party support to that. So that's is so this is one key point of that what makes longhorn different. Another point is currently longhorn we can do live upgrade without the warning downtime even on data plan. So I can explain more how we did that later. And we support cross cluster disaster recovery with the defined RTO and RPO that is also achieved by with the help of the our backup store, which is the location you will your backup your warnings. Longhorn provide intuitive UI as a big message infectious the first thing many users notice about longhorn is we provide dashboard and the full functionality UI to expose all the longhorn's functionalities and you can really operate easily when you see the UI and it's really intuitive many users like it. And that's probably the this one of the one of the reasons why they get into longhorn lock like longhorn if longhorn is still in the alpha stage. And the longhorn is viewed as a Kubernetes native application. That means that we are using controller pattern on CRD for the management plane. And you can install longhorn just using one line of the control apply or help installation and longhorn runs on the ending Kubernetes cluster. And one thing to make note here that is when we see that you can do one line installation use control apply dash F that we really mean it. Because, you know, there's many devote demos in the details. So many many storage vendors are claim that or applications a client that okay you can just use one line store and then you can just use a control apply dash F, but then you have to choose all kind of options and make like tell you what's the connected versions was the driver is and what's option you want to make and you have to fill them this and they generate the control file for you. So we spend much effort in the to make it easier for user to access longhorn. So many methods and all the mechanism we basically we just building automatic detection, like what is your, especially on the driver part like DSI, we, we basically, we're going to deploy different versions DSI depends on what your Kubernetes version is and if your Kubernetes is too old, we're going to deploy flex warning. And for in each case is we're going to detect that what is the correct directory for longhorn to install the driver. So you can do your Kubernetes can connect to the normal correctly. So that's why we basically we detected the Kubernetes version, the distributed version and anything we can and make sure that we minimize the burden of the configuration to the user and make sure that user can easily start using longhorn. All right. So this is the longhorn architecture for the data plan, which we call the engine, assuming that you have two nodes here each node has a bunch of disks and the RAM and CPU. So one part was asking for one volume and longhorn will receive a request and they will start longhorn and start to replicas and each replica will place as one SSD on the two different nodes because we want to make sure that if one node goes down, the warning should still works as long as there's another running replicas. And we'll start a longhorn engine, which is also the name for the microcontroller longhorn longhorn engineers and the longhorn engine we're going to connect to the replicas. And the longhorn engine we're going also going to expose a block device on the host on the node and that's block device will be used by the Kubernetes CSI driver and CSI driver are going to format it and mounted to the one directory on the node. And that directory we're going to bind mount into the part as using as the warning for the part. So you guys you see it here the architecture is very easy with the longhorn data plan, but if we have multiple volumes which are going to just start multiple replicas and engines, and everything will be the same. So another benefit of this architecture is the data pass is the isolated is isolated within the between the different warnings. So if anything happens to a data pass of the one one, the other one is not really going to get affected. But how do, how can we operate and orchestrate all this engine replicas. So that is the how the code net is coming into the picture. So those rent engine and the replicas are all orchestrated by the Kubernetes. I will explain more in the next slide. So, Shane, quick question. So, so each of those engine and replica instances. They are separate individual containers in this case therefore. Yeah, so we conceptually we designed them to be separate instance, and the current implementation in the longhorn set file. Oh, it's the separate parts. So basically we're starting one part per engine and one part per replicas. But then, as you know, that's, as you can realize, they will quickly become a problem because the committees have a limitation of the 110 parts per node. So we already have some user he's that on the wireless large committees node, this committee's node. So in the next release, we're doing, we're doing architecture, we're going to react to on the how the engine replicas started. And in the next release, we're going to start them as a process instead of a container, instead of part. And the one part will contain the multiple process and in fact, on the node, there will be one node, one part contain the engines and not a part contain the replicas. And inside inside that part, the replicas are still accesses independent separate processes and engines also separated. So that's it. That's is how we are going to solve this. It's costing a part members on the node problem. I see. Okay. And, and the replica is effectively the, the process that actually writes to the back end disk whereas the engine provides the front ends. So it gets mounted. Okay. Yeah, yeah, that's correct. Yeah, that's absolutely correct. If I can provide some advice bluster fest had the same model where they had a process per volume. And in the end, when they, when we tried to, when I was part of the project and we tried to containerize it, and so on, we noticed that it was consuming a lot of memory for many thousands of volumes. So instead, they what we called was called, they called it multiplexing, they would, in other words, a single process was able to handle many volumes. So, just something to think about as you pick support thousands and thousands of volumes. Yeah, yeah, definitely. Yeah, though the reason is, so basically the at first that we are doing on the part because as this seems a very obvious choice because everything else is very complicated. And then we hit the limit of that 110. So we decided to do that as the process. And in fact, we also, we also thought about multiplexing. But just when we are not sure how complex that thing will be because they were also because we do multiplexing or go to the same processes, we can measure as well take much more effort than just running on our existing framework using single instance for each engine replicas. But yeah, definitely no, I definitely think that's something we need to consider. And if we are designing for say each node, we're going to have thousands of the volumes. And I think now we have each node has, we're talking around some hundreds because we are block storage. But if each node we are talking more than that, we are, of course, we are going to consider that how to do multiplexing and async using one process handle more requested, of course, that's well, you are safe to memories and you will be more efficient, but at this moment, we think that handling process is good. But of course, we're going to take that into consideration if well, in the future, we are going to meet more high, the larger numbers of what it was used by using by the part using by nodes. One additional related comment, even independently of how many of these volumes you have, essentially to provision to provide a piece of block store, you're consuming, you know, Ram and CPU as well. And those are vastly more expensive than the actual storage. So, I mean, that's the other motivation, irrespective of the expensive storage because because there's so much cost tied up in the Ram and CPU. Yeah, so I think currently we, the RAM, the RAM consumption of our current engine, the replicas is okay. And but the CPU sometime when you run in, of course, we're running press, some pressure test and benchmark the CPU, CPU visualization is something we need to deal with. So, yeah. So in fact, we, we saw, we saw a lot about if you want to keep the single instance as a multiple instance and or by instance handling multiple requested. In the end, for now, we just what we wanted to be simpler and we wanted to be at least reliable. And in the current stage, because we have spent much effort and use as a try this many times. So at least it seems stable for now. And but in the future, of course, if it's needed, we definitely have to change to that model if we're, if it's really, it's really, it's really needed for the larger scale. So that's, yeah, I can see that's probably one thing we have to address in the future. Yeah, but, but not just not currently we think this model is sufficient enough for the, for the currently usage. Thanks, that answers that question. I'm also quite curious about how your replication algorithm between the replicas works. Are you going to cover that later or I wouldn't now so that the replica. In fact, this was simple, quite a simple answer to that everything is synchronized on the replica art. We do the synchronized application. Each replica is the same as any other replicas supposed to be same. Yeah, so we we are. So when any instructions sent by the engines and the replica and the engine will wait for replica to confirm that is written before it's before it's response back to the block layer and say that this this block has been written. So that was sort of my question. If you have, if you have two replicas, for example, of a block, or I guess it's a volume that you replicate, not the actual blocks. Then, then you've essentially, you know, doubled or multiplied by a very large amount, the failure probability, because if you can't get to either replica or the network between them you are the volumes essentially. Yeah, so in fact, engine itself has the detect mechanism for the replica. So the referee replica doesn't response or doesn't confirm the right within a certain time limited engine will just cut it off and the the manager will start not the replica and start reviewing process. Yeah, we know that the replications definitely the the data intensive and also the the security intensive part. So that's that's but the thing is we at least for now we didn't we want to keep this thing simple and also we we cannot think about a better way to say because the long run was designed to be a crash consistent storage. So if we are not, if we have a beauty something like locality that means that they were definitely going to be a different between the local replicas and the remote replicas. And there will be much more things need to be deal with that on that area. So I think for now this we just we just going to have to buy into this amplify the right read by the right problem and there was we can and see if we can prove that in the future. A couple more questions. Excuse me. The pods that consume any volume can only be scheduled to one of the two nodes that the replicas exist on. Oh, no. So, yeah, if so basically the the nodes that provide the storage to the long run is not the same node that can use the storage. So the basically what we deploy is we're going to deploy in the next cluster and the replica doesn't need to be on the same node as the consumer but engine have to be on the same node. Got it. And second question is how are actually are you constrained by the size of the smallest basically is disk disk for for provisioning new volumes. Yes. Okay. And then third question. How do you discover the local disks to use. Oh, so the local disk is discovered by the just user need to specify that which pass up in the local file system they mounted on the new desk. So we do have we do have building some error detection in case user want to double counting and we of course we don't want to double counting. Say you're using the same file system for say two different directories, but basically what user need to do to add a new disk is the format that is an amount that in the one pass of the director of the node directory and the talent or about that. Well, the disk is And our last question is, do you support raw block or only file on block. Oh, yeah. So we're, we're working on that the raw block is having been supported yet and currently we are providing the file system on top of a block device with the through the CSI. Yeah, so that's it. That's something we are going to get it in. Sounds good. Thank you so much. Thank you. I have a following question. Yeah. You said that the engine is always with the volume. Yes, I guess always with the, the world in which is the, which is at the same no that's the parties running. Okay, so that means that the engine is the one doing. So in other words, Longhorn does client side replication, instead of server side replication. Is that correct. Sorry, what's this, what do you mean by client side or client side replication means that the client meaning where the volume is being requested or being used. When a right goes down to the, the IO, the, the client is the one that's copying the data to two different nodes. Is that the way it works. You can, yeah, you can, you can, yeah, I think it's. Yeah, I think, yeah, because if you think engine as client yet engine engine is going to write a two copy of the data into two replicas. Yeah. Okay, because again, this is again very close to Gloucester. And Gloucester has client side replication. And, and one of the issues with client side replication, specifically, specifically with replica two is you're going to, you may get a lot of split brain. So one of the things that they wanted to do in Gloucester is do server side replication like stuff does surface server side replication. And then that way the server can then decide when to send the replicas and how to log the replicas and so on. You may have a lot more power there. So it's just something else to think about, then doing any client side replication. So you're talking about so you think, sorry, if I understand correctly, with you said that client side replication will lead to sleep, sleep, sleep out the brain. It may. Yes. And actually through many years, Gloucester has been trying to deal with split brain. And it's one of the solutions to use server side replication. Yeah, I understand what you mean, but I think the things is a little different here because class FS is the distributed file system, your client have to be wrong at every node to provide service. And yeah, for the for the Kubernetes, you are read write many, right, but to the blog as a blog device, the long blog device provider long horn is the rich right ones type service. So we are storage. So we are only able to provide storage in one node. So in that node, the sense is that engine is the one on that note. So, of course, it's only one engine on that nose. And there's no other engines connect to the replicas. So the split brain is not a problem here. It could be true. Okay. Yeah, thank you. Thank you. I think I actually had some other kinds of similar stuff that so I think I think this split brain actually is independent of whether it's client side or server side replication or you have you have similar problems in both cases. And I can, you know, I don't want to go into too much detail now but you can imagine many different cases where the network connection. Yeah. One is is intermittently failing. And then, you know, you try and write to both those replicas, but you don't know which replicas you wrote to. Unless you have a fairly sophisticated protocol like pack sauce or something to figure out, you know, which is the master replica and which one is considered to be true, etc. Okay. Yeah, so yeah, it's very problematic. Yeah, I understand. So the currently we the first step is currently the status replica, the single point of choose is coming from the engine. And the second thing is, is currently we detected the failures basically just depends on okay so if engine sink this replica is bad is bad. So at least we know that if and also we know that which replica is the last one receive the written from written command from the engine since we're going to have only one engine. So, so in that sense we kind of mitigate that just this split up brain problem. Now I think you still have the problem because you can send a request the replicas can get written but you don't know whether they got written because the response gets lost. And so the engine doesn't know which replica got written and which one didn't. Yeah, but in fact, in that case we currently we just simply drop that which doesn't really respond and continue with other replicas. Yeah, if the network arrows on the client and like the pod one has a bad network connection then both the replicas go bad from the point of view of the engine both the replicas go bad at the same time. So that those all happens that's coming with our. So we have a failure or fail over us or sorry, not fail over but just the week it so basically if engines in every replica is bad of course and you are going to go down. And we have some other mechanism when the engines goes down and the warning market supported. We can take a look into the replicas and try to figure out which one is really the reason the writing and which one contains the most data most written. So that's one will be choose as the as the choose choose as a choose for this data and represent the state for the data. So that's. And also and then engine can start with that replica and start rebuilding. Yeah, but now perfectly understand this is really complex problems and we we're trying out and also. So we're trying our best to get this working and in the case of two replicas of both failed but unfortunately if the two replica both failed we we the engine probably going to shut down and the part where lost the access to the woman and probably going to need to restart. Okay, thank you that makes sense. And just just a warning I guess that. Yeah, yeah. The whole process you're describing there you basically end up implementing pack sauce deciding which one is the master and and yeah building all that stuff. Yeah, so kind of long one just treat every replica the same not really. So we don't have a master concept here. So, okay. All right. Well, that's 40 minutes already. Gosh. Okay, I think I probably going to skip a few slides later. Okay, so that's about the long range and now we talked about how the man's plane is going as the the manager part. And of course, longhorn is running on top of the next cluster. And the one when the next cluster want to have a volume of persistent warning created and assigned to one of these parts. The next cluster we're going to talk with the CSI and CSI we're going to talk with the comments. Sorry, longhorn CSI parking, which in turn calls the local API to the local manager. Longhorn manager, as I said, is the one to orchestrate all the volumes. So whenever you create a new volume longhorn manager at the API part, we'll write, we'll create a new warning object in the cognitive API server using CRD. Sorry. And that's the creation of new object will be picked up by the controllers in the local manager as well. And the controller will see, okay, this new volume, this new volume was created and then it needs to be attached to some node. So the controller will start in the engine and the replica process and dealing with that and compose this longhorn volume to the need of the part. And if we have more volumes, we're going to just create a few more sets of the engine and the replica to provide the series of that one. The way to access to the longhorn manager is of course through the longhorn UI. So local UI complements the functionality of that is create a delete and attach detach mount on mount. And the current longhorn UI can do basically everything longhorn feature is and they provide the dashboard and the snapshot node management backup restore and some more features like cross-class replication. And also we are working on the warning snapshot and also the raw block device support for the Kubernetes. Any questions. So this is the one example of the how longhorn use the Kubernetes controller pattern to achieve the operating operating the longhorn volumes. For example, we have four nodes here and node 123 has three replicas running on that and engine connection and everything seems fine. And what if we somehow lost node three, the replica, the engine will immediately detect that the replica three lost connection and the engine will mark nodes replica three as a photo and the manager will see that will remove the manager will remove replica three from the engines from engines back. So you can see on the right side, the volumes are supposed to have three healthy replicas but currently I have two since it's only one or two. So the manager will also see, okay, there's another node, node four, which we can put the replica on. So the manager will start a new pod with with replica four new instance in the later releases and add that to the engines back and engine we're going to see that. Okay, so now I have two replicas but I supposed to have three and the last one is replica four and the engine will connect to the replica four and starting the rebuilding process. Once the rebuilding process was completed, the replica four will change into the healthy state and everything will be recovered. And also this time the warning status showing the current healthy replicas will be three. And of course it's matched with the desired state of the number of replica so everything back to normal. Alright. Just a very short question on that so so effectively the the state of the of the volume in terms of which replicas are on which nodes and which ones are healthy etc are actually stored within the CRD in the Kubernetes API right. So the CRD is reflected the state the state was observed from engine so the engine still the single point of truth in in this case. Yeah, but what we observed in the engine we're going to stay store that in the CRD like in the engine status and like I said, this is here in the replica list. Thanks. Thank you. So, I don't know if I have time to go over the engine under the hood so let me just go through this word quickly so as I said this long form in the end using the link is fast file to store differentiate discs. My current have 512 byte block size. And the read is lazy feeling as let me explain how that's work. Yeah, I think for this. I think probably many of you already know how this work. This is the way why we stand away to handling snapshot. For example here to handling the data and with the basis of the snapshot. So, for example, the live data always has the highest priority. So we when we read the block one we read that from the highest priority one which is live data. And when we read block zero because life data has no data and the block zero we're going to read we're going to check if that data on the news snapshot. And we found the block there and we read from there and the block to we found that okay so old snapshot and the live data doesn't have that block except for the oldest snapshot. So we're going to read from all the snapshot and so on the fourth block is infected empty and we search everything and we found okay. This just no one has this data so I just written as written now. And also the if you read from the block seven and the same is from live data and block three is from the news snapshot and the block five is from the yeah the middle snapshot. And the six blocks six from the data and if we write a new block say that user now just write a new block into the volume and that block is block five. And we're going to update our index and remove that from the original position and rewrite that to the redirect that to the live data so next time we use a read from want to read the block five you're going to read from live data. Just to clarify, which copy on right mechanism are you using I presume this is just copy on right so standard. Yeah, so this we're just using the space file. And where is your metadata store for your indexes. Yes, so the indexes stored internally in the in the mini spa file. So there's a there's a function call code FIE map, which you can get the layout of one spa file. So that's why Longhorn requires the underlying file system to be xd4 or XFS which suppose a spa file. If the underlying file system using by long story cannot do the last bar file we can no way to know that was the best data is But now if I understand you correctly a read in this example on your slide here, a read might involve like four reads like a single logical read might involve for physical reads. Yes, but whenever after you are read the first time you are going to the one the index will be updated and so next time we read the same data we'll know that which one is going to read from. Oh so the index keeps the keeps the pointer to the actual logical block the physical block representing the logical block. Yes. And does that mean that the index is also kept cashed in the engine. Yeah, the live volume and then so there's to like there's the index of all the snapshots on the replica, right on each replica and then there is an index on the engine that has the live. Okay, so sorry. The cash in fact is on the replica. So the thing is that the cash is in the memory the cash is in the memory and every time when the engine want to read something the engine was just in the read to one of the replicas and that replicas is have a kind of the responsibility to keep that well the block should be and which steps and I should read from so that's nothing that's in memory catch is kind of in the map is in the replica, but we didn't store that physically on this so if you reload the volume, you need to do the catch web to be revisited again. So, just a quick question then, doesn't that imply quite a large memory overhead, because if you had, you know, a volume of a couple of hundred taken size for example doesn't doesn't that mean that you end up with millions if not hundreds of millions of keys in the index that need to be a memory. Yeah, we store that in the store that in, we're using one byte, we're using one byte for one blocks, I need to do redo the calculation see how much is that. Yeah, you're going to have some memory overhead here. Okay, all right. Yeah, if your blocks are 512 bytes, then I guess you've got somewhere around about 0.2% of your disk size will be. Yeah, so. Yes, you may want to agree that to like a, at least 64 k or something and then you'll reduce the memory pressure. Yeah, so the thing is the this block size 512 here is the kind of collision with the QCal size. So because we also support using QCal file as the base image for your warning. So, if you when you use a QCal file you have to align it. So we basically just okay so we use 512 because QCal is byte for 512. But I think, yeah, I think it's a good idea if we can upside this to originally we have 4k and we probably can do even bigger and and but we need to measure the was the overhead and the competitor was the memory usage is to decide that was the optimal block size. But the thing is this block size is have to be have to be fixed for the for one warning. So, otherwise is they're going to, we are not going to have very good time saying that try to figure out where the location of the data is. But I guess you have to. There's a compromise there right. Yeah, it's compromise. If the block size is small, you have large indexes and presumably those indexes grow based on the number of snapshots you have. But then if you have a large block size the index size is smaller but you have a higher property on right. So you're going to waste more space as well. Okay. Ending and I was just wondering if this seems a lot of great discussion. Do we want to do this again or do we want how we're going to end it today. Let me see. I think I have almost almost down. Okay. Yeah, yeah. So this one is the and the next one I'm talking about just the effect of the same concept and we are doing to how we do the backup. So that we do backup in incremental way as you see here and the right side, the right hand is the what's the change the blocks in the different snapshot. The left hand is how do we store these blocks in in the backup. So we in the backup, our block size is two megabits. And of course, when you do that, you have to convert what is our in class the block size to the backup block size and the calculated new layout. But the advantage of that is our store the backup is basically only the pointer to the backup blocks. So for, for example, if you see the green blocks coming from the snapshot to and snapshot only have a reference to three blocks one orange block from snapshot one and two green blocks from snapshot two. And when we do a backup for the snapshot three, we'll see that okay snapshot three is only a differentiation is different from how snapshot three is the different from snapshot two which is already we backed up. The snapshot three only have two change blocks. So what is really happening is the snapshot three just copy what's the metadata of snapshot two is and plus the change of two that two blocks and update the reference of the first and second block to the what's the blocks we copy the from snapshot three. So that's we how we implement the incremental snapshot. Also, the on the disaster recovery volume feature, our backup is also incremental. So that's, that's basically it on the on the how we how we do the backup and snapshot backup and restore. Does that mean that your backup feature uses the index metadata to determine this information or, or is there. Well, we're not using not index metadata we are we are. If you mean by the reading catch the reading index right, we're not using that part we are just we're in the background mechanism we're going to look into really the layout of the each snapshot. And the first of course is real works if the previous snapshot was back it up and still exists. So if the first snapshot doesn't exist, it will not work. So I'm trying to understand where's that metadata captain. So it's still we're looking at the new response file. Using that IE map core and just real time get the layout from a snapshot because snapshot on change. So there's no risk condition or whatever. So we got into this now we got into got out of the layout and we calculate that which two blocks, which to make that blocks we need to copy and whatever and we just backup backup and update the reference in the new snapshot. So maybe, I mean, not in this context of CNCF, but maybe a long horn specific talk on just how you guys do snapshots. I'm very curious. Okay, all right. Thank you. Thank you. Yeah. All right. So the next one is just how bike up a store this very simple we have a configuration for one warning, and the one has two snapshot. And the one had five blocks. And so basically the backups just store the reference to the blocks. All right. So this is the last slide. I think this is doing the life upgrade. So what do we do is because from and we have some we have a unique store may socket to connect the front end with the engine. And what happens if you want to update the data plan, we are going to start another set of engine replicas, and we're going to use the same disk, basically same location for the data and make replica point to them. And we're going to just switch we have a wait for the previously read write to be complete and just immediately switch to the to the new engine. And after the after switches down the new engine can be done can be can be get rid of. Yeah. I'm sorry. This is a good picture. We didn't talk about how is it that which is actually a continuation of sad question, which was, how is it that parts that are not on the nodes with the replicas communicate. Is that all based on ice cousin and connection attachment detachment. Okay, so the communication between the TGT, which is our use for our front end as a framework is TGT going to expose ice galaxy target. And that's us that I started internally connect to our long one engine, and that's through a unique store may socket with the fact that is what it's not efficient at all so we're going to change that in the future. Okay, currently is working this way. So that's TGT from and expose ice got the target and not ask us target we're using the ice got the beam on the host to talk to that ice got a target, which also in fact is this in the same host and And the expose the block device there and also this this from and it's one part we think is most overhead, one of the most overhead coming from, but the current performance of long or is it's not bad. And so we're more focused on the stability and the stability at moment but we definitely think a lot of things can be done to improve this from the end. Um, we had another from and before we should is called the TCM you, which is the Linux access target in kernel with what's called LL before so TCM you is going, you can expose block device directly through TCM you I also have contributed a few kernel patches to the TCM you to make to make the speed much faster because previously is due is due to read write in the synchronized way. So that's basically So that's not really to be production use, but in the end we decide to go with the TGT because is We we the patches in the kernel. If you want to patch kernel is take years to reach downstream like to want the distributions. So we cannot we don't want to create a barrier for user to entry at that point. And also any spark you found in the kernel will take many years and at least many months to reach to the downstream distribution. So we decided that okay we just go with the user space solution here and make sure more user has accessibility to the loan for Completely agree. Excellent. Thank you. Thank you. Actually, I have to drop off but I look forward to more of this. Yeah, in fact, this is basically the last technique once so no worries. Yeah, yeah, I think we're coming to the end of the time. Thanks, Shane for for this presentation. This has been great. It's probably where it's, I think maybe we should do a short follow on call. Okay, yeah. Cover today. All right. But, but thanks again everyone and And obviously feel free to to ask questions on the Slack channel as well. Take care guys. All right, thank you. Bye. Thank you.