 Good evening, everyone. So I'm Hari. I work in the Gluster project for Red Hat and right now with this session you will be seeing how to set up a Gluster volume and have a geo-replicated copy over in another place. So these things will be explained along with where geo-replication is headed in this presentation. So to start with, let me explain about Gluster so that you get an understanding about how things are actually going to work. So Gluster is a distributed file system which provides easy scale-out when necessary. You can add as many as much storage as you want on the go without taking the Gluster down. And the other benefit that Gluster has is it doesn't have a metadata server. So there is no bottleneck and we don't have a single point of failure as well. And you can run it on pretty much any hardware. You don't have to buy a box to make use of Gluster. That's another benefit. And it does provide you the basic features, file system features like replication, erasure coding, ing, bitrot, and a few more. So the terminologies that you should be aware of during this presentation are what is a brick, what is a server, and all those. I'll explain one by one. So brick is basically your hard disk being mounted on some place in a mission. So this mount directory is going to be called as a brick. It's going to run as a separate process on the particular node. And server is the node where your bricks are going to reside. A server can have a single brick or multiple number of bricks and so on. And a volume is a collection of these bricks across various nodes. And this is a logical entity that we are going to use to perform various operations on Gluster. And then comes the client. So you have a Gluster volume which is being created. And now you can mount this particular Gluster volume on a particular node. And that's going to be a client. So all your operations on this particular client is going to be transferred back to server, back and forth. And that's what Gluster takes care of. And then comes the trusted storage pool. It's going to be a collection of servers within which you can create a volume and so on. All the operations you perform are going to be within the trusted storage pool. So as I said, you can see this diagram will help you understand better. There can be a number of servers which form a trusted storage pool. And within the servers, you can have any number of bricks as per the server's capacity. And a volume is going to be a collection of bricks from various servers. It can be within a server, between servers. The number of bricks, everything can be configured when you create the volume. And once the volume is created, you'll be mounting it on a separate mission, which will be the client where you're going to do the IOs and so on. And yeah, about geo-replication. So why did we come to geo-replication? Basically, it's for disaster recovery. With normal replication, you can have it at a different place, but the IOs are going to take a long time to be synced between various places. So we go for geo-replication. And the way, the requirements from the geo-replication are, you have to copy the files from one cluster to another. And you just, you want to know until what point the data is being copied. So this is achieved through checkpoints. So we make a checkpoint at every duration, at every certain duration. So you know that until this duration, whatever data was there, will be available in the replicated one. And then comes making the copying process efficient. And I'll be going into how we are making it efficient in the later part of the slide. So yeah, how did we achieve geo-replication with Lester? The first thing is basically you have a volume which is supposed to be copied from one data center to another data center. So the first main volume is called as the master volume. And this master volume is going to be copied to a different data center, which is going to be called as the slave volume. And on the master side, every file operation that goes to the master volume will be getting recorded on the master. And these changes are called as change logs. And the change log is on the brick side, which I'll explain with other diagram like if you see it on the brick. So everything you write on the mount goes to the various bricks as per the volume configuration. So when the IO goes to the brick, the brick records the change logs. So it knows what are the operations that were done on that particular brick. And yeah, that's basically a change log. And based on the way how a change log is generated, there are going to be three ways that we will be replicating the volume from the slave master to the slave. And the benefits of change logs is if you're trying to copy a particular file operation from the master to the slave, and if it erases out for some reason, you can retry that particular change instead of changing the whole file. And yes, that's how we create checkpoints as well. So you're saying until the 10th change log we have synced to the slave volume, and so you have a checkpoint right now. And basically, as I said, in the Trusted Storage Pool, you will be having a number of servers which you have created your master volume, and that master volume is mounted some way. So this whole thing that's written on the master volume has to be copied to a slave volume. So in your different, this is going to be a different data center. Like let's assume that's in Bangalore and this is in US. So you create a different volume over here using Gluster in US and make it a slave volume. And to the slave volume, you establish a geo-reposition which is going to copy your data to the slave volume. So how we copy the data is done in basically three ways. Let's assume that you have a volume which is already created and before, sorry, one minute, I'll show this diagram. Yeah, before you create a session for geo-replication to copy, you would have a number of data which is already stored in your bricks. And as you haven't enabled the geo-replication already, you won't have this particular change log I was talking about on the brick side to know what are the changes that have been created. So at this point, you don't have change logs to sync. So what we do is we have something called a hybrid crawl where you will be crawling the brick and this will create pseudo-change logs and these change logs can later be consumed by the geo-rep to sync from the master to the slave. So this is going to be the initial part where the sync is happening and from here you'll be going to the change log crawl. So once you have all the data up to the point copied, now whatever you write will be getting registered in the change log and those changes are going to be synced from the master to the slave using the change log crawl. And the third unique scenario is you have a number of change logs which are already made available but you have stopped geo-replication for certain reason. In this case, you don't have sync the change logs so there is a huge number of change logs which are present on the master that are yet to be synced. So here you use history crawl which is going to consume all the change logs that have been produced so far from the master to sync it to the slave. So these are the three different ways we are syncing the data from the master to the slave. And then comes the various components that are involved in geo-replication. So as I said, your brick side will have a translator called a translator is going to be a basic unit in Gluster which performs a certain operation. So this change log translator receives every operation that is performed on that volume and it starts recording it into a flat file. And whatever the change log as recorded in the flat file will be later used by the process called agent. This reads the flat file and understands what is written in the flat file. Like this particular file has been changed or created and all those things will be in the flat file. And the agent is the one which is going to use this libgeapi to understand what actually the flat file says because those are binary files. From that, the agent gives the output to the worker and worker is the one which is actually going to copy your data from the master to the slave. And this worker and the agent process are monitored by the monitor. So if one goes down, the monitor is the way that it has to start it back up and so on. And these are the basic components in geo-replication, the internals. So as I said, let's assume that we have a mount at the top. From the mount, all the Ios that you do will be coming to the brick to get stored at one point. So when it comes through the brick, let's assume you did a write. So this write is going to come here and then you are doing a read. Then actually it will be coming. So when it comes, it has to cross this change log translator which is on the way. And this change log is going to record everything to the flat file. And at one point, agent is going to read this flat file and send it to the worker for it to copy from the master to the slave mode. So this is how basically geo-rep syncs your data from the master to the slave mode. So the disadvantage with the current approach is that we have something called a GFID in Gluster which is more similar to an IDenode in a file system. So using this GFID is what Gluster identifies each and every file in the whole volume. It's going to be a unique identifier for each file. And as of now, the whole structure of your file system is linked through GFIDs. So from the master to the slave, we are going to replicate this whole structure. So here we try to have the same GFID on the master to be synced onto the slave. And we perform all the operations based on GFIDs. This was the easier approach which we are using currently. And it does have drawbacks like the slave is supposed to be a Gluster volume only then it can understand what a GFID is. So you can't manually copy whatever you want from the master to the slave. That's a drawback because we won't have the GFIDs there. And other Gluster problems like GFID conflicts and stuff. Apart from that, let's assume that there is a scenario in which you create a file and then you delete the file and then you create the file again. With the current approach we are going to reproduce all these three operations on the slave. But if you see the create and delete are actually not necessary because you are again creating the file. You just have to create it once instead of doing it thrice. So these drawbacks are currently there in the current approach. So the other problem is there can be races between creating and deleting a particular directory in the master volume. So the order in which these operations go further can result in various situations. And that's not a good thing for syncing the data from the master to the slave. And how we got around this issue is through something called as a path based geo-replication. So what we did is instead of having this GFID for each file being synced from the master to the slave, we created a workaround where GFID's parent is stored in the file itself. This way every time you have a file, you know the parent of the file and you can create a path of that file. Using this path, you just send it to the slave volume and ask it to create the file in the particular path. Now the structure, whichever was linked through GFID's is actually broken down into path which is linked, which is easier. So this is done using the concept, like, so you create a file. When you create this file in the file, we put up this parent information so that you can create the path. And we use a tool called GFID to find, which is capable of generating the whole path from the file. And, yeah, so the slave just gets the path now instead of getting the GFID and constructing the whole thing back up there. This way we have simplified the process and the dependency on GFID is removed. So the advantages of this approach is that, as you see, GFIDs are unnecessary, need not be same on the master and the slave. The slave doesn't even need a GFID, so it doesn't need to be a cluster volume to mount it. So right now if you can sync with this approach, you can sync a cluster volume to any other particular volume, be it your normal file system, CIF, or whichever you want. You can sync it to that. That's an advantage. And in case if you see any failures during a sync for a particular file, you can manually copy that file from the master to the slave. That's another advantage, which we don't have right now. And if you remember the hybrid crawl where I was saying we have to crawl through the file system and then create the change logs, that was quite expensive doing a whole crawl on the file system. Instead here we can come up with tools, which is just going to go through the file system and then paste it on, I mean, sync it to the slave. This way we can reduce certain overheads over here and these will be the advantages of a basic path-based georeplication. And the demo. Yeah, so I'll show how to set up a basic georeplicated. Yes. A little more. Is this fine? Yeah. So I'll explain how to create a volume, master volume and sync it to the slave with this demo. So I'm actually now getting inside a mission where I'm, so the first step would be to create the trusted storage pool. And to create the trusted storage pool you have to peer probe from one particular node. So let's assume that we are in 134. From 134 I'm peer probing to 136. And so now the cluster, the trusted storage pool has two servers if you see. So these two will be the servers. And in this I'm going to create the master volume now. So you see that there are two bricks. This will be one break and this will be the other break each on one of the servers. And using these have created a master cluster, cluster volume called master volume. And then I am starting the master volume. After this I'm creating it another volume called shared storage, which is something you can leave right now because it's necessary only if you create a replicated volume or else that's not necessary. If you want I can explain that later. And then I'm starting that particular volume as well. Yeah. What I've done is I have the master what we can go back to the line. Yeah. So here with this command I'm mounting the particular master volume on that particular directory of this mission. Now if you have a mount where you can write files and it will be stored in the in the particular brick. And then I'm doing the same for the others volume as well. So once done with this I'm creating a session between two volumes. And the step is so you can use this command instead of creating the second volume I'm showing now here as I have created the second volume. It's saying that it is not necessary. And then this is the first command where I'm the first come out that is necessary for the geo application. So what happens here is with this command each of your server will have SSH key generated. And all those SSH keys will be collected to the initiator node which is 136 sorry 134. Yeah 134. So from that I'm using this command to push these SSH keys to the server and the server is going to be 70. And between the server 70 and 134 I have already done a password less SSH so it can move things there. So I'm pushing SSH keys and this will be helpful in actually syncing your files. The syncing of the files is basically done using our sync or tar over SSH and that's the reason we are sharing these keys. And then this is to use the volume that I created. So I'm starting the geo obsession here saying the start after this you can see that the status is getting initialized. And then I'll log on to the master mission and show you before that I'll show the master volume. So if you see I'm doing an list on the master's mount and we don't have any files created as of now. I'm SSHing into the slave cluster and in that particular slave cluster you can see that I've done the same procedure. I've created a cluster storage pool by pure probing and then I've created a volume called slave. And on this particular slave volume I'm doing an LS to see what is the content inside the slave and you can see that the content is quite empty now. And now I go back to the master cluster where I create 10 files right now in the master's mount and you can see the 10 files being created. And I again SSH back to the slave and do an LS on the slave mount and you can see the 10 files being copied on from the master to the slave. So with these minimum number of steps it's easy to set up a cluster volume, two cluster volumes and have your up between them and get your data sync from one data center to the other. And cluster is quite simple you can do it even with a single node but just to get an understanding of pure probing and stuff I have showed this. But yeah the size of the volume and stuff differs based on the way you create the volume. That's it with the demo. Any questions? Yes? What about the scale of the liberty of replicas? How many replicas can I, you know, samely configure? Yeah so your master volume and your slave volumes are totally independent. It's just that when you create the session you have to, the geo-replication start command will make sure that the, sorry, there's another command before that. Push, push perms. That will make sure that your slave is bigger in size than the master. If that's the case the geo-replication will start fine. So when you basically expand your volume you are going to expand your master separately and similarly when you are running out of storage you have to expand your slave separately. So those will be done as a separate command called cluster volume, add brick and so on. These ways you add more bricks which means you are going to add more discs and you will have a bigger storage. This is nice. I was more thinking to make multiple replicas. Yes you can do that. You are talking about multiple slaves, right? You can create multiple slave volumes and establish... I've got n larger than two sites that's what I'm talking about, right? Master to two or three or n... Yes, that's possible right now. You have to create... What would you recommend as a maximum or you support it? So right now I have seen volume where we sync from one master to around three slaves and so on. So that's fine but I don't know where the limit is. So that is something that I have to look into. Does the transfer effect you mentioned use on-site? Yes. Say I'm basically replicating large files when it gets interrupted at 90% does it do? So I talked about changelogs, right? So with every changelog let's assume that you are creating a file of 100 GBs and 99 GBs is copied and the last 1 GB is it to be copied. So here what happens is changelog is not going to give it as a whole chunk of 100 GBs. It's going to give you as few MBs or KBs. So if it fails at a particular point, until that point whatever was there will be fine. And whatever fail will again be something that will be tried again. That's something I have mentioned here. Retry a particular operation alone. Let's assume that after the 99 GB there is a write going on and that failed. So we retry that write so that it passes and then it again continues. So this is how we have handled it. It's going to be on the master. Sorry I didn't get that part. How do you raise it on the master side? How changelog will tell you if you do not have for example 10% left. It's not that way. So let's assume that you have a file and you're starting to write 10 MBs after each. First 10 MB, second 10 MB and so on. So every time you write your first 10 MB or so on. One minute I'll show you this. So from the mount it's going to come back to the brick where it's going to be saved. So the first 10 MBs you write will come to the brick crossing the changelog. So changelog is going to make a record of the first 10 MBs in the flat file. And then comes the next 10 MB and next 10 MB and so on. So this way when you get this flat file will have a list of all the writes that came in one after the other with the timing. And the agent is the one which would be aware that I have synced the first 10 MB. I have to sync the next 10 MB. This is done using the times. So the timestamp of the first one let's assume is first 10 MB was written in 10 minutes. Sorry, some end time and your next 10 MB comes in N plus one. So agent will read the N plus one because it knows that the checkpoint is at N. So the N plus one copy will be taken from the flat file, read from the master and then return on to the sleeve. Does that explain? Yeah. So on the slave side you just send the writes and based on the output from the r-sync or stuff you get to know if this particular write was a success or a failure. So this is the only way you know if it is actually written down to the sleeve. And if it is written down then you do mark it as that particular form, sync till that checkpoint. Any other questions? So yeah. Path-based georeplication is being worked by Aravinda and he's working on another project called Kadalu which is integrating cluster for containers. And the cluster team from Red Hat is Kotrich, Sunny and Shweta apart from me. And if you have any questions you can reach out in all these beans. And we do have cluster community meetings where you can attend the meeting and ask your questions in person to clarify if you have any issues as well. And all the links are mentioned in the presentation and you can use that. Thank you.