 Anyway, let me just try it out side by side. No worries. Same? Yeah. So just put it under your chin somewhere. Yeah. It goes in your pocket. Yeah. Do you want to give it a try? Yeah. Can you guys hear me? Yeah. OK. Yeah. All right. And I just... OK. Everyone, let's welcome Ravi Shankar. Yeah. With the Thin Arbiter from Lesterfest. All right. Thank you guys. Yeah. So my name is Ravi Shankar. I'm a senior software engineer with Red Hat. I've been working with Lesterfest for about seven years, mostly on the replication component. But I've also worked on other areas of Lester, like the CLI, the Gluster Demon, and the POSIX translators. So anyway, so this talk is mainly about Thin Arbiter for Lesterfest replication. So the agenda for today is... I'll spend the first few minutes on the first three bullet points where I talk about what the Glusterfest architecture is and how it achieves replication using the automatic file replication or AFR translator. And then we will discuss how quorum logic is important in preventing split brains when you're writing two files from multiple clients. Once we have an idea of these three things, then we can actually go to the topic which is Thin Arbiter for Glusterfest replication. All right. So this is the architecture of Gluster. So on the right-hand side, you see many green boxes. They are all servers, server 1 to server n. Each of the servers host a Glusterfest process which is composed of many translators like it starts from the server and down with POSIX. So all of these servers are connected together to form a trusted storage pool and this is what the volume is compressed off. And on the right-hand side, you have the client which is basically accessing the volume via different mechanisms like Fuse or NFS, Ganesha or there's also a libgfapi binding where you can write your own application using those bindings to access the volume. So most of the logic in Glusterfest is done by translators. So each translator has a specific job. The replication translator sits here on the client side and it has many children. So depending on the replication factor, each client talks to the respective bricks and it does the replication. So the synchronous replication in Gluster is mainly client driven, meaning the client connects to all the bricks and the updates are sent synchronously to all the servers and we wait back for the responses from all the servers before sending back the response to the application. So it follows a strong consistency model unlike your application where the consistency is eventual. Here the moment you do the right because we propagate it to all the servers, you get the rights immediately on the disk and the rights follow a transaction model because you also have multiple clients accessing the same file and you need to have a transaction to prevent, you know, stale data or like partial rise going from one client to one brick and the other client to the other brick. The reads are served from one of the replicas and the slowest brick also dominates the right for performance because we are binding the rights to all the bricks. So there's also a feature of self-healing where when an update from a client does not go on all the bricks, the self-heal damage keeps track of what files need healed and when the brick comes back up, it automatically does the healing. So to that effect, you have CLI commands to monitor the status of the pending heels and you also have commands to resolve split-brains in case of replica 2. So I was telling you that the right follows the transaction model. So there are basically five steps for a right when a client does a right. The first is the lock. So lock is when the client takes the lock on all the participating replicas. You need to do this because there are multiple clients accessing the same file and while doing the right, you need a lock to prevent out-of-order rights. So once you get the lock, then you do something called the pre-op. It's basically a set-exactor call that's done over the wire. So you mark an exciter on the file saying, hey, I'm about to do the right. Let's mark something called a dirty bit. So then after that, you do the actual right operation and if the right is successful, you remove the dirty bit and then you clear the data bit and if the right transaction fails on some of the bricks, then the good bricks will actually mark another extended attribute on the bad bricks saying that there is something pending and so that when the brick comes back online, it can start healing. And then finally we do the unlock. So the reads are very simple. It's just basically serve the read from one of the good bricks. So AFR uses extended attributes to know which is good and which is bad brick. So reads all are always ensured that you always serve from the good brick. So which brick does the read get served from? That's configurable. You have various policies. So the default policy which is used is the hash of the GFID of the file. So that means that even though there are multiple clients, if they are accessing the same file, they will go to the same brick. But you can also load balance it using other strategies like mixing the hash of the GFID and the client PID. The client PID is unique to each client. So you can distribute the reads too. So the self-heal demon, I was telling you was the responsibility of the self-heal demon to ensure that the mist writes are actually healed on to the bricks when they come down, when they come up. So the self-heal demon runs on every node of the cluster and it heals both data, meridata and entries that were missed when one of the bricks were down. So there are two ways to do the heal. One is to crawl the entire file system. That's like a really stupid way of doing it. So what AFR does is it maintains the list of failures in a special directory called Indices.GlusterFaceIndicesFolder. So whenever a write transaction fails on some of the bricks, the good bricks record these GFIDs inside this.GlusterFaceFolder and when the bricks come back up, the self-heal demon crawls this folder and it just gets the list of files that needs to be healed and then it does the heal. So the self-heal demon does the healing under the presence of locks because clients can also be writing to the same file when the healing is going on. So you will have to take locks from the client's exclusion from the client IO. So the traditional way of replicating had been earlier replica 2 but the problem as you guys might already know is that replica 2 is prone to split-brains. So there can be two types of split-brains. Split-brains in time and space. Split-brain in time is when the write from the same client succeeds on one brick and fails on the other and the next write succeeds on the opposite brick. Brick 1 success, brick 2 failure. Here the write on the brick 1 failed and brick 2 succeeded. So when the bricks, both bricks come back up the client doesn't know which is a good copy and you cannot resolve it. The other one is split-brain in space where clients can partially see the bricks. So there are two clients say client 1 and client 2. Each of them can see only one brick and you still allow the write because there is no concept of quorum in replica 2. So you can end up in a split-brain in that state also. So how do you avoid split-brains? You have to have a notion of quorum which means that you need to go at least for replica 3 or basically odd number of replicas. So the general thing is that for a 2n plus 1 replica you can at least you can at most tolerate failure of n nodes. So that means if you have a replica 3 you can tolerate failure of one node which is going down. So the thing to note here is that just because you have replica 3 nodes online you cannot say that you are always guaranteed to server read. So there is a problem that if the only good brick which had witnessed all the writes goes down then you still have to fail the I.O. So let's just look at that with a diagram. So here you have client which is trying to write to all the 3 bricks. The first write did not succeed on the third brick and the second write did not succeed on brick 2. So now we have brick 1 which has witnessed all the writes. So when the third write comes even though the client is connected to brick 2 and 3 we cannot allow the write because the only good brick which witnessed all the writes previously that is brick 1 is down. So that's one of the things in any replication system. So even if you have quantum number of bricks if the good bricks are down then you cannot still serve the needs. So you guys must have already used or known about arbiter feature also. So arbiter was also a type of replica 2 plus the third brick is used as arbiter where it stores only the name space. It doesn't store the contents of the file so they are all 0 byte files. So I was telling that AFR uses extended attributes to figure out which brick is good and which brick is bad right. So in case of replica 2 because there are only 2 copies you cannot store the X-haters you don't have 3 copies of the X-haters. So the arbiter kind of overcomes that problem by storing the file name so only the name space is captured and you store the X-haters on those respective files. So now since we have 3 copies of the metadata information we can prevent going up in a spibrin state. So this was the arbiter. So why did we go with the thin arbiter? So let's see what the thin arbiter is. So thin arbiter is essentially a replica 2 volume plus a lightweight thin arbiter process. So if you look at the normal arbiter it's actually a full blown process in the sense that there is one arbiter for every replicate sub volume and it stores all the files of that particular volume. But thin arbiter is not like that. It is actually lying outside the trusted storage pool which means that it's not a part of the cluster at all. So you can host this in a cloud environment somewhere where the management demon of cluster D is not running at all. So the node is not here but it's not running and it's not managed by cluster D. So if you look at the volume information you will still see that it is depicted as a 1 cross 2 which is a replica 2 volume but you will also see an extra line called it's a thin arbiter. So that's how you will identify that the volume is a thin arbiter volume. So thin arbiter can be the advantage of thin arbiter is that you can host multiple replica 2 volumes with the same thin arbiter node. So you have if you look at the diagram you see that there are different trusted storage pools. You have TSP 1, TSP 2 and they host different volumes. Some of them are thin arbiter volumes. Some of them are normal volumes and they all use the same thin arbiter which can be hosted separately in the cloud. So all the clients which are in the respective which belong which access the respective volumes also talk to the thin arbiter. So one caveat here is that we must use volume names which are unique across the different trusted storage pools. So the reason is because the thin arbiter has some id file which we will see in the next slide. So that uses the name of the volume to identify which replica belongs to. So if you have the same volume name across multiple trusted storage pools then the thin arbiter id file uniqueness is lost. So that's why as long as the volume names are unique you can use the same thin arbiter for multiple storage pools. So what exactly is the thin arbiter process? So it is essentially a lightweight brick process. So you have all the standard translators which you see in the glossary surface brick process like starting from protocol server and ending with the POSIX. But you also have one additional translator called the thin arbiter here which is sitting just above POSIX. So I was telling you that the thin arbiter contains only one file and that file is used for quorum to determine which brick is good and which brick is bad. So the only operations that come on the thin arbiter are first creating the id file which happens only once during the life cycle of a volume and then the actual set exciter calls which AFR uses to track which brick is good and bad. So any other op which comes on this has to be like barriered. So that's the job of the thin arbiter translator. So it allows only the create and the exciter op to go through anything else would be barriered. And the other thing is that the thin arbiter I was telling you that you can run thin arbiter process on a node which does not have GlusterD. So if you know GlusterD, GlusterD is a management daemon which is used for spawning all the brick processes, the cell field daemons and all that. So if you restart a node it is GlusterD which ensures that the brick process comes back up. So without GlusterD how does it actually work? So if you have mounted a Gluster volume you know the way the mount logic works is when you issue the command mount minus t blah blah blah the server name and the volume name. The client initially talks to GlusterD and gets the information about the wall file and then connects to each of the bricks using a particular port number. But when because thin arbiter node we are not hosting any GlusterD brick process. We are not hosting any GlusterD process. We need to hard code a port number. So we currently use 24007 because that's the port of the GlusterD. So because GlusterD is not running when you mount the volume the client will directly connect to this port. If you want to change it to some other port there are volume options available you can configure it using a different port. So let's look at how thin arbiter works for writes and reads. So let's assume that the application is writing something on file 1 and you have brick 1, brick 2 and thin arbiter and let's say the write failed on the second brick and succeeded on the first brick. So what AFR does is before sending back the application it marks on the first brick that there is some pending operation on the second brick. So it essentially marks that brick 2 is bad. So that information is marked both on the first brick and also on the thin arbiter. So after marking that it also stores in memory that which brick is good and which brick is bad. The reason why it does is because thin arbiter does not need to be in the 5 millisecond latency limit which is therefore very difficult to process. So in order to not contact thin arbiter for every file operation you try to maintain in memory the information about the bad brick. So the client says that in memory that brick 2 is bad and then it responds with success to the application. Now when the write 2 comes on the same file as long as the write succeeds on the previously known good copy it is a success to the application. So if you look at the diagram from here write 2 comes on file 1 and it succeeds on the second brick but I am sorry it succeeds on the first brick but fails on the second brick. So because we already know from the previous write that brick 1 is the good copy and brick 2 is bad we can return success to the application without actually contacting the thin arbiter. So that is how it is not actually participating in the IO path. So let us see what happens when a write comes and it fails on the opposite replica. So write 3 comes on the same file and this time it fails on the first brick but succeeds on the second brick. Now because we know that this can if you allow this as success we can end up in a speed brain state. So we do not return success and we actually fail the fob. So this is the essence of thin arbiter. So to summarize like how the write fails on both the data bricks then you can obviously say that the write fails. If the write fails on one brick and if it is already a known bad brick then you can return success to the application but if the write fails on the brick which is already good then you will have to fail the application. Alright so let us look at how reads work. So yeah so let us look at each case. So in the case when the client so when we have this state let us say that we already are in a state where the brick 1 is marking the brick 2 as bad. So brick 1 is good brick 2 is bad and the client is connected to both the bricks. In this case we do not have to query the thin arbiter because the FR excaters on both the bricks already tell us which brick is good and which brick is bad so you do not have to actually contact the thin arbiter to know whether your results that you interpret from the are valid or not. You can trust them and you can serve the read. The only problem or the only reason you need to contact thin arbiter is when before that let us go to the second case. So case 2 is when we have the client connected to the good brick and it is disconnected from the bad brick here. So if this good brick already blames the second one with an excatre then we can be sure that you do not have to contact the thin arbiter node because the excatre state is node and the excatre here already is blaming this guy so you can directly allow the reach to go through. But let us say the client is connected only to this brick which is bad and not the first one which is good. So if this guy does not blame anybody we cannot blindly state that you know I will serve the read from this because this does not contain any excatres sorry this does not contain any excatres you will have to query the thin arbiter. So that is the case where the client actually has to query the thin arbiter and if the thin arbiter does not blame the brick which is connected to then you can serve the read otherwise you will not be able to. So to summarize what we discussed in the previous slide if both the data bricks are up then you serve the read from a good copy both can be good. If one of them is down then you will have to query the brick which is down I am sorry which is up and if it is a brick which is down then you can surely serve it but otherwise you will have to contact the thin arbiter and get the information from the thin arbiter to see which is good and which is bad. Okay so the next two slides are a bit of an implementation detail yeah so I was telling you that the client maintains in memory which brick is good and which brick is bad right. So the cell field demand actually also heals the files when the bricks come up. So how does the client invalidate in memory information when the cell field demand heals the file. So for that it makes use of up calls. So the locks translator in Glusterface provides this notion of an up call for locks. So when there is a conflicting lock from another client the locks translator will send a notification to the one which is currently holding the lock and it is up to that client to you know release it so that the conflicting client can take the lock. So and there is also so locks translator also supports taking lock on the same file from the same client on multiple domains. So for example client one can take a lock on file one say from offset 0 to 10 on domain 1 it will be granted. If it is trying to get the lock on the same file using a different domain it will still be granted. So the locks translator has this distinction feature of domains wherein if the offset and the range of a lock on a given file is same but the domain is different you will allow the locks to go through. So AFR actually uses these two features to invalidate the in-memory information so we will see how that works. So when the first failure happens while writing from a client during the post-off phase AFR takes two locks one in the notify domain another one in the modify domain then it marks on the thin arbiter saying that this brick is good and this brick is bad after doing the marking it releases only one lock which is in the modify domain. So the notify domain lock still resides in the thin arbiter. So for every client that is connected to the thin arbiter you have a bunch of notify domain locks which are residing in the brick process. So how is that used is what we will see. So when the self-heal demon starts to heal the files in the volume it will attempt to take both the notify and the modify locks and because of the lock contention feature available in locks translator it will send an upcall notification back to the to the client. So if there are ongoing writes in the client it will complete the writes and then release the notify lock so then the self-heal demon can actually get the lock and it will proceed with healing the file. So the thing to note is that if IO fails during the heal the client will again mark on the bad brick saying that it basically invalidates in-memory information. So this is how locks translators upcall infrastructure and multiple domain locks are used for maintaining the in-memory information. So installation and usage is pretty simple. So on the thin arbiter node you will have to install the server RPMs and you will have to run a script to start the thin arbiter process. Once you have that done then you can develop and create normal volumes which are replica too and use the syntax call cluster volume create. I will show this in a demo. So if you are using it in a standalone mode you can use this method. If you are using containers and if you are using a provision for containers there is something called kadalu.io which is also provides support for theta arbiter recently you can try that out. So the things to do that are there is you don't we currently do not have a bad brick and remove brick CLI. So if you have used cluster volumes if you want to replace a brick or convert existing replica to or replica 3 to theta arbiter volumes it is currently not possible. So those are the things that we need to work on. So the reads also I was telling that so the writes have in-memory information on which brick is good and which brick is bad but the reads do not have that information. So every time it queries the thin arbiter to get the information. So that is why we need to work on to optimize the information about using the in-memory information about the bricks and you also have to fix bugs if you guys try it out and report bugs we will be happy to fix them. So I will just show you a demo now I have recorded it already I will just play it out. So we have I hope the font is visible so we have 4 VMs here like 1, 2, 3 and 4 Ravi 1 and Ravi 2 and Ravi 3 I am going to use for hosting the thin arbiter process and the 4th machine would be the client. So let's first start with installing the thin arbiter on VM 3. So you have to run the script call setup thin arbiter.sh and you have to run it with the minus s flag so it will basically ask you the brick path and you enter where you want the brick to be hosted you say brick ta and then the thin arbiter has been started so if you check whether the process is running you can see it and I was telling you that there is no cluster D on the thin arbiter node which means that the process management has to be done automatically by system D. So we have integrated this with system D so that even when the thin arbiter node gets rebooted or the process crashes it will automatically start it. So the unit file takes care of that so let's try to kill the process and see you can see that it is again spawned again with a different PID you don't really need cluster D on this node. So having started the thin arbiter let us try to create replica 2 volume using the first two VMs so I will just export some environment variables with the IP addresses of the VMs and then I am going to create a thin arbiter volume so the syntax is cluster volume create, wall name replica 2, thin arbiter 1 and then you give the list of bricks which form the data bricks of the volume and in the end you mention the thin arbiter. So VM3 is a thin arbiter here so we say VM3 Bricks Brick TA and that's it. So we will start the volume. So now let's see on the second node whether the bricks are up and running it is so now we are good to actually mount the volume and start doing IO on it. So we will go to the node 4 now. So before mounting the volume let me just show you the ID file so if you look at the thin arbiter it is currently empty there is nothing here so I was telling you that the ID file is created when you first mount the volume so if you go issue the mount command and then come back and check here there is this ID file which is created it is a 0 byte file and this ID file is used for capturing the good and bad brick information for all the files in this replica sub volume so if you do write to a file you see that the file contents are getting replicated to both the bricks of both data bricks so let's kill one second data brick and try to write something so the write is successful and you are also able to read it because now this is the only bad brick so if you now look at the extended attributes on the thin arbiter if you do a get a fatter and see you will see that it contains certain extended attributes which blame the client one client one is a second brick so client zero is first and client one is second so because we killed the bricks in vm2 it is saying that there is a pending data heal on the second brick so the thin arbiter essentially captures this information the client also has it in memory that now the second brick is bad and I should not allow any writes which might fail on the first brick so let's kill the first brick and see what happens so now we are going to kill the brick which is good and we will bring the second vm back up so now you see that the first brick which witnessed the write was skilled and now the second brick is up but even now when you try to access the client you will see that the clients fail both reads and writes so LS reader fails with input output error and writes also fail with input output error so because the only brick is down we are not allowing the writes anymore so if we bring the brick back up so we will restart luster d on the first node so that the brick comes back up now the self-healed would have automatically healed the file by now and you can see that the file is again accessible from the mount yeah and if you look at the thin arbiter before that let's look at the contents of the file from both bricks yeah so the contents of the file is also the same on both bricks and if you do a get a factor now and look at the external attribute that AFR maintains it it's now reset it to all zeros so earlier it was like blaming the second brick now because the self-heal is happened it's reset it and now you can continue with the IOS issue yeah so that's pretty much it guys have any questions alright then thank you yeah I am just going to put the 15 minutes it's okay the talks are fast you have the byte lines okay it's the video and the slides are to the box so if you're standing in front of the screen you will not be recording okay um I am still doing the I think let's skip for now if I need it I will put it on target by then yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes What do you use? If you check the system out? So the arrow next to it? Okay. I think there might be an option. So the arrow next to it? Yeah. And we have both the arrow and the stress. Okay. Okay. Yeah. Go back to yours. What does that say? Nitric. Oh, we have time. It makes it easier for you. Okay. Yeah. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Oh. You have to wait until the time. Yeah. So it's a lot of 15 minutes. Yeah. So you want to have the mic already? I think I'll skip it. Okay. Because you have? Yeah, exactly. So the mic's mostly for the recording. Even if you speak long enough, the room might understand. Thank you. Thank you. Thank you.