 Hi, everyone. My name is Vaisharao. I work on the GlusterFS project at Red Hat. Previously, I have been interning for the Linux kernel NF Table project as an outreach intern. So today, we'll look into how GlusterFS achieves high availability. So first of all, what is GlusterFS? GlusterFS is basically a distributed file system which collects storage servers and provides a single global namespace. There are some Gluster terminologies which we will have to look into before we proceed further. So there's something called a trusted storage pool, which is basically a network of storage servers. And these storage servers are only allowed to create volumes. We'll also look into what is volume. Hold on. And there's something called as brick. Brick is the basic storage unit in GlusterFS. And it is represented by an exported directory on a server. And we have volume, which is basically a collection of these bricks. And if you remember, I told about the volumes. The servers in the trusted storage pools are only allowed to create the volumes. And also we have sub volumes, which is basically a subset of these bricks and provide a functionality together, such as data duplication in replication feature. And also we have something called as translators. These translators are set of modules, which together provide a feature. Each module has a defined function. And GlusterD is a GlusterFS management service daemon, which is present on each of these servers. And we have something called as metadata, which provides information about other data. And we make use of xattors, which is extended attributes. And it allows the program or the user to associate file or directory with the metadata. And we basically store the metadata in the xattors. So let's look into what is the architecture in just a brief. So there are different types of volumes in GlusterFS. There's something called as distributed volume, in which basically the files are distributed in a random order on different bricks. As you can see here, there's one brick one, which is for server one. As I said, it is represented by the export directory on that server. And there's brick two, which is on the server two. And the files are distributed. But this has a basic disadvantage of single point of failure. So we need something better. So for that, we have something called as replicated volume, in which the files are distributed randomly and also replicated on each of these bricks at the same time. But we also need something for scaling and high reliability. For that, we have distributed replicated volumes. The files are distributed as well as replicated on different bricks. So this is just an architectural of GlusterFS in a brief. Now let's proceed to our main topic, high availability. How it is achieved in GlusterFS. So what is high availability, basically? High availability is the availability of the service even after there's some interruption. And our applications run smoothly. There's access to the data. Why is it important? Or why is it required? Today, we realize so heavily on the data, we cannot lose access to it, even for a moment. Otherwise, we'll incur huge losses. And also the users, they would be discomforted. So how is it implemented in GlusterFS? So there's a feature called as automatic file replication feature, which is basically a synchronous replication, client-side synchronous replication. What it does, it copies the file on each of the bricks, as I showed you. And then we use a GFID. GFID is an internal identifier in Gluster, which is allocated to each of these file, so that Gluster can know which file is new, which file is old for the replication. And the number of bricks for this particular feature to work should be equal to the replication factor. So here, I have shown you an example. Here, test volume is an example volume, which we are considering. And there's a replica count of two. As you can see, there are two bricks. One is server one and server two, with the export directories as EXP1 and EXP2. So it can be created very easily with one command like this. But to achieve the consistency, right consistency in replication, we want all the bricks to have the same data. That is the most important thing of replication. So for that, we have log and unlock phase. We acquire logs on each of these files so that no other IO operations are going on. With these logs, we later proceed to pre-op and post-off phase. In the pre-op and post-off phase, we have something called as extended-attribute operation, in which there's a counter which is incremented or decremented. In pre-op, we increment so that the demon knows. Like we know which particular file is newest file. And this is present on each of the brick. And we have operation phase in which the actual IO operation takes place. So we have the log phase. And then we proceed to the pre-op phase. In the pre-op phase, we have the extended-attribute operation, which basically increments. And we let know that this is the file on which some operation will be going on. If it is successful, in the post-off phase, we decrement that extended-attribute and release the logs in the unlock phase. So there is something called a self-filled demon. Suppose I talked about the bricks. All the bricks should be online. Suppose one of the bricks fails the file replication feat during the file creation. So as we have GFID for each of these files is same on the bricks, the self-filled demon crawls through this particular directory is present on each of these bricks. So it crawls to the GFID present in this directory. And it gets to know which file is on which file the operation is going on. And from the good brick, it will continue the healing. And on the bad brick, which went down, it will heal and it will create the file on that brick when it comes up again. So we have a split-brain problem with the replication feature. In replica 2, basically with replica 2, so what is the split-brain problem? In split-brain problem, we'll consider this again, brick 1 and brick 2. I'll consider this as node 1 and that as node 2. And suppose during the file creation with the replica 2 volume, my N1 and N2 both are online. And later on N1 goes offline and N2 is online with the file created with the name F1. And the GFID I talked about, it is for example this ABC, as you can see here. That is the sample GFID which we are considering here. And later on, for the self-filled demon to work, the N2 brick should be online. Then only it can crawl to the GFIDs and it can create the file on N1. But before self-filled demon can do anything, the user creates a file with the same name F1 in the same directory. But at that time, we'll have a different GFID. And now suppose N2 also comes up and N1 is online. But for the self-filled demon to work properly, both the GFID should be same. So this causes a problem. It doesn't know which file is the latest one and which one is the stale one. So this is a split-brain problem in the replication feature. So how do we resolve it? For resolving, we have something called as replica three volume. Instead of two, let's take three volumes. And we have client-side quorum. Certain number of bricks should be up for the right operation to proceed. So we have quorum. It will allow the right operations only if two bricks are up at a time in replica three volume. But it consumes a lot of space. So we have something called as arbiter volume. In arbiter volume, it also makes use of the client-side quorum. But along with that, on every third brick, we have metadata. On the third brick, there is no actual data which is stored. Instead, we have the metadata so that even if all the other two bricks go down, it can heal itself by using the metadata. These are the resources. You can visit the Gluster org for the documentation. The dogstore Gluster is updated. And there are different kind of volumes which you can try and play with it. And yeah, thank you so much. Have people ever really used Gluster? So the question I have is, on the documentation side, do you still have a big branch in IBM? Yeah. Do you understand, or do you not think? Yeah, yeah, yeah, I just did. But you can have a big branch in IBM with your writing at the end of the on-set on the bike. OK. It is, documentation is up to date. We all have full request and all, so I need this push. With the replica 3, what I know, I'm telling, we have, like, if there are a quorum count, should be two. So if there are two bricks up, then only the IO operations will be proceeded after the log phase and the pre-op phase and all. So it will not allow any write operations to be done. That is the basic idea of the whole thing. And this is how it basically works in a simple way. I talked about the extra, like, extended attributes. So there is a string, I would say, a string, suppose, as I considered. In that, they will be incremented if, suppose there's 0, 0, 0, and it would be incremented by 1. So it will be 0, 0, 1. So that would be present on each of these bricks. They will have this. So when one of the bricks goes down, the self-healed even would crawl on all of them. So this is down. On the other one, it would still be 0, 0, 0. And when this comes up, it will compare for the same GFID. That is really based on the user what they feel, like, according to their use case, because you are the best one to choose which one works for you. So. I wouldn't wish this way to be on that for a while, since you're off. OK, then. Any other comments? There are certain workloads that cluster performance better than set. So you should really set up one of these and try your control of those. And if performance is an important criteria, data durability or a data reliability issue, do you advertise this as a cluster to any committee that's setting 10 or 15 words about how real HAA works? Where, let's say, node 1, node 1, node 2, are written in the same file, and it's the same on both. But now node 2 goes down, and I want to read, can you talk briefly about what happens there? You're telling about there are two nodes or any replica, but you say node 2 goes down. And then you want to read from that file. So the self-healed demon, that would heal it. If it is down. The real answer is the client sends fires off to read all of the records in the cluster. Yeah. Yes, yes. Like, there are other bricks which are available. Yeah, so self-healed. Yeah, no, no, no. That is for the, yeah, no, it's not. It's for the right, basically. But for the read, it will be allowed. No, it's OK. Thank you. Please, thank you very much. Thank you.