 Good afternoon everyone. My name is Athin. I work in Red Hat as a senior software engineer. We work in a product called Gluster which is an open source scalable file system distributed file system which can serve up to petabytes of data. So, before I proceed further I just wanted to know how many of you are aware of Gluster. How interesting quite a few? Okay. So, my topic is about a distributed system or a distributed file system and how we can scale a distributed system. So, going forward this will be the agenda what I will plan to cover. Few terminologies what we need to understand those of you who are involved in distributed system probably they know about few terminologies like what is consensus and what is cap theorem. Then the major thing which comes in designing a distributed system is what are the different challenges what we face in designing a distributed system. So, I will be talking few of them over here and what are the design approaches what we can go with. Then I will cover little bit on algorithm called raft and how this algorithm can be used in technologies which use consistent distributed store like at CD console zookeeper and so many things and then I will be happy to take your questions. So, consensus is a very key aspect in a distributed system. So, as I mentioned consensus is an agreement, but what we need to understand is this agreement is for what and between whom. The answer is pretty simple the answer to what is the operation or the transaction which is coming to my distributed system as a request whether I can perform that operation or whether I can commit that operation that is the decision what I have to make out of a particular distributed system and between whom is the nodes which actually form your cluster in the distributed system whether they agree to go on and commit the transaction. So, these are the two things which we need to be careful about coming to a conclusion about whether my distributed system is in consensus or not. So, ideally another term which comes in our mind when we talk about distributed system is quorum that is normally the quorum what we have here is n by 2 plus 1 where n is the number of nodes. So, consider a cluster which has 5 nodes then if 3 of the nodes are up in my cluster then I can say that I can go ahead and you know my cluster will still be operatable in when if you know there are couple of nodes down. Moving to cap theorem this particular theorem actually tells you that if you meet any two of the following three categories or following three properties that is consistency, availability and partition tolerance then you would actually you know make a distributed system achieve the cap theorem where you can guarantee that your distributed system can work at max potential. So, consistency is where you can say that all the nodes can see the data at a given point of time and that data should be same across all the nodes and availability is that is a guarantee which you know where your distributed store or a node which can have like whenever it sends a request to the other nodes over the cluster it can say that I can get a reply or I can get a response back from the other node or for that particular request. And the third point is really crucial like given your nodes are in a network and if there is a you know network breakage or your cluster goes into a kind of partition and a very few subset of nodes in your cluster are currently operating and the rest of the nodes are broken from the cluster the moment the network comes back you would need to provide a mechanism such that all the nodes which went down and came back are in sync with the operations what went ahead. So, that is the thing which we need to be careful when you said my system is a partition tolerant system ok. So, in any system it has its own set of meta-ata. So, when I say when I consider cluster my meta-ata is all my configuration changes all my state changes what happens in my distributed system. So, there are two different ways of you know thinking about designing this one is with no meta-ata at all and one is with the meta-ata server itself. Now the question what we need to ask us is which option is better should we go with no meta-ata at all and or should we go with a meta-ata server. So, how many of you are actually involved in a distributed system here and how many do you think that we should have a meta-ata server and how many do you think we should not have a meta-ata server also 50-50 looks like. So, I will sorry yeah. So, I will try to cover both of them. So, this is what my experience is about using no meta-ata server currently the product what we have we do not have any meta-ata server dedicated meta-ata server, but these are the challenges what we are currently you know facing while we while you know going towards the scalability. First of all is n cross n you know exchange of network message. So, given a five node cluster and you if you if you say I do not have a meta-ata server dedicated meta-ata servers to store my configuration details what you need to do is every node in the cluster has to have the exact set of configuration information which the other nodes have which requires n cross n machine network or machine way of communicating to each other. Now when this n is pretty small say probably less than 100 you can still achieve it, but think about a case when you want to scale this particular system to 1000 nodes or probably in the you know 10000 nodes can you can you you know can you do this can you can you really you know go in a machine network way of you know exchanging the information. We have seen cases where if I go up to more than 1000 more than 100 nodes it really takes lot of challenges to exchange the information and keep the information in sync across all the nodes. The initialization time can be also pretty high because when when you when you bring up a node. So, what it has to do is it has to get all the nodes from the other sorry get all the information from the other nodes and have them stored in its local configuration which requires a n cross and handshaking. So, if the n goes up it is actually the time goes exponential. So, consider a 1000 nodes you know forming a cluster the initialization time would be really really you know very high. So, which we cannot afford to do in a you know real time distributed system and another key aspect here is a typical split bend situation where you could end up in a situation where you see that one node having a configuration data and the other node also have a similar configuration data, but somewhere different and you do not know whom to believe and whom not to because the way we we sync our data is based on the versioning. So, there could be cases where the version numbers would be same, but you would have different set of data. So, then you you you do not know how to how to solve that split bend situation if you go with this n cross and handshaking way and another key aspect what we see or another key challenge what we see in a distributed system is how do you roll back your transaction when when there is something you know something failing in one of the node or how do how do you know get back to the previous state where where you initiated the transaction. So, these are the things what are normally you know a designer faces if if he or she has to choose the no beta no metadata server way of designing the system and with MDS it if you choose to have a single MDS definitely it is a single point of failure, but that is not the ideal problem because we can still go ahead with you know HAPS probably in the form of replica 2 or replica 3, but it could incur additional network hub because probably the clients what what talk to the server it it does not know whom to send that request it can send to request to any particular node and that node has to each for each and every request that node has to talk to a dedicated MDS server fetch the information and then process. So, it definitely incurs a additional network hub. So, so from a personal point of view I think this is also not something which you would you should go for while thinking about you know solving scalability in a distributed file system. So, the answer to all these problems actually you know starts from here sorry. So, this is an algorithm which most of the you know recent technologies use these days. So, it it is called raft it is a consensus algorithm that I will not be you know going into much detail about raft algorithm because my major focus on of this particular topic is to make you aware that how we can utilize this algorithm and the other technologies such that we can actually you know come up with a good scalable distributed system. So, these are the key functions what raft algorithm offers to you it is an asymmetric that is leader based algorithm. So, in a in a given set of cluster one particular node will be always acting as leader it will it will take the request from the client and it will it will you know then instruct the followers in the in the given cluster to take the necessary action and this algorithm has the intelligence of choosing a particular leader at the given point of time then you know proceed ahead with the normal operation normal operation is nothing, but what exactly a follower does when it receives a request from the leader it is based on a log based replication model. So, whenever you issue a RPC from the leader it actually records that particular transaction in the log which is a replicate mission and then you know that is moved to to the operation you know engine such that it can perform that operation excuse me. So, another aspect of raft algorithm is it actually provides you the mechanism of safety and consistency after the leader changes like in a in a in a particular case when a node is a leader and at that point of time something went wrong and the that particular node goes down you need to you know go forward and select a new leader and such that your data is still safe and consistent enough such that I can go ahead and you know operate or continue to operate on the cluster. And neutralizing old leaders means the moment this this leader comes back again right that the node which went down and again it it comes back it should not be acting as a leader forward because it it went down and during during that time raft actually chose a different leader. So, the moment this guy comes up it should not be acting as a leader because raft does not say that you can have multiple leader at a given point of time. Then client interaction is also a key aspect like client should be able to know that whether a transaction is being committed or not it should not be the case that client. So, so there is a there is a case if if the client sends a request to the raft engine and somehow it it it does not get back a response at a given point of time and if if the client sends another retry RPC request back my raft engine should be aware of that it is a redundant request. So, which and it should understand that client has not got back a response for the earlier request and it has to notify the client that I I have to you know I have to take you know some form of intelligence such that I reply back to the client saying this is what you you are supposed to or you are expecting for. And raft also has a way of doing all the configuration changes like how how we can set up your cluster initially how how we can set up a leader and all. So, I as I said I would I will not be going much detail about all these things probably very few of them how how we especially on the leader election part how exactly we do that. Another important term here is raft actually uses a terminology called term to indicate its own set of time. So, it is a kind of time interval what raft defines and this is divided into two different parts in one T 1 in the first T 1 seconds it goes for an election where it has to elect for a leader and the the the rest time slot it will be actually focusing on the normal operation that is the you know RPC processing. Few of the characteristics what you need to understand is there would be at most one leader per term. So, it should not be like you know in a given point of time your term can have two different leaders. There could be a case where you can lead to a failed election like the followers did not vote for a vote for a particular leader or probably the quorum was not met you probably had you know you probably got back respond equal number of responses from two different set of followers which means that you cannot choose yourself to make your choose yourself to become a leader. So, what you need to do in that case is you need to just drop that particular term and increase increment your term number and move forward for the next election. This point right each server maintains current term value this is the kind of data which is persisted in the disk. This term value is pretty crucial when a leader goes down and comes back and such that you know you should not get into a state where you are actually having the stale information. So, consider a case where you know one leader actually trying to say that I am the leader I can actually you know play a leadership role now in the cluster, but it had old term value compared to the other ones. So, in that case when RAF sees that the RPC requires what it sends along with its term number it is quite old it will reject that particular leader to be or the particular node to become as a leader. So, this is very crucial aspect of RAF and that is what I have mentioned that identifying the obsolete information is pretty crucial in terms of achieving the consistency and safety in RAF. So, these are the server states which RAF maintains as you see from the diagram we have follower, candidate and leader I have already discussed about leader and follower the candidate is a kind of interim state where when follower sees that there is no there is no leader available at this point of time it actually choose to move to a candidate state such that it can vote itself as a leader and then participate in the leader election. And this is exactly how the you know kind of 1000 feet overview of the way RAF works. So, as you see in the first point client actually sends a request to the server consensus module and then this particular server consensus module actually sends another set of RPC request to across all the nodes where this will be logged in its own log replicated machine and the moment it is logged into the log replicated machine it will go to the state machine in the leader and the moment the leader actually commits that transaction it sends back a reply back to the client. The thing which you need to remember here is it is a sequential consistent which means the order at which the request arrived there is a guarantee which RAF provides that at the same order the transactions will be committed. So, these are the different set of RPCs which RAF uses one is request vote RPCs which is purely used for leader election. Then append entries RPCs with no message actually holds the actual data which has to be communicated to the follower for the normal operation and there is another set of append entries RPC which has no message which is used for heartbeat such that you know leader can actually send a heartbeat message to all the followers making them aware that I am alive. So, you need not to go for another set of election the moment the follower misses out any heartbeat at a given particular time interval then it says that I have not got any message from the leader. So, I should actually go to the candidate state and you know starts a leader election. So, this is exactly how the leader election works the moment as I mentioned that you know the heartbeat time is elapsed or the timeout is elapsed the current term which is persisted in the desk is implemented then the state moves forward from follower to candidate it votes for itself and then send request vote RPCs to all other servers and retry until either it receives votes from the majority server that is you know to achieve the quorum and receive RPC or receive RPC from valid leader or it can actually have a election timeout and then you know the which is a failed vote or a kind of election timeout set such that it can actually increment the term and you know move forward for the next set of election. So, the another you know thing which we need to you know notice here is there are two different properties which raft guarantees out of election which is safety and liveness. Safety is actually you should have at most one winner at a given point of time as I mentioned in my previous slide you cannot end up having a situation where at a given point of time your algorithm would choose two different leaders and liveness is some candidate must eventually win to you know become a leader. This is achieved with a with a you know with a way that your election timeout is not static it actually varies from few you know microseconds between different nodes. So, if your election timeout is static you could end up in a situation where all candidates are receiving a vote or receiving a request from one particular follower. So, it will immediately you know reply back that ok my job is done you can actually choose yourself as a leader. So, it will be always like one particular node is playing acting as a leader instead of you are switching between different nodes in the cluster as a leader. So, yeah that was you know very basic about how raft works. Now, moving to consistent distributed stores. So, before I talk little bit on the consistent distributed store I would like to tell you that in a given distributed system or a distributed file system there are three key aspects what you need to focus on one is how to store your configuration data like the like any state changes whatever is going through in your node you need to store that data how would you need to sync your configuration and how would you need to perform your transaction. So, these are the three things what you need to focus on while designing a distributed system. Now, this consistent distributed store is something where you can actually have a store distributed across your cluster it should not be local and you can actually store those configuration data using this consistent distributed store. So, it actually uses a key value pair for ease of use and believe me there are several such consistent distributed store available with the latest technologies now. Yeah. So, I have actually explored HCD and console and little bit on the zookeeper as well. So, the current product what we use it is in it is written in C and our community is more comfortable in C. So, we chose not to you know explore zookeeper this is this particular you know thing is still in a kind of POC mode we haven't implemented our distributed system using HCD as of now, but in the near future we are actually planning in the next release we are planning to do that. So, as the name says HCD is a distributed kind of HCD it is a core OS open source project it is actually based on draft. So, that is why I chose to cover few of my slides about draft it is very highly available and reliable sequentially consistent as the as draft guarantee is that the order at which your request will come in the same order the operations or the transactions will be committed and watchable. This is this is really a you know crucial requirement what we have while designing our system because we would need to have a kind of a method or mechanism such that I can you know constantly pull on a particular key for a given namespace I and I want to see whether anything is getting changed for that particular key. So, it HCD actually provides this facility it is exposed via HTTP it can be runtime configurable which is a kind of selling feature I heard that you know zookeeper was not doing that I am not sure whether it does that at this point of time it is durable as well like you can actually you know take backup or snapshot and then probably restore is and you can actually you know set up your keys in a way that it can actually stays there for some certain amount of time and then probably it can expire which is also kind of very cool feature to work with. Why I choose to you know go for HCD it has a real vibrant community if you look at their GitHub page there are around 500 plus application like kubernetes cloud foundry they are using HCD there are around 150 plus developers and they are actually following some stable releases I heard about something HCD 2.0 which was released so this is the slide where I will be spending little bit more time. So, whatever we learnt at this point of time is theory now if you visualize all this raft and HCD what we would need to do is given a n number of nodes in a given cluster what you have to do is you have to set a sub cluster in your in your big cluster using HCD. So, given say you have 1000 nodes or say 100 nodes you can set up a cluster of 5 nodes which uses HCD and use that particular cluster to store your configuration which means if any request comes into any of your node you have a cluster which can guarantee that whatever configuration you would need to store that is that is reliable that is consistent. So, you would not need to you know send 1000 odd RPC request to all the nodes in the cluster to store its own set of configuration. So, you would reduce the number of RPC request by any means and whatever you know features HCD provides to you that is also guaranteed if you are using HCD or any form of distributed consistent store in your application and yeah. So, as I said no burden on application to maintain the consistency. So, we have been facing lot of challenges to maintain the consistency of the data while syncing or storing across different nodes which you would not need to you know bother about because that has been already taken care by HCD and it is a proven thing. So, only thing what you need to focus as you just write or you just plug in HCD in your application form a set of cluster. So, the way you can do is the moment you bring up first node it should have HCD running on it the moment you bring up your second node it should also bring up HCD in it. So, in this way the probably if you choose to go with 5 number of nodes in your HCD cluster the moment you reach 5 nodes the further onwards you do not need to you know launch HCD they will be acting as normal set up, but the initial 5 nodes will have all the HCD intelligence in it and it will be playing a key role in storing all the configuration data. And I think that is the main motive of how we can make our system a scalable and these are the references what I have put forward here which you can go through later on. There is a very good visualization web unit which the first one which I have pasted here that actually explain how exactly the raft algorithm works and now I am open to take questions. So, you said you explored console as well right. So, what was your thought process to choose HCD over the console as it provide a health checks as well as console provides health checks on top of the whole infrastructure. So, if you have to integrate any third party application in your project you would need to look forward and see how the community is working at this point of time whether they are coming with you know proper stable releases or not because that is also very crucial role. So, I did have a slide where I spoke about HCD has around 500 plus applications which they uses HCD. So, that was one of the major point what we why we went for HCD over console because honestly speaking we do not see any much of difference in terms of the actual underlying architecture or actual existence guarantee what HCD zookeeper or console provides. But, it is the ease of use or it is the guarantee what the community provides that is what you need to you know look for. So, the main reason of going for HCD was this community itself and kind of usage what is been happening. So, any drawbacks that you saw while when comparing the other. So, what was the main reason like you said ok zookeeper was not the right choice for your. Because, our project is not in Java and our community also does not like to use Java and zookeeper zookeeper actually uses an algorithm called PAXS which was actually the base from where the raft was originated from. PAXS was pretty difficult to understand and if you look at the raft algorithm the way it was implemented inside the HCD it is a 700 odd lines of code in go. So, that was one of the reason why we did not actually you know want to go with PAXS. So, does it mean that you guys are moving from the no metadata store to a metadata store by using HCD? It is a kind of metadata store, but at the same point of time as I say it is not a dedicated metadata server that the node itself will have its own set of compute engine as well. So, it will work like a class node participating in the cluster, but it will have its own set of HCD knowledge as well just to store additional configuration details. So, in that case if let us suppose you are talking about a 100 node cluster, does it mean that HCD will be running on a 100 node cluster? No, no, so it is 100 nodes. So, if you so normally right if you look at the HCD or console documentation they prefer that you should set up 3 to 5 nodes in the cluster just to achieve the high availability. So, our plan is to have 5 nodes with HCD running and the rest of the nodes will be only running the you know cluster daemon. So, you are going with the elastic search concept where the master nodes and the data nodes. So, other question is more about that like there is a safe storage file system as well and that is also from the Red Hat now. So, like I would like to understand the methodology like why safe and G-Luster FS both exist as of now. If you can explain like the thought process. So, that is actually honestly that is of the topic, but it actually depends on the kind of workload what you are looking for. Cluster is really a file system which works over the file well compared to the safe, but if you look at the workload where you need block and object interfaces, safe does better than G-Luster at this point of time. So, if you look at the open stack area right safe is really playing a very you know good job over there, but if you talk about backup or archival kind of workloads then G-Luster is definitely a choice over safe. So, we actually you know product is these two things and we call them as a Red Hat storage and customer is free to choose that product depending on their workload. So, we do not tell the customer that you should go for G-Luster or you should go for safe. It depends on the kind of workload what you have and based on that we actually suggest what product you can go with. I am not able to hear your thoughts on a multi-master metadata server kind of model. Why do you think that or other what advantages you will get if you have a multi-master? They could replicate the state information and the other metadata information among themselves. So, they leverage ETCD and on the network I can use DRBD with that. So, even in a single master right even if the rest of the nodes are followers you would still replicate all the informations in the other cluster. It is that the concept only says that your client will only talk to a particular node which is a master and then master will actually eventually send the request to all the followers to sync all those configuration data across all the nodes. So, at the end of the day all the nodes are actually the one which are crucial or which are playing a role of master for the overall cluster. So, if you visualize that your cluster is 100 nodes and where the only 5 set of nodes are forming an ETCD cluster. Even if there is one node which is currently playing as a leader or master, but from the outside from the 1000 feet overview what I can see is those 5 nodes are actually mastered to me because without the information I cannot operate on that cluster. So, for me kind of requirement what I have I can actually satisfy my requirement even if I go with a single master. Exactly, because your client would not come to know where to send the request and where not to and as I said if raft actually guarantees that there is only one leader at a given particular point of term and having different masters would definitely lead to a different issue until and unless you change the underlying algorithm, yeah absolutely, absolutely. No, so it is actually for the complete cluster, but all the nodes in the cluster would be aware of who is actually running with HCD. So, if there are 5 nodes running with HCD the rest of the nodes have that knowledge that these are the 5 nodes which are running with HCD. So, if any given point of time I have to fetch the configuration data I have to talk any of this particular node in the cluster which raft already provides that implementation that this is that my leader at that given point of time. We are talking about 100 terabytes of data. So, if there is a one master it means a leader and every rights and rights goes to that leader there will be a problem right. No, when you talk about data are you talking about the application data or the metadata? I am talking about the application data with the metadata because if the there is a huge data sets. So, obviously metadata is also beginning to be used in terms of freedom and right. It means I am hoping that ok reads rights will go through the leaders not directly to the storage nodes. Okay, this particular thing is actually for managing the cluster. So, the way our cluster works is for the file operations right we would not need to talk to any particular leader or particular follower in that case. So, the clients which connect or the mount point if you have a mount where you are running cluster client it has the intelligence. Okay, it is a smart client that which is the metadata. Yes, yes, because we have a algorithm called DHT that actually is aware of what all bricks or the data storage units you have. So, you do not need to know where I have to forward my request because your client is intelligent enough to you know guarantee that. How frequently client will pull the data from the nodes from the HCD cluster? I did not get your question correctly. So, you said client has a metadata information from the metadata servers because that is how it knows that this bricks resides on this. So, this client is actually the one which I am talking is the CLI client where you can issue the cluster command. Got it, but still like let us suppose this bricks is where this particular brick is deployed and this particular data nodes. So, client has to know this is the data node that I have to talk and write the data. That is the file operation client. The one which I spoke about is the CLI client, the command line interface client which talks to cluster D. So, if you can explain the read and write operations what would be the flow if somebody writes from the client to storage cluster and reads the whole operations how it will happen it will be great. So, our management operation and file operations are separated all together. So, management operations do not interfere with the file operation. So, if you talk about the actual read and write what happens is your client is sitting here whereas you have a server stack as well. So, in the client it has the intelligence that whenever a request comes in for a write or read it must be having a file name or I node number or whatever form it is. So, it performs an operation that is distributed hashing and it can come to know to which particular data storage unit it has to send the RPC request. So, there I do not need a CD cluster. So, it means that you are saying the consistent hashing happens when you start writing into a file. So, does it mean that the client uses the consistent hashing to get the node information. Yes. Any more questions? Yeah. Hello, can you hear me? Yeah. Yeah. Can you speak up little bit? Yeah. Is it ok now? Yeah. Yeah. Actually, we are using Docker containers and we have our nodes in these containers and we are using HAProxy and which has its own round robin algorithm right. So, now can we use ETCD on these container nodes on the application container nodes in order to you know for a configuration information metadata information along with Docker and the HAProxy. So, how would it work? You can use it right because so if you can actually make your containers running with the HCD which you can definitely do in my opinion. So, in that case what you are having here is if you can form a cluster with all these Docker containers and run the HCD in all those Docker containers, you can still use the leverage of a distributed consistent store in the form of HCD and store the data in any of those Docker containers. Say for example, you are running 10 containers out of a where you want to set up a HCD cluster of three particular Docker containers, but you definitely need some tweaking in your containers itself such that your containers can understand that who is who which node is part of the HCD cluster and who are not. Yeah. But we need some tweaking right. Yeah. You need to do that. Hi. Any more questions? Yeah. So, I want to ask you one thing here is like we are going, we are going with the configuration management system. So, what I'm asking you is we are going with an approach of configuration management where we maintain our configurations. Okay. So, we are putting like infrastructure as a code, but again when we talk about ETCD it has to provide the configuration to the whole cluster. Right. So, if I have a configuration management software then will it be possible? It will be a conflict right? As per my opinion right, ETCD is not meant for configuration management. There are several other tools which you can explore to you know have for configuration management probably different orchestration tools like Chef, Puppet. Correct. So, what I'm saying is you know in a configuration management software if you run a client then you get all the configurations to the host. With the ETCD approach you have your local configurations and then you fetch it from the local cluster no need to contact to a master somewhere outside the cluster. Okay. So, now the approach is if I have a configuration management software I can't have you know five machines dedicated for you know a configuration management store. You don't need to do that. So my my requirement was I don't want to set up ETCD over all the nodes in my cluster. But if you need that I don't want that kind of segregation you can still go ahead and you know choose to have ETCD running in all of your nodes because ETCD doesn't you know restrict you to do that. You can still go ahead. So, ETCD will provide that guarantee that you would need to you know send your request to one particular node to get back that configuration data. So you could yes, yes, yes, yes. If you really want to use ETCD as a store data store so you can always write the API is hook into your shape or puppet thing and then face the data. So, data bags for example in a shape you have a data bags. So, data bags are nothing but a JSON store right. So, you can always put the data as a key value into your ETCD but obviously you have to write a wrapper to face the data from there. So, that you have to do. There is no out of the box solution here. Yes, yes. Okay. Thank you. Thank you.