 Hi everybody, my name is Suresh Venivas, I have been working on HTFS for more than three years now. I used to work at HTFS and Yahoo, we essentially worked on both stability related features such as scalability of HTFS and some of the new features we have added in terms of about the federation and things like that. I moved over to Hortonworks, I am one of the co-founders at Hortonworks. I am also a Committer and a DMC member for Purdue. So in today's topic what I will do is I will cover the background of HTFS availability, how it does quite fault tolerance and what are all the issues that exist and how we are solving that, mainly name note being single point of failure and high availability use cases related to that and the design of the solution. I will also go over for folks who want to try it out, I will go over what is currently available and what is still under development. And we will also cover some future work and I will have lots of questions related, you know, time for question and answer. You can ask me questions related to HTFS in general, I need not be specific to this topic. So one of the key things with HTFS is early on when at Yahoo we were trying to solve the problem related to big data, of course as you guys know HTFS is influenced by GFS, some of the key decisions that GFS took, we carried forward in HTFS. We also made our own simplifying design decisions and that really helped us in getting HTFS to become stable, to scale and easily make it production ready. So we did that in a year or so and I think everyone is going to touch upon some of that history aspect. Some of the key simplifying decisions that we did for HTFS was not to write our own storage layer or using the raw block device. Instead we decided that we will rely on the local file system such as ext3, ext5, ext4. This was a key decision because we did not have to spend a lot of time stabilizing and dealing with the data connection issues related to storing data on raw block device. Instead we could rely upon the stability of the file systems themselves such as ext3, ext4. The other key thing with HTFS and this true with GFS also, people miss this when they look at the architecture of GFS or HTFS. The key thing is earlier the systems were designed with rate in a sort of fault tolerance. So if this fail you choose an appropriate rate and then what you do is if this fails then you immediately add another disk and you rebuild the rate. But here the choice was different. They decided to go with multiple replicas. The advantage with multiple replicas is if a system, the whole design is based on nodes fail, disk fail, and racks can go away in a cluster. When you take that into your design consideration, having multiple replicas helped us in the sense that a node goes away. There is an active monitoring component in HTFS which looks at replicas that have gone away and replicated versus having to run and put another disk into the rate and rebuild the rate. The other thing is also typically HTFS is configured with maybe 10% excess capacity. So that over a period of many months you accumulate these kind of failures and then go address them versus in case of rate you have two disk failures you might lose data. So that was another key decision. The other thing that sometimes in HTFS, a lot of people comment about is there is a single name or there is a single master. But it was a key decision for making HTFS quickly develop it and make it production ready. And what we ended up doing was just like GFS we have single name or master and that master votes all its data in memory. So you don't hit the disk at all and you can be more performant and what we ended up doing is vertically scale the name node for larger cluster sizes. So for example, I think the biggest cluster that I know of is at Facebook with roughly 400 megabytes of data. We also have lots of clusters at Yahoo more than 30, 40 megabytes of data. So that was a key decision to get the HTFS ready quickly. And the other thing is to tie it back to availability. Keeping things simple also makes it really robust. If we had built it really as a complicated software we would have had failures that we had to deal with and we could not go to production with those kind of failures. So the simplicity also was key in making it robust. So with all these decisions made, how well it all worked? One of the studies that we did at Yahoo, we saw roughly 19 blocks out of 329 million blocks or 20,000 nodes in 10 different clusters. And that was done in 2009 where HTFS was still in its infancy. And if you look at this 19 blocks lost out of 329 million blocks it's seven-ninths of reliability. Of course, for a file system you never want to lose any data. And we did have some bugs that we have fixed in release 2021 to address those 19 blocks that we lost. The other key thing which is relevant to the topic I'm covering here is how stable is name node? Why did we not do name node high availability earlier? Why have we waited this long? One of the reasons is in a study that we did which covered 18 months period, we saw 22 failures in 25 different HTFS clusters. So that's 0.58 failures per cluster per year. That's not very significant especially for Yahoo use case which is batch processing. And out of those 22 failures, only eight could have benefited from having a name node HT. The rest of the failures were things that would have gotten triggered even if it had failed over to the other node. So given this, we considered HA as a lower priority However, now the things are changing in the Hadoop ecosystem. There are solutions that are built around Hadoop which have real-time requirement. In such a case, a name node failure would affect the production whatever, serving the data, things like that in real-time. And also, Hadoop is now graduating from web companies to now it is going to enterprises and high availability is one of the key features for enterprises. So let me cover some of the related work and the relevance of this for the stock is there were some attempts made for doing high availability for name node and out of these activities we have learned quite a bit and we have applied that into the current solution that we are building. The earliest work was backup name node that was done in release 21. So if you guys have, you know, run the HTFS cluster, you know that there is active name node and there is a secondary name node and the idea of secondary name node is active name node cannot do checkpointing and so secondary name node is a checkpointer. The backup name node was an earlier implementation of a component that we intended to use as standby in the future and so what it does is the edit log or the journaling that name node does of changes that's happening on the file system that gets streamed to another node in real-time instead of the secondary name node where it just, you know, picks up the files and does the checkpointing. Here it is in real time getting streamed to the backup node and backup node is always in sync with the active node so you could choose to use the backup node for some kind of read loads if you want to. So that was, you know, the early work and we are planning to use some of the components we added here in the final HS solution. There's also out our name node that was done in Facebook. So the main use case here was the main reason for unavailability in HDFS is upgrades, things like that, maintenance. And in Facebook they wanted to have a standby where they manually fail over during upgrades and stuff like that so that their service is not affected. So there are some learnings that we have had from that that is applied to the solution we are currently building. There's also a prototype that we have built long back using NXHA and some of those things we incorporate as well in terms of designing the interfaces for it. There's also a prototype that is done in eBay which uses some custom components and so these are all the these are all the solutions that we have looked at and that has influenced our current design. So let me quickly go over the terminology. The active name node in the solution is the name node that provides read and write access to the clients. It's also the name node that sends commands to the data nodes to delete a block and things like that. Only active name node does that. The standby name node is a name node that is waiting to become active. It's standby. And currently we don't use standby to perform to provide read operations to the clients. In the future we could allow read operations for the clients at the standby and thus scaling the total read. Also there are different kinds of standby. The current HA if you look at release 20 and if you call it HA what we have is a cold standby. What name node promises is it persists its state without corruption. It ensures that the state is persisted in a secondary storage. Now you could choose to bring up another name node using this state persisted in secondary storage. So essentially you have a cold standby solution. You could have another machine that you bring up as a name node using the state persisted. And that's the solution that we have right now. You could also foresee having a warm standby where in case of cold standby it has no state from the active name node at all. It's got zero state. It's got to load the entire thing. Warm could be it has the file system metadata it has loaded up and it's slowly keeping base with the active. It doesn't have all the active state. But it has some state that active has. And in this case compared to cold standby the failure load is much faster. Warm would become an active name node much more quickly. And then finally the hard standby which is what we are building in the current Hs solution where standby keeps base with the active and keeps track of all the state changes and it can quickly become active during a failure load. So let's look at some of the high level use cases for name node high availability. So in one of the slides I covered that HDFS is really robust and fault tolerant through replication and stuff like that. The current problem is name node there is a single name node and if a name node fails HDFS entire cluster is not useful. So one of the reasons why a name node would be down is planned downtime. This is the main cause of service unavailability. This is the case where you make a configuration change and restart the name node. You want to upgrade and this downtime is longer in case of larger clusters. So for smaller clusters you can restart the name node in under one or two, three minutes. But in case of large clusters at Yahoo it would take 30 minutes to restart a name node. So your unavailability time is 30 minutes in case of planned downtime. So this is one of the reasons why outer name node came into existence and the reason why name node HA is also critical. The other kind of use case is unplanned downtime where had cases where we have been running clusters and the memory would go bad. This way are some other hardware error results in you having to switch from one name node, you know, node to another node. There are cases where server becomes unresponsive. There are also software failures where name node process itself might fail or JVM related issues. These are kind of, you know, downtime that are very infrequent based on our observation at Yahoo. So now even these high level use cases what are we going to support? What kind of failures are we planning to support in the first curve? We, typically in HA, double failures are not supported. That is, we have one active one standby. If both of them fail, you know, you don't have a service. But we plan to support a single hardware failure. That is, if one active name node dies the other guy, you know, picks up the active role and then that is to provide business. And in case of software failures, some software failures can be handled. But some software failures that occur both on the active and the standby, you know, you know, if you fail over it happens on the new active and those kind of software failures cannot be handled as well. Let's look at some of the deployment models. Today we have a solution. HTFS is deployed with single name node and a secondary name node. We want to continue to support that. We want to deploy HTFS with high availability feature you know, becoming available in front. And 2.0, you can still run your previous configuration. It's backward compatible. You don't have to make any changes. You can also run a single name node configuration if you are doing proof of concept or you are doing some beta testing and things like that. So having to use the HA configuration is not mandatory. And also we will support two kinds of HA deployments. Deployments where administrator wants to do manual failover and doesn't want automatic failover. There are some administrators who want to be in control of the failover and things like that. So that mode needs support. In this mode, again, the main cause of downtime is handled because, you know, during upgrade administrator is involved and then we can continue the service without, you know, he can keep the service going. The other mode is active standby with automatic failover and that's what we are currently working on. You have a hard standby and then system automatically fails over. It detects the problem with active and then fails over. So let's look at the high level design. The key thing in case of HA is there are two key information in the name node. The first one is the name space, that is the file system name space and the second is the location of logs. So this information in order to have a warm or hard standby, you need standby to have the same state as the active name node. For name space, what we do is when the standby starts up, it loads the file system state that the active has loaded, the same file system state and then it keeps up with the active through the edit log journaling that the name node does for every change that is happening in the file system that changes sent over to the standby and standby is applying it to its own name space, that is in memory. And that's how it keeps its name space hard. In case of block locations, what we do is the data nodes that are there currently register with one name node and they send block report saying I have all these blocks. They also periodically send I received this block so that name node now knows that this block is at this data node. That communication now happens with both active and standby. So the data nodes register with both the active and the standby and they send block reports and block receivers and all the other communication that happens that establishes the location of block at the name node. The failover controller is a demon that we decided to keep it outside the name node. What it does is just like it is done in many of the other HF frameworks, this is a demon that monitors the name node process and any other required resources that the name node requires for it to be active. And then what it does is it can choose active, standby and then for campaign or things like that. So that's one of the other big components in this design. With active and standby comes the problem of fencing. You have edit logs that you don't want during, there's a condition called split brain where two name nodes if they cannot communicate with each other they might end up thinking that they are both active. And then they might try to perform the activity that only an active should perform writing to an edit log. You don't want two guys to be writing to an edit log and corrupt the edit log. Also a name node board send block deletion to a data node and two of them are sending block deletions to different data nodes resulting in removing all the replicas of a block. Those kinds of problems happen. So there is some level of fencing and also stony things like that to ensure that there is only one active in the cluster. There is client failover. Now that you have two name nodes and any one of them could be active at any point in time a client should be able to figure out who is active and only talk to the active to get its htfs service. So that client side failover and there's also another alternative which is you have a virtual IP and then IP fails over between the active and standby during failover. That is the other component of the design. And some of this design is available in htfs 1623. I have few slides around where to get the information where we posted this design where keeping a lot of things open so that people can tweak it and customize it in such a way that it is applicable to their deployment and their environment. So let's look at the failover controller itself. It's a separate damon from the name node. The reason why it is outside name node as a separate process is this is a simple process it's very easy to get this because of its simplicity it could be a lot more robust than a name node which has much larger jvm and things like that. Also if name node goes into gcpaus and things like that those things should not be affecting failover controller because it is the one that is making the key decisions in the htfs. So what this failover controller does is it looks at we follow same things that some of the hf frameworks do where in case of hf frameworks there are demons like this which monitor resources. For example you could say if the name node has lost its IP connectivity this guy cannot be active anymore so I design the other guy could say he has lost his network connectivity and I have to become active. So these failover controllers have two model things as resources. One of the resources is name node. What it does is it periodically runs a name node command and makes sure that name node is active and name node is responding in a timely manner otherwise it says this name node is not healthy and then it fails over so that is one of the functionalities of failover controller. And failover controller also uses zookeeper and yeah this is a split picture but what happens is the standby is partitioned from these sort of data nodes and it does not receive local reports and then it comes back in contact with the data nodes. So is there any consideration? So here are a couple of things. What we do is for us there are I have a slide around that but given that you have asked the question there are two kinds of shared resources for us. One is the location where we are writing edit log. The other is data nodes themselves are shared resources for two name nodes. What we do is we have some kind of fencing mechanism at the data node where data nodes constantly if they are communicating with two name nodes they know that who is the latest name node through some kind of transaction editing mechanism and it ensures that it accepts only commands from the active name node. It has an arbitration built in where it says if there is a condition where in split frame if two guys are active it ensures that only one guy can be really active truly active the other guy is in a quality condition. So it can detect that and it doesn't accept any commands from the old active that is no longer actually active in the hx cluster. In which time this goes down at a later point of time when the node comes back online. So you have the stale configuration. We can talk about this. These things are handled in the design but you know I think this probably is an argument. How does the system adapt to Hadoop 23 which changes the notion of name node because of federation. So what we do is in case of HA HA is done for the name node and federation brings multiple name nodes so for each name node you will have a standby. So that's the only change. So federation and HA they have sort of only touch point is in case of federation the same set of data nodes are used by multiple name nodes and hence unlike the previous version where there was a single name node that name node is down your entire cluster is not being utilized versus in case of federation if one name node is down that part of the name space is down that's it. The rest of the name nodes are still available and you still can continue to use the cluster the entire cluster is not done. What is the system where it is given that federation itself would give a high availability? Federation is not high availability. Federation is a multiple name node it is scalability and it is multiple name nodes using the same underlying infrastructure and hence one name node going away that will make all this infrastructure useless. And so we have added some fencing during split brain and that fencing is done by the failover control. So this is how to recap what we have discussed there is an active name node and there is a standby name node and the active name node and the standby name node they share two states one is the name space and today the name space is loaded by the standby and standby keeps up with the active name node through a shared list where journaling is being done that is the current state of affairs and so active name node has new files are added, deleted it writes it to the edit log and then it is read by the other standby name node and then it applies it to its own in memory state that is keeping up with the active name node and then the data nodes register with both active and standby and they send block reports and other messages so that you also have block location at both the places. The failover controller is the one that chooses an active when the cluster monitors name node and if there is a failure if the name node is not responding it is the one that triggers a failover and that is done by how does a failover controller decide to make a name node active it is done by leader election using zookeeper so a failover controller gets elected as a leader they can only be one leader in the entire cluster and that failover controller currently makes the local name node that is the overall feature of the design. So now let us look at what work has gotten completed in name node high availability and what you can use currently in release 2.0 in case you want to play around with it we have completed we have added the notion of active and standby states to the name node and the current solution user has support for one active name node and a single standby name node we don't have support for multiple standby name nodes standby name node we modified it to perform check pointing so that you don't need to have an active standby and a checkpoint or a secondary name so the standby does the check pointing which used to be done by the secondary name. Current solution uses a shared storage NFS storage where the active is writing to the NFS storage and the reading from that NFS storage. This is all the sharing of the state name space modification has happened. Given that they have a shared storage we also need fencing so we have and data nodes are also shared resources so we have fencing of data nodes and we have stonet which is based on plugin model so you can keep giving different kinds of scripts including a power network connected power switch and what failover controller what you know there is a tool available which could be executed to stonet rather if you want to make a name node active. Then we also have support for client set failover currently which is based on configuration so instead of using a specific name node post use a logical URL to access HDFS and then that logical URL is mapped after two name nodes two name node posts and then the client side determines who is active and then connects to the active and if the active fails it fails over to the other name the client side failover works. So what is pending and what is still under development current solution as I said is only manual failover and the operator is the one that is failing over the name node, choosing one of the name nodes to be active. We are adding support for automatic failover this is going back to the picture you have failover controller working with zookeeper and it chooses an active name node and it performs failover if you make a name node it becomes unhealthy. The other thing that we are doing is one of the feedback we have been receiving is we have built a solution now it depends on NFS for shared storage and that shared storage itself it becomes a single point of failure if it is not available the standby cannot even keep up current solution what it does is the input solution we have if the shared storage dies the active name node shuts down essentially you are moved your single point of failure from name node to a shared storage right to avoid that we are now adding some capabilities where we would like to run some daemons and there is a protocol that was introduced for backup node where a name node can stream edits to another subword and so we will have multiple of those daemons running and then so the name node will be streaming multiple edit log streams to these different daemons and you keep multiple copies for reliability right and you eliminate the need for NFS instead it becomes an internal hdfs component so there is no external daemons anymore and because we run multiple of them there is no single part of it there is some extra work needed for management and monitoring even that you have active and standby you need complete state of how the HEA cluster is performing there is also some more need for tests which do fault injection some of these HEA kind of solutions they are very hard to test the only way to test them is create boundary conditions through fault injection so there is need for fault injection tests and also a lot more testing so the plan for HEA is it will be available in one of the 2.x releases along with other exciting key features that are coming out such as wire compatibility and mhdfs things like that so what are the things that we might do in future related to high availability today we support single active and a single standby some deployments may want to use multiple standby that way if you bring down a standby from maintenance you have another standby to take over and hence you handle multiple failures the other thing that we are exploring is currently the failover is all built based on the on the client side where client side is the one that is failing over we want to do it using virtual IP where an IP fails over along with the active name so that you just connect to one IP address and that's where the active is running it also has that advantage of not only just simplifying the failover it simplifies the protocols such as HTTP you just go to a URL and you don't need to do client side intelligence for I need to connect here or there or we have a proxy front end some of the detail design of this feature is in HTFS 1623 what we did is we developed manual failover we developed this in a separate branch and then we merged it when the manual failover was completed but still automatic failover and some of this journal demon that one is still going on and to track that work that is happening we have an umbrella JIRA please take a look at it to see the progress of to understand what is pending and the progress on the spending work any other questions related to HA HTFS it works so you will never do demons running for HTA or for a single day so the way it would be is on a name note there is a node on which you want to run the name note process so there is a name note process so there are two such nodes each are running a name note process and a failover controller there is a name note process on a note and then it is also on the standby you have a failover controller and a name note process one name note process is in active state the other one is in standard so the question is not regarding to this what was the bottleneck as to why HTFS was not able to overwrite things what was the bottleneck so random writes is not harder when you have a pipeline because pipeline does result in very strange boundary conditions if you want to appreciate the complexity involved in the pipeline you should look at the new up and design which talks about when you are writing in a pipeline how the knowledge needs to flow through how the generation stand needs to be updated things like that append itself added a lot more complexity and in case of append interestingly what happens is if in the pipeline one of the data nodes fell off from the pipeline you at least have the length which is changing which is different the data nodes that were there and the data nodes that fell off also the generation stamp is updated but the key thing is there is also this length that will be a mismatch in case of random writes when you overwrite parts you don't even have length that is changing so there are some interesting issues when you do random writes in a pipeline and hence you must have seen some discussion around should we turn off append because there is a lot of complexity I would think random writes is at least 2 times more complex than append even append was introduced just now so what was the bottleneck that was preventing append so append was introduced in release 19 just to give you an append the background of append and it was done for hbase what hbase does is it writes it's on journal to hdfs and the only way you can ensure that the data is made durable on hdfs is by closing the pipeline so the only way it would work before this append and flush was you open write some things and you close and you start creating lots and lots of files so in release 19 the main requirement for hbase was this flush an ability to say I flushed the data onto hdfs now it's no longer in my process in my memory space it's gone to hdfs and hdfs needs to make it durable along with that we also added append currently append feature is not used as far as I know by any application including hbase so in release 19 a first cut append was done and initially some failure scenarios and boundary conditions were not thought through that resulted in data corruption we turned off append and then append happened in a separate branch and it's become really stable now though it is called append feature it's actually hflush that is what hbase uses and then this feature was rewritten in release 21 considering all the failures that we had observed so the interesting thing was in release 19 append was done meanwhile release 21 re-implemented append and some of that design went back into release 20 as you know we fixed more bugs in release 20 after that yeah you mentioned that the standby name mode is doing the job of check-pointing the secondary was doing now the secondary name mode concept that that beam was already was just check-pointing so now the standby is handling block reports as well as check-pointing so have you seen any performance implications does it has so the the question is now the standby is also doing both check-pointing and the duty of a standby that is to keep up with the active name mode and the block reports and things like that and the question is have you seen any performance implications because of this so when you're doing check-pointing essentially what you're doing on the standby is you have accumulated edit logs to a certain finalized points and at that point you need to just write what is there in your memory onto the disk typically on the largest clusters I've seen fs image of size 10 GB or something like that and it doesn't take a lot of time to write that also when you're writing that you have stopped being in sync with the active and the block reports but then it's very easy to really catch up with these things because we are not providing any write operations at all on the standby and currently we are also not providing any read operations out of standby in the future we might give read operations out of standby just to scale the read but I don't think it should be a problem the only complexity is when you are doing check-pointing suddenly the active fails and you have to take over the duty of it you have to abandon check-pointing which is not a big deal but you abandon check-pointing and become active any other questions the daemon which you are writing how similar or different is it from popular daemons like kpalive or linuxha so I mentioned linuxha sometime back in one of the slides this daemon we did consider running linuxha using linuxha as a proof-of-concept earlier if you look at how we did the development first we did a protocol to the name load where typically there is hfremworks what they do is they have a model called resource and this resource hfremworks they have commands such as start, stop monitor and then become active become standby these are typically 5 or 6 commands they need and it's a nice way to model a resource and enable hfremworks and so we built similar commands for name load also and so nothing should prevent us from using linuxha and then do this where linuxha framework itself is doing using the same interface and providing high availability the problems that I have seen with that are linuxha in case of redact linuxha is not true that's one of the problems second thing is an issue of there is an external component that is coming in the packaging, installation management all those things will be different right so if if you look at just failover controller itself it's not really rocket science it's not a complicated piece of code the other thing also with linuxha is there are policy engines you can configure various things where you say you want this weightage for this name not to become active and fail back there are whole bunch of rules and stuff like that we don't want to have that level of sophistication to start with so we want all native components right now I think there are whole bunch of websites that compare file systems that compare themselves with hdfs I'm sure you can find something you have a commit respect to what was there already just part of the file I mean if you have a specific question you know I can address that I see a lot of those functionalities that are available that they perform to whatever the level they perform so what is that that's getting addressed here differently what else I think the architectures are completely different all we are doing here is most of the components in hdfs are designed where they are going to fail and the failures will be handled by hdfs except for the name so all we are doing here is just doing building a high availability solution and it may not apply to gbfs where it might not have a single master so in our case it's a single master and the master is not highly available that's what we are trying to do so mapper also provide a distributed name so how is that different so what I don't know enough about the architecture because it's I cannot read as well about it as hdfs our code is out there anything I can look into it from what I know I think they have they run it's not truly distributed these are all different volumes volumes are all chunks and then each chunk is managed by name node the understanding I have is you have the file system name space along with its data all in three or four nodes manage the chunks of those and if you look at it what it does is from what I understand there is a component called cldb and what it does is that's the one that pieces all these things together so you go to cldb and say where is this volume management you go to that place if you do that comparison federation has something called a client side mountain instead of cldb all this directory information is there in one configuration a client can just load it and it knows where to go it's transparently done now if you take federation and make every node a name node along with its own data node I think that's sort of what what they are doing but I don't know very well about their architecture any other questions ok thanks for listening to me