 HDFS that is Hadoop Distributed File System Architecture. This is Dr. Neeta Pooja, Professor in Computer Science and Engineering Department at Wolchen Institute of Technology, Solaapur. At the end of this session, learners will be familiar with HDFS architecture and its components. Pre-requisite is the learner should have the knowledge of distributed file system that is data storage and data processing in distributed file systems. Now, let us start with the key points of Hadoop Distributed File System that is HDFS. First of all, it is a storage component of Hadoop. So, it uses a distributed storage concept. Second, it is a distributed file system that is it is modeled after Google file system and it is optimized for high throughputs. So, it is designed in such a way that maximum throughput is achieved that is maximum number of user requests are satisfied or user tasks are executed. We can use a replication strategy here. So, it provides a replication strategy that is we can replicate a file on a configured number of nodes which is tolerant in terms of both software as well as hardware. Then we can re-replicate the data blocks automatically on the nodes that have failed. So, nodes that have failed will be having some of the data blocks, but we can re-replicate those data blocks on other nodes so that data ability is not hampered. Then the power of HDFS is mainly lies in read or write operations on large files. So, as I have already told you that Hadoop is not suitable for executing small applications. Its power is realized when it is used for executing large applications. It sits on the top of the native file system such as ext3 and ext4 that is operating system files. It works on the top of that. Now let us see this figure. So, here is the HDFS which sits on the top of the OS file system that is ext3, ext4 and then below that there is a disk storage. So, these are the HDFS that is shown here and these are the multiple data blocks that are being stored on it. So, a Hadoop distributed file system key points are first of all we use a concept of block structured file. The file data is divided into blocks. It is considered as a block storage. Then default replication factor is 3. Every block is at least replicated on 3 nodes and default block size is 64 MB. Now let us go to the HDFS daemons. That is what are the different processes or components of HDFS. So first one is a name node. So name node is if we consider master sleeve architecture name node is just like a master node. So name node breaks a large file into smaller pieces. So as I have already told you that every node in Hadoop consists of two components at least that is HDFS as well as MapReduce. So HDFS on the name node breaks a large file into small chunks which are called as blocks and it uses RackID because all these slave nodes or you can say data nodes are organized into racks. So the data the blocks which are stored on several data nodes that is in the racks So name node uses a rack ID to identify the data nodes in the rack. A rack is a collection of data nodes within the cluster. Now name node it has to keep track of the blocks of file, the location of the blocks of file that is on which data nodes they are present or they are placed. So this is called as metadata that is maintained by the name node. Name node manages file-related operations such as read, write, create and delete. It performs all these file-related operations. Its main job is manage the file system namespace. The complete directory structure of the file system is managed by name node. A file system namespace is a collection of files in the cluster. So name node stores hdfs namespace. The file system namespace includes mapping of blocks to the file that is which blocks belong to which file and as well as which blocks are placed on which data nodes. As well file properties are maintained and all this information is stored in a file called as FS image. So FS image is the part of the metadata. Name node uses an edit log to record every transaction that happens to the file system metadata. So whenever some changes take to the metadata that record is stored in the edit log that is it is stored in the form of transaction to this edit log records. Now this is the hdfs architecture. This is the client application. Name node or it is also called as master node then these are slave nodes which are also called as data nodes. So here if you see why it is named as hdfs client because a client who is going to submit the application to the name node that is task to the name node. For example there is a file sample.txt which the client wants to access then that file has three blocks block A block B and block C. Now block A is stored on node A and node B as well as node C. Similarly block B is stored on node A, node B and node C and C is also stored on node A, node B and node C. That means block A has its replicas maintained on A, B, C. Similarly block B has its replicas maintained on the data nodes A, B, C and block C is also maintained on the data nodes A, B, C. So if you see these data nodes here separately it consists of all the three blocks. Each data node consists of all the three blocks of the file sample.txt. Now client application is also in direct interaction with the data nodes as well as name nodes. So this is how hdfs architecture looks. Now so the main job of the name node is it manages all file-related operations. First of all that is create, read, insert, delete all file-related operations are done by name node. It maintains a metadata called as FS image. Now you know that FS image consists of what? The file namespace as well as the mapping of the data nodes and the blocks of the file as well as the mapping of the blocks to the file and file properties. All these data is maintained in FS image. So entire file system is stored in FS image. Then it maintains one more data that is called as edit log that is it records every transaction that occurs to the file system metadata. So these are the two important metadata present at the name node. Now we will pause the video for a while think and write the answer to this question. So FS image consists of file system namespace or location of blocks on the data node, file properties or all the three. So just now as we have discussed FS image consists of the complete file system namespace. It also consists of the mapping of blocks to the various data nodes and also the file properties. So the fourth option all of the above is the correct answer to this question. Now moving on to the data nodes. So just now we have seen the name node what all metadata is generated name node what all functions are done by the name node. Next comes data node that is these are all the slave nodes. There are multiple data nodes in the cluster. So one cluster has one master node and multiple data nodes here in the simple architecture. Now during pipeline read and write data nodes may need to communicate with each other because the output of one data node may be used as input by another data node. So there is always communication between the data nodes. Data node also continuously sends a heartbeat message to the name node to ensure that the connectivity between name node and data node is up or data node is in the running status or it is in the working mode. So in case there is no heartbeat sent from the data node to the name node then name node thinks that the data node has gone down. So it replicates that the data on that data node to other nodes and keeps on running as if nothing has happened. So now here the data node and name node communication is shown in this diagram. So all the data nodes they communicate with name node through heartbeat message. If there is no heartbeat message from a single data node to the name node then what name node will do whatever data is present on this data node that will be replicated on some other available data nodes. And this happens very smoothly without bringing the system down. The third important component of HDFS is secondary name node. So it is just like a backup used for the name node that is master node. So it takes a snapshot of the HDFS metadata on the name node at every fixed time interval. Then since the memory requirement of secondary name node is same as that of the main name node primary name node it is always better to keep both these nodes on different machines. They should not be on the same machines and at every fixed time interval all the status of this main name node that is primary name node moves to the secondary name node. Now in case of the failure of the name node the secondary name node can be configured manually to bring it up in the working state in the cluster. However, secondary name node does not record any real time changes that happens to the HDFS metadata. Now the special features of HDFS there is absolutely no need for a client application to track all the blocks. It directs the client to the nearest replica to ensure high performance. So this is about the data replication. So you know that HDFS supports the or uses the replication strategy. So this need not be kept track by the client at all or client application is never burdened with this to see that data is replicated on other nodes also it is done automatically by the HDFS. Next comes data pipeline. A client application writes a block to the first data node in the pipeline then this data node takes over and forwards the data to the next node in the pipeline. Now this process continues till all the data blocks and subsequently all the replicas are returned to the disk. These are some of the references that I use to prepare this video. Thank you.