 Welcome back to a session on Hadoop architecture and its special features. This is Dr. Anita Poojar, professor in computer science and engine department at Vulture instead of technology solar power. At the end of this session, learners will be able to explain Hadoop architecture and its components. Now, let us start with the Hadoop overview. What is Hadoop? As we have seen in the previous session, it is a free and open source software framework to store and process the huge amounts of data that is big data as well as for the distributed processing of this huge data on a large clusters of commodity hardware. So commodity hardware are the low cost or the cheap hardware. Basically Hadoop accomplishes two tasks. First is massive data storage that is storage of big data and second one is faster data processing that is faster processing of the big data. Now, let us see some of the key aspects of Hadoop. First is it is free and open source software that is it is free to use, it is free to download and use it and as well contribute to it. So students can or learners can add some additional features to it. It is a framework that means it provides everything that is needed to develop and execute the user applications. For example, it provides libraries, APIs, software tools, some of the programs as well as multiple clusters of computers, as a scalability features as well as the fault tolerance applications and so on. So it naturally provides everything to the user to develop and execute its applications as per the needs. Then it is a distributed architecture, it divides and stores data on multiple computers as well as it processes the data with the help of multiple computers. It provides massive storage of data because we are using clusters of computers, we can provide huge amount of storage for storing the big data. Then it uses larger amounts of commodity hardware to process the data in parallel. So users can be given quick response because all the tasks or the big data is being processed in parallel with the help of multiple computers. So let us move to Hadoop components. Now Hadoop architecture consists of these four main components. First one is MapReduce, it is a programming model used for distributed processing. Then comes HDFS that is Hadoop distributed file system which is modeled after Google file system. It uses distributed file storage and processing concept. So mainly it provides distributed storage on multiple computers. Then comes YAN, YAN is very important component in Hadoop because it acts as a resource manager, it is yet another resource negotiator. So it takes the client's applications, divides it into sub tasks and allocates them to the separate computers or slave nodes to execute. Then comes Hadoop Common, this component provides libraries, Java library and utilities required to develop and execute the applications. So these are the four important components of the Hadoop architecture. Now we will pause the video, think on the given question and try to answer it. The question is, YAN architecture basically separates resource management layer from the processing layer. Now before I explain this question, let me tell you that Hadoop came in two versions initially, Hadoop 1.0 followed by Hadoop 2.0. In Hadoop 1.0 there were only two components, one is MapReduce for distributed processing and another one was HDFS for distributed storage. So MapReduce was allocated a task of resource management as well as distributed processing. So MapReduce was overburdened in Hadoop 1.0 with both the task, that is resource management as well as distributed processing. So since this complete Hadoop 1.0 was tightly coupled around MapReduce, it created many problems. First of all, MapReduce was purely written in Java, so we needed very good experts in Java to write the MapReduce programs, MapReduce applications using MapReduce. And second one was both functions were done by MapReduce only, resource management as well as data processing. So in Hadoop 2.0, this problem was overcome by introducing one more component called as YAN, which handled the resource management functions. So we were able to basically separate out resource management layer from the processing layer that is MapReduce. So the answer of this question is true, YAN, since YAN was introduced in the architecture, it was able to separate resource management layer from the processing layer. Then we go to the high level architecture of Hadoop, where we see in two layers, one is data storage, where HDFS is used for distributed storage of data, huge data amount of data and then data processing layer, which consists of the MapReduce for processing the big data. Hadoop is a distributed master slave architecture, it consists of master node, single master node in a single cluster and multiple slave nodes which are known as data nodes. The master node is known as name node and slave nodes are known as data nodes in the terminology of Hadoop. Now, actually if we see every node in the Hadoop consists of both HDFS as well as MapReduce, HDFS for data storage and MapReduce for processing of the data. In master node also, we find these two components. So master HDFS is mainly responsible for partitioning the data storage and slave the data chunks across multiple slave nodes. Along with that, it maintains a metadata, that is that metadata consists of the track of the locations of data on different data nodes, that is which data chunk or which data blocks are stored on which data nodes. Then master MapReduce decides and schedules the computation task on slave nodes. It will create the sub tasks and it will allocate the sub tasks on multiple slave nodes. So if you see the architecture, there is a master node which consists of two components, that is one is for computation MapReduce and other for the HDFS and the same is the scenario with all the slave nodes. Now, let us move to some special features of Hadoop. Hadoop is suitable for big data analysis. It is as the big data tends to be distributed and unstructured in nature, Hadoop clusters are the best for analysis of such big data. Now, since the data is unstructured, RDBMS cannot be used, but no SQL databases can be used to store such data. So we will be studying further about the no SQL databases also. And remember, Hadoop is not suitable for smaller applications because to execute the small applications, a lot of overhead is incurred with it, initiating the name node, initiating the data nodes and protocol exchanges between them. All this will take a lot of amount of time for small applications. So for big applications or for big data analysis, Hadoop they are most preferable. Now the second very good feature of Hadoop here is we are moving the code to the data. We are not moving data to code because the size of the code which processes the data is very less. So it is always efficient to move the code to the computing nodes. So less network bandwidth is consumed because the network traffic is also very less. Now this concept of moving the code to the data is called as data locality concept. So it helps in increasing the efficiency of Hadoop based applications. Now let us see the second special feature of Hadoop that is scalability. Now Hadoop clusters can scale to any extent as per the requirement of the size of big data or the requirement of the size of the application. But one best thing to note about this is during scaling the system is never in downtime. There is no downtime of the system involved during scaling or scaling does not require any modifications to be made to the application logic. Next becomes the fault tolerance that is the next third feature is the fault tolerance. Now Hadoop ecosystem has a provision to provide the replication of the data on multiple nodes within the cluster or across the clusters. Now due to this replication strategy even if one node goes down, data is available on the other nodes. So data availability is never stopped. So client applications or client request can be executed with using the replicas from the other data nodes, from the other nodes. As well if any one system goes down, the task allocated to it can be allocated to another available node. So in this way the execution of the applications will also not stop. So these are the three best features of Hadoop. Now these are some of the references used to prepare this video session. Thank you.