 on an introduction to Hadoop. This is Dr. Neeta Puja, professor in computer science and NG department at Vulture Institute of Technology, Solar Power. At the end of this session, learners will be able to comprehend why Hadoop is required, what exactly is Hadoop and what are the different features of Hadoop. Pre-requisites are the learners should have the knowledge of RDBMS and most importantly distributed file system. Now, what is Hadoop? It is a framework that allows for distributed processing of large data sets across clusters of commodity computers using a very simple programming model. It is an open source project by Apache Software Foundation. It has the ability to store and analyze data that are stored on different machines at different locations very quickly and in a very cost effective manner, cost effective just because it uses commodity hardware. It uses a concept of MapReduce which enables it to divide the query into smaller tasks and process them in parallel. Now, if you see the history of the Hadoop, it was first started with dog cutting and Mike Kaffarela in the year 2002 when they both started to work on Apache Nudge project. Now, this project was designing a system which was designing a search engine which was able to index billions of web pages. So, this is nearly a huge data set. At approximately the cost of such system was half a million dollar in hardware and along with monthly running cost of 30,000 dollars approximately. Also, project architecture was not capable of storing and processing the large data sets. So, in 2005, cutting found that Nudge is limited only to run on 20 to 40 node clusters. So, he joined Yahoo along with the Nudge project because Yahoo had a huge team of engineers who were interested to work on this project. And he separated distributed computing part from the Nudge and named it as a new project Hadoop. So, Yahoo successfully tested Hadoop on 1000 node cluster and started using it in 2007. In 2008, Apache Software Foundation successfully tested the Hadoop to run on 4000 node clusters and it successfully run. In 2011, Apache Software Foundation released Apache Hadoop version 1.0 and today currently we have Apache Hadoop version 3.0 which was released in December 2017. Now, let us see why Hadoop? What are the different features of Hadoop? Today big data has become a buzzword, enterprises all over the world, they are beginning to realize that there is a huge volume of data that is being produced through various sources and this information which is in the form of structured or unstructured data needs to be stored and analyzed. Now, let us see how this big data is generated. We have some facts. Every day, New York Stock Exchange generates 1.5 million shares and rate data. Facebook stores 2.7 million comments and likes. Google processes about 24 pentabytes of data and every minute Facebook users share nearly 2.5 million pieces of content, Twitter users share it with 3 lakh times, 3 lakh tweaks, Instagram users post nearly 22,000 new photos every minute, YouTube users download nearly 50,000 applications every minute and every minute banking application processes more than 10,000 credit card transactions. Now, why Hadoop? Why it has become popular? There are various factors due to which Hadoop has become popular. So, if you see that all the factors have been shown here. So, first factor is the low cost. Hadoop is an open source framework because it uses commodity hardware to store the huge quantity of data and process it. So, it becomes cheaper. Then comes computing power. Hadoop is based on distributed computing model. So, that is it is using multiple nodes processing power to process a huge data set. The more the number of computing nodes, the more the processing power is available at hand. Scalability, it simply adds nodes as the system grows and requires much less administration. Storage flexibility, like traditional RDBMS or data warehouse, no pre-processing of data is required here. It can store as much data as one needs and it provides the flexibility of deciding later as how to use a stored data. So, initially it stores the data and later it provides the flexibility as how to use this stored data. It stores unstructured data like images, videos and free form text. Then it provides inherent data protection, that is Hadoop protects data and executing applications against hardware failure. That is even if one node fails, it automatically redirects the job to other functional and available nodes. As well it uses replication strategy to increase the availability of data, that is it stores multiple copies of the data on various nodes within the cluster as well as across the clusters. Now let us see some of the features of Hadoop. So as you already know it can handle large quantity of the heterogeneous data using commodity hardware. It uses shared nothing architecture that is all the nodes in the system are loosely coupled. They have their own resources, so these systems they are not dependent for anything on other systems. Replication is used across multiple data computers to increase the availability of data. It rather uses goes on high throughput than low latency. It always focuses on high throughput. It complements OLTP and OLAP systems though it is Hadoop actually is neither OLTP nor purely OLAP. So OLTP systems are the online transaction processing systems which have large number of short transactions and OLAP online analytical processing these are these systems are mainly used for the analysis of the stored data. Now Hadoop is not good when there are dependencies within the data. If all data are truly independent then Hadoop works very faster. It is not good for processing the small data sets. It works best with large data sets. The response time is more of course because it has to handle large data sets in batch operations. Now why not RDBMS? RDBMS is suitable for storing and processing large files, images is not suitable for storing large files, images and videos and especially it cannot deal with the heterogeneous data. It is not a good choice when it comes to advanced analytics involving machine learning. So if you see this diagram you can find out that as the size of the data grows and grows it scales up high that is it the whole system becomes very costly to work out. Now we will see the comparison of RDBMS with Hadoop with respect to some parameters. So the very first parameter that has been taken here is the system. So RDBMS is actually RDBMS system that is relational database management system. Whereas Hadoop it is node based flash structure that means all nodes are viewed in a single system image they are grouped into form into clusters. Then second parameter is data. RDBMS is suitable for structured data. Hadoop is suitable for unstructured and semi-structured data as well as structured data. It supports variety of data formats in real time such as XML, JSON, text-based flat files for flat files formats etc. Then the third parameter is the processing. RDBMS is supported by OLTP that is they are the OLTP systems online transaction processing systems. They handle lot number of short transactions whereas Hadoop is used mainly for analytical and big data processing. Choice, RDBMS are the good choice when the data needs consistent relationship that is when high consistency is required whereas Hadoops they are used for big data processing which does not require any consistent relationships between the data. Now as far as processor is considered RDBMS needs expensive hardware or high end processors to store huge volumes of data because it follows the vertical scalability strategy whereas in a Hadoop cluster a node requires only a processor, a network card and a few hard drives. So basically Hadoop is using commodity hardware so processors are of moderate configuration and low configuration as compared to RDBMS systems. Now if we talk in terms of cost, cost in case of RDBMS goes around $10,000 to $40,000 per terabytes of storage whereas in Hadoop it goes to about $4,000 per terabytes of storage. So you can see that there is a very good difference or a large difference in terms of cost between RDBMS and Hadoop. Now you can pause the video, think and try to answer this question. It says Hadoop is neither OLTP nor OLAP purely. So the answer to this question is true. Hadoop complements these systems but it is neither OLTP nor OLAP. It complements OLTP and OLAP. These are some of the references that I used to prepare this video. Thank you.