 The causes of this big data, what is new SQL, then MapReduce, Hadoop, then why we need MapReduce, then issues with Hadoop and then my references. So, now let us understand how big the problem is. Can you tell me what is big data? Yeah, what is the database? In the form of relational tables and all. That means you are only talking about the structured data. Actually the biggest constraint with the SQL or our relational database is that the type of data that DBMS is dealing with is the structured data. Structural data means any identifiable data that is stored in the database in the form of rows and columns. But in this talk, I am not talking about that data. Means that is stored in the rows and columns, typical rows and columns. So, my data is like the images, the sound, the videos. How Facebook is handling that much amount of data, the exabytes of data to receive our 50, to receive our 60. So, how Facebook, how YouTube, how Google is dealing with that much amount of data. He will not be using the normal IDBMS. He is not running SQL queries. So, what is the need of this map reduce this Hadoop, Hive, Pig, Bigtable and all. So, according to the IDC paper, till 2006, we have created just 161 exabytes of data. But in 2010, we have created 988 exabytes of data. So, exabytes means 2 ratio of 60, I guess. So, you can imagine how big the problem is. It is not 80. It is exa and then zeta. Zeta is 2 ratio of 70. This exa is 2 ratio of 60 and then beta is 2 ratio of 50. Maybe I am wrong, but I think you are wrong this time. I think you are wrong. Sir, you should check it. Yeah, I am also happy if you are wrong. So, this is according to the IDC paper. The Eric Schmid, the Google CEO also said that till 2003, we have created just 5 exabytes of much amount of data every day. She is taking online. So, you can imagine the problem of big data in the serial world. This is the worldwide growth of emails you can relate from 1998 to 2010. 1 exabyte is equal to 1000 petabytes. 1 exabyte is equal to 1000 petabytes. What you found out? What you found out? 10 raised to 6 gigabytes. 1 megabyte, 1 gigabyte, 1 terabyte, 1 zabyte, 1 zabyte and then 1 zabyte. So, 1 zabyte is the biggest unit. So, 1 zabyte is 2 ratio of 80. 80 is 0. I am not confident enough if you all are saying I am wrong, then I may be wrong. Because we have actually done that and we are observed. So, this is 10 raised to 6 gigabytes. That may be 10 raised to 6 gigabytes. Yeah, 10 raised to 6 gigabytes. But 2 raised to 16 gigabytes was happening in 1970s. 2 raised to 8 gigabytes. Or 2 per 6 gigabytes. 6 gigabytes. 10 per 10 gigabytes. Or 2 per 80 gigabytes. You cannot leave it. The needs of enterprises is increasing. Now, the needs is the velocity, the variety and the volume of this data. So, the companies are not bothering about what type of data it is. The company is saying just store this data like Google. What the Google is doing? When you are doing search on the Google, then do you think that Google is not taking your information? Google is storing every aspect of your information around 8 aspects. Like your location, your age group, what type of information you are searching, everything. Because there are a lot of enterprises who are hungry for information. Hungry for information related to certain amount of certain set of peoples for their corporate exploitation. So, the needs of enterprises, the second thing is cost of storage is decreased significantly. Like you can go to the market and take a 1 terabyte hard disk for just 4000 rupees I guess. So, 6000. So, the cost is decreased significantly. So, improvement in the processor designs also. The problem is not with structured data. Because there is a relational database that can handle structured data easily. But we are bothering about unstructured data. Unstructured data is that data that cannot be stored in the rows and columns. So, we have to deal with that unstructured data. So, over 95% of the digital universe is unstructured data. It's currently according to the ID paper. So, in organization unstructured data accounts for more than 80% of the information. So, we have to deal with this problem. So, now what do we have for managing data? There are two types of techniques. Relational database system and NoSQL. NoSQL is not only SQL. We have to take something new. The relational database works well in case of structured data. Data is stored in the definite structures and uses SQL as a query language. But what is NoSQL? NoSQL is not only SQL. Actually, you must have read that SQL or relational database work on a principle. The asset properties. Atomicity, consistency, isolation and durability. So, every transaction has to follow these properties. But if I am talking about the distributed environment. So, I have to relax something so that my liberty can be increased and the processing speed can be faster. Like I have to relax the atomicity property or the consistency property or the isolation property. I have to relax something. So, that relaxation is not provided by this relational database. So, this is the biggest constraint. Second constraint is that there is no flexibility. The schema is there. Definite structure is there. So, NoSQL stands for not only SQL. Classes of non-relational storage system usually do not require any fixed table structure. Instead of using asset properties, NoSQL is a cap theorem. That is consistency, availability and partition tolerance. So, cap theorem is given by brewers. Brewer's theorem states that any distributed database cannot achieve these three things simultaneously. The consistency is availability and the partition tolerance. What is consistency? Can anybody tell me? Like in normal sense, in the real world. What is consistency? Exactly. But in the distributed environment, if I am talking about consistency, the consistency means is that let's say I have two nodes. A distributed environment. You are getting what is distributed environment? Like the set of computers. The data cannot be stored in a single computer. That data is distributed across the computers. Like I have 100 computers in my cluster. So, in that environment consistency means is, if I am storing some data X on one machine and a user comes and update that data X. And after some time T, another user comes and want to read that data from another machine. Then will we be able to read the updated version of data X or the previous version of data X? So, that is what the consistency is all about. But if I am using relational database, the answer is yes. Because relational database is always consistent. But if I am using no SQL, the answer is maybe. Maybe we get the updated record or we get the previous record. Fine. So, that's what the consistency is all about. CapTheorem classifies the databases as the Vertica. It is column oriented database. I am not talking about column oriented database. Just forget about it. Vertica, the traditional database, the green form. No, CapTheorem classifies the databases on the basis of the concerning factors. Like the database which concerned about the consistency and atomicity. The database that concerned about atomicity and partition tolerance. The database that concerned about atomicity and partition tolerance. The column oriented database. The column oriented, we have acid properties. The column oriented, pardon? The column oriented database. Acid properties. All have acid properties. No, Vertica doesn't have acid properties. Vertica is based on CapTheorem, not acid properties. Actually, Vertica is a distributed database. What is the distributed database? Sir, Harub is also a column oriented database. It may be using column oriented, but not exactly what is distributed. It is a distributed, but may be using column oriented. The column oriented doesn't. Actually, sir, this column oriented database, key value oriented database or documented database. Sir, these three databases are classified on the basis of how these databases store in data. These are storing on the basis of columns. These are storing on the basis of rows, just like your RDBs. RDBs are storing on the basis of rows. The column oriented database is just storing on the basis of columns. So that for a particular queries, that... How does CapTheorem apply itself? CapTheorem places Vertica here. CapTheorem doesn't tell about Vertica. It's a column oriented or a row oriented. No, it doesn't say. CapTheorem just classifies this Vertica here. That is, Vertica mainly focuses on consistency and availability. What name of the company? Actually, I know just four or five databases like Bigtable, Peanuts, Hadoop. These are pretty new ones. And then Amazon Dynamo. Yeah, IQ also. This Cassandra by Facebook. Yeah, these databases focuses on consistency and partition tolerance. Like Bigtable is used by Google, presently. Which one? BongoDB. Burkley. BurkleyDB is a database. It is used by Burkley. The old one. Actually, the focus of this presentation is on Hadoop. Actually, the main motive of this presentation is just to explore what is going on in the field of database. The research prospects. What are the flaws in Hadoop? The areas where you can explore. Just to make you aware about these. I don't know, you must be knowing about it. This one is Redis. I'm just telling you the main issue with Redis is the scalability is poor. What is scalability? It should perform the same when the loads are increased. When the load is increased. That means I may be able to add more machines into my system to sustain extra loads. That is scalability. In layman terms. I'm not talking about exact definition. But in layman terms. Lacking this scalability factor. And how Redis works is it loads everything in the RAM. The primary memory. And then perform some operations. So the physical memory is the bottleneck. The processing speed, the processing memory is limited. So this one is the lacking factor with Redis. Similarly, there is a Cassandra. The issue with Cassandra is that only horizontal scalability is possible. Actually, there are two types of scalability. There are two types of scaling. Horizontal scaling and vertical scaling. And you want to add five more machines to increase your computational power. Then it becomes horizontal scaling. And if you increase the computational power of the existing machine. Like your 2GB RAM, you make it 4GB. So that is called vertical scaling. So only horizontal scaling is possible. Vertical scaling is not possible in case of Cassandra. Facebook is using Cassandra. Now what are the challenges in the distributed environment? The first challenge is all the distributed environment is using very cheap nodes. Like if I am the CEO of some company and I am using 1000 nodes for my cluster. Then I cannot afford 1000 to be very much expensive. Like I am using some cheap nodes. Like all the commodity nodes will be cheap. So the main time of failure of one node is 3 years. But the main time of failure of 1000 is 3 days. Means every day you have to face failures. So your system must be fault tolerant. But in RDBMS, the fault tolerance is very less. Means up to a certain extent the system is fault tolerant. Because it is not designed for a distributed environment. So we have to make a system that is fault tolerant. The second thing is we have to take care of bandwidth. Network bandwidth. Like, okay, tell me one question. Data is bigger or your network is bigger? What to do? Writing some code. Let's say 500 lines of code to deal with 500 GBs of database for fetching some queries. So will it be worth to fetch the data from the database and making in front of code or fetching code from the code database in the database machine. Which will be better. RDBMS is doing first thing. The machine means the code and the database in the separate machine. And it is just fetching some database from the database. But what if we make some technique that instead of pulling the data from the database we are pushing our code to the database. So we have to take care about the network bandwidth. This functionality is there in Hadoop. The next thing is programming distributed environment is hard. Like we have to take care about pipelining. We have to take care about the parallelization. And we have to take care about the fault tolerance. Everything we have to take care. So programming in such an environment is very difficult. So we have to make a system which is user friendly. That user don't have to be bothered about the parallelization details and pipelining and all. The Hadoop that I am talking about for so long. Hadoop is a Java based software. That is working on a map reduce programming methodology. Map reduce is a programming methodology. In which a user write every function in form of maps. Fine. Means for every program. Every problem have to be mapped with these two functions. I will show you with some examples. How? So the run system take care of partitioning, scheduling, parallelization and fault tolerance. So Hadoop is the open source Java implementation of map reduce. For every DBMS, for every database. We need some computational part and some database part. Like how actually that data is being stored in the database. Like if I am talking about IDBMS, the computational part is SQL. Like you are writing every query in SQL. So in this case I am writing every query in map reduce. And my storage is HDFS. HDFS is Hadoop distributed file system. Fine. How HDFS works is let's say I have a file of 300 MB. How it will get stored in my database. So what HDFS does is HDFS splits this file into some data block. That is user and makes copies of a single. Actually this factor is means every block is triplicated. This factor is called replication factor and replication factor is also user defined. But by default the replication factor is three. All the data all the data blocks are triplicated. And after that let's say my cluster. All these applying some algorithms by following some algorithms. There is a name node called data nodes. And actually monitor node or name node tracks all the blocks. Means where these blocks are stored. Like the green one, the blue one and the brown one. If a node get down in case of failure, fault tolerance. I am talking about fault tolerance. How this is achieved in case of Hadoop. So if node goes down then the complete. But there is a problem with this mechanism. Actually can you figure out? Means it's very limit. If this node then the mapping is saved here. And we are going to recover that replication factor. By redistributing that. If this node goes down then what will happen? Actually this is the current problem with Hadoop. And the researchers are going on that how can we again put up some secondary machines in the. Means then everything is gone. So that's why it's open yet. Means we are just considering that this name node is very efficient. High quality RAM, high quality hard disk and everything is high quality. We are just considering that. So let's take an example to understand how Hadoop works. So I just said two types of functions. How many name nodes? I have not analyzed Facebook quantitatively. I am talking about an example. I am just want some feedback. So let's take an example. When you make some program in C programming or Java. Then first program what is your first program? We are expected hello world program. So in the same way in the Hadoop. In the same way the Hadoop we make some the word frequency program. What frequency means you have given some file that 300 MB file. And you have to calculate the frequency of each word. That is like an hello world program in case of Hadoop. Fine. So how this work on problem will be solved by map reduce functions. So actually what are map and reduce? Let me explain on the physical basis. Actually let's say I have a cluster of 100 machines and I have some monitor node. That monitor node will deploy some node as a mappers as a reducer. Fine. That mapper node will do the function of mappings and then reducer node will do the function of reducing. So what are mappings and reducing? How these are working? This I am explaining. Let's say need some key and some value. Let's say this is some sentence, some arbitrary sentence and I have to calculate the frequency of each word in this sentence. So I am giving an offset. Offset 0 means the offset of this line in the file. This is the 0th line of that file in that file. The offset plus the value. Value means the content itself. So I am giving this to the mapper. And a small script is written in mapping. Like first it will split and then emit word comma 1. Means it just attach one to every word in the file. So that's what mappers is doing. So this is data node and then every data node this mapping function is going on. Similarly in all the data nodes this mappers can I emit? Then it will show error. Simple. Then it will show error. Where? Actually for printf we need a console. Then it will show. Then it will show. But if you simply write printf then it needs a console to print. Similarly on all the data nodes this mapping function is going on. Then this mapping function will complete. That means I have attached an integer 1 to every word of the file. Then I need to calculate the number of ones for a particular word. That's how the problem is solved. So for on every machine this type of file will appear. Like comma 1, score comma 1 and means one is attached with every word. So a shuffle function is there on every node that will shuffle. That means a can appear multiple times in a file. Multiple times on a data node. So this will shuffle everyone. So now the reducer come into picture. Now the monitor node deploy all the mappers as a reducer. And send this code to the reducer. So what reducer does is it just merges the list of ones from all the data nodes. So that's how the word frequency program is achieved. Fine. Means it's very simple. Now Facebook is not doing simple thing. Hello world is a simple thing. What are the positives about MapReduce? The fault tolerance is achieved at runtime scheduling. Runtime scheduling what I mean is if you are using RDBMS then you know that writing some queries query plan will be generated after you sending that query. Fine. And that query according to that query plan everything will be done. Fine. So we can say that everything is done statically. Means everything is done according to that query plan. But there is no such query plan in MapReduce. So everything is done at runtime. So that if a machine goes down then the function of that machine can be handled by some other machine. So that machine is called scheduling node means the slower nodes. Fine. So query plan tree is generated first in RDBMS. This is the lacking factor in RDBs why we are not using in case of distributed system. Or in case of distributed system. Load balancing is achieved by the runtime scheduling that scheduling node factor. And it is very simple. Only two functions are there mapping and reducing. Independent of storage because HDFS is there and everything has to be loaded in the HDFS. Scalability is there. Flexibility is there. Flexibility means no need of schema and all. No schema. How do you know what it is? Actually the data it is working on doesn't fit since schema. Then how will you fit video in schema? It's data. It's data. It's data. It's also. There is no schema. The schema gives you. Sir like the program which I explained the word count problem. So there must be some other schema generation. Let's say I am giving you one real world example. Like if you want to calculate the popularity of a website. Only link is there and number of clicks is there. That is schema. You can say a log file. It's not schema. It is a log file. It can variate. Actually if you are using a schema then that becomes fixed. You can't variate. Popularity. Popularity. A website is accessed by million and million people. Yeah. So you want to see the traces where it is. Yeah where it is being called and where we are accessed. That itself is a schema. It's not explicitly stated. No. You can say that it is a semi-structured database. Even when we are putting SQL on it. We can do it in SQL but if the size of this file becomes large. We are still putting SQL on it. Some SQL. Not exactly SQL. It's HiveQL. HiveQL. Yeah. But sir that HiveQL language. I am talking about this. So what are the negatives of HiveQL? Open problems. Open problems is you can have a research on it. So what are the negatives? Focusing on the scalability but missing performance. Scalability and performance means like say, Recall that example. That every file is replicated. So for storing one file, for storing one data block I need three machines. Just to ensure the availability. Fine. Just to ensure availability I am putting on three machines. Just for the storing of one data block. So don't you think that it is a wastage of performance of three machines. So this is the current problem with scalability. Actually there are solutions but that solutions are also not enough. So trade off between efficiency and fault tolerance and no high level language. Like it seems that it is very easy to write mapping and reducing function but it's not so easy because it was very simple program. But if you write some bigger problems then it will be very cumbersome to write mapping and reducing functions. Because you have to tokenize everything. You have to write down everything means if you are an SQL person then it will be very difficult to write mapping and reducing function. So you need some SQL like language. Fine. So and this one is the important point. It is not exploiting pipelining. Pipelining means in case of IBVMS you make some query plan. In query plan the intermediate results producing achieving that final result. The intermediate results are pipelined into the next stage. Fine. Means we are not storing the pipelined intermediate result. So if a machine goes down then the query has to be restarted from the scratch. But it is not using pipelining. Currently it is not. Because we are not using pipelining for the sense that as a technique we can use that go. But that depends on the application. But because we have to achieve the fault tolerance also. And if I am using pipelining that fault tolerance property will go. So we have to take care about every factors. Means I am not. I am sorry. Sir how you can achieve both fault tolerance and application. Sir there are two senses of pipelining. I am not talking about that pipelining. I am talking about yeah actually there are two sorts of pipelining. First pipelining. I am explaining first respect of pipelining. Mappers. And after completing the job mappers I am deploying the same machine as a. What can't I do? Some machine as a reducer. Things will go. But that is not. Haruka is not doing this. Haruka is assigning everything as a mapper. And once all the mapping task is completed. In that sense no. That is what I was talking with Fatak sir. Why can't we do this? You might say we are not presentation. No. No extra competition. Same competition. Yeah. But the need of reducer is that actually one reducer can take data. This is doing this. These are the differences in Haruka and distributed DBMS. Which I think almost I have covered everything. Regarding cost. Okay. I am going a bit faster. Yeah. This. Actually map reduce. Map reduce in front of Haruka is just like. Programming language. C programming and assembly language. You can't write assembly language for load store instructions. So we don't write mapping and reducing function for every program. So we need some language on Haruka. Haruka player. On the top of player. So that language is Pig. Pig is a skill sort of language. But not exactly a skill. You can see the sample query. Actually it is procedural language. Like it is explicitly saying that this line tokenize this line. Grouping this line and for each group send count the number of words. So the better option is hive. Hive is almost like let's say select customer last name order item. It's almost like a school. So these are the diff. Both the languages. Convert their queries and map reduce functions. Internally. And these are the differences in hive and pig. One is procedural. One is similar to SQL using threat server. One is not. Hive is. Partitions which speed. Thank you. These are my references.