 So I am going to talk about what are the different components of the Hadoop system. But before that I will be talking about how does math reduce work and can talk about each of these components. It's fairly a lot of material but if I have time at hand I will also cover a little bit more about big high end interface. A little bit about me. I work for InMobi. It's one of the largest independent mobile ad networks out there. And we built this system alone because we are facing big data challenges as at other firms like Twitter, Facebook and Google. And I will actually illustrate some of the findings or some of the evaluations that we have done when we actually use some of the Hadoop software. And we started using Hadoop when it was fairly new which was about 2 to 3 years back. And we made some design decisions. Some of them do not work anymore. For example, we ran it at the time. It was not very mature but it has become very mature right now. So let's start it. But before that, how many of you have used Hadoop in some form? Can I just say a few? How many are aware of what math reduces? Still a few who don't know. Thank you. So we are really in a world where stuff that we do, everything kind of moves our way. Plus there are systems which actually keep on capturing what you're bringing. For example, central entities like social networks, Twitter, Facebook. Whenever you come in, whenever you like something, anywhere on the web, all of it kind of gets captured. And if you look at a number of interactions that we have online, it's a humongous amount of data. So there's a humongous amount of data. And there's a lot of system-contained data also. It has been around for quite a long time. For example, astronomical data, weather-related data, and so on and so forth. Also financial data. Whenever there's a trade on the stock market, there are various other parameters associated with that. It's a huge amount of data. Lately a trend we have seen more and more is that governments are opening up whatever data they collect. For example, sensors, data, economic indicators, and opening it up. I think there are a few talks on that as well in this conference. So it is possible to kind of take all these metrics and use and current data and coordinate them and find out things that were not possible. So this is actually bringing a lot of transparency. There are various apps and, for example, Samsung is a app that can use machine learning and whatever learning is there for processing different audio data and actual news data. And then we have file resolution videos. So all of these ads are going to huge amount of data. So we have truly in a world where the amount of data that we actually store, consume, see is just exploding. Harooq is one way of actually making sense of all of this data and using it for analyzing. It provides a framework where you can actually take all of this data and process it. So when we talk about the data stack of the big data, what are we talking about? Mainly we are talking about these four layers. Everything is in a fairly different state of maturity. Let me talk about each of these layers first starting from the bottom. So one of the big challenges is that we actually take all of this data and we capture it. How do we actually get all of these data to central entity? For example, analyzing a problem in case of Facebook. They have servers all around the world. So you capture the data but you probably have to get it for central entity where some of it can be processed. And then store it. So the challenges are how do you get all of the data and how do you store it? So whenever individual machines kind of play, how do you get around it and your system still works? Second layer on top is once you have got all of the data again, how do you predictably run jobs on it? How do you do different kinds of analysis that you have quoted on it? How do you kind of schedule jobs? How do you actually do some amount of validation as well? For example, we know that training codes at least in India have six layers. So whenever data comes in, you want to have some kind of validation seen so that whenever the uppers start using it, they do not connect to yours. So you can start using yours. So basically the role of this layer is to add predictability to the whole process of actually taking out this data and processing it in a predictable way so that a data analysis layer can take all of the different aspects of the data. So what is the difference between data analysis and data processing? Data processing is mainly in some sense a slightly dumber layer where it does not have context. The main focus is on the infrastructure whereas the data analysis layer you have a lot of context. The different aspect of data analysis layer is also you can take third party data. For example, you have data related to a geogram and then you want to take some economic related data. For example, in a neighborhood, you know the geographical location, say Bangor, in Guatemala or India. You know a certain set of people in certain localities. So when you want to overlay some of this data, there are disparate sources. So you take all of this data and then you can go over this and then kind of get insights which you would not otherwise could get. You could not otherwise get if you had these two data sources to separate it. So you have to kind of join them. There are challenges in kind of doing retrieval, doing validation and combining this data in ways that you can get insights. And finally, when we have taken all of this data and analyzed it, we have to visualize it. This is a family asian field right now. There is not much out there in terms of this. It is a very active research area because when you take all of the data, for example, let's take the case of in-mobile. Whenever we have magneto. So the three critical entities that are associated. There's a publisher, there's an advertiser and there's a user. So all of these people are active. For example, the user in the mobile world is represented by the handset. The handset has different attributes. For example, the size of the screen, whether it is video, what kind of video is in place. On the publisher side, what kind of site it is? Is it a sports site, is it an entertainment site or is it some other site? So when you take all of this data, it will be very difficult to take multidimensional data and plot it. So this is an active area of research where you can represent the relationship between different kinds of data. Because data in isolation is of context does not have meaning. So you want to visualize that after you have done all the processing, you should be able to figure out what is the relationship with different kinds of data and what is the correlation between different aspects of the data. So this is the data vision. So if you look at Hadoop, the Hadoop stack, it's actually very strong in the data storage and data processing. And to some extent, it's getting better at the data analysis part as well. But there's not much in Hadoop about data visualization. So people use tools like R and we use toolkits that are dedicated to JavaScript, for example, in Chrome. I talk a little bit more about those in the data part of the cloud. So on top of the aggregated data that comes through all of these processes. So let me talk about Hadoop. I just wanted to figure out what do people actually think. So what I did is this is a kind of visualization that is known as a world cloud. It's generated using a site for anybody can access and use it freely for the world. So I pointed to a friend's blog, that is, bottom words, you can see it there. And one of the kind of keywords I came up, the first one was Hadoop, obviously. The other was data. Hadoop is Apache project, right? The components are map-reduced. HTFs, people use it for learning. This also, just because of the times when this snapshot was taken, you can see, they had a lot of focus on education. So you can see some related to education. For example, there's education itself. There's students and there's students. So basically, this simple infographic can tell you what is the relationship between what are the words that are correlated and what are the components that are playing in the ecosystem. A little bit about the basics of HTFs. HTFs is a bi-system that is designed for storing large amount of data. So, stop the advance into terabytes across each amount of clusters, YAMU has clusters, more than 4,000 machines. It scales really well. It also allows for streaming data access. So what I mean by streaming data access pattern is that you can use any language and as long as you can read from standard input and write to standard output, you can actually stream the data of HTFs. So actually, it provides a generic framework where you can write methods to jobs in any way and then use all of these jobs to actually analyze data. It also assumes that if you learn a large amount of commodity, so that it actually has affected the design of HTFs for example, it has fairly good failure point models and fairly resilient. A little bit about HTFs architecture. HTFs, there are different kinds of nodes. When I mean a node, it's a machine. Then there's a name node. What the name node does is it stores all the data into a file. For example, when the file created, what is the name of the file, how was it actually stored after the cluster, so on and so forth. The data nodes are like the workhouses of HTFs. These are the ones that actually store data. So the data nodes themselves actually store what is on HTFs. So whenever you actually copy a file onto HTFs, what happens is broken into HTFs and depending on the likelihood factor, it's actually put in three different views. You can also make HTFs rat on that. What that means is that you have a large number of machines. You put those machines in different tracks and because of that, the amount of latency and point algorithms that you can add in a rack is dictated by the distance. For example, a complete rack page, the HTFs can still work if a whole rack goes down. So typically, if you thank your HTFs to get out of it, what it does is depending on the area of course stored a file from the local disk, copy to HTFs stores a chunk on that server, then it takes, suppose a replication factor of three. So that means that chunk is replicated by two more times. It takes that other chunk and puts it on a machine which is in the same rack and also takes a chunk and puts it in a machine which is on other rack. So what happens is that a rack which has two retina goes down. Then you have still have a copy and the main node knows that and what it does is it figures out that the chunks are under replicated and then it will actually copy those chunks down very nicely. So MapReduce is a paradigm that is the basis of program model for Hadoop. That is slowly taking because there is something new for DR which allows for different kind of models to be infected on MPI because MapReduce works well for a certain class of problems but if you go into social network analysis or if you want to figure out what is the relationship between different aspects of a group of people or a group of assets or a hierarchical data set, it doesn't work really well because of the way it is designed. We'll give a quick demo of how it works in Python. So I will define three variables, XY, ABC and now what I am going to do is I am going to define a function map. What it will do is take the length of these three and calculate the length. So I will define one function and run it over each of these. So that will be simple to look at. So I will say map. I have defined map. So what map will do is if you take this function, length of each of the variables that I have defined here. So now L contains the length of each of those. Now what I am going to do is I am going to do a radius. What radius will do is to take all the values of L and it is just going to add it up. So what I will have is the total length of each of these things. I have nine here. So those who cannot read, I define three variables ABC, A equal to ABC, B equal to XYZ and C equal to TQR. Then define a map function. What a map function does is it takes, you define a function and then take that function and apply it to each of each part of the data set. And then the radius take what you do is now you have all these partial results to add them up. And this is exactly what the method is. So the way it works on Haroo face, you fire a job, the job gets translated and you define a method that is function. When you fire a job, what happens is it gets distributed across all of the data nodes. The data nodes are where actually all the data is stored. And I talked about chunks, right? So now each of the, for example, you have two, two gigabytes, right? And you have a chunk size of 128 MB, right? So you will have about 16 different chunks. And if you have, suppose you have maybe 20 data nodes and there's no 16 different machines. So you can take and split each of these and run the map function, whatever you have. Each of these. Now, so now you have as well, right? So it will take each of these and then it will map and it will run it locally. So you have a data local property. So now we have all the partial results. We will take all of these, shuffle them across, which is the shuffle case that is happening here from the different machines. Now, again, bringing to the point of point of this, any of these scales, the name node or rather the job tracker will actually know about that and they start the job on another though the different replicas that we have. So anyway, so once all the maps are completed or even before sometimes the map is completed depending on how your reducer is done, all of that data is taken up that state and then it is run on a regular node. So on the first step, I have run the map function and the second stage, I have taken and correlated all the output of the individual maps and in the third step, I'm taking and reducing it, right? So it's family intuitive and this kind of works for a large category of problems that let's talk about the data storage here. So I talked about HDHS but Hadoop can equally work well with other kinds of file system as well. You can use them as a store, you can access them. So you are not tied to HDHS as such. So it's a great file system to use. You can use something, if you are using stuff in the cloud I think there's a talk on, there was a workshop at least yesterday on how to use, map and use on the cloud. So you can actually use S3 both in a native mode and also in a block based mode and it works beautifully. I have used it myself and it's very easy to take your own, have your own Hadoop cluster process the data there and push it on S3 or use S3, pull data from S3, write it on a cluster and send it back. So it's actually works seamlessly from the Hadoop command line as well. SSFT is another way, HDT is a very popular protocol so HDHS works beautifully over HDT as well. There's something called a Cosmos file system that was company called Cosmix that actually at some part was competing with Google and they built a KFS system that was recently from what I've read on the internet I've not used it myself and there's another file system called CEK where it can work. So it actually supports a lot of data storage. They're not tied to any one entity. And because of the way it's structured you can actually pull data from disparate sources and use them as a, including maybe there are tools for actually pulling stuff from database. So when you actually take all of the data you want to kind of serialize it because it can be kind of complicated format or simple reason that it's more efficient to store that. So for data simulation you have something called as Aero. Aero is again a Hadoop project and it works relatively beautifully well. I have also used Protobox. There's a term tomorrow that one of my colleagues is doing on a system called Deewda which is built to do network analysis on a network and in that we actually used Protobox and it worked very nicely. Thrift is another library that Facebook has provided. What Thrift does is actually something very nice. Hadoop actually provides a Java API through which you can access the AWS file system. And you need not use it. You can use a drip service that can access the front end for over AWS and probably use it and use it from any language which can provide whole generation for any language. You can actually use Python or see any of the other languages and then link it with Thrift and use it really nicely. RC file. RC file is something very interesting. How many of you do analytics or looking through analytics in Hadoop? I would highly recommend that you should look at RC file. Concept of RC file is actually very simple. What it does is, whenever you have a book, typically Hadoop or any kind of hardware in your system stores it on the local system. But if you look at analytics systems, what typically you do is you are more interested in columns. For example, what is the average temperature of this city over a period of time? In the database, you can have the geolocation, the lat-long and the temperature and maybe rainfall and other things. But typically we run queries like what was the average temperature over a period of time or what was the max or the mean. So typically you run functions which are based on individual columns. So RC file actually stores data in a column or format. And the good thing is in terms of performance, it works really well because we are accessing data in a column or format. The second thing that storing data in a format does is that it provides extremely good compression. So take for example the numerical values. They are typically going to be integrated. For example, the amount of rain that a certain city has got over a period of time is going to be within a rain. Now when you know that rain, you can actually compress really, really well. And you can possibly get up to 30x or 40x compression otherwise if you just store in other format as compared to RC file. So I would suggest that you look at RC file. You can look at rain analytics. Hadoop RTI for HCR is another format which is used by the main node to kind of RTI and it can also be used by other people for RTI purposes. So Hadoop natively supports RTI through HCR. HCR is not kind of in data space RTI. So kind of a metastore. For example, we will not bring out our peak and height and each of these stuff. So each of them has their own semantics. Each of them has their own days of accessing data. What HCatablog does is it kind of provides an abstraction over that. So no matter whether you are using peak, whether you are using height or using raw, like map, your own custom Java map. It provides interface where all three of them can run simultaneously and it was really nice. It is a new project. I don't know how stable it is but something that even can look at, especially people who are looking at different kinds of tools for analytics. So one part is a story, but how it gets stuck from multiple machines or even multiple data centers. There is a project called SCOOP. The premise of SCOOP is actually very simple. What it does is it allows you to take a dump of the database. For example, if all of the metadata is typical, any kind of system, all the metadata that the data is actually stored in a database. So the database is not closed down. You can still use the data and store it in the database, use SCOOP and define the schema saying how do you retrieve what kind of data and what is the location. SCOOP specifies that and it will keep on getting the data and your jobs can keep on running the analysis that they want. It works really well. Now it is fairly stable. The other three in that category are Bloom, Scriven, Kafka. Bloom is actually an adult project. Scriven is actually a project that has been developed by Facebook. Scriven is an Indian project. Each of them provides different kinds of trade-offs. Although the documentation from Bloom is really good, I can tell you, if you are looking at data transfer, please look at Bloom as it is now. The reason for that is, though the architecture is really nice, in the sense that it provides sources, it defines the architecture of source and same, different output and it can apply to transformations all along the way. In addition to the transport that it provides, it just doesn't work well. We had a lot of errors with that. This place was black for a long time and Scriven came along. Scriven was actually really well and we used it as a network. When I came to network, just to give you an idea, we have four different data centers around the world and there are, and we keep on transferring data across data centers and it works really well. But the thing is, you can't use it just off the box. You have to do some customizations. We have written some customizations on top of Scriven to provide greater interfaces and so on and so forth. Probably with open source at some point of time. But Scriven, for data transfer, you are looking at something, look at Scriven. I have looked at it and read about it and used it a little bit but I don't have too much experience with it. But Kafka was really well, you want to kind of go back and forth. So what Kafka does is, in addition to providing the transport, it also, for example, machines keep on going now. So you are transferring data from one data center to another. And maybe connectivity is likely. It just went down even though it's dropping packets due to huge amount of data that you're transferring. What Kafka will do is it will kind of spool on to this. So Scriven also does that to some extent but Kafka has really good servers. So you can actually go back and forth and towards that kind of interface where it can actually use it in case of failures. It works beautifully in case of failures. My friends at LinkedIn have said that internally, whatever they have, it's also externally unlike Scriven which is some code that was done before. And it works beautifully in LinkedIn, at gas scale as well. So it says that a lot of people will look at this particular aspect as transport and also look at message queues. From my experience, I can tell you that we looked at actual NQ, zero NQ and radical. And it just doesn't work well because there's a large amount of data transfer there as especially across data centers because the amount of message queues you're going to send the amount of elasticity that you want is just not there. One more thing about Scriven Kafka is also it can provide a proficient subscribed kind of mechanism. And that allows you to kind of sort by confidence especially if you're taking social screens you can actually sort based on the data that is coming in future. Zebra is a columnar storage that I don't know how many it works about as we looked at about six months back and it wasn't a very good possible so you can kind of try it out and see if it works if you're looking at analytics space if there's another thing you can do but I'll still recommend RCPen and there's a key value store that is based on top of STS for the STS. If I have time, I can watch that talk and talk a little bit more about how STS works. So that's the data storage part getting data into the system storing it. So you move on to the next part which is how do you now that you have all the data in the system right? How do you actually run jobs predictably? What are the frameworks that are available so that you can use them? Azkaban is a good framework I think there's a typo I don't see it as an edge but you can search for it it's a product by link link it provides base where you can specify a process or a tag and then you can schedule it Uzi is somewhat similar but it's a little painful to use at least it was when we used it but I think over a period of time it has become stable so Uzi also provides scheduling and provides inquiry from videos so you can basically in a job case you can rerun it you can do some kind of validation and no trigger on jobs especially Azkaban and Uzi can extremely useful when you don't know when the data will come onto the cluster you can actually trigger a job and when all the consents are satisfied so these two frameworks are really good and in movie we have built something called as ivory what ivory does is it does a lot more than scheduling it provides a data-based framework and it looks at data coming in as fees so what I mean by that is suppose you have three or four different things in our case we have ads server, log data coming in ethnic log data coming in and once somebody has clicked it might go to a project so you have three different disparate teams and customer comes and touches on each of these aspects so you maybe want to combine all of these for example you want to figure out what is the dropout result how many people saw the ad how many people actually clicked on the ad and how many people actually bought the product so the only way to do it is to do a cross fees and do a drawing on those fees and when you are doing it at our scale it is extremely difficult because the amount of traffic that we are doing at this point of time we are showing three different ads per day and the amount of data that is coming into the network is about two bearable so the scale is very large so we had no option but to build an ivory because there is nothing of that kind of return this is an open source framework actually we have open source it you can actually take a look at it it is on Jethas so it is for ivory, moby and jethas you should be able to find out so you can do a lot of stuff it is threatening for example you can do cross fees validations you can do your own validations such as you know data should be in a certain format you have certain sub screens so on and so forth and actually you can write those rules when you are in the state and process in the system so it provides for that and under some area when you made a Hadoop cluster and we again speculated this for some time was Ganglia cluster monitoring and we used Ganglia for getting all of the data from our cluster like how many how many KDA instances are there how many, how much you see if you are getting those from the cluster what is the amount of for example when a shuffle this happens there is huge type of activity on the network so it is the network of the modern name because when you are being processing at last case anything could be a bottleneck this could be a bottleneck CPU could be a bottleneck memory could be a bottleneck or even the network could be a bottleneck and depending on the kind of jobs that you have each of these can be a bottleneck so it provides you have how the cluster is being at this point and it provides a huge value and you can actually patch it to even very very granular layer matrix for example how many what the number of baseballs are happening how many how many times a certain JDM was instantiated so you can get stuff related to JDM you can get stuff related to machine you can get stuff related to even so it is a fairly good framework to do cluster monitoring it is working beautifully in production for us and we use naturals for identity and that also has been working fairly fine there is also an upgrade for chakwa the last time about year back we looked at it was also that naturals did work for us and I thought for our needs also it was fairly heavy and so we didn't look at chakwa but chakwa is another option to the combination of cancer and cluster monitoring finally we have two people two people is a distributed coordination service so what that means is you have processes running across different machines so how do they figure out who would be for example the leader so you have algorithm for leader election or you want to find out how do you kind of see if a machine is up and running because you might want to have one or two critical service if you are running a script server and everybody is using the script server to access data on cluster the script service could not work out so you can actually do people and it provides it works really well out of the box to actually monitor and restart any of the services when they work out and of course highly sophisticated algorithm for that finally data analysis is not very strong here project that we have which is who provides good classification algorithms, it provides good clustering algorithms but probably it is the best at recommendations that is what most people use internally us and I have talked to other people in other companies such as LinkedIn or Twitter or Facebook they typically write their own map reduce jobs on top of HDFS and the the map it is being done because there is no generic library of that that works for everyone piggy bank is a set of udfs user defined functions that works really well if you want to need load store data and it provides a bunch of a bunch of user defined functions to talk about it then there is Hive what Hive does it provides SPL like interface on top of Hadoop again that is something that we will cover later Pegasus and Jira Jira works on top of Hadoop Pegasus, I am not too sure how well it is integrated with Hadoop but Jira works really well if you want to do graph analysis for example you want to do social research which are the people who have the most number of connections and who are the influencers in a certain kind of network based on certain kind of algorithm then you actually have to traverse and look at the connection between different kind of nodes so for that you promise Jira but most people again roll their out roll on their own stuff for this there is also a library called all reduce that works really well and someone suggested that manage is really good I need to speak Pegasus provides Hive.ly abstraction it is a data flow language where you can every time if you want to run in Java typically before Pegasus you have to write a map reduce to kind of go and write map reduce code for every piece of software out there and what happens is when you actually go and run for every software out there and just takes a huge amount of time to actually get everything processed so what if you have a want to do graphic prototypes Pegasus is kind of like you can use Hive which is very very close to SQL instance so if you know SQL instance you know SQL as well Pegasus is more of a data flow language somewhere between it is somewhere between programming language and Python and SQL so I will give an example right after this so SQL works really well what it does it takes whatever your reference it takes and translate it into map reduce jobs and gets the result and it works very very good abstraction pretty much everybody I know uses Pegasus we also use Pegasus for pretty much everything we do the only place where we actually use map reduce is when we know a job is going to run for a long period of time we can optimize so when we have figured everything out we know that it is going to run only at that point of time we can go to map reduce but the gains are in some cases not that much probably might get a 20% gain so it is getting better and better and actually at state where in some cases if you write a map reduce job and a normal program might actually be slower than people because people become very smart at figuring out what is the correlation between different types of data and how to do the data access so it is pretty smart right now so if you are looking at getting your feedback with this is Stardust with so it provides a very easy learning or something like high but high setup might take some time because again not much in the data inflation space Ambrose is a so well known actually it runs a job it does not run just one job it runs a series of jobs same about HPA so one job starts and then there are dependencies so how do you know which job is and when and how it works so Ambrose actually provides a good visualization for what is the correlation what is the dependency graph between how do they depend on each other there is a tab which is kind of open source there is gwp jpea is actually extremely good if you are doing social network analysis I don't know how many of you have looked at in-maps in-maps.linkedin so there is a in-maps.linkedin provides a lot of linkedin data and then clusters of the people that are going to prove may be some school friends who are the first, second, third company and does a very good job of that so if you want to look at a graph and turn connections then jpea is extremely good so this is not a new question but let me addition to the data analysis please so I come from, I come from where we do recommendations for early job seekers so we do recommendations so you explore a lot of tools in data analysis but most of the time I do because I sometimes don't know this is something other than that there is a new database it's a graph database called Python that's a distributed graph database which supports FBase and Casandra and you can use all your graph data with the power of graph database so if you know a basic traversal using Wendland or any of the Singapore sites you can know with Python and use the power of FBase and still be able to yeah I think there is a lot of time to talk and there is light about what next so today we are just using the MapReduce network but what I do, the next generation that is actually not 10% but testing in Yahoo and Facebook and a couple of other places is that it provides other ways of running jobs it's called YAAB, YAAB look it up, Python it provides retrieved values don't work really well with MapReduce graph cloud as you said does not work really well with MapReduce I want you to look at it I want you to look at it so this is how you can reach me you can I'll put up the slides there is a lot of supporting materials that I have for this but unfortunately there is not much time I'll put it up on GitHub maybe in a couple of places and this is how you can reach me as a responsible creator and give me an answer I'd love to look and give some other people about what they are using every time I talk to different people for example I haven't heard about Titan yesterday I found out about very awesome visualization interactive visualization so this statistical analysis at the visualization actually apparently it's changing really fast thank you