 Hello. Good afternoon. Thank you everyone for coming. My name is Akshay. I am an engineer at LinkedIn. I work for the Hadoop development team. In this presentation, I will be talking about how to optimize your Hadoop cluster by tuning the jobs that you run on them. So, I have divided this session into kind of three phases. To begin with, I will talk about performance in Hadoop. Some of the challenges users face by tuning job. And then in the second phase, I will share some of the learnings, some of the things we did to address these challenges. And then in the third phase, I will talk about a tool called Doctor Elephant. It is a self-serve tool to help users easily optimize and tune their Hadoop and Spark jobs. So, before we begin, I would just like to talk a little bit about Hadoop at LinkedIn. This will help you get a better context of what I will be speaking in this session and also scaling in Hadoop. How does that work? So, in the year 2008, when we started with Hadoop, we started out with a single cluster with about 20 nodes. They were hardly about 10 users and about 10 workflows running in production. They were predominantly written in Java, MapReduce, and PIG. So, fast-forwarding to 2016, we now have more than 10 clusters. There are more than 10,000 nodes, more than 10,000 users, thousands of workflows running in development, hundreds in production. And now people use a variety of frameworks, including MapReduce, PIG, Hive, Spark, Scalding and others. So, how do you scale Hadoop? Simple. You add extra machines to the cluster to improve its performance. So, you already saw how we scaled from a single cluster with 20 nodes to 10-plus clusters with 10,000-plus nodes. But there is one issue here. Though Hadoop is scalable, that is, you can add extra machines to improve the performance, Hadoop does not guarantee an optimal working cluster solution for you. So, even if you add extra machines, the given cluster may not have the right configurations. Even if you run a job, Hadoop doesn't guarantee that the job will run optimally on the cluster. In addition, you cannot keep throwing machines to improve your performance forever. Adding machines is a huge investment. So, once we decide that we need to tune the given resources, once we know that we cannot keep adding extra machines, we need to try and get the same performance by tuning the given resources. And once you decide on tuning, you need to start measuring the performance of the cluster. So, measuring performance will highlight some of the hardware failures in the cluster, some of the poor performing components in the cluster. It also gives you scope to improve, upgrade your cluster. Maybe you want to upgrade the version of Hadoop that you are using. You might want to update the disk, update the CPUs. So, once we start measuring performance, then we can think about tuning specific things on the cluster. So, I have divided Hadoop tuning into two broad categories. So, one is the cluster level performance tuning and the other is the job level performance tuning. So, by cluster level performance tuning, I mean tuning your network bandwidth, your input output to disk, your CPU cores and other cluster metrics. Now, we already have a lot of cluster monitoring tools like Ganglia, which help you tune your clusters. You can even take help from an Hadoop expert to do all of these. But what I am more interested in is, or rather another important part of tuning your Hadoop cluster is tuning the jobs that you run on them. Why is it important to tune jobs? Because we usually schedule our jobs to run periodically. We run it daily, we run it hourly and they keep running infinitely until they actually break. So, if your job is not tuned, then over a period of time, it's going to consume and waste a lot of resources. So, how do you come up with a cluster where all your jobs are running optimally? So, here's what we tried at LinkedIn. What we did is we created a production cluster and we made sure that we let workflows go into this production cluster only after verifying it is tuned. So, the Hadoop dev team would act as kind of a production gatekeeper and we would make sure that users can run their workflows in production only after verifying it is tuned. But there's one problem. This is this kind of restriction to the users. Users are running their workflows successfully now, but we want them to tune the cluster and then only we say you can run it in production. So, we started receiving hundreds of emails from users asking us how to tune a Hadoop job and we started helping them out. But how difficult is it for the user to tune a job and to what extent one can help them? Hadoop itself is designed to let users tune their jobs, right? But there are a couple of challenges with it. One cannot optimize your Hadoop, one cannot optimize Hadoop jobs they write without them understanding the internals of the framework, right? And even if you decide on tuning your job, the information required to tune your job is scattered. So, in most cases, people usually use some high-level tools like Pig or Hive to run their Hadoop jobs and where they are just exposed to a console or a terminal where they can just see some client logs. If you want to tune a job, it's not just enough if you look into this client logs, but you also need to go look into your resource manager logs, your application master logs, there are job configurations, there are job counters, there are task counters and each task will have a different set of counters. So, there's a huge library of information and this is just for one MapReduce job. If you run a Pig job or if you run a Hive job, that might trigger say 100 other MapReduce jobs and then you can imagine how complex tuning becomes, right? And thirdly, Hadoop has a huge set of parameters and it's really unfair for you to expect someone new to know the significance of each of these parameters. These parameters are also interdependent. So, if you tune one parameter, it might affect some other parameters also. So, in short, I would say you cannot tune what you do not know and you cannot improve what you cannot measure, right? So, how do we do this? How can you help the users? All these challenges put an extra burden on the users. It tries to lower their productivity. So, how can you help the users? How can we improve their productivity? So, we started out training users. We would keep regular training sessions and teach users how to tune a job. We would say what parameters you can change, what parameters, what is the effect of changing a parameter? What is split size? How can we change split size? What's min split size? What's max split size? We start teaching them all of these and gradually they pick up how to tune jobs and then they do it on their own, right? But there's one problem here. This doesn't scale. So, as more people started joining LinkedIn, as more people started onboarding on to Hadoop, we had to keep up with these trainings and people coming were from different Hadoop backgrounds. Some did not know anything about Hadoop. Some were experts in Hadoop. And people use different frameworks now to run Hadoop. Some use PIG, some use Hive. So, our trainings had to be specific to our audience, as well as specific to the framework that we started to use. So, at some point we started realizing that we were spending way too much time training users and focusing away from other important issues. So, how do we solve this? So, what we next try to do is rather than educating users rather than training users and asking them to tune, why not have some experts review it on behalf of the users? So, we had the Hadoop dev team, some experts in Hadoop dev team review workflows on behalf of users and give them specific suggestions on what you can change to improve your workflow. So, this again, sadly has a problem with scaling. So, again, if you have more people in Hadoop, there are more workflows in development and the probability of more workflows going to production. So, there are more review requests, requests keep that coming. But the number of experts are limited. But there are a couple of other issues here also. Typically, how this works is a user would set up an appointment with the Hadoop team. He would then go talk to a person from the team, get suggestions. Then he would go back, make changes to the workflow and he would set up another appointment with the Hadoop dev team to review it again. Now, by the time he does that, the Hadoop expert would have reviewed 100 other workflows and it's difficult for him himself to keep track of what suggestions he gave earlier. And what happens here is, sometimes this happens, a lot of two and four. So, user reviews, he goes back, the expert says some other changes and he goes back. So, sometimes it takes months for a workflow to go into production. And the other issue here is, we really do not know by changing specific parameter, has it really improved the performance of the workflow. You cannot actually compare your current execution after you make the change with an execution before the change. And these reviews are subject to human errors. And some expert might overlook certain aspects, different experts might have a different opinion on the matter. It's sometimes difficult to come to consensus. So, both training and expert reviews did not work as we started growing. At this point in time, we did not have a clear picture on how to solve this issue. We couldn't even go back and the help requests kept coming. And we had to spend more of our resources helping them out. So, here's a lesson that we learned from here. That is, scaling Hadoop infrastructure is hard, but scaling user productivity is much harder. So, this is when we decided that we will try out an exploratory intern project to fix this problem. So, we will build a tool to automate all of these. But what's the idea? How will we do this? While doing these expert reviews, we came across several very common repetitive optimization patterns. And we thought we will gather all of these patterns and feed it to this tool as a rule. And this rule will automatically now analyze all the jobs that are running on the cluster for the existence of this pattern. This is what gave birth to the tool called Dr. Elfin. So, what is Dr. Elfin? It's a self-serve performance monitoring and tuning tool for the jobs that you run on the cluster. What does it do? It helps every user get the best performance out of the job. It's a self-service tool for the users of Hadoop. People who run pig jobs, people who run hive jobs, people who run spark jobs, they can go to Dr. Elfin, search for their job, and get insights into how their job performed on the cluster. And Dr. Elfin will then make suggestions like how you can improve your workflow, what parameters you can change to improve the efficiency of the job, how quickly, how can you speed up your job, all those things. And you can even compare and analyze your historical executions. It's also a platform for other performance-related tools to make use of this and come up with some new metrics. Here's a high-level architecture of the tool. At the top, you can see a job generator. So what this does is it fetches the application from the resource manager periodically, say every minute. And once it gets a list of application, then depending on the type of the application, whether it is MapReduce or Spark, there's a separate fetches. There's a MapReduceFetcher and there's a SparkFetcher. The MapReduceFetcher will start fetching all the information that it can get for that application. Say, it will query the job history server, get all the information for that job. And then similarly, there's a SparkFetcher which will fetch all these Spark-related information from these Spark history logs. Now, once you have all the information for a given application, we have now fed these rules into Dr. Elfin. There are MapReduce-specific rules and there are Spark-specific rules. They will run over these data and then they will classify the job as critical or safe. And then you save it in the database and then there's a UI where you can go and look into how your job performed on the cluster. So I'd like to discuss a couple of rules that it has. So one is the MapperData skew. So what is this rule? When you run a MapReduce job, it runs a set of mappers and set of reducers, right? If there is a skewness in the input to the mappers, then we say there's a skewness on the mappers side of the input. So for example, here you can see there are blue boxes and then there are yellow boxes. The yellow boxes are input to the mappers and the blue boxes, which are the mappers, the height of the mapper, represents the run time of the mapper. Now, suppose mappers one to K are reading a larger split of data. Mappers K plus one to map N are reading a smaller split of data. Now, mappers K plus one to map N which are reading a smaller split will relatively complete sooner than the other mappers. And once they complete, they have to wait for all the other mappers to complete. So ideally, we want every mapper to get the same amount of data. So none of the mappers have to wait for other mappers to complete. So how do you fix this? What you can do is you can combine all these smaller splits of data and feed it to a single mapper. If you go to Dr. Elfin, it will tell you what specific parameter you can use. I think it's the combined file input size. If you set that, then all the splits will go to a single mapper. But what's the cause of this? Why does this happen in the cluster? One reason is, suppose you have your file which is a little larger than your HTFS block size, right? In that case, your file gets split into a block size and then a smaller chunk. Now, if you have a large number of files, then there'll be a large number of smaller chunks that will be formed. And the framework, what does the framework do? It allocates as many mappers as there are these number of splits. So how can you fix this? You can set a combined file input size and all these smaller chunks will get combined and go to a single mapper. Another rule, mapper memory. So when you run a mapper-reduced job, you request certain container size from the resource manager, and then your job runs within that resource manager. Now, suppose that the user requests a 4GB of container, but the job when it runs within that container just uses, say, 500 MB of it, right? So the problem here is you are actually requesting more resources which means you're relatively going to wait for more time for this 4GB of container to get allocated. And even once it gets allocated, you're not going to utilize the entire 4GB of resource area. You just need 500 MB of it. And you're actually blocking 4GB of resource on your cluster. So what you can do here is you can request a lower memory by setting mapper-reduced.map.memory.MB. That's for the mapper container. For the reducer container, you would set mapper-reduced.reduced.MB. So all these suggestions will be given by Dr. Liffen to the users. Now, these are two of them. There are a bunch of other mapper-reduced specific rules. There are a bunch of spark-specific rules. The spark-specific rules are not as vast as the mapper-reduced. They're still in the early stages. If you go to Dr. Liffen and search for your job, you will see a result. So the highlighted rectangle is one specific mapper-reduced job. And then you can see a username that's the user who ran that job. And pig represents the pig job that triggered that mapper-reduced job. And then there's a job ID. And below that, you will see these colored rectangles. Now, these are the rules. So you can see at the leftmost, there's a mapper-data skew. Then there's mapper-time, mapper-speed, mapper-spell, all the others. The color of the box represents how severe it is. So if you click on that specific job, you will get a report on that job. So here's a small snapshot of some of the heuristics. So here at the top, you can see mapper-data skew. It's saying the job is severe. And why? Because there's a group A which has 197 tasks. Each of these 197 tasks are reading an average of 9 MB. Then there's a group B which has just three tasks and each of these three tasks are reading 512 MB. So you can see the skewness here. All those 197 tasks, which are reading 9 MB, will complete soon. And then they have to wait for the other three map tasks which are reading a larger chunk of data. And if you click, there's an explain link. If you click on an explain link, and then it will tell you how you can fix that problem. So similarly below, there's a mapper memory which I explained. It is also severe here. You can see the average physical memory that's the memory it took in the container. That is 658 MB. And at the bottom of that same box, you can see a requested container which is 4GB. So he has actually requested 4GB, but using just 658 MB. So it would be ideal to request a lower value for your requested container memory. You can look at your history. You can look at your jobs history. You can look at your flows history. So at the top, you can see a performance graph of how your job or your workflow performed. At the bottom, there's this table which shows all the executions. At the top most, you can see the latest and then the previous and so on. Each of these colored bubbles represent a rule. The color represents the severity of it. Now say for example, your workflow is running on your cluster. You scheduled it. And today, suddenly your workflow failed or your workflow took a longer time to finish. Then you can actually come to Dr. Liffen, look at your execution. And probably what will happen is your color in one execution might be green and in the next current execution that took longer might be red. So you can identify which specific rule is that, which specific dot is red. And then you can just hover over that dot and it'll give you information like what was the requested memory? What was the delay? What was the runtime? You can then actually debug. Was it actually your job that took that long or was it because it was waiting for resources to get allocated? So you can do all such kind of analysis. Maybe suppose you want to change a specific line in the code. You want to change a specific parameter in the code. You can then, by changing that parameter and running, you can then actually see how it has reflected in the performance of your job. Has the performance improved? Has the wait time improved? Or has the wait time reduced? Or has the runtime increased? You can do all such kind of analysis by looking at your job's history. So how do you define a rule? A rule has an input, a logic and an output. The input are the task counters, the task data, all the inputs from Format Produce. And then this, you do some logic over these task counters. You compute a value and then say, if this is greater than a threshold, then you classify that job as critical if it's less than this threshold, then you say it's green. In Doctor Elephant, we have five different output levels. Red indicates it's critical, and then there's orange, which is severe, then there is yellow, which is moderate, and then there are low and none which are represented by green. So ideally, you want everything to look green for your job. What can you customize in Doctor Elephant? You can write your own rules. So if you have an idea, if you have experienced tuning MapReduce job, if you have experienced tuning Spark jobs, and if you know how to tune a specific job, then you just convert it to a rule. You can look at the way how other rules are written, and just by looking at them, it's very easy to write a rule. Once you write a rule, you test it out, and then you write a help page explaining what that rule is, right? And then you also say what parameters you can set to fix that rule. And once you have the rules, and once you have the help page, you can just go to Doctor Elephant and there's a heuristic con. Heuristic is a terminology in Doctor Elephant for a rule. So you can go to this heuristic con fraud XML file, and you can add two new heuristic tags, and then give a path to your class, and give a path to your help page, and give a name for your rule. And you just restart Doctor Elephant, and Doctor Elephant will start analyzing all the jobs with the new rule that you have written. What else can you customize? You can change rules. You can disable rules. You can change their threshold levels. You can expand to other schedulers. You can expand to other fetches. You can expand to other application types, other job types. Almost basically everything is pluggable in Doctor Elephant. So earlier I talked about something called production gatekeeper. So I said that we maintained a separate cluster for production, and the Hadoop dev team would act as a production gatekeeper to make sure that workflows go into production only after it is verified, it is tuned. So now that we have Doctor Elephant, the Hadoop dev team just can refer the Doctor Elephant's report and then see if everything is green, then you can directly go and run your workflow in production. But if there is something, then we can just ask the user to go back and fix this and come again. But this is again a manual process. So what we did is we created a tool to do this. So we created a Jira bot. So a bot that is integrated into Jira, and the user just has to create a Jira ticket and give an execution to its flow, right? And this bot will automatically fetch that execution, get all the reports from Doctor Elephant. And if everything looks green, then it goes ahead, comments back to the user saying everything is good. If there are some issues, it comments back, asks the user to go back and fix them. And if everything looks good for the tool and it goes to production, creates a project on production cluster. If there is some setup to be done, give permissions to the user, it does all of that. And the user can go and run his workflow in production. So this entire process is now replaced, like the tool itself does it. So about 80% of the workflow is going to production using this tool, without the intervention from the Hadoop Devs. What else can you do? You can monitor, you can use these reports to do other kind of performance analysis and reporting. For example, we have a tool that generates daily reports saying what were the top 25 workflows to tune today? What were the workflows that were failing continuously in the cluster today? And there are other tools which make use of these reports to do cost-to-serve analysis, to compare different revisions of your workflow and others. Good news, we open sourced this year in the month of April. Since we open sourced, we received an overwhelming response. There were more than 60 pull requests. Since we open sourced, there are more than 10 contributors, more than 50 topics under discussion. I welcome you all to contribute and make suggestions and feedback. One thing that would be good is we can all collectively gather all the good practices, all the good optimization techniques that we have and feed it into Dr. Elephant and make it as a single hub where to analyze any job completely in Hadoop, any job that you run in Hadoop, you can do it with Dr. Elephant. Since we open sourced, a lot of companies have shown interest and they have started using Dr. Elephant. A couple of them have running it in production. There were folks from Airbnb who added support to Airflow Scheduler. There's a pull request on Uzi. There are a couple of pull requests trying to scale the fetches and a bunch of other pull requests, adding new features and bug fixes and all of the others. A roadmap of Dr. Elephant. So we're soon going to add real-time support for the Hadoop jobs, show analytics for failed jobs, visualize your workflows and jobs through dependency graphs and also as we expand to other rules, we expand to other schedulers and other frameworks. Couple of references, you can go check out LinkedIn's engineering blog. There's an article on Dr. Elephant. If you want the source code, it's on GitHub. There's also a mailing list and a Gitter chat, so feel free to discuss and share your ideas. And it was also presented in the Hadoop Summit last year. Thank you. Can we monitor Spark SQL? Here, I'm here. Here, here. OK. Spark SQL runs on Yarn. Can we monitor via Dr. Elephant too? Yeah, so as long as you run your framework on Yarn, it can analyze any job that you run there. And by default, it uses MySQL, right, in case if I need to change it to PostgreSQL as a meta store. Yes, it's a player configuration, so you can change your backend as you wish. Which specific configuration file? Because when I tried changing from MySQL to PostgreSQL, I was not successful actually. OK. So OK. There is one app.con folder. There is one elephant.con folder. That is a place where we need to change the PostgreSQL. OK. Maybe you can bring this up in the mailing list and we can discuss it from there. Hey, hello. Yeah, so it will not work with Spark, with Cassandra or something other than Hadoop. Right now, we have support for MapReduce and Spark. Spark may be running on Mesos or? Right. For Mesos, we don't have support right now, but you're welcome to contribute and add that solution. There's already someone who raised that in the forum, so to add support to Mesos. So that's something we would look forward to. OK. OK. Thank you. Here. Here. So like you said, right, that one map, one pick job, can launch multiple MapReduce jobs. So is there a place where you can see RAM, CPU, average, max, min of mapper and reducer, all in one place for all the MapReduce jobs that the pick job launched, that pick script launched? So that's what I mentioned in the end, that we will visualize these things through a graph. So then you can actually click on your pick job and see what were the other pick jobs, the other MapReduce jobs that were triggered by that. You can actually look at your entire workflow and drill down to these. And you can see the RAM, CPU, taken by each one, each job, each application. You can look into all the Hadoop specific parameters. By RAM, I don't understand. That may be a cluster metric. This is more of a metric for the users to tune their jobs. OK. You can't see that RAM, because counters are not user-friendly. OK. And that's fine. OK. Thank you. You mean these profiling information for a job? Yeah, that is something that we are looking forward to do that in Dr. Liffin. I'm afraid we're out of time for more questions. So if you could take the questions offline, that'd be great. Yeah, I'll be here throughout the day. So you can catch me up and ask questions if you have. OK. While the next speaker.