 Good afternoon everyone, my name is Samat Akash, I am working with Intuit. I work in the data engineering and analytics group. I have worked quite some time on the Hadoop performance part, so today I am going to talk about a tool that we had developed internally for Hadoop performance jobs, a tool is more targeted for the Hadoop beginners, when they come from the RDPM background or they are learning Hadoop, so such people make very common mistakes about Hadoop jobs, hypo varies, other ecosystem software, so let's start what are the pain points for Hadoop beginners, so they come to the system, they run any kind of map reduce job or hypo varies the big script all right, then they see okay this often they see okay my job is not moving or it is 0% 0% then you know 1% so it is moving really slow, though there is no exception no error but moving really slow, no idea what, so they go to you know job tracker Hadoop job tracker page, by the way so for the audience expectation in this talk I am assuming that you know you have worked with Hadoop a little bit, so they go to the job tracker web page and see about information about their job right, a lot of information, a lot of counters, a lot of configuration, but as a beginner they are not able to make any sense out of it right, these counters they hardly make sense out of it why the job is slow right, please them frustrated right, what is wrong, they go to their operations team and other teams figure out what is actually, so the real pain point is people that are new and the beginners right, when they write map reduce job hypo varies and other things they really make common mistake, I have worked in the performance engineering team people make common mistake like not impressive, for example not enabling compression right, this is really important when you are working and you know people have no clue why their job is slow, because they do not know what are the past practices to write map reduce job and apart from this please unoptimize or suboptimize job actually create extra pressure on a cluster when you are working in multi-tenant Hadoop deployments right and also people do not learn the optimizations and Hadoop counts, they really do not know what information Hadoop already gives and how they implement right, so we have developed this tool, Dr. Hadoop what it is, it is basically a rule based analyzer and it actually makes sense out of the configuration and the counters that Hadoop provides and the logs, it finds out the common mistake, it has the rules predefined, it also gives the recommendations like recommendations on all the rules that are there, shows the severity of the recommendations in a different task right and leaves the user happy, since I do not have much time talk is 15 minutes I will not go deep into the architecture just to show the data here that we collect all sort of logs from Hadoop like job history, logs, jmx and all the meta data that we have, we collect it using component called Jpasser and then we throw it into the repository and then finally there is a web app, web app called Dr. Hadoop which has the plug-in tools, so here is the secret talk of this Dr. Hadoop right which are the rules and these are the pluggable rules like that, so if I talk about you know I will go through the rules one by one, so this map still in a map reduce job right, in a mapper phase every mapper has thought buffer, a buffer where it stores the intermediate data, the map output and then it finally thoughts and partitions it and so that are the steps, so in your mapper do job the more the steps that you are creating, the mapper is creating the more the steps read and write and that will reduce the performance of the job right, so what this rule has all the information about your job back end, what it does is it calculates the average number of skills that your job is creating and if that average goes beyond the threshold then it is really bad, you need to improve it, now how it does is it actually calculates the average number of skills and then it also considers the max heap allocated to your job or allocated to every mapper or every user and also use the counter that how much heap that your job has or the map has actually utilized and what is the difference that you have not used right, so it basically calculates all those things and gives you a recommended value of this configuration which is thought buffer right IO.mp and other related configurations right record person and other things right, so it gives the recommendation on what should be the actual value or ideal value for your job, the next rule is the job compression output compression, so this is recommended in every job just to check that how much data your job is writing in HDFS and if it goes beyond certain limit right, you should be enabling output compression, it is a much better job and it is writing data right, it actually calculates it through the configurations that are in your job right, if it is false then it tells you ok make it true and then the small file problem is a typical problem from the name note perspective but also on the map reduce perspective and you know when you are working on very small files you will end up creating you know thousands of mappers right, you know using your time on the image files from map, so it takes consideration of how much data your job is working on and how much mappers you have created, so it finds out and calculates ok your job is really working on the small file, so it gives you the recommendation on using combined file in the format to accommodate other parameters, there is an average mapper time, so it is recommended in a mapper job that your average mapper time should always be like you know close to one minute position better mapper distribution, so this rule calculates ok if your mapper average mapper time is very less like even less than 10 seconds right, then you should be improving your I mean either increase the split size that mapper working, map is working on or use the split combinations of the combined file the reducer is queerness, this is more on the pattern that is your mapper job is working on, one reducer might be at a lot of you know data and the others are not true, this rule actually calculates whether there is a skew in the skew that work on the key pattern, it doesn't give you the solution exactly but it tells ok there is a skew in a skew there is a data locality, it calculates ok how much how many of the mappers that you have in your job are data locale that locale are you so it calculates the percentage and if the percentage goes beyond the limit it actually you know alert or severity, I mean make your job data improved locale, speculative execution in some cases it might you know create negative effect right in some cases when you are talking to the external system you really don't need speculative execution and you end up using your rock, so it calculates if you are you know speculative execution, other rules is intermediate output compression it just check one parameter and compression code act and there is intermediate job compression, there are other other parameters that speak is temporal compression also if you are it also check ok, if you are high job, if you are high table right it is a bucketed table then you better use bucket join, make the configuration for high to use bucket join right and it also has the metadata about all the tables, which table is large, which table is small right, it can even suggest you go for map side join, it also says ok the path is the high query and check ok whether your high query limit or enable the limit optimization considering all the other factors of the limit so next step is so this is targeting only one job at a time right, the next step for this tool is to you know go across the jobs and learn the behavior of your job over the period of time and suggest better recommendations right, the sort of machine learning right that they have learned, this tool is still under development right now and we have plan to Hi, I just want to ask how do you decide data locality for the mappers, how do you find out which mapper is data local ok so how do you give, if you have seen you know jobs counter there is one counter called data local and the rack local right there are two counters which tell you ok these many mappers are your data local mappers ok so I was just like I am aware of that but I am just thinking that did you write any customizations for that to find out what is the like did you basically overwrite that code to find out how do you decide that see if it is not right local or if it is not on the same machine, if it is at the remote side I do not think that counter will tell us yeah so what we do is so you have total number of mappers you have data local mappers you have rack local mappers and the remaining if you subtract from the total mappers it comes out to be even the remote mappers right so that is how we do not have any extra code we are utilizing these counters and calculating on them ok ok thank you this is Janesh here so I have just Dr. Hadouk which you have so is this available to your company or is it available for others sorry is it also is it or is it a property or something so this is still under development ok so we have plans to control so let us repeat the question so when you wanted to collect the requirements for the software did you get inspired by some other tool it is proprietary or some other tool it is open source no to be honest no but there are there is one product called called Hadoop asia right which is similar to this but that has a limitation in the that you have to give the logs to it it will not maintain its own depositing and also since it is not maintaining any data it is not creating its own database right it works on the instance of the data and gives you recommendations right but this tool is different in the sense that it actually maintains the history of your job right and it has the capability to learn about your job and gives you the recognition that is one difference but the motivation behind this tool was not that right we were working in a Hadoop performance engineering thing and we saw these are kind of the typical problems that occurs for beginners so that was the main okay for the next questions our sponsors over at Bloom Reach have decided to give you guys during these stressful times they have some de-stressing stress balls so for the next few people who ask questions they are going to be passing out some stuff so make sure you get your hands up high we want as many questions as we can we got about roughly five more minutes for questions hi this is Rahul I am from 24-7 so you mentioned there that Dr. Hadoop does something it calculates the average mapper time and suggests a split size so does it also set the split size no so it gives you the recommendation it doesn't set the document and even the later job we can set it right okay just a recommendation thanks okay next question raise your hand up high this man right here hi this is Mayank from 7 do you have plans so that this Dr. Hadoop is easily usable for AWS EMR we have not part of it yet as I said it's under development and it's working with ours okay unfortunately that's all the time that we have for questions right now everybody give a big round of applause thank you very very much appreciate it