 All right, so continuing on our spark theme next up is Nicholas and Alejandro Who actually will be taking us through a cage match between spark and hive so take it away guys Yes, but you also need to take for the recording All right. First of all, thank you very much for attending this lightning talk. I'm Alejandro Montero master student The Barcelona super competing center and in the last couple of months We've been working with big bench to compare the performance of different big data engines Specifically, we're focusing on hyphen spark So big bench is an specification based benchmark with an open source implementation and recently it's been proposed to be the first Big data benchmark on the world and that's because right now. It's the only one that Covers all measure big data Characteristics and as you may know it. They are the volume a variety and velocity Some very quick characteristics of big bench. It's an extension of tcpds that adds some new SQL queries and Use cases such as machine learning natural language processing some others It can support multiple implementations engines and table formats for hyphen It can also execute multiple parallel and streams at the same time in the same cluster And it can define different scale factors. We used scale factor of 100, which is approximately 100 gigs in table size So big bench what it's doing. It's emulating Store that it's selling items both physically and and be a web page and for that reason it Provides this these data structure First we have the structure data, which is the one we're used to easily indexed and recoverable and We add two more which is the weblock which contains a click streams of every user that it's navigating through the web page and The reviews of the users that actually have bought an item and want to make a review So for the workload itself, there's 30 queries divided in four kind of use cases 14 pure purequail queries which retrieve information from the structure section Then we have four queries of map reduce Pre-processing of the data before selecting it seven natural language processing queries and five machine learning queries So don't be overwhelmed by this. This is the software stack of the implementation. We've been using Very fast starting from the bottom to top all files are physically stored in the in how to distribute it file system But as we're running queries, we need to have a middleware A metastore to store the logical tables And on top of that we need that SQL engine that is receiving the queries from Big Bench It's parsing them and it's retrieving the information from the metastore once we have The location of the of the of the files we want to recover We can use one of these three engines one of the execution engines to actually retrieve the physical information from each DFS The engines can be classic map reduce tests, which maybe you don't know about it It's a hack on top of my produce to click create a direct a secret graph to reduce latencies and improve overall performance of Mappers and reducers and spark engine And for the machine learning queries We also need a new application to perform the the learning techniques and we can use two applications Mahout, which is based on map reduce or spark ML eep a custom built spark ML eep library So of course John is the one that's managing all the containers for every single application in here So we've been smart all permutations of the engines you see in in this slide But we still have a few work in progress things We have some results for hype to but the they are quite odd So we're still working on that and also for spark to it was compatible with Mahout But the custom library was not binary compatible with spark to so major code refractories needed and we hope to get results pretty soon Hardware wise we're using an HD inside platform as a service cluster Model D D4 B3 with four working notes Featuring a high-can intelligence CPU with eight cores 28 gigs of RAM And the HDFS is completely remote For the software stack a GI is relying on hot on works that a platform 2.5 We've noticed that both map reduce and tests are really well tuned So we decided not to change a bit of the configuration what we did notice though is that spark was recently added and The configuration is quite strange It's only using one executor per working node and that executor has three out of eight cores available in the machine So for the results we decided to Divide it in use cases Starting for pure cool as expected map reduces the slowest one on the group followed by the fastest one Which is spark to which is very close to the other engines spark one and high tests We wanted to see a little bit more what was happening inside a pure cool query So this is a trace of the CPU behavior of one of the queries for a 12 to be correct consists What we can see here is that test it's reaching 100% of the CPU usage which indicates the CPU bounded And also in this case you don't you cannot see the numbers because of resizing reasons Well, it's a lot faster than both the other engines This is finishing in 100 seconds sparks is finishing in 200 seconds and spark two in 160 Moving on we see that spark one spark two both are reaching at top of 30% of CPU usage Sorry, you cannot see the y-axis And most interestingly we can see that spark a spark one has a lot of I await for some reason and Spark dude deals with that. It doesn't show any more. I await and It ends lot faster in this case Also, it is using only 30% of the of the CPU and that may be because of the software configuration I just talked about you For very fast for the second use case custom reducers to pre-process the data before selecting it We see that high tests is the fastest one here followed by spark two and very close to spark one My produce again is this lowest one moving on Natural language processing here tests once again. It's the winner by a long shot in this case Followed by spark to spark one really close to spark two and my produce is really really slow And finally for the machine learning sections We can see two interesting things here First of all is that changing one execution engine for the other doesn't bring us any real difference in performance But what does give us a difference in performance is changing the application that actually performs a machine learning and changing from my out in any of the of the engines to spark MLLip give us a two times improvement in performance As I said before unfortunately, we were not able to test spark two with Spark MLLip, but we're hoping to see two times improvement and well as in the other cases and And finally for the aggregated results for the four use cases What we can see is that for the whole group the fastest one here is test plus a spark MLLip Second in line is a spark one with spark MLLip followed by spark two but plus my out We're hoping to see spark two plus spark MLLip to be a lot faster when we have results But right now it's on the third position and map reduces this lowest one on the on the group So just to finish some conclusions we can gather from from these results first off High plus test is improving SQL performance by a long shot by Overmap reduce it's slightly faster than spark one, but it's a slightly slower than spark two and We have to make clear something in that point in this point Is that the implementation spark of the queries? I mean the implementation spark is using it the same as the hive one So they are using the same the very same as SQL queries and in this implementation these queries are very optimized for hive So maybe tweaking for spark may give us a different results Second concussion we gather from these studies that spark MLLip is way faster than mahog we encourage you to use it instead of it and Finally the best production company at the moment is using a patch a test for the SQL sections of your queries And if you need to do machine learning techniques stick with MLLip, which is the fastest one So before finishing I encourage you to assist tomorrow to presentation my colleague Niko is doing using for using a Non-volatim memory to improve performance of H base Hadoop and all this stuff quite interesting building you at 12 o'clock, so encourage you to attend and That would be all. Thank you very much. We've had any question All right, well, thank you so much. Thank you