 Okay, let's get settled so that we start on time and hopefully end on time. You are welcome to today's presentation. The title is Hadoop on OpenStack Cloud. This presentation is actually based on a proof of concept that was undertaken by our team in Dell EMC based in Austin, Texas. The executive sponsor for this project was Dr. Akadi Kanevski, who is going to be a presenter and you will get to know more about him when he presents. Mr. Fazal Kanevski was the contractor who did most, if not all the work. My name is Nikola Swakou and I was the lead engineer on this project. So what we have for you today is we will show you the reference architecture that we used for this POC and then the motivation. Why did we do it? And then we'll take you through the tests that were performed and how they were performed. We'll share the results that we got. We conclude after that hopefully within time. We intend this to be an interactive session. So feel free to ask questions, but we request that those questions should be specific, precise and to the point so that that way we can get through everything that we intend to share with you. So without much ado, we'll call upon Dr. Akadi Kanevski to start off the discussion. Thank you. So let me talk a little bit about what motivation, why did we start this endeavor. It's very simple. We actually in Dell MC have a solution which was optimized for Hadoop and it's actually working very well. And we have another solution which is open stack one. And they're not too far apart with respect to the hardware and the configuration they're using. And I was questionerized, okay, if you want to run the Hadoop on open stack, how far is performance, how close the performance you're going to get between the runs you're going to have on the kind of optimized hardware configuration solution versus the cloud one. So first we kind of do two things. One is we looked at what is the hardware which we use for the Hadoop solution and what is the hardware we use for the open stack solution, specifically for compute node because that's where the bulk of the work is. We figure out that additional load which will be generated for open stack itself for creating the networks, creating the volumes and attaching them, creating VMs and so on would not be too big a deal at least that's was our thought. So we kind of stick with that. The other thing is that we wanted to make sure that the hardware configuration with respect to RAID, BIOS and several other things which are underlying hardware configuration are identical for both of them. So I'll get into a little bit more details of what specific hardware was in a minute. And then we said, okay, what are we going to use to evaluate that from the benchmarking point of view because we want it to be something which we can relate to. And we chose TPCHS benchmark which is the one which we publish when we release our Hadoop solution to as a part of the verifiable thing which community can see the results and publish and it's published through the TPC consortium and various other vendors and companies publish those results. So the architecture we chose was specifically kind of the bare minimal we can do to kind of start the work and that is following the model which we have for our open stack solution. The results presented here are on slightly older version of the solution so it is actually run on the Liberty code base but as you will see from the results we are doing, the results are not really impacted by the version of the open stack. So one other point I wanted to make is that while our Hadoop open stack solution basically optimized by the administrator, so administrator have the full control of the entire, all of the hardware, servers, switches and storage, the same is not true for the open stack because that's a clear separation layer between what is a user you can do and what you can see. So the design which we came up with and we'll go into more details as we go through each of the results is that we are providing the architecture and configuration which we recommend for the targeted for the Hadoop configurations but from the user point of view they don't see that. They operate as a standard open stack APIs and don't know any difference. So from the hardware point of view it's fairly typical open stack configuration. We have one simple node which we use for administrative purposes. For this work we use triple O for the deployment and controlling the configuration of the nodes. We have HA environment, so we have three controller nodes where all of the open stack services are running. We have three nodes where we are running the compute and some of the results were using some lighter compute nodes with respect to the memory and cores available in the processors. But those will not impact the final result because the final results were actually on the same hardware we use for both bare metal Hadoop cluster as well as compute nodes for the open stack. And finally we use SAF as a storage layer both for the block and for the image store. We have not done the investigation but it will happen when you have your data on the SAF as an object storage and try to process it out of there. But given how the performance result we have for SAF for the block, we don't expect that going through the object interface will improve the performance trustically. Moreover we expect that because we have to go through the RGW through the control plane to get to the data we expect the results will be worse. So once we got the result on the SAF block storage we realize that probably we don't want to go there because it's not how we're going to get any improvements. If you are the true hardware buff that appears in the configuration we use those are of the shell standard Dell hardware. We use kind of two flavors of the nodes the six stories which was originally what we used for the compute nodes are somewhat lighter with respect to the storage capabilities and that's usually good enough for most of the compute loads for cloud native apps, for dev and test but they cannot sustain the same number of the storage capacity. So for that what we use for both the SAF nodes and also for the compute nodes for actual truth test comparison between and against the Hadoop. We use the seven theories which have 24 disks this I think 1.6 capacity per disk. So this kind of compares apples to apples because we use identical hardware for the bare metal Hadoop and for the open stack on the Hadoop. So this I'll pass the baton to Fazal to go into details. Thank you. Good evening everyone. So as Arkadi has already discussed the motivation of this POC and the reference architecture I'll quickly walk you through the deployment of our system in the test and open stack Sahara's architecture and its components. So open stack Sahara provides a robust interface to easily provision and scale Hadoop clusters. As an open stack component it is fully integrated into the open stack ecosystem so a user can administer an entire Hadoop workflow from the Horizon GUI from configuring a cluster all the way to launching jobs on them. A typical Sahara cluster consists of name node instances and multiple data node instances. So all those instances were deployed and configured by open stack Sahara. So we benchmarked the cloud error distribution of Hadoop version 5.3 using TPCX HS as the test workload. TPCX HS was the first industry first benchmark standard by a major industry standard performance consortium this in particular is a derivative of Apache Hadoop workload TeraGen, TeraSort and TeraVelidate. So this test workload was installed and executed from the name node instance by the cloud user and as Arkadi has already mentioned that the underlying hardware configurations were not kept hidden from the cloud user. Coming to the open stack Sahara's architecture as you can see just like any other open stack component you see we have auth, rest, api, data access layer that persists internal models in the cloud database and then the provisioning engine which is responsible for actually provisioning the Hadoop cluster using heat orchestration services then there are vendor plugins so vendor plugins are basically a pluggable mechanism which lets you deploy a specific when Hadoop distribution I'll talk more about it in the next slide then we have EDP which is elastic data processing and it just helps you manage and launch jobs from the Sahara's core component and just like any other open stack component we have Sahara as its own python client and the Sahara dashboard it's fully integrated into horizon and you can deploy a cluster resize it or even delete it from the single click-off buttons Looking into the components the major components include the plugins which I've already told are the pluggable mechanisms so there are several plugins available as of now including apache vanilla cloud data distribution of Hadoop and some others Next is the image registry as you all know that open stack starts virtual machines based on a pre-built image and installed OS so the requirement of images for Sahara actually depends upon the plugin that you're using as well as the framework version so some plugins might require a simple cloud image and will install the framework on the machines from the scratch while some machines might require images with pre-installed Hadoop packages so in order to simplify this provisioning process Sahara employs the concept of templates there are two types of templates node group templates and the cluster templates as the name suggests node group templates are for node group creation and cluster for cluster creations basically templates actually remove the burden of specifying the required parameters each time a cluster needs to be deployed so for example you can have a name node group template of you not only specify the role in the template but also the flavor and the security policy that you need for the nodes in that node group so for example a name node group template can be of you can assign M1.medium flavor to it and the roles of HDFS name node secondary name node yarn job history and resource manager and likewise for data node you can have HDFS data node and the node manager and you can set it to use the M1.extra large flavor and cluster templates on the other hand is basically a collection of the same node group templates so you not only specify which node group templates you want to use in your cluster but also their quantities for example you can have a cluster template with one name node group and three data node groups so you will end up with one name node machine and three data node machines running in your cluster provisioning engine is basically the main is the core component of Hadoop which instructs heat to communicate with compute, nova, neutron and other open stack services open stack Sahara used to communicate with those services directly but since the Liberty release it has been deprecated now and we have heat engine for that purpose coming to the methodology that we used for this POC work first of all we used the bare metal tests as a baseline for our normalization of the results so all the tests that were executed in the cloud environment were and the results were normalized to bare metal test results secondly we used TPCXHS as the benchmark and its performance matrix for comparison of those results in order to evaluate the performance of Hadoop in different cloud configurations we had to define some configuration parameters like worker instance configuration where we actually tested how many number of instances per compute node along with the VCPUs RAM and storages allocated to each would give us the better performance then we tested CPU over subscription memory over subscription and the block storage so we had two different type of storages available the shears have storage and the local disks on the compute nodes in the other setup so we had to test the performance and compare them using those concepts could help us in getting the better performance so we adopted an iterative testing methodology where we iterated through the set of values for each configuration parameter and evaluated the performance so in this way we came up with the configuration values that were giving us the best performance like each parameter that showed the best performance we used the same configuration for the subsequent tests and then we concluded with the best value for each parameter coming to this is basically the resource allocation that we used for our instances so basically we had three different types of instances the cloud error manager instance that we used the cloud error plugin so it comes with the ease of using cloud error manager as well so we had just one cloud error manager instance with these resources and one name node instance so the resources for both the cloud error manager and the name node were fixed throughout the test while we iterated through the resources for the worker instances now I would like Nicholas to come over and talk about the test and the results thank you thank you so we continue with the results now when we talk about worker nodes basically we are talking about data nodes if you are used to the Hadoop lingua so what we did is we started off with determining the best instance configuration what kind of nova flavor should we be using that would give us the best or optimal performance so what we did was we looked at five instance configurations three of them were the standard nova flavors medium, large and extra large and then two of them were custom one of them was just one instance per physical server and the other one was varied so each of those instance configurations were the tests so we ended up with about five tests and through all these tests we made sure that the total number of CPUs accessed by within the test remained fixed for all the tests and the same thing was we is the amount of memory that was accessed by the worker nodes but we kept on varying the number of worker instances per physical server now what we found out was that there wasn't very much performance difference between the instances instance configurations at most we saw a 5% difference but otherwise it was fairly close but the other thing that we noted was that there was a sweet spot of four instances for each physical node and so what we ended up doing was that we adopted that as our standard configuration for all the subsequent tests so as you later on the instance configuration which is really in this case it is shown as four that is the one that actually we called it i5 but that is the one that we ended up adopting to use for all subsequent tests and then we also ended up using only four instances for each physical node now after that we wanted to see how far we can get away with over subscription from a performance perspective we knew that over subscription is definitely going to make you lose performance but at the same time sometimes it is a necessary evil and we just wanted to know how far we could get away with it so again using the optimized configuration that we called i5 we we maintained the amount of memory and kept on ramping up the number of CPUs we did about four tests and each test the first test had a one-to-one ratio between the physical number of CPUs then we ramped up to one-to-two, one-to-three and then one-to-four now what we found out was that indeed CPU over subscription is actually not good a 2x CPU over subscription will make you lose about 2% of performance a 3x will make you lose about 13% of your overall performance and again when you talk about performance we were looking at the performance of the Hadoop workload that we were running and that was the TPC XHS workload and that's the one that was standardizing on next we tried memory over subscription and in this case we maintained the number of CPUs but ended up ramping up the memory we were ramping up in terms of percentages not in terms of one, two or three so for instance we we did a one-to-one test then we did a test that was 10% of a subscription 20% of a subscription and then 30% of a subscription what we found was that actually of a memory of a subscription was really an honor because a 10% memory over subscription gave you a 64% performance degradation so if you were to do any over subscription you might get away a little bit with CPU over subscription memory will really hurt you so with that information we decided to move to the next level of testing again maintaining an optimal instance configuration and ensuring that there is no memory over subscription so now we had to make a decision on what kind of storage we should be using do we go with safe shared storage which is really what our our era at that time was recommending or do we use local storage now there are many reasons why you could go either way but wanted to find out which one gave us the best performance one of the things that we did was that self storage comes with replication default replication of three we diode that one down to one mainly because the TPC XHS workload that we were running comes with its own replication of three so we did that and what we found was that if you run on local storage you have a 32% advantage over safe with replication one and so we started seeing that we are ramping up in terms of performance and now there are many reasons why obviously you would use safe it's not just for performance but where performance is a serious consideration you are definitely going to get more if you have you're using local storage then we tried CPU pinning in this case we just configured our instances to be aware of the NUMA technology of the underlying infrastructure but everything else remained the same only thing that we changed was making sure that the CPUs where CPU pinning was in place now the performance improvement was not really dramatic it was actually quite small we think that it was mainly because of the hypervisor overheads and some other things this is an area where we could have done more investigation we didn't have the time to do that but we got small, very minute performance improvement but it's also an area of further investigation and research then we looked at disk pinning and in this case again making the instances to be aware of the IO subsystem and we maintained the NUMA nodes and then implemented disk pinning on local storage what we found was we could get an extra 15% due to disk pinning at this point we found that we were at 94% of the performance of bare metal and it was also a time where the deadline for completing this project was coming to an end the executive sponsor was on our back he wanted everything wrapped up quickly so we went to him and said hey we at 94% is that okay? he said fine we can live with that but at the back of our mind we knew that we had just touched the top of the iceberg there is a lot of opportunities within OpenStack that we could have done and that we can still do to not just match bare metal performance but indeed not only match it but even do get better performance than bare metal we are now having done this we are confident that we if we give it more time we can actually do that so this is where we ended maybe we shall continue with it but I will ask Mr. Akadi to come and conclude not much left to do I just wanted to put all the results together and kind of demonstrate kind of piece by piece analysis how each of the configuration options change the performance spectrum so there are several things I mean while we used kind of a very clean-term analogies pinning and disk pinning to be very fair the user of the Hadoop don't see that the people who deploy Hadoop don't see that at all the disk pinning is basically what happens is that you as an administrator configures each of the disks on the local compute node as your sender volume attach those volumes locally so you don't use anything you don't know anything under the covers you basically use a standard open stack operating calls to operate on the whole thing we do under the covers pinning of the CPUs we create the VMs for the right size to make sure that they match the best performance so when Sahara deploys Hadoop uses those VM sizes to operate on so let us stop at this point if you have any question please come to the mic and ask them and then we'll have a ruffle so my question is did you try the story begins different from direct local source so did you try LVM backed solutions so the LVM backed perfect question so the question is have you tried the LVM backed solution so if you look at the green light which is called local storage so that is the LVM backed solution so we take all of the disks on the node create a single volume LVM volume on that and we expose that volume as a sender volume so when you run the thing you basically attach a single volume to to across all of the nodes on the all of the VMs which are running on that node so it's still local storage and volume is attached to them but instead of separate volumes which are pinned to the VM it's a shared one and because of that we see quite a large difference in the performance I have two questions the first one is about performance is the springer reference about the paper itself did you publish the results or not well we published it at yeah so these results were published at the TPC technology conference actually were presented at the TPC technology conference and that happened in New Delhi in August and then they are going to be published by springer okay so this is something to come it's a referable material you can reference of course a video once it comes out but there is a springer book we shall come out as from the I guess it's big data TPC TPC it's actually a big database conference which includes the TPC technical conference and it's published by springer and they have the full you know full paper actually it will be very interesting to reproduce the setup and I was wondering whether you will also publish or share your configurations or something of this kind so yeah so some way among those slides we actually do have the reference but as of now the full because it happened so recently springer hasn't yet given us the source but the paper itself will have all that information that could help you to reproduce the test it has a full configuration information including all of the hardware bumps and the last question it's unfortunate that you didn't reach the point where you will exceed the performance of the metal but do you have intuition or hints of how this can be achieved yeah I can try that well first of all there are things like using the LVM driver if you went with a road driver you would probably get better performance we think that places like the CPU pinning there is a possibility that we didn't configure it very efficiently because we only got a 2% performance improvement there could be more and there is a ton of other things again this was a limited time project there is a lot of other things on the open stack side that we could do so I'll mention we have one angle which we have not we wanted to do then have time to do it and that is we have not tried to optimize the java virtual machine hadu pranda in the JVM so basically in each VM to have a single JVM running that have not been optimized for that VM size so that's where we expect where we'll get the remaining improvements to go over 100% since I'm assuming you probably want to repeat this in depth as you're discussing this have you been evaluating the idea of changing the file system from HDFS to something else? thank you, excellent question so the answer to that is twofold so the first one was very simple the only configuration we tried was a replication on a gdfs level and no replication on a self level so the immediate thing which we wanted to do but again run out of time was to flip it around have one replication factor on gdfs and rely on the self replication factor so that's one angle to go to the second as you point out okay self is just one of the you know one of the file system we can try we can try different file systems both on the self level as well as local level you know again we tried just standard HDFS configuration we have not tried switching from for example to ext4 some other configuration of local file system for that the self was one of the candidates we actually wanted to try the self file system unfortunately he made us stop but otherwise that's something we could have done and there's now a ton of other file systems that we have to that we may consider but obviously if we had to get another opportunity we'll definitely go through a number of them but self file system was pretty much at the top of our mind at that time also I mean we only tried on a three compute nodes so I think as we increase the number of compute nodes so then we can completely separate node instance and clodder manager completely out of the picture I think we'll get a little bit more balanced performance that was another angle which we thought of too but it's like okay now we need to reconfigure hardware sorry no time for that any other questions