 Hello, let's start my name is Andrew Lazarif and as slideshow I'm working for mirage company and for last year I actively contribute to a Sahara project and Recently became part of core team of Sahara project and the whole yeah the same questions rise Was raised a lot of times. What is performance of Hadoop on OpenStack and Even on yesterday's Sahara presentation the only question that was raised from audience was about performance and During today's talk, I want to shed a light on this topic Okay, I'll start with Introduction and describe what I'm testing can why after that there will be results for himself and The last topic I want to cover during my talk is data locality and how Hadoop users or Can benefit on it by running Hadoop on OpenStack First of all, what is Hadoop? Actually, there are two correct answers on this question. It can be either the whole Hadoop ecosystem with whole spectrum of Hadoop services Or it can be narrow meaning on of Hadoop Only Hadoop core components that include my previous and the HDFS components This talk is mostly focused on the narrow meaning of Hadoop so performance of only core components were tested and Why do we need Hadoop on a visualized environment at all? the main driving force of Virtualization is flexibility It allows to use Resources for jobs that are really needed right now if user needs Hadoop cluster It can just create it if cluster is not needed anymore. The user can just delete it some companies Utilize Tries to utilize resources by running the big Hadoop clouds and sharing it between a lot of users but in this case we need to provide Sufficient level of isolation for tenants Hadoop wasn't originally designed to Isolate jobs from each hour So for example for Hadoop one it is nearly impossible to isolate Internal jobs from each hour the situation is better for Hadoop 2, but it is still not trivial Running Hadoop on OpenStack allows to isolate Tenants on a hypervisor level and even if some job crashes the cluster only this cluster will be affected but not our clusters when we talk with customers about Hadoop on OpenStack they usually understand well why do we need Hadoop for test and stage environments But still we have concerns about production and the main concern is what is performance and during this talk the main goal of this talk is to Measure cost of migration from bare metal to OpenStack cluster When I talk about Virtualization gives more flexibility. I assume that there is a simple way to run clusters modify existing clusters and delete unneeded clusters For OpenStack, there is a Sahara project that is data processing as a service it allows to Provision Hadoop clusters and run on-demand jobs on it Sahara is Starting from June cycle Sahara is officially integrated project It has pluggable architecture that allows to run different Hadoop distributions to create clusters with different Hadoop distributions and Support different versions of Hadoop on it Currently Sahara supports a patch vanilla plug plug in and Hortonworks Hadoop distribution Both distributions support Hadoop 1 and Hadoop 2 It was also supporting Intel Hadoop distribution, but recently Intel unfortunately announced that they are closing the Hadoop direction and cooperating with Clouder Okay, okay, what can be performance impact of Moving from bare metal to a virtualized environment first of all, there are some direct impact this is IO operations network and CPU impact But also there is some Indirect impact for example When we run Hadoop on bare metal we need to tune only one operation system actually system where Hadoop is running for a visual environment we need to tune both system host and guest and This is much more complicated Our impact is that hypervisor itself requires resources for example for OpenStack at least one of nodes will be used as OpenStack controller and we can't use its power for Running Hadoop on it Now example is that For compute nodes we need to we can't use the whole compute node memory because we need to Leave enough memory for disk cache Okay, now I'll describe what environment I was using probably some of you already and Noted that Mirantis introduced Mirantis OpenStack Express service that can be described as OpenStack as a service So user provides parameters of OpenStack cluster he wants and After that he received fully installed and configured OpenStack cluster and I think this is pretty awesome and I was privileged to be one of beta testers of this service and On it I used a 20 nodes cluster it at nodes Was with 24 CPUs actually 12 but with hyperfraiding It node was with 33 32 gigabytes of memory on terabyte disk drive and two gigabit interfaces network interfaces Note that the whole testing was for ephemeral storage Using of external storage is out of scope for with research In terms of operation system. I was using a CentOS 5.6.5 for both host and guest systems as OpenStack There was a Mirantis OpenStack and On my system there was a pretty old but stable QEM or QVM one dot two and I'll say later why it is important as network there was a native natron before GE running on Open the switch of 1.10 and I am listing this because performance of previous version of Open the switch wasn't acceptable For Hadoop all tests were done on Hadoop one But I believe that Results are applicable to any Hadoop version for bare metal I used All compute nodes or the same compute nodes as for OpenStack I didn't use OpenStack controller as Hadoop node to make comparison with OpenStack fare For OpenStack, I mostly tested on cloud with 19 nodes and I'll say later why Okay, let's switch to results themselves. So I Ordered cluster on Mirantis OpenStack express created a virtual machine on it and measured discrete speed using building the digital and It showed that Discrite is 40% slow on virtual machine and this is actually not something I wanted to see But maybe Hadoop goes better Hadoop contains a building drop to measure IO operations and I Created Hadoop cluster Started drop to write one terabyte of data to the cloud and it shows that Hadoop on OpenStack is two times slower The first idea why this could happen is that One virtual machine can't utilize Host resources and Probably if I spawn several virtual machines of the same host probably performance will be better but I tried and performance even worse, but Should be some solution It should be fixed somehow and There is a disk cache modes parameter in no config file that controls how QVM Configures disk cache parameters for guest system by default guest system is fully relies on a host page file and There is an option to enable Discrite cache for guest systems QVM doesn't recommend this option for systems where the crash of VM can cause data loss, but in our case Hadoop replicates data by itself and Hadoop was designed to be highly fault tolerant tolerance and It seems that in our case it is pretty safe to enable this option. So I created one Latch VM per host with all memory dedicated to this virtual machine and tried the same tests and and performance Described in this configuration is pretty the same for host system and for guest system and What about Hadoop? Hadoop is a little bit slower, but not so drastically 18 performance 18% Slower and I think this is already acceptable result, but one Latch VM on Per host is not something how Hadoop on open stack is supposed to be used usually customers want to share compute node resources between different clusters between different nodes within one cluster and One Latch virtual machine per host can be temporal solution until Sahara supports ironic for bare metal provisioning but in general This is not something how Hadoop on open stack is intended to be used So there should should be some our solution as I said before I used pretty old Version of QM. I was using QM or 1.2, but in Later version in 1.4. There was a new block device driver Introduced and Red Hat climbs but it is two times faster on random ride and gives significant performance Benefit for our operations like read like significant sequencer right Merantis open stack state on version of 1.2 because version 1.4 wasn't stable when When it was tested, but currently version 1.7 is available. So there is a big chance that all instability issues were fixed and New block device driver can be used in production, but I didn't have enough time to test it so more More testing is required here. Okay Let's switch to Our factors that can be affected by virtualization for disk read HD palm tool shows that cash trees read speed is pretty equal for guest and host systems It's also shows that buffered read speed is much better for a virtual machine and This is expected result because When tool is running on virtual machine host page file is Used and operation system of virtual machine knows nothing about this for Hadoop Read operations on of Hadoop is a little bit slower than on bare metal and performance impact it in our I or test is about 20% as for network, I was using network with GIE and Performance impact in this configuration is about six percent and Probably it it can be better if using a wheel answer instead of GIE for testing CPU impact, I used Building p drop that is mostly depends on CPU And the test shows that performance impact on on CPU is only two percent. So for calculation tasks Performance impact is minimal Okay, that was Impact of specific parameters, but what about Real-like scenarios In Hadoop distribution, there is a tear assort job that pretends to be Pretends to represent a real Hadoop workloads it involves all kind of calculations and since Intensive right is involved enabled right cash in open stack and used and tested with Tested on configuration where one large virtual machine was running on each cost and for real-like workload Performance of Hadoop on open stack is only six percent slower than performance of Hadoop on bare metal Okay Last topic I want to talk about is date locality Hadoop can consider distance between nodes and if Hadoop is installed for example on big in big into data centers when It is better to schedule drop to data center where required data is located. So this can give us Using this fact we can benefit. We can get all benefits of isolation for example, we can as we can assign data nodes and processing nodes of Hadoop to different virtual machines and They will be in different in different virtual machines On the previous slides when I talked that About network speed it was related to Speed between virtual machines on different hosts on the same host network speed between the virtual machines is much faster and it can be compare it is compared with discrete speed so for To demonstrate how Running Hadoop on open stack can help with other isolation I used the same terasort example and Separated data nodes and processing nodes and Test shows that if Hadoop knows nothing about topology of a system There is a significant performance degradation, but if we provide topology to Hadoop performance in such configuration is nearly equal to a situation when Data node data processes and processing nodes this was running on the same virtual machine Okay So with research showed that on Real like workloads performance of Hadoop on open stack is only six percent slower than performance of Hadoop on bare metal and The main performance degradation is caused by IO operations and Probably upgrade of Hypervisor tools like QM will help with it a lot like a upgrade of open the switch helped a lot with network speed the main benefits that Virtualization gives a flexibility and isolation and isolation both for between both for nodes In different clusters and for nodes in the same cluster if you have any questions you are welcome That's all from my side Thank you One question. Have you evaluated also Linux containers like dockers and the effect on the IO operations? It was out of scope of this resource. So I just don't know I Didn't try this Hi First congratulations. Well, well done. I have a question regard to the benchmark part So you mentioned that you changed the like the numbers of the world virtual machines in the host, right? so for different Virtual machine size did you change the for example Hadoop mapper and reducer numbers Yes, sure Usually on open stack summit where are much more open-stack experts than Hadoop experts and right a Lot of technical details are not included to this presentation to make it More clear so it's a good practice to Run the same number of workers as we have CPUs so Okay, thank you. You have a question also Did you find any performance changes or differences between? tests that you performed with bare metal using physical discs versus the virtualized storage if you touch that I Must have missed it. So as I said in the beginning of presentation the whole testing was done for if emerald storage, so I Didn't test it with Network storage so volumes Nass so this wasn't tested During my research. Okay, if there are no questions, thank you everyone for attending