 One, two. OK, let's start. So today we wanted to present some results of the performance testing of Hadoop on top of OpenStack. And I want to introduce colleagues presenting together with me. So Nikita is a software engineer at Mirantis running the Sahara team now and working primarily on Sahara. Paul Work is an OpenStack software lab manager at InTool. And I am principal engineer at Mirantis working on Sahara and XPTL of Sahara project. So a few words about the agenda. We are going to take a look a bit on a Sahara project in OpenStack, on the performance lab setup and challenges that we got during the testing, the overview of the testing itself, and some conclusions. So some disclaimers always. First of all, way to run Hadoop in the cloud. There are some pros and cons, as always. And to the pros, we can talk about a few items. First of all, it's a controlled infrastructure when you can deploy through the API, easily orchestrate different distributions over the data processing frameworks, such as the Hadoop, Spark, and et cetera. You can use it for dynamic resource utilization, for example, to use the spikes of the labs to enable additional data processing workloads. And thanks to the isolation in the cloud, you can run a multi-user in multi-talented environments like to share the single Hadoop cluster between the tenants and the users and enable mixed joint workloads on the another plus using virtualized environments in a cloud. You can run multi-version Hadoops on the same infrastructure with isolation on all levels. And they will not use resources for another cluster, for sure. And there is security additions by running, for example, them on different networks. So you can run vanilla Hadoop, for example, cloud data or Hortonworks distribution or some other distributed frameworks on the same cloud. And in addition to it, it helps you to share data across the cloud applications because you have all of the data available through the same API on the cloud. So as always, we have some drawbacks of using Hadoop in the cloud. The first of all is the performance overhead. And it's very important because we're going to have overhead on a CPU and network, for sure. And this overhead should be captured as one of the cons for running Hadoop on the cloud. The schedule became non-trivial when you're moving Hadoop to the cloud because you need to still manage the affinity of the data to the compute part. And you need to ensure that all of your data will not be stored on the same hardware node to enable the replication of HDFS, for example. And of course, when you have a large number of VMs on a single physical node, it will create some additional performance overhead. And it will make the affinity case even more complex. So there are already some existing solutions running Hadoop in the cloud. Mostly all popular public clouds currently have something related to the big data, and specifically to the Hadoop part. So here we can see likely the most popular of them, the KMR Google Cloud Platform with their Apache and Spark workloads in Azure. And if we're going to take a look on the private clouds to OpenStack specifically, we can use a Secure project. It's an official OpenStack integrated project, enables data processing services and data processing from frameworks provisioning. So using Secure, you can provision frameworks on to OpenStack and enable the data processing API. So you can choose a different distribution of Hadoop and other frameworks using Secure with their versions. The cluster topology could be specified as well. You can specify the cluster and node group in plates through Secure to enable different setups of set of processes to run on each role of the cluster. And you can specify any kind of the configurations to all the processes you're running on a type of Hadoop cluster. So a few words about Secure architecture. It's a very complex picture, probably not clear at all. So in a few words, the main points is that Secure provides the REST API. It's backed by the Keystone for authentication. And through Python Secure client and CLI, you can interact with Secure. As well, we have the horizon dashboard, the Secure dashboard and the horizon that enables all the API functionality through Vubui. We have integration with different data sources, such as HDFS, VIFT, and you can use Manila for the shared file systems. So Secure using HEAT as a provisioning engine for the underlying resources. And that's like transparently uses Novaglance, Indoor, Neutron, and other services to provision the underlying resources on top of OpenStructure run Hadoop cluster. So I will switch to Folk. So about a year ago, we were very interested in learning more about OpenStack and performance for some big data applications that our software developers were working on. And we partnered with Morantis Folks to set up this experiment project to demonstrate the functionality of Sahara and the performance possibilities with the direct block storage. To do this, we set up in one of our labs in Oregon a small, clustered environment with 12 identical servers, all connected through a 10 gigabit switch, which was then connected through a VPN appliance to the internet, which allowed the Morantis Folks to come in over the internet to perform their testing and experimentation. And it took us a little bit of a learning curve to get this set up and functioning normally. But after a short period, that was very successful for us. Each of these servers was configured identically with two of the Intel Xeon E5-2699 v3 processors. These are codenamed Haswell processors. Each of these has 18 cores running at 2.3 gigahertz. We also configured these servers with 192 gigabyte of RAM in 12 16 gigabyte sticks. And they each had 24 SATA hard drives, standard 1 terabyte each 7,200 RPMs. In order to get maximum possible performance out of the hard drives, we set up each server with three RAID controllers set into pass-through mode. And then we also had one SSD boot drive, 800 gigabyte size, and a dual port 10 gigabit NIC on the motherboard. To communicate all this together, we used an Extreme Networks Summit Switch, the X650 variety, and then the Comcast external uplink on an isolated network. And this preserved security so that no one else could hack in. And the Miranda folks also couldn't get into the rest of the Intel internal network. And with that, I'll turn the time over to Nikita to talk about their experiments and their results. Thanks, Paul. First of all, I'd like to briefly describe what the software stuff was running on this service. So we had the Ubuntu-based controller and compute nodes running Ubuntu 14.04. Actually, the virtualization was backed by QMN QVM version 2. And the OpenStack distribution was based on Miranda's OpenStack 6.8, which is the June release of the upstream branches. So of course, in the testing, we have faced a few of the architectural challenges that we needed to solve to reach the reasonable level of performance. I'd like to highlight the three of them, which was the CPU-to-RAM utilization on the virtual environment. So we should keep the CPU and RAM as much utilized as it is on the hardware nodes while running Hadoop workload. We had to somehow solve the storage issue so that the Hadoop demons could read from directly those 12 drives on each compute host without network or other overhead. And of course, we had to tune the Hadoop distribution itself to be able to maximize the utilization of both CPU and RAM and network. So the key to the proper CPU and RAM utilization is keeping the ratio persistent across your nodes. So each VM should have the equal amount of RAM to CPU core to allow the yarn containers to be spawned consistently. Then what we've noticed is that the VM flavor really matters. And you do not want to stack up a lot of small VMs on a huge compute host to generate affinity. And then the actual affinity issue, which could result in a data loss. So the data nodes should be aware of where they are running or single compute host or on different compute hosts. So as for the utilizing the resources and keeping the balance, well, there is no single solution to how you should tune your OpenStack or Flavor. All the workloads are different. And it's always true that adding CPU power will allow you to spawn more yarn containers and to run more tasks in parallel. And for workloads like Spark, Streaming, or MapReduce, that's usually the key. But for more balanced workloads, which you use also, the IO and some caching, you should need at least 2 gigabytes of RAM per vCPU core. So our final flavors had 32 virtual cores and 64 Gibbs or RAM for each data node and yarn storage VM. As the results show, that when running the KVM virtual machine, it will usually consume 5% to 20% more RAM than you allocate to your guest memory. So that, again, gives us the hint that you should try to avoid spawning smaller VMs, which of course allow more flexibility in your scheduling. But a larger VM will give you less overhead. And you should never over-commit your resources while running heavy workloads. Otherwise, you're just going to end up with a race for the resources while not running the workload. The typical anti-fint use case is when you try to spin up the Hadoop cluster with an HDFS service. And if you just leave the node with the standard filters, it can easily allocate your VMs with data nodes on the same compute host. And by default, the replication factor is set to 3 in HDFS. So once you lose the compute host number one, and if you had all the three replicas on it, of course, you will lose some of the blocks. So the card doesn't allow this to happen. And it will always place at least one of the replicas outside on a different compute host, so outside of the first one. Next, I'd like to cover some storage sections that we used to optimize the performance. So the things you should pay attention while setting up these workloads, Hadoop-like workloads, for batch processing is the locality of your storage devices. Because Hadoop was initially designed with the locality in mind. And the closer the device is to the data processing unit, like your container, the better the performance is. Of course, you should not forget the reliability of this storage replication and backup is always a thing to be considered. And of course, from the operator's perspective, the ease of setup of the storage should not be forgotten. Because otherwise, why should you go to the cloud if it's harder to set up than the hardware server? So as for the storage for the cloud, there are three popular solutions at the moment. So we would categorize them into these areas, like first one is the safe, or actually any other distributed backend, which is connected to the compute node via the network interface. And can store the data across the safe cluster or any other clustering system that you use. The next one is the LVM partitions, which can be stored locally and remotely. And those partitions are usually attached via the ISCSI protocol to the VM. And the last one is the local device storage, which is the approach that Cinder allows to have attaching the real block device directly to the VTIO bytes of the VM. So if you look at the diagram shown these three approaches, you can clearly see that in first two cases, there's always a possibility that you are going to have the network interaction to read from the storage in safe and LVM, while the block device driver in Cinder will guarantee you that the devices you are attached to the VM are always local to on the same compute host. There are always drawbacks for each of these approaches. But what I would like to point here is that the block device driver, which was used for the testing, has no network traffic impact at all on the cloud. And it has no CPU impact on the compute host. So the block device driver doesn't require any ISCSI or LVM managers running. Of course, there are drawbacks, like it doesn't support live migration, and you can't evacuate your block device from the compute host, or it doesn't allow a flexible scaling, so the whole device is utilized all the time. So that was the OpenStack configuration setup required for the testing. And now we'll just go through what are the configurations done to the Sahara part. First of all, we had to do the Hadoop distribution choice. And then once we have a Hadoop distribution, we need to set up the topology to deploy, and finally configure the locality and the services on the Hadoop distribution. As for vendor choice, Sahara supports the different vendors of Hadoop distributions. And we also have the vanilla plugin, which installs the upstream Hadoop package. For this testing, we were using a Clouder Hadoop. The topology of our cluster was basically the three node group templates based on their role separation. So the smallest node group template was running the manager, and it was used only on the provisioning phase. The master template was running all the Yarn and HDFS master process, and Hive and Zukiper to run other coordination stuff and the Hive queries. The rest of the cluster was occupied by the worker node groups, which were running node managers with data nodes. And those had only the local disk attached as the storage. So totally we had 12 worker VMs grouped to on the compute host. So six compute hosts were completely utilized with the workers. And one was running the master VM and the manager VM with the fmrl storage. So to sum up, the layout of the flavors looked like, as already said, 32 vCPUs for 64 gigs of RAM. And we used some extra root disk storage for the workers for the temporary files, which are required by the Yarn containers. All the VMs had their swap disks disabled and were connected to that 10 gigabit internal network. And the volumes were used only on the worker nodes. So once the cluster is provisioned, the Hive configuration should be done to the Yarn and HDFS servers. And it's always important to know that any Hive distributions comes already configured with some default settings. And those default settings are pretty much allowing you to run some small tests. And they are consistent. And they are working. But they will never give you the best performance utilization of your cloud or bare metal. And please do not try to achieve the performance while running on the default distribution settings. So I'd like to briefly go through the configuration settings we did for the Yarn service. Here you can see that there are actually two groups of configuration options. First one is related to MapReduce framework. And what it basically says is how many CPUs and how many gigabytes of RAM you can allocate to each map and reduce task. And then you also see that there are similar configurations, but for the Yarn service. And those are telling the Yarn framework, which is an underlying framework under the MapReduce that how many gigabytes and cores should be allocated to the containers in which the map and reduce task running. The optimizations for the HDFS were not so significant. We just increased the read-ahead parameter so it could cache more data while reading the data into the memory. And the block size for the testing was also significantly increased from 64 megabytes. So it will be used for the synthetic testing. So in different workloads, again, the setting may be different. We also disabled the HDFS permissions just to ease the access to HDFS. But this is not a huge impact on the performance. And this shouldn't be strictly considered for not running on the production with permissions disabled. So what did we use for testing? Well, we took the very standard Hadoop benchmark, which is TheraSort. It's a three-phase synthetic benchmark. And all of these three phases are putting some components under stress. But first of all, of course, it is the data generation step, which uses a lot of describe operations, writing some random 100 byte strings to the HDFS with random values. Then the actual TheraSort benchmark comes into play. When it reads all the random values, splits them into chunks, then sorts them by key. And the entries with the same key sources by value. And after that, it stores them in sorted fashion. All, of course, replication of the output also comes into play. So there's a lot of network traffic being generated at this phase. And the last one is the validate hash-based validation of the TheraSort, which you usually can just put the disk read operation as stress. So now we're coming to the actual results of your view. So basically, we did maybe a few tens or hundreds of runs for each of these benchmarks. And here are the one terabyte benchmark results collected for the virtual machines compared to the same setup on bare metal. So the bare metal performance was about 15 to 16 minutes sorting one terabyte, while the VMs were somewhere around 20 minutes. The same test for the three terabytes showed that bare metal can complete it under the 50 minutes, while VMs were getting close to one hour in average. And while judging by this simple execution time of the benchmark, we can just say that in a balanced workload, the bare metal machines could complete it in 15% to 20% faster than VMs. But this conclusion makes more, asks more questions than it gives answers. So we collected some CPU and other metrics to analyze for the future. So first of all, the CPU utilization graph, which shows that both map and reduce phases look quite similar on bare metal. And the VM setup, there is a down spike when the shuffle phase happens. And there is not much CPU power used when the data is transferred across the nodes. The RAM utilization for the VMs and bare metal also show that the phases pretty much match up. But the bare metal nodes seem to utilize more RAM on each of the phase, even that's taking into consideration that running two VMs on the same node is not equal to running this single node on the same stress test. We still see that the VM is better utilized on bare metal. As for the network traffic, we actually see the opposite picture that the VM tests could generate a lot more traffic across the cluster. On the middle phase, which is the shuffle phase of the MapReduce job. And so looking at all of these three together, we just can see that probably the CPU utilization was quite high in both cases. And it's not a bottleneck to this 20% gap. But RAM consumption and significant network traffic are the areas that need to be investigated further for bringing the VM performance closer to bare metal. So what are our conclusions so far for this comparison? First of all, the Sahara service and the OpenStack components proved to be able to provision and run the Hadoop distribution properly on the virtualized environment because fitting the VMs with the scheduler fields, with the regular scheduler fields, so no additional development was required for that. We also learned that actually bringing the large VMs with more CPU cores and more RAM onto the huge node gives the better performance and better utilization. And the Cinder and Nova APIs are just perfect for that. They also learned that the Cinder block device driver could completely remove the network overhead from the storage layer. And it was also quite easy configurable on the compute hosts, so running the Cinder volume process. So what is the next steps? What are we going to proceed is, first of all, the better tuning of CPU disk and network should allow to close that 15% to 20% gap between bare metal and virtualized so that we need to continue the investigating on that. Also, there is an ironic case when, in some cases, the installation to bare metal may be still an option, even in the cloud environment with the ironic interfaces. But at the moment, the user interfaces for ironic and hard integrations are not that good. So the UX improvements should be done. And the third case to investigate here is the hybrid approach when part of the head of cluster can be stored on bare metal. And it can handle the long-living data and the persistent data that shouldn't move a lot while the workloads can run on the VM spawned on the same hardware on the nearby hardware to complete the burst to spiky workloads and then scale down back to bare metal. So that's actually what we wanted to share about our scale testing and performance testing of virtualized head of environments. So please, if you have any questions, you're welcome. Just a very quick question. Did you guys look at why the RAM utilization did go up with the VMs? OK, so what we think could cause this RAM utilization on hardware nodes is that the Java garbage collection works slightly different on the huge amounts of VMs. So the log manager demon was running on 64 gigabytes in virtualized and on 128 gigs on hardware. So that might cause the difference in consumption. Probably the caching system is also behaving differently on these huge amounts of RAM. So it's very hard to trace whether the consumption goes actually into the yarn containers themselves or into MapReduce tasks. So that may be the difference. Did you try any flavors larger than the 32 gigabyte all the way up to like one to one ratio with the host? Well, we didn't try to launch one VM per host, but when trying to launch larger VMs, we saw that this difference between the guest VM, RAM size, and the actual RAM that is used by the process in the host system, it actually grows. So the bigger the flavor is, the more RAM is somewhere wasted between the hardware visor and the host operating system. So I guess we will not be able just to spawn one VM the size of the host at some point. Another question. When you spawn multiple VMs on your physical host, did you divvy up the block devices? How exactly were the block devices scheduled to the multiple VMs running on that host? OK, so there is an instance locality filter, which is part of the Cinder backend. And while spawning the instance, the VM appears first. And then when you create a volume, you can just tell them run this volume on the same host as the instance. So the Cinder volume will create the volume, which actually is just the mapping to the real block device, like def sda or def sdb. And then while attaching it, you just specify the Cinder. You just point then over to the Cinder device, and it will attach it to the Iskazi bus. So it will attach it to VARTO bus. It will not use any Iskazi or anything if Cinder doesn't. So you directly hooked up a certain number of, so how many block devices per VM was that? We had 12 per VM. OK, thank you. The three copies of, so how did you do the third VM on a separate host? How did you make sure that that was consistently applied across your cluster? To handle that, you can specify that now you need to use Nova filters. And it is called the, well, there are actually two approaches to that. First one, you can say run on a different host from the other VM. So if you spawn them sequentially, you can just say run the first VM anywhere you want, then the second VM on another place than rather than the first one. And the third one on anything else rather than first and second. But that's a very slow approach. And the faster approach is to use the resource groups with the affinity filter that will allow you just to give you a number of nodes you can allow to have on a single host. And then it will distribute it with the Nova scheduler. OK, so if there is no more questions, I think we can close this talk. Thank you.