 All right. Hello. Welcome to benchmarking Sahara-based big data as a service. My name is Trevor McKay from Red Hat. On the schedule, this was actually listed as Matt Fairley. If you know Matt, he sends his regrets. He couldn't make it, so I'm filling in for him. I'm here with my colleagues, Jidong Yu, and Wei Ting, Chen from Intel. We put this thing together. So I'll be doing introductions, and then Jidong will be covering the performance data. We're going to start with why Sahara. Sahara has been around for a few cycles now. It's big data on OpenStack. We'll talk a little bit about what the motivation is for having that as a service. We'll go into details. A little bit about the architecture and features. Then we'll cover deployment considerations. From a performance perspective, what do you need to think about when you're putting these things together? Containers, bare metal, VMs, where your data is located, those types of things. They'll show some results that they came up with from Intel. Then we'll talk a little bit about where Sahara might go and specifically what we're going to try to do for the Liberty Cycle. Let's see. Big data is everywhere these days. It's really not sort of a niche thing anymore. Everybody's doing it. My new friend, Asim, here from Cisco has extensive experience in that. You can talk to him about use cases. If you're working at Walmart or Yahoo or Facebook or Twitter or you're doing genomics like the SwiftStack guys we're talking about with Fred Hutch, there's just piles and piles of data. The CERN guys write petabytes, terabytes, exabytes of data all the time. Someone you work with is probably running workloads on something like Amazon or Azure or Google. It's just a reality. This stuff isn't going away. There's a few reasons to run workloads there. One is just virtualization, the benefits of OpenStack. We all know that. You also want these modern features. You want storage, you want database as a service, you want elastic map reduce type facilities from Amazon. When you migrate to OpenStack, getting the core services isn't necessarily just enough. We love Neutron, we love Cinder, we love Glance, we love all that stuff. But it's the Defcore and it's really, really important, right? Essential. But on top of that, you start to need to add applications. So newer projects like Sahara and Trove start to cross that boundary between just core services and applications out of the box so that you can do things at the application level and you don't have to write them yourselves. So this is sort of Sahara's reason to exist, is to make that possible for you. And of course, as the bottom line says, writing the applications is complex enough without having to manage all the infrastructure. Let's see. Okay, so big data analysis is tough. You have to acquire the data. There's a lot of it that can be challenging. You have to organize it. Anybody who's worked in ETL knows that that can be hard, just managing this stuff and shipping it around. Then you have to write your analysis, figure out what questions you want to answer, and you've got to take your output and do something with it, right? Present it to somebody so that they have an answer for, you know, some customer informatics question that they were trying to answer. And again, all that complexity is just in the domain space itself. That doesn't include the tooling and infrastructure you need to do it. So, this is why we made Sahara. And some headline features here in case you don't know what it is. Sahara gives you repeatable provisioning capabilities. So, you define cluster topologies, what you want them to look like. You store it in a database. You load your images in from glance and you walk up and you press a button, and in a few minutes you have a Hadoop cluster or a Storm cluster or a Spark cluster or any other framework you want to write a plug-in for. So, once your cluster is up, now you want to run jobs. And this is what EDP is about. That stands for Elastic Data Processing. That's basically your job management system. So, you might have bursty workloads. You might need, you know, various kinds of storage. So, we have things like cluster scaling. We've got integration with different storage types. We integrate with Swift. Hadoop compatible file systems. We also pay attention to security for you. And things like service, like anti-affinity and high availability considerations so that once you have this up, your cluster doesn't go away or become, you know, unserviceable. So, these are all the things Sahara gives you. Here's a quick look at the architecture. So, this giant blob in the middle is the main Sahara engine. As most things in OpenStack, it has a client that talks through a REST API. We have a page in Horizon. Within the main Sahara body, that DAL box down there at the bottom is the interface to the Sahara's local database where it tracks state and those kinds of things. It has provisioning logic in it which talks through heat to build you clusters and uses Nova to spawn instances, sender to set up non-ephemeral storage. We get our images from Glantz. We use Swift to store input and output data sets. And then, of course, there's the vendor plug-in piece. Vendors may be a misnomer because there's actually a collection of grad students from South America who are doing awesome work. So, you don't have to be a vendor to write a plug-in for Sahara. So, if you want to write one, just write one. So, we should maybe change that name. I don't know what else to call it, though. But the vendor plug-ins have the logic to spawn a cluster of a particular type and configure it at runtime. So, you can see from this that, actually, Sahara is in the big tent model of OpenStack. We're really a consumer. A large majority of the OpenStack services and use them ourselves. I forgot to mention there is a raffle at the end of this for this awesome biometrics gear that will measure your heart rate when you come to exciting talks like this. So, stick around. You don't want to miss the raffle, okay? Let's see. So, how do plug-ins work? We have plug-ins for particular processing engines. So, there's an HDP plug-in for a Hortonworks data platform. There's a Cloudera plug-in for the CDH stuff. We have a Vanilla Apache Hadoop plug-in. We've got one for Spark. Spark standalone deployment. We've got one for Storm. We really, any processing framework that you want to add, you can. I was just talking to somebody today who has a new storage system from their company and they're thinking about writing their own plug-in to be able to use their storage with Sahara. So, obviously, to vendors, this is a way to integrate with OpenStack and get accessibility to a user base who's consuming OpenStack. You know, you've got a big OpenStack cluster at a place like Walmart or something like that. They may have various applications to run on their stack. One of them probably is going to be big data processing, so it becomes one-stop shopping. Let's see. Oh, MapR, I forgot to mention. That's relatively new. I think maybe last cycle it was being added. Storm is brand new. And, of course, in downstream distributions like RDO from Red Hat or Mirantis OpenStack, those vendors will certify particular plug-ins, right? And they'll do all the QA, give you reference architecture so that you know when you go and deploy this thing it's going to work and I don't have to worry about it. Let's see how we're doing on time here. Oh, okay. Great, perfect. Okay, so now I'm going to hand this over to Zhidong and he's going to go over the performance findings with you. Good afternoon. My name is Zhidong Yu and I work at Intel. We are committed to cloud and big data and we kind of believe for big data moving to cloud is inevitable trend, so that's why we work on this Sahara project. So when you start deploying a Sahara-based solution, there are a number of things you should consider. In this section we will go through those considerations. It's kind of like a functionality benchmarking. Then followed by this section we will show a little bit of performance testing we have done and the results. So it's the performance benchmarking. Then based on what we have seen in the functionality and the performance benchmarking, we are going to talk a little bit about our invention of the future of Sahara. Then finally we will have a summary and a cut reaction. So the first thing you need to think about is the storage architecture. So there are basically two major different ways to provide the storage in Sahara. The first thing is you can let the tenant to prevent the storage by themselves. By default Sahara can help deploy the HDFS components within the virtual machine. So the good thing for this model is everything is almost the same as the traditional bare-made model where the computer and storage are co-located together. But the problem with this is that since when you upgrade it on top of cloud you are actually expecting an on-demand service. That means you will launch the virtual cluster and terminate it from time to time. So with this model you have to import the data from somewhere else because you can really go on the data processing. And you also need to possess the data back to somewhere before you can terminate the cluster. That's not exactly the users want to see. So instead of this there's another option the tenant could prevent the storage in another set of virtual machines like the scenario 3 in the chart. So with this scenario the storage VMs could persist longer than the computing clusters. So that's kind of fixed the problem we are talking about in the first scenario. In those two scenarios the storage is actually a disk attached to the virtual machine. So the disk could be either a ephemeral disk or a signal volume. With ephemeral disk the good thing is the computing and the underlying storage actually is co-located on the same physical machine. So there is still a chance for Hadoop to achieve data locality. But if you use the signal as the volume back end depends on how you implement the signal service. You may use LGM which is co-located with the computer because the network removed a safe implementation or something. So in that case you may lose the possibility to achieve data locality. Another way to provide storage is admin could deploy the storage system in advance. So in that way the storage is actually logically disaggregated from the computing tasks. Tenants will only run the computing tasks within the virtual machine. But this implementation could be either physically co-located with the computing or totally a network remote solution. So in the diagram here the scenario number two we deployed the HDFS in the computing of the host level. The HDFS system is external to the virtual machine but with proper consideration it's still possible to achieve local data access. But in other scenario the scenario number four the virtual machines are accessing data in a network remote swift. So in that case all the data access will go to the network. There is no data locality. By the way in both mode since the virtual machine needs to talk to the outside of the VM so the network is very important. In our testing we found one of the new trends feature DVR is very helpful for this. So our opinion is that in disaggregated storage system has much more values than a build-in I mean within the virtual machine storage system. So in disaggregated storage system in this mode there is no data silos or the virtual Hadoop cluster can share the data in the same system so there is more opportunity to find the potential new business usage or something. And another thing is with this mode Sahara may be able to leverage the OpenStack Manila service. So that means Sahara may be able to call the Manila API on behalf of the tenants and let Manila handle all the storage provisions since Sahara is just a consumer of the Manila services. Then this approach also can create more opportunity to the vendor so they can build some advanced solutions. For example create an in-memory overlay file system within the computer nodes like Taqian or something. So this is about the storage architecture. Then the next thing is the compute engine. Basically we have three choices, virtual machine, container and bare meta. We all know about the pros and cons for virtual machine. For container we know it's lightweight. Provening is very fast but the security is kind of a problem because it's weaker than virtual machine. The other thing is currently the Nova Docker driver is still not in the upstream. So Sahara actually cannot support the container very well. The last option is the bare meta. I know this morning there is a panel discussing using Sahara and bare meta together. For me the good thing for bare meta is it has a very good best performance in all these three options. But the problem is that we have tried to deploy Ironic in our lab but since Ironic is less mature than both VM and container the other thing is the resource utilization efficiency is still a problem. Every single virtual cluster is still monopolizing a single physical cluster so there is no resource sharing at all and the migration is also a problem. The provision is also slower than container. My point is that my opinion is the container seems to be the most promising technology for running big data on top of cloud but unfortunately currently Sahara doesn't support the Nova container very well so there are many things ahead. On the other hand there is another problem common to all those three options which is even with Sahara the tenants still need to concern about how to determine the appropriate size for the cluster and what kind of flavors should be used with this cluster. So they can choose to use a large flavor with less nodes they can also use smaller flavors with more nodes. So it's not easy for tenants to make this decision. In many cases the tenants are just developer or data scientists they just want to use the services. They don't want to need too much about the underlying complexities. The third thing in the consideration is the data processing API. By default when you finish a cluster provision you can use it as usual tenant could just access it into the virtual machines and consume the service as usual. If the specific distribution provides a high-level API like OZ tenants could use that as usual. But in Sahara we recommend to use the EDP API. EDP is designed to be an abstraction layer for users to consume the underlying service. So ideally EDP should be vendor-neutral and plug-in agnostic means you can always switch among the Sahara plugins like CDH or HDEP. But unfortunately currently EDP is still under development. We have implemented a lot of features but we could add more like more job types. And I believe a few hours ago in the self-knowledge there was a session talking about adding more features to the EDP. So there's another option. You can use a third-party abstraction layer like the Cask CDAP. Cask CDAP is an abstraction layer to make the big data developers live much easier than before. In theory those components can be used to replace EDP or anything that can do such kind of job. So unfortunately Sahara currently does not support the third-party abstractions. So as a summary this is a matrix showing all the things you need to consider about. On the bottom there are storage and computer. There are a few options. Then the next one is the distributions. There are many options but we did not talk about anything here because it's totally up to the consumers or the users. On the top is the data processing API. Currently we have the traditional way, the EDP and potentially we could support third-party APIs. So with this in mind we are trying to back to this page we did a few performance testing to compare the options for the storage and computer layer. The top layer is totally about the functionality so we did not do any performance testing. So we just used a four-node cluster to compare a few configurations. So the first one is we compared the ephemeral disk performance with the bare method disk performance. So basically we see the only left of this chart is the read performance, right is the write performance. We see 1.3x read overhead and 2.1x write overhead. So we understood in the clouding context the disk access pattern totally changes, right, because in OpenStack a typical configuration is that you have a bunch of disks, you create a rate and have a logic volume, then the logic volume to the Nova instance store, then every virtual machine will have its own image file from the instance store. The access pattern is totally different from the bare method scenario. However, our characterization shows that only 10% of the overhead are from the access pattern change. So all of the overhead are still from virtualization like the IO and memory efficiency. So we have already done a few tuning here, not very extensive, but the performance is still not good. So we believe if you want to achieve good performance, heavy tuning is necessary, is required. We compared another scenario. We moved the HDFS from the virtual machine to the host level. The good thing is that we can see the performance for the write case improved significantly. For the read case it's still higher, but we know the reason. The reason for the improvement in write cases and the first thing is the virtualization overhead is removed. Now the HDFS can access all the discuss in the physical mode. I mentioned that the Neutron DVR feature helped optimize the network path. So that's another reason for the performance improvement. However, the reason for the write overhead is that we further to enable the location awareness feature for Hadoop in cloud environment called HVE. This is a feature developed by VMware in Hadoop to support their vSphere product. Actually Sahara can benefit from that feature as well, but somehow we tried to enable this feature, but somehow it didn't work in our environment. So if we can make this work, we expect to see a better performance for the reader case as well. Then we compare this with the performance. Same as before, this overhead is higher than the ephemeral disk case. I believe we know the reason. The reason is this location awareness feature for Swift is not enabled. We tried to enable this, but somehow we run into errors. So in Sahara, if this feature can be turned on, then we should see similar or even better performance than the internal HDFS. So the conclusion from last page and this page is that if we can move the data from the virtual machine to the outside, we have a chance to achieve better performance. And at the same time as I said before, now we have a centralized storage. It may bring more business opportunities. So the last test case is compared by a radar versus container versus a VM. I just said Sahara doesn't support a container very well, so we actually did a few hackings to make the container work. We use the Novadocker driver. So the performance shows the virtual machine has 2X container, has 1.4X. One thing I need to call out is that we have done a few tunings for the VM case, but everything with the container cases out of box, so it's still much better than the virtual machine case. So we think container is very promising in terms of performance compared to VM. So based on what we've seen in the functionality and the performance benchmarking, here are a few things in our mind that we expect to see in Sahara. This is not a roadmap. This is just something we think Sahara would be better to support this. The first thing is we need an architecture to allow disaggregated computing storage. We want Sahara to support more storage back end. If possible, Sahara could be integrated with Manila. The second thing is we need better support for container and Belmada. I believe Belmada is already in the roadmap. There are talks in this summit, but for container, we should consider if Nova is the only option, or maybe we should use a Magnum. The third thing is the EDP. As the abstraction layer for Sahara, it needs more improvement like data connector workflow, policy engine, SLA, auto-scale auto-tune. So ideally, as I said before, the user don't want to concern about anything about the underlying infrastructure, like how many nodes, what kind of VM flavor. Ideally, this abstraction layer should take care of this. Even like auto-scale, when the load is high, this layer could add in more nodes to the cluster, something like that. Finally, I think Sahara should offer a broader when integration opportunities, not just the big data engines like Cloud Air and Hortonworks. This is because a complete big data stack may have many options at every layer. You have the storage at the bottom, then you have the data platform, and you also may have a few data analytics layer components on top of that. So there are many vendors at each layer. I think Sahara should be open these opportunities to all the vendors. See that as an example here. Another example like SAS analytics products. Okay. So this is pretty much my talk then. Trevor, we'll continue to talk about the currently low map for this Sahara liberty and then conclude with a summary. Okay, so quick note. How many questions do you think we have out there? Show of hands. People want to ask questions? Haven't thought of it yet? Because I can talk fast or slow. All right. Looks like maybe we have a couple. Okay, so let's see. I went back too far. There we go. So that last slide wasn't a road map. This is a road map. Okay, and these are the highlights for the cycle. So big things we're going to be doing. HDP is one of our major vendor plugins. It's going to be ported to HDP 2.2. If you're familiar with Hortonworks in Umbari, you know that there are a lot of improvements there. We're going to be working on HA for the CDH and HDP plugins. That's HA on the cluster level. So there's HA for Sahara itself. There's also HA within the cluster, within the Hadoop services. That's what that's about. We are using heat as our primary provisioning engine now. We used to have a direct engine. We had heat as well. Now the direct engine is being deprecated. Heat is the new heat. That's what you should use. And we're going to add features to it. Also plan to bring Sahara to the Python OpenStack client. One-stop shopping client. So that will be great. Bare metal clusters. We've already talked about several times this summit. So we're going to be working on integration with ironic security enhancements are very much on our mind because we realize that that's important. So there will be a lot of work there. And also a number of things in EDP land. So job scheduling, repeatable jobs, future jobs, coordinated workflows. We were just talking about that today in a panel. So you can ultimately, someday, we hope to have things like directed acyclic graphs with dependencies from one job feeding into another so you can develop more complex things. Log retrieval for simple debugging. Very simple feature, but very helpful when something goes wrong with your job. You want to know why. And also, if you've used EDP before, you know that some of the jobs, the argument passing and parameters that you can pass to a job are somewhat fixed for some of the job types. So we have a model figured out for making that much more general and much more flexible so you can run all kinds of things. Okay, so our call to action. We did a lot in Kilo. We have real customers out there using this stuff, running real workloads. We're beginning to get all kinds of customer feedback and asking us to help them out. So this is something that you can really use. It's maturing. Let's see. Honestly, I'm not sure. Well, I guess that's a summary. I'm not sure about the thought behind the second point. It's a true statement. Bare metal is still preferable for performance. There was an interesting note on that on the panel earlier today and it was a developer from Cloudera and he noted something which I think is true. The performance characteristics are going to be sort of dependent on your workload. So bare metal is going to be more important for people who are doing real-time analysis on-demand kind of stuff, queries against data, trying to figure something out and that's where your performance is going to matter more. If you're running batch stuff in the background overnight, it's not going to matter as much. So there's still room. VMs might be slower than bare metal. Okay, fine, does that mean if you're running Sahara, you have to go out and buy a whole bunch of new hardware and use Ironic and run stuff on bare metal? Well, no, it depends on your workload. So there's a performance difference and tuning is important but it also is going to depend somewhat on who you are and what you're doing. This is probably the major point at the end here. We're a smallish team. I don't know, I haven't added them up. Maybe what, 20 contributors, something like that, actively, really active people. We can certainly use more. We come up with more awesome ideas than we have people to do, right? So as someone else said this week, if you're looking for a project, come join ours. I guarantee you we'll give you something to do. Let's see, we have our, let's see, we're doing, oh great, five minutes for questions. I didn't put it on the slide. Our meetup is tomorrow in room 218. That's 218. We'll be there all day long. Please come, we love visitors. We had a bunch of visitors today. So come hang out with us, tell us what you think. Ask us questions. It's great. He hung out there all day today. So I guess that's it. And if you have questions, please use the mic. If you don't, I'll have to repeat them. Questions from anybody, for either of us. Andrew Lazar from Mirantis. And I was making a performance test of Sahara two summits before. And during my testing, it is really huge difference in performance parameters depending on configuration. And in your slides, just the results, can you share more details? What was host cache parameter? Which case do you mean? This one or this one? For example, for ephemeral storage use case. This one, we use a DFS IO. Actually, I compared our results with... Because tests, speed is very sensitive to, for example, chunk size. So you will get completely different results for small chunks and for big chunks. So tests are very sensitive to host page file configuration. Because if you enable... In this case, we enabled the huge page and we did see performance improvement with the huge page enabled. I think our case is a little bit different from the result you guys published last year. Of course, we used a 1TB depth size, but you use the DD as the workload level. We use the DFS IO. We use a much smaller depth size, only 64GB. But I compared the results. It's very similar. You guys reported a 2X performance overhead. And what drivers do you use? Drivers of QM or... Discuss? No, I mean... QM drivers... QEMU? What kind of... Because we saw... You mean, what IO or QEMU? Yeah, virtual IO drivers. I think it's... 2.7 or 1.7. Because we see... Yeah, QEMU, but Andrew's question is the driver mode for the storage. Is it a virtual or QEMU? Yeah, version of driver. Because we see a huge difference between 1.7 and 2.7. So it's... What driver was used for? It depends on what is the default in OpenStack NOVA. So for this part, we did not do much tuning. We only tuned a few OS level like huge pages. We tried to make the performance look better, but on the other hand, we realized maybe we just needed to show the out-of-box performance that the majority of the customer will see. So we don't want to do too much tuning or too extensive tuning. Thank you. So, hi. So during this week, I've heard two conflicting things. So some people say, oh, compute nodes have local storage, almost always. And then other people say compute nodes always use network storage. And I'm saying that when you look at this problem, so you are saying that the overhead is mostly because of IO. Do you see a solution that makes this optimal in both cases? It's more like a curiosity. I don't really have a question. Well, you know, the data here, we compared a few cases. We did not really compare the local storage versus the remote storage. So currently, we don't have data points to show how bad or how good if you put all the storage on the network remote side. But do you think the overheads would be the same if you're comparing network IO and disk IO so you have the same... Oh, no. I would expect the worst performance if you use the totally separated, safe or glassed FS as the storage back-end. Because after all, all the data traffic goes through the network. But on the other hand, it's a typical deployment for most open stack. People like to use the safe with the universal back-end for both glass, nova and single. If you deploy it safe or glass together with the computer, they may bring other issues like resource contention. You will not get a determined performance. So that's one thing we plan to do in the future. We will compare the performance. In those cases, the disks are all collocated with the computer. We will compare those cases with other cases like a network remote safe or glassed FS. Right, so you don't believe that the overheads, the IO itself is just the usage of the disk. So I think we're over time. So we'd love to answer your questions anyway. So come on up and we'll answer them. But I guess that's it for that official talk.