 Making sure that different plugins work, different solutions can work on top of OpenStack. So this session is about Hadoop and OpenStack. So at Rackspace, we are a pretty active user of Hadoop. We've been using it as far as I can remember since like 2008, so email and apps was a pretty big user of Hadoop since the beginning. A lot of internal teams within Rackspace, especially in the cloud, use Hadoop in order to mostly do log processing and other kind of analytics on big data. For from customer's point of view, customers who want to do Hadoop on Rackspace, they can either use our dedicated servers, we provide support on top of the servers and you can work with a partner like Portman Works in order to get support for Hadoop. On the public cloud side, we don't have a Hadoop as a service offering yet on the public cloud but you can do it yourself and you can expect something in that space from Rackspace soon. In the private cloud, we do OpenStack private cloud so we can do OpenStack private cloud at your data centers for hours and among different applications that you can do on the OpenStack cloud, Hadoop is one. So it's especially good use case when you have data being generated within your data center where you're, and if you have OpenStack running within that data center that you can use, you don't need to go outside of your data center in order to do processing on the data. So other than that, Hadoop on cloud within a private cloud is a good use case for development and testing of Hadoop too. If you are not familiar with Hadoop, I'll quickly go through the architecture of Hadoop. Hadoop mainly consists of two parts. One is the master node and there are a bunch of worker nodes. The worker nodes are called data nodes and they have data node service and tasks tracker running on them. The master node is a name node and a job tracker. So you have a name node managing a bunch of data nodes and Hadoop does replication of data in order to provide reliability. So, and that is the SDFS, Hadoop distributed file system side of it and Hadoop provide computation on the data by providing a framework called map reduce where you can write map reduce applications to do parallel processing of data. So if you want to do Hadoop on cloud, you probably want to go through the installation first. In order to do installation of Hadoop since it's a distributed file system, there are many components on it, different servers. So you probably want to use a provisioning software. If you quickly do a Google search, probably you will come up with Apache Wear. It's Apache project started with Amazon deploying Hadoop on Amazon EC2 and it has support for, since it uses JCloud's API, it supports a lot of different clouds. So it has support for OpenStack. It is easy to use, uses common API for different clouds, but it's pretty limited in functionality. So with Wear, what you can do is, if you're using, where you define an instance template, where you mention the number of name node and data nodes and where you want those to be installed and you can use that configuration file and do like launch cluster and then you'll launch the cluster for you. But once you launch the cluster, you really cannot do much with it. Like you cannot modify it or you cannot, it doesn't provide monitoring and things like that. So it's good for development and testing. In the next step you probably want to do is, you'll probably do write some Ceph recipes and do Hadoop deployment yourself. But if you, Hadoop, Ceph is really good at that and you can use the same configuration tool in order to do other deployments of other applications in the cloud and you can use the same for Hadoop, it's good. But when you are talking about managing a cluster, Ceph can be pretty difficult to use in that case. But still, like you can make it work. And with Ceph, you would, and the night plugin, you could probably do something like Knife OpenStack Server Create with this image and flavor and then you can give it a role of a master node or a data node. So as you need more data nodes, you keep doing the Knife Server Create with further data nodes. It's pretty good and most people who do Hadoop on cloud use Ceph and they can use the same recipes for dedicated to, it doesn't have to be cloud, except the Knife Bar problem. Another thing that keeps coming up is Iron Fan by InfoChips. It's also open source software and it is very good on the EC2 side but for OpenStack they don't have support for it. Although they do have a press release and the block was saying that they support OpenStack but there is nothing in the code. So I talked to that guy and they said like, yeah, we did talk about it but we never did it. But with Amazon, if you're using Amazon, it's pretty good and somebody can probably go and add a provider for Rackspace or OpenStack cloud too but it's not there yet so somebody needs to work. And it builds on top of Ceph and Knife Server. Another thing is Apache Ambari. It's also an Apache project. It is mostly UI driven so it is like point and click and it also has API support but almost everybody uses the UI side of it. Although you can use the API. It provides, other than the HDFS and MapReduce, it also provides other add-on services like high key and edge catalog and things like that. And it also provides monitoring through Nagios and Gangria. And with Ambari, you have an Ambari server and then you install agents on a bunch of nodes where you want to install Hadoop cluster. So you kind of have already a list of nodes where you want to do Hadoop from. So there is really no way to provision Hadoop, right? So if you look at all the deployment options, there is really a deed for Hadoop project that works with OpenStack seamlessly with VMs and then just not just the deployment but managing and monitoring the service. So when I submitted this talk a couple of months ago, I didn't know there was any interest on Hadoop on OpenStack or not because in the previous conferences, I didn't see anything like that. But since I submitted this talk, I've seen a lot of interest and there has been a project announced in the OpenStack community for doing Hadoop as a service on OpenStack. A lot of companies are working on it. So at this point I'm gonna hand it over to Imam Su in order to go through the rest of the talk on what he's doing with Hadoop and OpenStack. Imam Su is from Hardenworks and if you don't know Hardenworks, one of the leading contributor to Apache Hadoop and they have recently joined OpenStack Foundation too. So they really bridge the gap between OpenStack and Hadoop and really work in that area. Cool, well, thank you Sudarshan. So my name is Iman Shubhari. I'm a product manager at Hardenworks. One of the things I'm responsible for is taking our Hardenworks data platform and enabling it across the different cloud environments. So a quick snapshot of what is Hardenworks for some, this is an OpenStack conference, so folks who don't know about Hardenworks. So we are the only 100% open source distribution of Apache Hadoop and the way we go about doing this is we are spin off from Yahoo so we employed the keys, original architects, developers and operator of Hadoop within Yahoo. We work with these guys and drive a lot of innovation within the Apache OpenStack community and also enable a lot of our partner ecosystem and some of that is validated by our key strategic partners you see listed here. Then it's not just about Apache Hadoop. A true enterprise distribution for Hadoop involves a lot of supporting projects around Hadoop. So we take all of these together, we run through our test suite and we distribute a completely 100% open source enterprise Hadoop distribution. And the last step is from all the experience that we've gained by supporting Hadoop platform at scale across various enterprises, we provide enterprise class support to all of our customers. So that's a quick overview of Hadoop and Hardenworks and that takes us to the next question of why Hadoop and why OpenStack? So Hadoop and OpenStack are probably one of the biggest trends in IT today. So it sort of makes sense, two celebrities, let's get them married. The way I see it is we don't want the marriage to be like a typical celebrity marriage. We want it to last. So that raises a question, what does each person bring to the table? So let's spend a couple of minutes talking about that. So what has OpenStack really been to Hadoop? Now as we work with our customers, what we find is as the big little elephant settles down within a typical enterprise, there's a lot of challenges, specifically operational challenges around Hadoop in the enterprise. So if you look at an elephant here in the middle, you've got the different departments, there's finance, there's marketing, there's compliance. Everybody wants their own little version of Hadoop and they have different requirements in terms of capacity, privacy of data and other issues. Not only that, you combine that with the different sources of data, like you got data from the web, you got data from mobile and that just means that not only you have different versions of clusters for different departments, there might be different sets of clusters to support the different characteristics of the data. To add to that, there's different types of use cases. There is batch use cases and there is interactive use cases. So we often find customers wanting to have different sets of clusters to support different types of use cases. And that's not enough, as an enterprise goes through their typical Hadoop adoption journey, they find that they have to run clusters for QA, for production, for testing, for performance validation. And that just means that, hey, there's all these different types of clusters that a typical operations department has to support through the whole life cycle of the project within the enterprise. And then to sort of solve that problem, what we started seeing customers do is that, hey, they'll go to Amazon, they'll speed up a cluster, do some basic testing there. And the next question it raises is that, okay, we want to bring stuff in-house now. How do we do that? The fundamental gap is on Amazon, it's not the environment that the customers control. So just because something works on AWS doesn't necessarily mean it's going to work in your own private cloud deployment. So this is what OpenStack really brings to the table for Hadoop. It gives a way to alleviate all the different operational complexities of running a Hadoop cluster within the enterprise. And we'll talk about some of those use cases in detail next. So the other side of the coin is that, what does Hadoop bring to the table for OpenStack? And in my mind, it's really three things. First is Hadoop is new to the IT, it's new to the enterprise. So it doesn't have all the bells and whistles, the legacy processes attached to it. So it gives that perfect use case for a typical enterprise to do their POC on OpenStack. The second thing is it scales horizontally. So for a typical cloud application, the general expectation is that the high availability and scalability should be sort of built into the application itself. And it should not rely so much on the infrastructure to provide those capabilities. And this ties well to the architecture of Hadoop because it scales horizontally. And the third part is it's really a big giant shared platform in the enterprise. And we just talked or spoke about that, how different groups need to use the Hadoop platform for different use cases. So all these three things put together means it's like the perfect green field use case for OpenStack in the enterprise. All right, so with that, I wanna say that Horton was completely supports the project Serana that was recently announced by our friends at Mirantis. So Serana really promises to be this glue that ties Hadoop and OpenStack together. Now, just like a marriage, I was saying there needs to be certain rules of how do you work with each other to make sure that there's no conflict, right? So there's a little charter I have on the right here that sort of calls out what are the responsibilities of Serana? What is it that Serana will bring to the table for this integration between Hadoop and OpenStack? So starting with tracking of Hadoop clusters, mapping of different clusters to the tenants, it will provide an API. And the API part is really key here. So what it is saying is we don't want Serana, we don't want OpenStack to be too much into the details of how to configure and manage a Hadoop cluster. And at the same time, we don't want a Hadoop cluster to be too much aware of the fact that it's actually running within OpenStack. So that differentiation of responsibility is what Serana brings to the table. It works with both the parties through APIs. So that brings to the next step is, okay, so it's a great idea, let's integrate Hadoop and OpenStack. What are some of the key benefits that you would see? So what I've done here is I've bucketed the benefits in three separate ways. Starting with self-provisioning. There's a couple of use cases under each bucket, right? So if you talk about self-provisioning, like we discussed, all the different groups within the enterprise wanting a Hadoop cluster, typically we don't want the operators of the clusters to be the bottleneck. If you enable self-provisioning, you can have the different departments and the users within the enterprise provision their own clusters. And the second aspect is the Hadoop cluster is not a thing, it's not easy to deploy. There's a lot of moving pieces. So you have to find a way to reduce the operator error in this process. So what this integration between Hadoop and OpenStack promises to do is it provides to sort of, it promises to provide in the prebuilt templates. So you can just say go provision a QA cluster. The second part is elasticity. So you can have a pool of resources, a pool of physical resources, and you can scale up and down by adding new nodes and removing nodes from your Hadoop cluster as needed. You can solve problems like cluster time sharing. Now we hear from some customers that they have some workloads running on OpenStack, but at night there's not much activity happening, so can they use that set of resources to run maybe some Hadoop batch jobs? It's not really possible if you're doing this in a physical environment. So that's one use case that we see as being a key, a sort of a key benefit of doing this integration. And the third part is the multi-tenant Hadoop. Now, there's two aspects here. The first is by having a Hadoop cluster run on top of OpenStack, you can support multiple types of SLAs. So based upon the groups and use cases, there are different resource requirements for the Hadoop cluster. By function of having it run on top of OpenStack, you can tune those resource allocations based on clusters and provide specific SLAs to different groups. And the last part, the second point there is simplification of maintenance. So now again, if you look deeper into how a typical IT department is going to support Hadoop and go through the lifecycle of going through various versions of Hadoop, you've got multiple clusters running. They might be on different versions. So how do you handle use cases around maintenance or doing rolling upgrades? So by having a Hadoop cluster run on top of your OpenStack environment, it really gives you that simple platform to alleviate a lot of these maintenance and upgrade related issues. So we'll dig deeper into all of these different buckets and I'll lay down some of what are thinking as Hortonworks is in terms of how we want to approach the different use cases along these three buckets. So first is self-provisioning. Now, I have two phase here. So the phase one being template-based provisioning and the phase two being Hadoop as a service, which is job-flow-based provisioning. So let's start with the phase one, which is the template-based provisioning approach. So one option here is like I was saying, is that you want to provide a click to provision your experience to your users. So what we will do is, you know, we will enable something called as a cluster template. So typically we see an enterprise go through a process where they'll tweak and tune their cluster for different use cases and then they are like, hey, you know what, this is great. Now, can we provision yet another cluster to adhere to this exact same configuration? So what this will do is this will provide them an ability to sort of take that configuration, save it as a template, and the next time they want to provision, they can just do a single click and provision the entire Hadoop cluster based on that template. That raises the question that, hey, you don't always need single click to provision. You want some sort of flexibility. And that's the second part here, right? So we want to support not only cluster level templates, but go one level deeper and support node level templates. And the node level templates can be of two flavors, right? One is a typical sort of an Amazon type flavor where you have node.large, node.small, and whatnot. And the second type is more Hadoop specific, because the typical Hadoop cluster has different nodes. Each node has different functions. So you can create a template based on the function of the node within the Hadoop cluster. You can modify those templates and then either save that template or provision a cluster based on the modified template. So again, to summarize, single click provisioning, simplicity, as well as the flexibility to modify the templates and provision based on that. Phase two is now where we move into more of a Amazon EMR type experience. And what this says is that the first step is you allow the user to upload the data, upload the data to an HDFS-based cluster or a simple Swift-based cluster, right? And then based on that data, you do find different job flows. The job flows will be parallel to the type of job flows you see on Amazon EMR today. And then get results either on Swift or HDFS based on what the user's preference is. So let's walk through some of the detailed steps of how this process will work when it comes to provisioning. So what I have here is the basic assumption is the VM image is just OS. There's nothing else on the VM, just a vanilla OS image, and that's stored in glance. So what happens is the user comes into Horizon and it doesn't necessarily need to be a Horizon UI plugin. It can be just a direct interaction between the user and the Savana controller APIs, right? But the point is the user comes in and specifies the cluster blueprint. It's a new term, so introducing here blueprint is basically all the different configuration parameters of the cluster, number of nodes you want, how many resources you wanna allocate to the different nodes, the different Hadoop services you want to be running on the different cluster nodes, et cetera. All that information is captured into a blueprint and submitted to the Savana controller. The Savana controller then looks at this and says, okay, I need these 10 VMs to fulfill this request. So then it goes to NOVA and fetches all those VM images from glance and fires up all those VMs. So notice at this point what you have is a whole bunch of VMs running. There's no Hadoop cluster yet, right? And this now takes us into the point where we're seeing that we don't wanna open stack to be responsible of being able to tell what are the different Hadoop specific configurations that has to be done to start up the Hadoop cluster. So that's where the Savana controller installs the Umbari server. Now, Apache Umbari is a centralized management platform for Hadoop. So the goal here is for Savana to install the management platform for Hadoop and then delegate the responsibility of configuring the cluster itself to Apache Umbari. And that's what it does. So it will then basically pass on the entire cluster blueprint to Apache Umbari and then Umbari will take care of configuring all the different nodes in the Hadoop way and starting up the entire cluster. So this is one option, right? The second option is some customers might say that they're okay with incurring the overhead of managing different VM snapshots that are more purpose-based, right? And that's what I have here is that the VM image in this particular case is actually pre-configured to be a certain Hadoop node. So what this assumes is that the customer installed a Hadoop cluster and did a snapshot of all the different VM images and stored that VM image in glance and the provisioning is based upon those existing VM images now. And in this particular case, again, the workflow is by and large the same, but the key difference is when NOAA actually provisions the VMs and the VMs start off, they're actually Hadoop nodes. It's not a functioning Hadoop cluster yet because the wires have not been tied together, but it's still actually Hadoop node VMs that are spun up. And at that point, again, the Savannah controller then sort of tells Umbari's management server as to which of the different slave nodes that you need to manage to make this into a Hadoop cluster. And it informs the different slave nodes as to which are the master nodes that you should be talking to, right? And once that wiring is done, you have the complete Hadoop cluster running. So these are the two sort of workflows that we were envisioning from the provisioning standpoint. To give some sort of a color on how the user experience would be, we just have a simple mock-up here. Again, what it is trying to say is that there's two options. One is if the custom picks a custom option for the template, that means that they want to specify all the configuration parameters, and then they go about specifying the capacity instances, the data persistence options, and the different Hadoop services themselves. And if not, at this point, they can just pick a pre-built template and provision based on that. So once the cluster is provisioned, how do you actually go about managing and operating the cluster? So here, what we are thinking is since Umbari Management Server is provisioned as part of the cluster, there is an option to actually, when an open stack tenant comes in and logs into the horizon dashboard, based on the VMs that they have access to and the Hadoop cluster that are being running on top of those VMs, they will see a link to manage each and every one of those Hadoop clusters. And that will be a link to the Umbari Management Server. When they click on that link, one option is just basically, there will be a single sign-on between that tenant, the open stack tenant, and the Umbari Management Server itself. And the Umbari Management Console will load up within the same frame as an open stack dashboard here. So again, that sort of goes into giving that integrated experience so that for the open stack tenant, for them, they don't have to worry about signing on to different consoles, keeping track of the links to manage the different Hadoop clusters. And again, it sort of ties back to the point that we don't want to reinvent the wheel on the open stack side, and we don't want to build in all the logic of managing a Hadoop cluster on the open stack side. So all that stuff will be delegated to the Hadoop management platforms like Apache Umbari. So let's talk about elasticity next. And what I've done here is, I've tried to sort of break down the question of elasticity on two dimensions. On the x-axis, we have the cluster life itself. There's short-lived cluster and there's long-lived clusters. On the y-axis, we have the node elasticity, and how we configure the node elasticity. So the one option is manual, where the assumption is that the user comes in and they manually add and remove nodes. When I say manual, it just still assumes that there is an API, there's an API and a UI for doing that, but it's just that it's not happening automatically. And the second option is rule-based. Now, in terms of what specific parameters you would be able to pick on the rule, that is still something that we are evaluating. But to give you some examples, you can, when you configure a job flow, you can say that, hey, the fifth step in my Hadoop job flow, I expect it to need a lot of compute capacity. So I'm going to define a rule in my job flow that says that when you reach a step number five, fire up all these five new task tracker nodes so I can have additional compute capacity within the cluster. That's one way. The second way is make the rules more, rule specific to the cluster resource utilization itself. So based upon the CPU and memory utilization of a node within the Hadoop cluster, you can say that if my CPU utilization is constantly trending upwards beyond a certain threshold, not just spikes, but a constant trend, then hey, that means that I need to add some more compute capacity to the Hadoop cluster itself and I can specify a rule for that. So when are we planning to sort of deliver on these different dimensions? So first is for the phase one, we are targeting manual node elasticity. And in terms of where we see these being useful, the long loop clusters are useful for sort of predictable workloads. So if you have, let's say a QA environment that you have to have up and running for testing all the different Hadoop scripts, but as your cluster scales and as the amount of testing you're doing is going up, you need to eventually add more capacity to that cluster. So the long loop cluster with manual node elasticity is great for that. The short loop cluster with manual elasticity is great for the typical dev and QA workloads which are very short lived. So let's say I am an analyst or a developer within the enterprise, I wrote a couple of scripts or Hadoop scripts and I want to just see if those work. Before I move those scripts to the production environment, I can just go to my open stack based Hadoop cluster provisioning system, provision a cluster and do my testing that way. So the second aspect of the rule-based elasticity, so with rule-based elasticity and a long-lived cluster that's the first quadrant there, it's really important to support, it's useful to support highly variable workloads. And these are basically the type of clusters that you would provision based on job flows. So if you have a job flow, you start the job flow, the cluster ran and you have some rules in the job flow itself to scale up and down the cluster capacity. So you can specify, you can use this for that. And that's phase two. So the other point around elasticity support is often the question is okay, where is the data? And as I said that there's two options. The one option is to have the data on HDFS itself and the second option is to have the data on your Swift object store. So what we have here is how it will work for when for the use case where the data is actually residing in Swift. And what I'm saying here is that for phase one, we will basically deliver HDFS and Swift bridge that will, and the kind of user experience it will provide is your existing Hadoop jobs, the map reduce, paying in high jobs, they don't necessarily need to change. They will keep on interacting just like they do for a typical, as if the data were on HDFS, right? So they still look at the data in a hierarchical fashion. They still do the operations like create, read, write, just like it were a typical HDFS file system. But what the bridge does is that it transparently maps all those operations to the Swift world. And it also maps the hierarchical structure on the HDFS side to sort of the flat structure that is available on the Swift object store side. So in this example you see is the, we have a directory and the file one on a hierarchy and it is mapping to sort of DIR slash file one on the Swift side. And that is because the Swift file names support back forward flashes. So you can basically achieve the abstraction of having a hierarchical structure by putting slashes in the file names. The integration will support multiple containers as well as multiple Swift stores. So, and there will be single sign on through Keystone. So all this configuration would have to be done on the Hadoop node through which you want to access the Swift object store. So there's a question in the back. Yeah. Yes. So the question is they don't use Keystone. Can you do this integration if you don't have Keystone? So I'm not 100% sure if it will work without Keystone but I believe the answer is yes. I mean if you don't want to have security setup then you should be able to do basically bypass Keystone. Okay, so phase two is focus mode on bug fixes and optimizations for this integration. All right, so the last bucket around multi-tenancy. So the way I've broken down multi-tenancy here is around three dimensions. There is access isolation, there is resource isolation, there is version isolation. So on the resource isolation side, you sort of get it for free by function of running your Hadoop cluster within a VM boundary. But there's one point there, the second bullet under resource isolation that says ability to pin a Hadoop VM to a particular physical node. Now that kind of becomes important as you start supporting multiple different tenants on top of OpenStack for your Hadoop clusters. And if you want to provide certain SLAs to certain tenants for their Hadoop clusters. If I'm right, today there's no way to directly instruct the Nova scheduler and tell the Nova scheduler to fire up a VM on a particular physical node. So there needs to be an enhancement to the Nova scheduler to sort of have this functionality where you can instruct it to pin a VM to a certain physical node for this. So that will be part of the phase one. On the access isolation side, like I mentioned, there will be single sign on between the horizon dashboard itself of the tenant coming to horizon and then having to access the Umbari management console for the Hadoop cluster. But the thing to note is that there will be a single instance of Umbari management server per cluster. So if a tenant has like five different clusters, they will have five different links to five different Umbari management servers within the OpenStack deployment. And that's something that we will fix in phase two, where there will be a single instance of the Apache Umbari management server per tenant. So regardless of how many clusters the tenant has access to, the management of those clusters will be through a single instance of Apache Umbari. So you don't have to spin up a new Apache Umbari management server for every single cluster request the tenant makes. So this is it. I guess at this point, next steps, if you have more questions about this, please come find me, I'll be at the booth. Or you can email me. You can download the Hortonworks Sandbox. Now I wanna see a quick line about Sandbox. The Sandbox is sort of a pre-built environment, which is sort of a pre-configured cluster, and there's test data sets, which you can use to basically kick the tires on Hadoop and get a feel for Hadoop patient that. The link to download Hadoop or Hortonworks distribution platform and follow us on Twitter. So that's all. I hope this was helpful. Thank you.