 All right. Good evening, folks. Today, we'll be talking about Hadoop on top of OpenStack. So before we get into the topic, let me just give you a quick snapshot of who's Hortonworks. So Hortonworks is the provider of the only 100% open source Apache Hadoop Enterprise-ready distribution. We are a spin-off from Yahoo. We employ the original architects, developers, and operators of Hadoop within our company. We work with the ecosystem to forward all the different projects involved with Apache Hadoop. We distribute the only 100% open source platform, and then we work with the enterprises and support that platform. So let's get down to the topic. The question is, why run Hadoop on OpenStack? So if you look in the IT world today, big data Hadoop and OpenStack are pretty much like the two big celebrities. The most logical thing is, hey, let's get them married, right? But we don't want this thing to be like a regular celebrity marriage. We want it to last. And for that, we need to figure out what are the fundamentals. Does it really make sense? And we'll spend a couple of minutes on that, starting with the question, what does OpenStack really bring to the table for Hadoop? So the way we see it is OpenStack alleviates a lot of the operational issues with Apache Hadoop. In the typical enterprise, if you look at the little elephant in the middle there, when they embark on the Hadoop journey, they see different groups within the enterprise wanting to run their own Hadoop clusters. You've got finance, marketing, compliance. And all of these guys have different requirements in terms of data privacy, capacity, et cetera. On top of that, there's different data sources. So you've got mobile data. You've got web data. And not only that, there's different use cases that need to be supported. There's bad. There's interactive. So what we see happening is enterprises end up having to spin up multiple versions of the Hadoop cluster. And if that's not enough, as they go around in their journey of supporting Hadoop, they see that to support their production deployments, they got to do QA. They got to do performance testing. And all of this means different versions of the cluster running. Having to do this on top of a physical environment is extremely cost prohibitive and takes time. So OpenStack really helps solve these problems for Hadoop. And the next question is, what does Hadoop provide to OpenStack? The way we see it is three real important things. First is it's a low-risk application within the enterprise. As these guys are embarking on their Hadoop journey, Hadoop itself doesn't have any legacy within the IT world. There's no legacy processes. People don't understand Hadoop very well. So it sounds like a perfect POC application to keep the tires on OpenStack. The second aspect is Hadoop provides the horizontal scale. For a typical cloud application, it's expected that the application itself provides the linear scalability and does not rely on the infrastructure. And that's what Hadoop brings to the table for OpenStack. And third part is, as we discussed, it gives you the shared platform. So all in all, it provides a great Greenfield use case for OpenStack. So with that, it makes sense to integrate Hadoop and OpenStack. And Hortonworks announces its support to Project Savanna, which was proposed by Mirantis sometime back, to facilitate the integration between Apache Hadoop and OpenStack. Now, just like any successful wedding, there's got to be certain ground rules that you sign up for. And the way we see Savanna working with this integration is that it will provide the glue layer between Hadoop and OpenStack. So Hadoop doesn't need to know about the internals of OpenStack and vice versa. Savanna will track the Hadoop clusters, do the mapping between the tenants and the clusters themselves, provide an API for Apache Umbari to manage the Hadoop clusters, so you don't have to reinvent the wheel on the OpenStack side. So to summarize some of the key benefits, the first thing is self-provision Hadoop. So operators don't need to be at the bottlenecks every time a cluster provisioning request comes in. The user can self-provision the cluster as they see fit. The second is reducing the errors in the provisioning process itself. So Hadoop is not a simple system to get up and run. It has a lot of moving parts. If we package this behind a well-structured, template-based provisioning process, it really takes away all the different areas where there could be operator error in running a cluster. The second part is elastic Hadoop. So you can create a pool of resources on top of your physical environment and scale up and scale down the cluster as needed. And this can be incredibly helpful as we see enterprises do different levels of performance testing on Hadoop or even for their production Hadoop clusters as their capacity requirements go up and down based upon the workload. The second is cluster time-shared. So we see a lot of customers saying that, hey, we've got our OpenStack deployment running, but it doesn't do a whole lot at night. Can we run some Hadoop batch jobs during nighttime? So with this integration, they'll be able to do that. So share time between different turns and different workloads based upon factors like time of the day. And the last one is multi-genency. So as I said, enterprises have to support different use cases, data patterns, and operational groups within the enterprise. They have different SLA requirements on top of Hadoop. So there's batch workloads, there's interactive workloads. These workloads need different levels of resource assignment. So by running that cluster on top of OpenStack, you can control the resource assignment at a VM level and support different SLA requirements. And the second aspect is maintenance, which is often ignored by vendors, but it's incredibly important for the operators who run these clusters. As they see a Hadoop expanding within the enterprise, there will be different versions of Hadoop clusters for the different internal groups that they have to support. How do they take care of upgrading these clusters? How do they take care of running different versions of Hadoop across the same set of physical infrastructure? So all these problems are solved by this integration. So next, I will walk through some of the features along each of these buckets in some more detail. So the first is self-provisioning. And there's two options here. One is template-based provisioning, and second is job-flow-based provisioning. So on the template-based provisioning, in the most simplest terms, what the aim is to capture the entire requirements of the Hadoop cluster in form of a uniform template. So typically, an enterprise goes through a process of tweaking and curing the cluster, and then they're happy with it. And next time, they just want to cookie-cut that. So you can capture all those requirements in the form of a template and provide a single-click, simple provisioning experience to your users. But there will be people who would want to get more control over this template provisioning process. And for that, we would provide a second level of template that is a little bit more granular, and they'll be node-based. And the information you capture in that can be resource-based. So you can have a template based on the size of the node, or it can be function-based. So there are different nodes with different functions within a Hadoop cluster. So you can have a template for name node, region server, data node, as you see fit. Users can modify those templates, and then provision clusters based on the modified versions of those templates, or save those templates. And in the second phase, we look to provide more of an Amazon EMR-type experience where customers can come in and upload the data either on a Swift object store or HDFS, then pick the type of job they want to run, big, high, or the other job types, and get the results, again, either on Swift or HDFS. So two basic options split in two different phases. Let's get into a little bit detail about how this provisioning process will work. And the case I'm covering here is assuming that there's a simple VM image that just has the OS, nothing else in it, right? So the user goes to horizon and specifies the cluster requirements, all the details about nodes, services, how many resources it needs. It will go to the Savanna controller, which will, in turn, go to Nova, fetch the right images from glance, and fire up those virtual machines. At this point, there is just virtual machines. There's no Hadoop cluster yet. And that's where Savanna delegates the responsibility to the Hadoop management platform called Apache Ambari. It will start the Apache Ambari Management Server and pass it the cluster blueprint and let Ambari figure out all the service configuration and setting up the Hadoop cluster itself. So this sort of puts a clean delineation between OpenStack and Hadoop, so both of them can take care of their own set of responsibilities. So once the cluster is provisioned, the next step is how do I actually manage and monitor this cluster? So the way we envision this to happen is there will be a single sign-on between the OpenStack Horizon dashboard and the Apache Ambari Management Server. So based upon the VMs the tenant has access to, when they log in, they will see the Hadoop clusters running across those VMs. They can click on Manage Link for those clusters, and you'll be single signed on into the Apache Management UI console. There's one option of loading the Apache Ambari UI within the Horizon cluster itself, or it could be a separate window. All right, moving on to elasticity. So the way I've broken down elasticity here is two dimensions. On the x-axis, there is the cluster life. And on the y-axis, there is node elasticity. So for phase one, we will be focusing on manual node elasticity. What I mean by manual is there will be an API on top of the Savannah controller and a UI plug-in for Horizon where the users can sort of add and remove nodes to their clusters. And VM provisioning will be completely transparent to that process. But it will be supported for long-lived as well as short-lived clusters. The short-lived use case is very, very relevant for dev and QA type workloads, where someone just needs to run some script, some big script, or a map-reduced job, and see if it's giving out the right results before they sort of migrate that thing to production. And the long-lived cluster is more suitable for folks who are actually running production clusters or even staging environments on top of OpenStack. For phase two, we will focus more on rule-based node elasticity. When I say rule-based, the rule could be defined on multiple different factors. One of them could be the state of the job flow itself. So you could foresee a scenario where a certain step of the job flow needs a lot of compute power. And there needs to be a way for the user to specify that rule and automatically provision the additional VM images that provide that additional compute power. And this could be useful for job flow-based clusters, where the provisioning is based on job flow itself, and the cluster goes away after the job flow completes. So some detail on the Swift object store integration. So when I spoke about elasticity, I mentioned that there is an option to upload the data on Swift or on HDFS. Now, when the data resides on Swift, there needs to be a way for a map-reduced job to consume that data and process that data. So that will be enabled by this HDFS Swift connect bridge. So what this bridge does is for a typical map-reduced job or a big or high script, nothing changes for that particular script or a job. It still functions as if it is working with HDFS. So it deals with the directory and file system hierarchy. But all that happens is that the bridge transparently converts all the commands into a way that the Swift object store understands. And it does that through a single sign-on through Keystone. The reason that this is possible is the Swift file naming convention supports forward slashes. So if you look in this picture, all that is happening is that that directory structure is being collapsed into a file structure. So we have DIR slash File 1 that provides that simulation of actually being a hierarchical directory structure. The other point here is I want to mention here is we support multiple different Swift stores, not just one. And you support multiple different containers across these different Swift stores. The way to set up this integration is there will be a single shim that you have to install on the Hadoop cluster node through which you expect to access the Swift object store and do the configuration provided the Keystone login information, and you'll be up and running. For the phase two of this integration effort, we will be focusing primarily on bug fixes and optimizations. All right, the last part is multi-tenancy. So for phase one, I've split it in three dimensions. There is access, resource, and version isolation. Now, resource and version isolation is sort of free to some extent by function of running inside virtual machines. There is one aspect of resource isolation, though that is important here. And that is the ability to pin a Hadoop node to a certain physical host. And the reason why that is important is as VCA Enterprise is supporting different internal customers, they might have to provide a certain set of SLAs to these different internal customers. And for that, to avoid the typical noisy neighbor problem in a cloud environment, there needs to be a way to say that for a given tenant, these are the different hosts that the tenant will get. And these are the different hosts on top of which the Hadoop cluster should run, and nobody else should interfere on those hosts. So the ability to pin the Hadoop node VM to a certain set of physical hosts will provide that functionality. For the phase two of this effort, we will focus on providing a single umbari management server per tenant. So in phase one, as you provision more Hadoop clusters, there will be a different instance of the umbari management server spun up for each instance of the cluster. But for phase two, we will collapse that down to a per tenant basis. There will be only one single umbari management server. So you don't have to deal with the overhead of managing multiple management servers. And then there is, of course, the enhancement needed to integrate with Keystone. So you can do authentication and authorization at the job flow level. So that's all. That brings me to the end of this talk. To get more information, we provide a Hortonworks sandbox, which is a pre-built environment with test data. Follow us on Twitter or email me. My email address is right there. Thank you very much.