 Okay, one, two, three Okay, hello folks, let's start our today's talk is about Sahara Sahara is a project that brings Hadoop to OpenStack today's presenters Myself Sergey Lukyanov. I'm the principal of the engineering varieties and project technical lead of data processing program Includes product Sahara in OpenStack. The next speaker is Metaforella from Red Hat and Jones from Horton Rocks So our agenda is a brief overview of Sahara project highlights from my Scouse release overview of HDP plugin and brief overview of Juno plans Okay, so the mission of the data processing program is to make some scalable elastic provisioning of data processing tools and a number of operations on top of them so in Sahara for a foreseeable future we were supporting Hadoop on top of OpenStack and two levels of operations for Hadoop provisioning and operate Hadoop clusters and then on top of this provisioning clusters provisioning scheduling and operations for Hadoop jobs So in the future we will grow to support other data processing tools not only Hadoop But for now as I say we're working with Hadoop and Hadoop system a bit About the Hadoop. It's not just a single project. It's something like the OpenStack it based on the two core Services named HDFS, Hadoop distributed file system and Yarn And there are tons of different services and tools on top of two core Hadoop projects There are some projects for streaming processing for DSLs throwing my previous jobs and etc So let's take a look On the question why you think that it's a good idea to bring To bring Hadoop to OpenStack So as you can see there are three lines the blue line is OpenStack growth The red one is about Apache Hadoop and the yellow one is the Amazon Elastic Compute Cloud Elastic Amazon Cloud Contains a product named EMR that's something like our concept of Elastic data processing in Sahara and As you can see Apache Hadoop was started much earlier than OpenStack But both products have the same Anger of growth And that's why we think that it's very good idea to Merge our communities and to bring Hadoop to OpenStack So let's talk a bit about use cases the first main central use case of the Sahara is to self-service provision elastic clusters This elasticity provides and user's ability to Do not keep thousands nodes Hadoop clusters on hardware and Just add remove nodes on demand so this this Use case Creates a new use case like Jeff staging in production life cycle Use case when you can just create clusters of the same configuration same topology, but different size or different Compute power for development Q&A And production using the same tool same cloud to a different reasons of one cloud and etc and one more Central uses to use case is running Hadoop workloads on top of the created clusters You can run workloads on top of the existing clusters or changing clusters to So let's let's take a look on the architecture We have a horizon plug-in that provides users an ability to control and operate Sahara service using the OpenStack dashboard is currently implemented as a plug-in for dashboard And we are going to merge it into the horizon during the geno cycle we're using keystone for authentication and we're using currently heat for resources provisioning and That's how we're provisioning resources for OpenStack cluster like instances volumes networks and etc Additionally a very big functionality of Sahara is implemented file system for Hadoop that supports Excessing data in SWIFT directly without any proxies or something like this. That means that you can specify Specify some data sources in SWIFT and use them from the Hadoop cluster provisioned by Sahara Let's take some notes about current status of Sahara in OpenStack. We were officially integrated OpenStack project And we just released our first aligned release named Ice House in time with all of the OpenStack projects Matt will make some highlights for it We have a bunch of supported Hadoop distros. The first one is a Vanilla Apache Hadoop. It's just a simple installation of Vanilla Hadoop with some tooling on top of it It's made mostly for as for the reference The second distra is a hot-on-work data platform And then we have into distribution of Hadoop too. It's currently closed and in the process of merging into the Closed distribution that is now in the blueprint Additionally, we have we have a review implementation of Spark. It is not a Hadoop and it's Really the first non-Hadoop data processing tool in Sahara, I think Additionally, I'm glad to say that Sahara is included on different OpenStack distros in Red Hat OpenStack and into the Miranis OpenStack too Additionally, I'm glad to say that we have a big community right now And I'm really glad to see new names in this list so Let's Move to the next speaker. Hi, so I'm Matt I'm gonna talk to you more specifics about what's actually happening in the the ice house release of Sahara So if you're not aware, this is actually the fourth release of the project code the first release Where it was called Sahara and the first release where it was incubating with with OpenStack so through the four releases we've seen a nice growth of participation and whatnot in this previous in this exact in this release we actually had 142 bugs which I think was it was pretty high nice growth for us. We saw a tremendous amount of Blueprints and new features that were added to the code I'll go through some of those and then I'll hit off some of the the big ones around to the the workflow management CLI and then John will talk about HTTP enhancements and through that we actually saw about 32 people who managed to Contribute and make their way up to the the launchpad Metrics for this if you actually go look at the commit stream was coming through there another dozen people or so that were involved in the project And all of that we did it all using basically the standard process You can't quite see this here, but we're releasing at the same time was the first release where we weren't actually doing all the releases ourselves You're just keep going through the the same system that's all the other OpenStack projects do and I Mention all these things and this is actually just a part of what we've been working on if you go and you look at The client library or you go and you look at the disc image builder Elements or you go and you look at the puppet modules you go and look at the dashboard You see even more contributions and whatnot so with that Some of the big features that I want to point out to you for the ice house release The first one is our tempest integration So everybody who's kind of who's even remotely familiar with OpenStack knows that OpenStack is really good for maintaining code stability API stability code quality Sahara itself has had a large amount of Unit tests and integration tests that have existed through its four releases This release we've started pushing things up into tempest which will help us maintain even better API stability and Integrate even more with the OpenStack infrastructure So that's good for anybody who wants to use it Plus anybody who wants to develop on it the second thing that's going to help people who want to develop on it as well Is that there's more integration in Dev stack? So this helps people who want to do the developments But it also helps the project itself as we want to get into the more of the OpenStack gates And whatnot and maintain compatibility there So the third thing I want to point out here is that in the ice house release We actually did the first release where we pushed all Hadoop 2 Plugins into the system if you're not familiar with Hadoop There was an architectural shift between Hadoop 1 and Hadoop 2 or Hadoop 1 There's a pretty much just HDFS and map reduce and then an ecosystem of projects that interact or integrated with one of those and Hadoop 2 it's actually there's a platform that's provided with the yarn layer which is a resource manager and Map reduce is just one of the frameworks this will work out pretty well for us as we go forward as a data processing program within OpenStack and Also having this availability means that everybody is actually using a Hadoop distribution these days They're going to be using if it's a Hortonworks if it's cloud error if it's map bar or whatever They're going to be using Hadoop 2 and they need to have that functionality when they come to OpenStack So just a few other kind of scattershot features that are coming in we actually released support for HBase We've got we're building now images for running in your cluster that are using Spark We've done a lot of work to move our kind of native Original provisioning engine to use heat the hope is that in and I guess Sergey will talk about this But hoping will be just switching over to that for Juno There's also pushed to add Internationalization we did for folks who don't want to run all their services as root we added some root wrap stuff in and then part of the heat work and part of our kind of like changing our architecture to be more Agent-based we actually have a start of an implementation of a guest agents So those are kind of high-level features. I want to talk about Two kind of big areas. So one of the big areas for Sahara is elastic data processing or what we call EDP Basically, this is our take on how you do workflow Management and the goal being to make sure that people can actually come to OpenStack system They can not have to care about the details of the cluster that they're going to run their work on They just have their data and they've got some sort of question They want to answer and we move them through the system as as smoothly as possible got a quote up here from Amazon EMR about how they're they're actually running millions of these EMR Clusters these workflows on top of EC2. So having this functionality in OpenStack actually brings a workload to OpenStack Which will allow for kind of higher adoption and whatnot or at least we hope that is the case so In this in the Icehouse release updates for EDP. We actually have support through the The the Hortonworks HDP plugin previously It was primarily through kind of a generic vanilla reference implementation plugin So this is great. This means that you can set up your HDP cluster and then you can do all of your workload workflow through its using Sahara itself More more features that we added there We started out originally with being able to ingest data from SWIFT do the processing then output it back to SWIFT We've now added the ability to use HDF HDFS itself as an input output Which means if you have long-running clusters you can continually run workload across them or even if you have remote clusters You can ingest from a remote cluster due processing and then maybe output to a to another remote cluster I've added some new Actions so new types of workload that you can actually run previously was mostly it was pig hive map reduce added support for map reduce streaming and Maybe most interesting for like the total hackers in the group the Java actions which basically lets you run any Program as long as it's Java Doesn't have to be a specific Workload so there's that and then in the kind of like long long term We want EDP to be really smooth and easy to use framework We've added the ability to Relaunch jobs with new inputs new parameters so that you don't have to go through the whole process of creating up a template of jobs and what nots Okay, so that's that's EDP want to talk to you briefly about The command line interface which we've added so goal here is basically to have parity between what you could do with the Horizon dashboard it's what you can do with the command line For a lot of people my lease myself I tend to gravitate more towards CLI's which are more which are easier to script and whatnot We'll actually in the long term help us be able to do end-to-end testing of the system as well So in this you can do everything that you could actually do with the with the dashboard In sometimes not as smooth the fashion, but we're we're working on that the when I say not as smooth the fashion The the dashboard UI actually will allow you to do kind of like wizard screens to build up different structures And then like create a cluster and then launch it and whatnot With the CLI we kind of do that minimally we allow important export of JSON formats So you can you can build something with the dashboard then you could export it to JSON Tweak it a little bit with said here and there and then import it again if you wanted But we cover image management so this is Sahara's way of Associating images in the glance registry with plugins that Sahara runs to limit the amount of Times that people are going to try to run a cluster with an image. That's not compatible with We have all the functionality for node groups and cluster groups And job templates. I mentioned this has more of the the JSON input output, but it's still very functional We're starting to actually use it in a number of our in our tests Handling data sources If you want to come kind of find us afterwards I can show you what the typical workflows Or like or if we have time at the end they've got a nice little Slide to the describes the workflow, but data sources and job binaries data sources are things like this Swift's container or that's HDFS directory and job binaries are things like this jar file That has this main method and this other jar file that has another method that kind of stuff So but that can all be built up through the CLI as well Everything I mentioned before is all set up for these these last two Commands being able to actually create clusters and jobs and real long real run them launch them We adding some more functionality being able to scale and whatnot Coming up That's it at this point. I want to hand over to John to talk about HDP Thanks, man. Hi, my name is John from Hortonworks Hortonworks provides a plug-in to Sahara Sahara projects are very important to Hortonworks The HDP plug-in has been part of Sahara from from the beginning of the project Right now the HDP plug-in provides support for all of the Sahara functionality. We've talked about Nova Neutron scaling up Swift integration center support data locality and EDP one of the differentiators for the HDP plug-in is that it utilizes Apache and Bari Apache and Bari is the management framework that's used in the HDP stack So the way the HDP plug-in works is everything goes through the Apache and Bari It goes through the rest call it actually goes from provisioning Apache and Bari for you within your VMs based on your topology layout and then it proceeds then use Apache and Bari to actually lay out the cluster start to cluster for you So more important for the end user is that after your cluster has been started up You have Apache and Bari available to you for monitoring and managing of your cluster After you've gone ahead and provisioned it so you can go ahead and you could set up your security with that You can you can do monitoring via ganglia You can you could add additional nodes if need be in services, etc So, you know, you're not just stuck with a cluster that's hard to manage a monitor You have a nice Ui so there's a couple screenshots to the right there that show some of the you know the graphs that are available to you So that's something we've been working very hard on is providing good management and monitoring support And that's available to you here through the plug-in So we're also gonna talk about Why is this important to us so? Sahara provides the ability to combine the HTTP stack all the functionality available in the HTTP stack and an easy to use reusable deployment tool Sahara with an open-stack framework We also provide several pre-installed images So we have several HTTP images that you can use for our 1 3 and 2 0 6 stacks And they're available on s3. You could also use a generic image You'll get a little slower startup time But that way you don't have to have the the image be populated with HTTP bits All right, so talking about the HTTP stacks So we started with one so we sport several stacks we sport 1 3 2 2 0 6 and 2 1 1 3 2 is the the older Map reduce name the map reduce HDFS type of functionality Then we moved up in the 2 0 6 where we provided yarn and 2 1 is going to introduce further Storm and Falcon and going on we'll have solar cascading potentially in the near future. So the neat thing with this combination of Sahara and The HTTP plug-in is that as you download the latest version of Sahara You're going to get the latest plug-in which is going to give you the support for the latest HTTP stack you won't have to do any additional work. So as we add new services Part to the HTTP stack they'll become available to users of Sahara through the plug-in All right, so talking a little bit about disk images So there's a there's a project called the disk image builder that's available. It's a separate project But it it's used by all of the the plug-ins for Sahara to build VM images it's a consistent API across all the the providers to provide you a mechanism to build VM images HD plug-in is supports a plane a 1 3 2 with 2 0 6 and soon to be 2 1 preloaded images. So those preloaded images have HTTP bits already installed on them in the form of a local repository And that improves the startup time of your cluster Like I said, we also support a plane image where it goes over the wire and then I'll pull those things down if you Choose to start out with it say a plane centos image So the scripts also can be customized so in addition to taking running the script and getting all the the default VM image for say our 2 0 6 stack you can augment that the script to include changes for security Maybe some custom packages you want to add to your VM images. Maybe some tweaks some kernel settings and your OS, etc All right, so I'll talk a little bit about blueprints. So this is a little forward-looking Blueprints is a mechanism that we use within now within Apache and Bari to completely Describe a cluster in a text JSON format So this opens up some very interesting use cases between the physical and the virtual world. So What is a blueprint? First of all, it's just a it's a complete description of a running cluster So you'll be able to have a running cluster in a physical world. It's a physical cluster You've deployed this thing by hand probably taken you quite a long time to get everything, right? You'll be able to then ask in Bari for a blueprint for this running cluster It'll give you a JSON file that fully describes your your running cluster at that point you'll be able to take this blueprint and import it into the HTTP plugin and Ask it to spin up the same cluster with the same configurations and the same topology you'll be able to then override any topology specific or Yeah, any any configurations that you feel that you need to override in your virtualized environment You'll be given a chance to override those and provisioning your cluster and What what this allows you to do is? You know mimic something you you took a lot of time provisioning in your physical world And then just tell Sahara to deploy this thing and now you have a replica if you will of this cluster in your virtualized world More importantly, it's repeatable. So now once you have this blueprint You can then go ahead and provision as many of these clusters as you want based on this blueprint So some of the things that you know We're looking at internally to use as for is maybe something a use case where support might use something like this so we'll get a Support ticket filed from one of our customers and we'll go ahead and ask them, you know, send us your blueprint we'll go ahead and provision a cluster using this blueprint, you know through in our private cluster here and We'll be able to then take a look at what the customer's problem based on his exact configurations and topology So this is something that's now I think it's going to be released and I'm patching and barry and a month now and then soon to be followed in Sahara in the HDP plug-in right now We don't have the exact syntax of the blueprint support in Sahara in the HDP blueprint. I mean HDP plug-in what we do have is we have something is very similar So you could almost do this right now With with the HDP plug-in where we have something very similar to a blueprint that it's Jason basin was derived from the original blueprint Specification the we make use of the feature within Sahara that allows plug-in providers to have provider specific templates What we do with that is from you you upload this provider specific template and then it goes ahead and creates your No group templates and your cluster templates so if you don't wish to create these templates in the native manner through the Horizon tooling you can instead upload the HDP specific Template and we'll go ahead and create the cluster and the no group templates for you so like I said, it provides an interesting use case between the physical and the The virtualized worlds there where you can repeatedly deploy a cluster based on a standard Jason template And that's all I have right now for the HDP plug-in. So Okay, and The last slightly two slides. I think is about Juno plans. We have a brief general roadmap for Juno to Integrate with open-stike ecosystem better. We have in a plan to integrate a viscometer and ironic We slanted to publish some of our metrics statistics about Clusters about I do port clothes and etc. And ironic will make us able to provision Bermatt Island hybrid clusters that is a very interesting use case too Additionally, we are working on distributing a Sahara to make a giant in multiple processes like other open-stake projects we're working on guest agents to To remove for direct access from the Sahara controllers to the head of cluster notes Oh, of course, will you have some bunch of HDP and changements like new data sources new job types? And as I say in the first slides over working on merging dashboard to the horizon In fact, our roadmap will be discussed as a design summit that will be in the end of week Tuesday afternoon and Friday morning. Additionally, we'll have two more talks about the Sahara Talk about technical deep dive to the DB functionality tomorrow and the performance of Hadoop OpenStack and Wednesday So that's all Thank you and the questions Yeah Probably you should take a mic Can you elaborate a bit how this blueprint's concept is related to heat and in general, what are your plans on leveraging heat? Sure, so they're actually they're they're independent So heat is a way that we actually go ahead and provision do our back-end provisioning of the VMs Blueprints is more of what is the topology of your cluster? So you when you think of provisioning your clusters in multiple phases the first is you provisioning the hardware the VMs themselves Making sure they're set up. That's all done in heat. That's a layer below blueprints So blueprints comes in on top of that where you you have all your VMs set up your your networks all set up And now you want to go ahead and lay down your Hadoop cluster That's where blueprints comes into play So that it's basically the second phase could be defined by the blueprint Which is basically an HTTP specific template cluster template. So, you know, it's it regardless whether using a blueprint whether regardless of which Particular plug-in you're using it's always going to use heat and let's just specify not to on the back end to provision your VMs At that point each VM each plug-in then provision slightly differently One of the ways that you can provision through the HDB plug-in is is through a blueprint template Does it does that answer your question? Okay, I have a question on one of the slides I think it was HTTP 2.1, which is coming up. You mentioned Storm Yes, talk a little bit more about storm and what your plans are with that Other than so storm, you know is an app that we use within the the yarn framework of to HTTP 2.0 in this case 2.1. So it's It's a streaming API for doing real-time processing So that you know, that's going to be made available to users of the HTTP plug-in through the one of our 2.1 HTTP stack So I don't know the right agenda talked about the low-level details what actually storm is For this talk, I think it suffice to say that we make the 2.1 stack available to you and it's part of that stack The storm is is available as a service and based on you can deploy the service in its various components to the to your cluster via Sahara that will be available via a blueprint or Through the standard cluster templates and from the Sahara point of view will support storm topologies as a part of EDP I Got a little confused when you describe the plugins and the main projects Sahara Can you specify a bit more, you know, just briefly, you know, why are the new improvements that you plan to to add to Juno from Ice House and What are the reasons you are adding these new? new additions to the project Like what are your plans for Juno? Basically, what are you additions the new the new things you want to add there? Do you mean Edition of new plugins. Yeah in the project like our plugins or improvements to the project that you're you envision from ice house to Juno moving on It's improvements. It'll be ruined in all areas. We have plans on the new plugins. We have plans on in chains in our EDP stuff with a new job types data sources Probably updates for existing plugins with new versions for example for HDP plugins to dot-one version with new services Is the instantiation of an HTTP cluster intended to be within the guess of a KVM Hypervisor or is it Intending to leverage I want to do bare metal deployments of the HTTP instances themselves Everything is deployed to use in the novel instances. So it depends on how we are novel services configured For example, when we have the erotic integration, or it will be integrated inside the novel We'll be able to provision bare metal and hybrid clusters Currently we're a provision on the virtual clusters Is is there a capability to do them both at the same time? Into the same environment have a pod of Ironic deployed in a pod of virtualized Was that It's important here. We primarily we primarily use the nova interface So as soon as the nova interface provides us with that capability, we can we can do it We're not trying to circumvent or do something extra Well, actually one of the things I'm trying to figure out for our environment is how we can build customized Definitions of the blueprints so that we can direct that interface to say this segment needs to be on ironic this segment needs Yeah, we're playing that actually in the descriptor for the json or do we have to extend it to make it do that? Jason Right now is specific it doesn't have any notion of virtualization It's just basically topology and then you have to then map that on to another layer, which is What what are your VMs look like what your flavors? So there's gonna be have to be work to be done, but that's certainly a use case that we're looking at Do you have any plans related to a pot seeds park? Do you have any plans related to a pot seeds park? Yeah, so I mentioned I mentioned that we already have Disc image builder elements to create images that have spark. There's actually a spark plug-in That's under review right now. It was just updated again this morning. So there's there's good progress there So I think you mentioned about something that I wanted to ask so Are other than the HTTP parking do you do you have another example of parking? That you can elaborate a little bit more so that to help me understand the layer difference between the Sahara API and and the underlying So you're asking the relationship between The HTTP plug-in and Sahara proper is that right and I like to know if there are any other Parking that you can you can explain a little bit Other than the HTTP one and any other what I'm sorry parking Any other parking a plug-in. So yeah, we support we have a generic plug-in. We support We haven't Intel we currently support but that's going to be Yeah, so Intel the Intel distribution kind of went away. So, you know that may or may not be around for a long period of time So right now it's it's basically HTTP generic and Intel not sure of the future of the Intel plug-in based on the recent news with Intel You know kind of leaning away from that distribution So your question with regards to the relationship between the plug-ins and the Sahara proper so there's a formal contract between the two Sahara is responsible for talking through heat provisioning all your VMs it understands cluster templates and It basically creates your your virtualized Cluster is a fully networked at that point the controls handed over to the plug-in through an SPI and it says here's here's a topology Here's the nodes. Here's here's your cluster topology. I'm your Hadoop cluster topology Here's the cluster topology your VMs, etc And then it's up to the the VM the plug-in provider then to map that cluster topology on to your your You're onto the appropriate VMs on to you. Does that make sense? Yeah, so there's very distinct steps between the plug-in and Sahara proper, you know the plug-in is just basically I have I Basically have a bunch of nodes that I want to deploy Hadoop onto that's that's what it does Okay Quick question Rewarding so I'm very Monitoring piece right is going to be replaced by a Cilometer or Cilometer is going to stay as like very high level. I Just trying to understand the you want to plan for Cilometer. So by the Cilometer. We have two directions of integration with Cilometer first of all we can post different statistics about number of created clusters The life lengths of clusters number of jobs per cluster Something like this some statistics about Sahara objects and the another one Direction is to read some statistics from inside the cluster and repost them to Cilometer, but only some very important statistics We will discuss it on the design summit So so so it's it's not a replacement, you know the Apache Mbari is very fine grain monitoring of your your cluster where Cilometers I've been more coarse-grained really You can really drill down into very low level details in Apache Bari touring inside the furniture link like Mbari will work As an addition to other tools, yes, they don't replace each other they Have one question. So how does a UI look for this entire thing? So say for example the HDP plug-in does it have its own UI or is this all integrated into horizon? So stuff like you know say job cues or I want to know how how the cluster is or Some of the graphs that you just showed so were they as part of horizon or did every plug-in have its own UI for now the whole as her API is available through the horizon and We're a posting links to Hadoop tools like Uzi like web the UI for HDFS Is the links in horizon so you can just click the link and open the Hadoop specific web Management tools for example in Uzi you can see how your job's running on cluster and etc So that's how a patch in Bari is exposed It's exposed as a link so you click on the cluster you just provisioned and there's a bunch of management links and patching Bari would be one specific to the HDP plug-in at that point You then it brings it to the the typical UI for a patching Bari now you're outside of horizon It's not a pain within horizon Thanks. Thank you very much