 Savanna Project, which is Hadoop on OpenStack integration that's been started at the beginning of this year in the collaboration between several companies and Today will be presenting the results of the work that's been done and Today speakers is myself. I'm Eli Elterman I'm Iranian platform products as part of Mirantis OpenStack also will have here met for really principal engineer at heart and core Savanna contributor and we'll have Sergey Lukyanov from Mirantis who is Savanna project technical lead So what we're going to talk about is I'll give the overview from the different angles And we'll try to provide as much of the context as possible for those who are not aware what we're working about and then Matt and Sergey will talk of the actual features of the actual status quo of the features of the savanna as well as a road map and we'll show you Small life demo of how it's work so what savanna is Savanna is actually a project to bring the big data and data processing through OpenStack and it sits on the operations level is providing provision and operation of the Hadoop clusters and on on top of that as the next layer will provide and capabilities of Schedule and operating Hadoop just there was a kind of misconception that whether it's kind of mixing of the two different set of APIs of controls in one projects So the answer is not it's just Different levels of the operation and infrastructure support that savanna is doing so essentially providing automation tools for people to do Hadoop cluster self-service provisioning and to run the jobs on top of the OpenStack On the Hadoop on OpenStack so To start with I would like to give some insights for you know OpenStack crowds of what Hadoop is so Hadoop is Pretty much alike OpenStack is not just a single product. It's a platform It's a big data platform, which is an open-source project governed by the Apache foundation it consists of the set of the It's consist of the set of the components or kind of individual project that is has on life cycle road map and so forth Pretty much alike an OpenStack. There is set of the core services and there is set of the integrated Services that's that's works on the top and Also somewhat alike an OpenStack. There is difference of the Vendors that are working to build out of this kind of open source component the platforms The actual supported platform that is going to be shipped to that to the real customer So I took the example from Hortonworks data platform Who is working with us on savanna as an example to show how kind of Hadoop ecosystem is lookalike and It's it's I mean on these pictures notes even the full list of their services that are available on the Hadoop So the core services include like HDFS is just is distributed file system where you can actually store the information the data then there is a Data protein engine that's on top, which was originally MapReduce now it's MapReduce 2.0 also known as yarn Which is more like general purpose Distributed data protein engine with this different plugins on on the way how you can actually do Calculations and also there is a layer of the services that's on work on top of that for instance pig a hive It's in a sense of the kind of DSLs to do the actual data manipulation So like how pig and hive Themselves do the actual data transformation and manipulation While savanna just provision those services and help users to configure those services to work correctly So in terms in terms of In terms of whether it makes sense to bring Hadoop on OpenStack or not And whereas I mean who who cares about the Hadoop this graph is actually shows who cares about the Hadoop so the red line is actually Hadoop popularity the Google transfer the Hadoop popularity it's been started earlier It's growing you can see the line of growth the angle of growth is the same as for OpenStack and In terms of the interest and the overall world community is still kind of quite Significance and bigger than for OpenStack it is so integration will also conserve the OpenStack good service Why why would I do it? What are the use cases? So? The central use case is a self-service provisioning of the clusters you can do it for the FOK Purposes or to provision the elastic elastic production cluster also This elasticity concept helps to better utilize the capacity of their overall OpenStack if you imagine if you have a kind of thousand node OpenStack cluster and There's a bunch of the compute that will be from time to time just standing by and being able to provision workloads Using using savanna and Hadoop can help a kind of much better utilization and And and the and the last use case is also very important is actually once again to ease data processing for the end users so savanna can take care of the All the infrastructure and provisioning work and the end users can provision on the actual workloads So this picture is actually to show how savanna integrates with the Hadoop with the OpenStack ecosystem It can be controlled through horizon and we have a plug-in for horizon it use keystone for authentication We do integration of Hadoop and Swift so that Swift can read the data directly from How to can read direct data directly from Swift we plan to add support of the trove data sources And also currently savanna talks to the Nova Glansk seems are a neutron directly, but we plan to aid hidden as obstruction layer So here is a kind of few nodes on Where savanna is savanna is an official incubated OpenStack project quite recently been accepted We've released 0.3 version on on the same date of the Havana release and going forward. We'll just keep on there Whatever the OpenStack naming conversion and code drops are Here's a list of the Hadoop distros actually first two of them vanilla patch Hadoop reference implementation Hortonworks data platform Are actually implemented and officially released as part of the official release and we have Intel distribution Submitted for review and clodara distribution is submitted in form of blueprints And actually savanna is already included in in two distros, which I'm really pleased that it's happened and such earlier It's included in the Red Hat RDO and the Mirantis OpenStack So another kinds of points to make savanna at this point is not just a toy It's an actual tool that can be used its production or it can be used in production This is a kind of a preview over there. What's savanna performance on on its own is? We'll be doing more studies on the actual performance of the Hadoop on OpenStack and Savanna in much more rigorous way. This is a kind of a preview that shows that 200 nodes cluster can be provisions in actually less than seven minutes using savanna so and I really grateful for the community around the projects and Hortonworks and Red Hat in particular by Doing lots of heavy lifting on making savanna happening And I also really grateful to see a new names coming in and doing the actual contribution to the to the project So with this I pass passing the words to the next presenter to talk about specifically elastic data processing So so this I'm going to be talking about where we're turning EDP So this is elastic data processing. It's the free show that's released for the first time in this third release of savanna So on all of this in six months building on the initial code drop that the morantis folks did with some Basic cluster ops to having our having the Hortonworks folks come in and work to create a a plug-in architecture so that different management layers can be Plugged into savanna and then finally all of that infrastructure to actually enable primary use case of basically delivering Hadoop ecosystem to end users in this case when we talk about end users and not talking about people who actually have detailed knowledge about What it takes to manage a Hadoop system to spin up a Hadoop system to tune a Hadoop system to even do basic Configuration of a doop system and end user is just somebody who's got two things They've got some data and they've got some question that they want to answer and in this case the the data lives in some sort of repository be it Swift HDFS some other File system that you can attach to your to your cluster and the question is typically encoded in Embed embodying code so a little bit about why we're why we're doing this and why this is kind of an important use case for us This is not necessarily the most exciting quotes I've had a couple conversations with people that have real numbers on this and if you happen to be one of them And you can give me a reference. I'd love to see it but basically EMR and sorry EDP is not a new idea it's been out there for at least five years in Amazon and at this point Amazon is doing they say here millions of launches of Hadoop clusters a year if you do the math That works out to be a little bit more than a hundred an hour For every million that they but that those millions are and this is growing. This is just this is just AWS there's also a Azure and Google obviously has their own and whatnot and as as those different clouds are starting to add this value They're they're both adding a variety and depth Arounds around the offerings on top of their public clouds and why do we why do we kind of care about that? We care about that because those those offerings are Rarely if ever open they're pretty much always proprietary they rarely if ever Allow for positive feedback from their users to shape how they how they like how they exist in the future And they don't enable they they primarily enable lock-in to the cloud provider themselves So for instance on the on the Google side you've got Google has a map reduce Implementation they integrate it with some of their data services So that's that's something that you can only get when you live inside the the Google cloud If you look at Azure, they've gone even deeper in some places integrating with multiple data sources and integrating with multiple presentations like being able to do spreads your spreadsheets Interfaces for your processing, which is something that's actually very common people I'm sure are really familiar with the spreadsheet kind of interface and then AWS themselves they've been doing this for over five years now or whatnot and They've they've done all the things that the the Microsoft and the Google folks have done plus they've started adding Or they've been adding for a long time public data sets. So now when you go to one of these clouds you not only have Hadoop provisioned for you in a simple fashion you have deep integration into different offerings and you have data That's available to you. All these things are locking you into those clouds so We can do better in in open stacks This is kind of motivation for Savannah and doing EDP is that we can not only match the functionality with this with this community this this group of people Well, we can actually exceed it and a number of places We have already exceeded it in the way that we allow for tuning and optimizing the clusters that are actually deployed by giving information about the deployment of open-sack itself and bubbling that up into the Hadoop schedulers so The yes, so basically pulling all that together Letting us use this this community and you guys to actually do better and to eliminate the Barriers to entry of people who have deployed already and AWS or Google or Azure and moving those moving workloads into open stack So kind of to that end think I mentioned this already in six months We've already pretty much matched functionality with with EMR we haven't necessarily added all the other Kind of like side offerings like databases and data sets and whatnot. It's not necessarily our purview But it's maybe a purview of some other people in this audience So for this this third release of Savannah, we've done integration into the UI so you can go to horizon Click through start up a cluster create a template start up a cluster that template And then I think the the demo that Serga is going to show us will allow you will show you how you can take basically a pig job submit it to submit it through Horizon let it run get your results back in addition to the UI integration on in good open-stack fashion We've also produced the API's which others can start to integrate with Specifically on I mentioned data sources. We have Swift for this release in the roadmap. I think we're talking about HDFS and some other options to extend that and then the job types are basically the way you can just describe your question right now is map reduce jars pig and hive and one point this last point to mention here when When Ilya was pointing out what the Hadoop ecosystem is like and it's all these different projects that all kind of come together to form the platform We're actually leveraging that platform itself So there's nothing kind of proprietary or surprising in the what we're doing from a Hadoop community perspective So anybody who knows Hadoop can well show up here look at this see that we're using Uzi and kind of Understanding of the system itself. So with that I'll hand off to Okay Thank you Matt Let's take a look on the current state in terms of non-ADP features. I mean cluster ops Sure, we have rest API that provides us an ability to create clusters in one click Using the pre-configured templates. We have Two types of templates for not groups and for clusters where he can specify different Hadoop configurations and some open stack related configurations like flavors senior volumes network configuration, etc The next point is really cool. We support manual cluster scaling you can use rest API or Our dashboards to add or remove nodes to the existing cluster you can add and remove Some types of nodes. I mean you can add new types of nodes If you need more storage nodes or more computation nodes, you can just do it and you can remove some existing nodes and For the case of removing data nodes decommissioning will be automatically done The next the next interesting scene is that Savannah Provides interference and location control support. That means that you can specify that For example, all data nodes should be grouped in one interface in our group That means that all data nodes will be located on different physical hosts and HDFS will be reliable And the location control is about to support Sinter volumes as a weekend for HDFS. So HDFS will store their data on the Sinter volumes and Sinter could be backed by some network storages for example or for always hardware on the hosts The next scene is that is about data like I said We're supporting both strike and for level like awareness for HDFS and Swedish means that you can specify typology for for your open stack installation including data centers rags which and So I will map this information to the format that is a reliable for Hadoop and I don't will use it for Running jobs near the data And It's supported For swifts to have a patch already merged To swift and we have a patch already merged to the Hadoop to support it and and as a part of the shifting integration in addition we have an ability to use with the data source and the output for Running jobs on on Hadoop In terms of integration with open stack Sure, we have integration with Nova glance, Sinter, Neutron to provision resources And we have integration with OpenStack dashboard as Matt said all our function alerts here that That's presented in our REST API is supported in OpenStack dashboard plug-in using our Python bindings client and You can use the next point. Yeah, you can use both Neutron and Nova network for With Savana, so there are no limitations in network and we're using Keystone Trust API that was released and enabled by default in the Havan release to perform some asynchronous operations like Transient cluster support when you need to remove cluster after all jobs will be executed on it Okay, that's my one Let's take a look on our Dashboard and roadmap. I'll mix these points to it's not wait for jobs execution Let's start from there some live demo We have we prepared some some funny stuff We'll try to calculate amount of to doers and amount of to do's per person and Generate some top persons we used to do Okay, let's go on to the dashboard Here we have Timeout we have some problems Absolute I have some other don't it's probably doesn't allow more than that. It's different J connection looks okay Okay, we we have some existing cluster That was provisioned before the session it was created using some clustering templates and Let's take a look on the EDP functionality in this live demo To run the Hadoop job. We need to create several data sources both for input and output. Let's do it now we support only a swift and Let's create the input You can specify Pass in the swift and you need to Add credentials to access the data the same thing with the output Okay, now we need to upload job binaries In our case it will be the pick script that will calculate to do's Let's name it to do pink Okay, now we have the pick script And we can create the job using it Let's name it to Okay, now we have job Now we have job and To data sources and we need to upload our data to the swift we need some container and here is all sources from OpenStack organization That was charged and Let's upload it talks not very fast. We there is about 600 of megabytes in it Okay, let's We uploaded This data to the swift and sure is a container that input. Let's have an output Object, let's check data sources. I'll create them. Yes, everything's okay Okay, now we can run our job on existing cluster as you see in the previous You can run the job on that Transient class that means that you will be specify some configuration for cluster. It will be created for this job And will be automatically removed of the after the job completed We'll try the existing cluster. So you need to specify input data source output data source and the cluster to run the job In addition, you can specify some Arguments for the job and parameters that will be passed to to it We're using Uzi to manage job workflows and Now we can go to the clusters Here is the existing cluster very Ryan job and we can take a look on Web UIs here is a web UI of the HDFS. We see here three three leaf nodes Here is a Uzi Web UI and It's not fit You can see here the job that was executed before the session. I've checked the laptop and Here we can see for example some job blocks job configuration where our job workflow and job parameters located here is a job definition and Let's take a look on the Current job. It's running now Okay Here is a web UI of the MapReduce and Here we can see that there are some running tasks on workers Okay, let's return back to the slides and And talk about the roadmap As we have said we have been incubated For the ice house release and so our main goal is to Grew date from the incubation in ice house and become the integrated project. And so we have The main goal to integrate with open cycle system better than now as I said before we have integration with core projects to Directly provision and illustrate resources And we are moving our orchestration code to the heat We already have some proof of concept that use heat provision resources It works. Okay, and we start moving our code to heat I think next week and our plan is to finalize it till the end of the ice house release and Duplicate our current direct provisioning code Early in J release and terminate it till the end of the day release We already have some code in the stack that Provides your ability to install savanna using this tech with not very complex configurations, but with the support of Syndrome your drone and others and now we are pushing some additional code to To install savanna with some prepared templates to make it easier to use it In terms of testing we are now moving our integration test to the tempest and I hope we'll push first batches in a few weeks and we're thinking about how to make the Dev stack gating process for savanna because it's not very easy to Run hard up on virtual machines and nest virtual machines that are used for The stack gating now. They're not very large for Matrix and measurements will use a lometer and We have some thoughts about to extract and push some metrics to cylinder about On the level of savanna. I mean some stats like Number of clusters for different vendors. It could help to make some billing for paid support for example and Of course We're looking on ironic to Support better wear metal cluster provisioning and hybrid clusters provisioning too As for the EDP we were thinking about in change meant it's in the two different ways to support external with the FS and to support rdbms As data sources for example provisioned to buy trough Of course, we are thinking about code hardening and releasing a polished API V2 And as we are said we need to done some additional complex performance testing Okay, that's all for the roadmap. Let's return back to the UI and take a look on How jobs running? Okay, job is succeed We can take a look on the OZ UI too and here is a success status too So we now can go to the optic store to the demo container and here we see the output That's a Hadoop layout. It's not very funny And here we can see the results. Hmm. Where am I? Oh here it is not many to do and Here is some guy who have a lot of work in plan Okay, I think that's that's all for the live demo and for our slides Yeah, yeah, sure Good note, it's always exit We'll have design summit sessions tomorrow's afternoon We have we will have four sessions here is the link to our schedule Hmm. You're welcome. Participate it. We'll discuss networking problems Scalability further integration and our more detailed roadmap for the ice house So that was only a overview of the roadmap and we will discuss Detailed blueprints. Okay. Thank you So we have saved a little bit time for the questions of the rain questions So this is a Hadoop is obviously functional for big data. Let's say I have my 40 terabyte file of data Could you go through the data flow and how I would use that and how that would be sort of co-located with the processing? So I'm not moving data 40 terabytes across the network too many times Ah So it depends of where you have this file. I mean in the first place you need to have it somewhere and I mean one of the options to have it in swift So and in this case if you have it in swift Then we expose the information of the actual very chunk of the data is located in the swift and we transport transport this information to the actually to the Hadoop and Hadoop can use it to schedule the tasks closer to the data So I mean if you swift if you run in swift and compute on the same nodes, then it's my ocean ends up There's not copying the data at all. So if you have for the mixed installation with swift and compute notes on one notes, so you can Sven will pass awareness configuration to the Hadoop and Swift file system will use this information to access data On local costs and just to add a little bit to that swift being the interface You can always plug in different file systems below it So like for instance, we've done this with with cluster FS Where you can have you can have in your posits file system your data and expose it through a Swiss swift interface Which then can be consumed by the Hadoop cluster that Savannah started up To do locality Often it can be it depends on how much data is sometimes you can't do co-location and in Hadoop So, I mean, I'm sorry, maybe let's give people chance to ask some more questions. Yeah That's a kind of big topics. We don't have any official results so far We can talk about that later because we're looking at benchmarking virtualized to do versus bare metal I Mean, there's a big data panel with a different people kind of Making any points. I mean, it's really depends on how you measure and what you measure and really depends on type of rock loads Yeah, general expectation general expectation for the performance degradation 10 to 20 percent But it's also kind of arguable topic. I mean in terms of what you do benchmarking Ideally you need to do it in terms of the cost rather than just raw compute or whatever So it's it's a kind of a big topic No, we are not integrating with you. We were just running Hadoop on top of their Virtual machine on open stack. Yeah, and we was like once ironic is kind of gun Will be ready for production. It's exactly the same procedure will be used to deploy actual Hadoop on bare metal managed by ironic We've worked with the the Horingworks folks to to produce a patch for Hadoop That will allow it to get location information from a swift Yes We do not have that today In terms of the center, there are two options one is to use a femoral drive Just to place HDFS I mean even right now and today with the someone you can place the HDFS on the local femoral drive I will just work as a normal cluster or you can place it on the center volume so Yeah, so you can you can do both and it comes back to the kind of like workload and performance question because it's when we've Been running we do see a speed up With cinder, but you often have to tune the back end of cinder pretty substantially to make sure you get your architecture correct It's complex topic. We should talk after Any more questions AWS AWS API there is no there is no API compatibility. All right You can you can certainly do that the the kind of like depth of the the different layers of Of the API that we've built up a lot will allow you to do that or to have more more persistence Clusters that you can then run jobs against so but that's not the only use case I mean we're we're trying to be not to be overly prescripted to the customers So open stack is an open platform. You can shape it in a way How you need this or we have an like really extensive mechanism of customizing actually Hadoop clusters providing all of the kind of the Hadoop cluster parameters as well as kind of some you may influence on the how topology of the cluster is laid out So I mean you can just use it as a tool to provision your Hadoop cluster and the period of Hadoop clusters as well as you can Go one level kind of up the stack and use it is to just to manage the workloads So I think we're right Running out of the time and for those who are most interested and loyal to kind of concept of Savannah We have a limited amount of Savannah t-shirts left that we are giving out at this talk So one one t-shirt in the heads first come first serve basis Please help yourself