 Hello, folks. Today we will speak about the HERA project, about the last data protein feature. It's a technically deep dive into this functionality and today's speaker myself, Sergei Lukanov, I'm the project technical lead of this project and I'm the principal engineer on Mirantis. The second speaker is Alexander Ignatov, he's from Mirantis too, he's a senior engineer and the last one is Trevor Marquet, he's from Red Hat, he's a senior engineer. So our today's agenda is a very brief overview of the HERA project and overview of the EDP architecture and technical concepts of this feature and we will have a live demo today to solid start. Our program is about providing and operators functionality for creating and provisioning elastic clusters. This elasticity provides the ability for users to utilize different resources in their clouds much better than just having a thousands of nodes for Hadoop on a bare metal for the whole time, so you can just add remove nodes from the cluster. And second direction of the HERA is providing operations for Hadoop jobs for Hadoop workloads on top of the created provisioned clusters. So what is the Hadoop? Hadoop is a big platform, it's not just a single project, it's pretty same as the OpenStack, it consists of two core projects HDFS and Yarn that provides distributed file system and distributed data processing engine and there are a lot of different services and projects built and working on top of the yarn. For example for streaming processing, for batch processing and a lot of other tools, so it's a very very big and fast-growing platform in big data world. So why do we think that it is a good idea to bring Hadoop to the OpenStack? As you can see it's a Google Trends chart that shows graphs for the OpenStack for the Amazon Elastic Cloud and for the Apache Hadoop. So as you can see the Apache Hadoop was started several years earlier than OpenStack but both projects have the same angular of growth. That means that they will grow very good in the future and that's why we think it's a good idea to bring big data world starting from the Hadoop project to the OpenStack and have a service integrated here. So let's take a look a bit on the architecture. We have an UI plug-in in Horizon that implements all of the functions provided by the Sahara service. You can do anything the Sahara could do from the UI. It's now in the process of merging of this plug-in to the Horizon itself so I hope we'll have Sahara integration as part of the Horizon release in June. We're using Keystone for communication like all other services and we're using Heat for providing and provisioning different resources for our Hadoop clusters like instances, networks, volumes, etc. So a few notes about our current status of Sahara project in the OpenStack. We're officially integrated project in June release and we support different Hadoop distros. The main ones are Vanilla Apache Hadoop. It's built by ourself as a reference implementation of plugins. It's about instead in Vanilla Apache Hadoop and some tools on top of it. The second one is built by Wenders. It's a Hortonworks data platform. It's a big management platform that installs Hadoop cluster and tons of the different tools on top of it. Additionally, we have a distribution of Hadoop plug-in that is now closed by the Intel and in the process of merging it to the cold air distribution that is now in the blueprint on review some parts of it. Additionally, I'm glad to say that we have Sahara included to several OpenStack distros. It's included to the Erdio and to the Miratus OpenStack 2. So the slide is about our contributors. You can see the first three logos, Miratus, Hortonworks and Red Hat, the companies that were started as a project about a year ago in collaboration on a Portland summit. I'm glad to see a lot of logos here and new names and companies who are contributing to our project. So the next speaker is Alexander. He will talk about EDP. Okay, thank you. Hello, I'm Alexander from Miratus and today I'm going to talk a little bit about EDP. It's architecture and technical concepts. So it's a high-level point of view. EDP is a key feature of Sahara which allows users to execute and manage Hadoop jobs on clusters provisioned by Sahara. Today, Sahara EDP supports three types of external data sources. They are SWIFT, Hadoop Distributed File System and CIF. Also, Sahara may work with several kinds of map-reduced jobs like Java Actions. It's a Java program compiled into the map-reduced instruction, map-reduced itself, PIC scripts and HIO queries. Who is not familiar with PIC? Apache PIC is a platform for analysis of large data sets which contains high-level language for expressing data analysis program. And HIO is a tool known in the big data world, allowing users to expose SQL-like queries over non-relational data in the no-SQL storages. For executing jobs, Sahara uses Uzi. Uzi service is a workflow schedule system which is used to execute and manage Hadoop jobs. Today, Sahara supports both Hadoop versions included in all plugins in the Sahara HDP and the Vanilla plugin. And there is one new feature, interesting feature. It's job executions on the transit or temporary clusters. So, why this EDP is needed? The first use case I'm going to talk about is simplified task executions. Sometimes, Hadoop user do not want to know about which cluster is used for his calculations, for calculations over his data, which configurations is used for that, how this cluster was provisioned, which cloud resources used for it. So, Sahara can work with this use case. The second one use case is about efficiency of utilizing cloud resources. Sometimes, user needs to run resource intensive map reduce tasks in a short period of time. And this task could be run nightly. So, let's Sahara EDP can run this job at night. And the day times cloud resources could be used for another purposes. And the next use case is about something like after-scaled clustering. It could happen that some map reduce job running on the Hadoop cluster could work faster if we add new calculation resources to it, like data nodes for HDFS layer and task tracker for map-reduced layer. So, this use case solves this problem and can speed up the calculation doing that. So, let's go deeper to the EDP concepts. And Sahara EDP can work with three types of objects in terms of EDP. The first, there are data sources, job binaries and job executions. The first one is our data sources. A data source object represents a user-defined URL for input and output locations in some external storage, like Swift. Knowing that each Hadoop cluster running by Sahara always will know where to get the input results and where to store the data. The next one is job binaries. Job binaries also provides user-defined URL to the programs written by user which will run on the Hadoop clusters. They could be pig-and-hive scripts, executable jar files and pluggable binaries and libraries. There are two options in Sahara to store binaries in Sahara internal database. In that case, no extra credentials needed to get these binaries. Another option is to store job binaries in the Swift. In that case, user has to provide some additional credentials to get job binaries from the Swift containers. So, let's look how to execute the job step by step. At the first step, user already has input data uploaded to the external storage job binaries. He uploaded to the Swift or to Sahara database. And he has to start running the job. But before that, in the second step, he should run cluster. It could be already running cluster. It could be new cluster. It could be transient cluster. A transient cluster is a type of cluster which is dedicated to run the Hadoop job for only single Hadoop job. And after that, Sahara will go and kill the cluster. There is one more restriction that cluster which will run Hadoop jobs in the ADP, must have Uzi service. Uzi is a tool which allows Sahara ADP to run the jobs. It communicates with job tracker to push to it some map-reduced instruction and so on. So, at the next step, user goes to the Sahara ADP, provides all additional configurations, job specific configurations, like number of map tasks will be used in the map stage during the map-reduced calculations, or number of reduced tasks, how much Java heap size will be used for each task and so on. Also, user has to provide URLs for job binaries and data sources, and extra credentials if it is needed. And finally, user has to push the launch job button. When job execution has started working, Sahara copies all job binaries to the shared location in the HDFS of provisioned cluster, so letting the all Hadoop services to get it later during the actual job execution. And the next step, Sahara ADP generates file or script or scenario file to Uzi. It contains job specific configuration, URLs to job binaries and data sources, and extra credentials if it has in that script. So, the next step, when the data processing happens, Hadoop services reads the input data, do some calculations, and store the results in the provided location. At the same time, Sahara ADP tries to monitor the state of the job executing, job executing, job may be in the state of the job. So, the last step is to get the output results. So, there are three types of states. Firstly, it's in pending state, then in the running state, and then in the succeeded or killed state. And the final step, when user goes to the data storage and grab the output results. That's all from my side. I'm passing the ball to you. So, we were hooked up from the Lenovo back to a stack at Red Hat. I was going to show you this great stuff, and for some reason, the weakest link, the projector won't pick my machine up. So, I'm going to have to do this live, well, somewhat live, from a video. Can we go back to the slides first? I had a few introductory slides. Okay. So, I apologize for the delay. Okay. I just am breaking everything. Are they not there? They're not there. That's the end. Oh, there we go. There we go. Okay, great. Something's going right. All right. So, the demo that we have on video is called Big Pet Store. Some of you back from database design days might remember that Pet Store was like a big popular thing to do, and you track purchases of transactions of pet supplies. So, in the Apache project now, there is this demo, Big Pet Store. It's meant to be essentially a test laboratory for all things Hadoop. So, you can play with it. It tests the Hadoop ecosystem. It generates data, cleans data, processes data. And the really cool thing about it is it's being actively developed with integration testing. So, it's a nice platform for testing that you can always count on that will be there. I've got the git address up there. You can clone it and build it and run it. So, we're going to look at that in a minute. So, what does the demo do? It generates a million records. It could be much bigger than that, right? Big data is much bigger than a million records, but just for the sake of for display, it was set at a million. Then we run over it and clean up the CSV a little bit, rearrange it, and then run a query on it to extract cumulative counts of purchases of pet supplies per state. And so, it shows you the basic building blocks of how you run jobs in EDP. Essentially, there's really only three things beyond the cluster itself. There are job binaries, right? So, that's just a pig script or a hive script or a jar file. Then there are jobs which group multiple binaries together into a bundle. And there you can parameterize those at launch time. And then there are also data sources, which are basically objects that represent paths. So, for some things like Swift, there will be some kind of authentication mechanism in there. Currently, it's password credentials, user password. We're looking at doing something different in the future in the design sessions. But all that information that you need to access your data path is encapsulated there. Those are the basic building blocks. This is just a view. I don't know how well you can see that of the data that comes out of this thing with a listing from Hadoop. So, you've got the block on the top, shows you the sort of unclean data, and then it gets processed in the second job and tab separated and normalizes some of the date fields and things like that. And then in the third job, it runs a query using pig across all the data and pulls out sample by state. And of course, this is just a, you can see that it ends at Arizona. It's actually potentially much longer than that. That's what the data ends up looking like. And I think, yes, okay, so that's our next slide. So, if we can switch back to the video, and I'll try to keep track of the time here and make sure we have time for questions, I think we're fine. And I may need to pause this at certain points too. What's that? Yeah, it should be at three minutes. All right. So, here we go. So, this was previously recorded, thank goodness. So, the first thing we're showing here is the definition of a job binary. You give it a name. In this case, big pet store.jar. You select your file off of disk with a browser and you upload it and hit create. That's all it takes. Now you have a job binary. Okay, and so that contains the classes that the big pet store needs to run. So, next, we create a job. Also, very simple form here. We give it a name. We select the job type. In this case, it's a Java action, which corresponds to Uzi Java actions. And you select libraries from a dropdown of things you've already uploaded. Now, our job is defined. Now we can launch, you can launch on a transient cluster or on an existing cluster. This is the existing case. For Java actions, of course, you need the main class. What are you going to run? And then you pass arguments to it. So, in this case, the first argument is a million records. And the second argument is a relative HDFS path. So, it's going to generate this data under the Hadoop user on the cluster. And then we launch it. It'll sit there and be pending for a while. When it turns to running, it means Uzi has picked it up and started to execute it. And I'm not sure. Okay, so here we go. This is the Uzi console. Do you want to pause just for a second? Because I think it's going to outpace me. Thank you. So, one of the nice things about Sahara is we have links to a bunch of the web UIs for different tools exposed through Sahara. And the Uzi link is one of those things. And it's very helpful for checking the status of your job runs or debugging things when they go wrong, finding out why a job didn't work. So, if you go under the cluster details, you can pull up the URL for the Uzi console. And that's what this is here. And in there, you can examine the workflow that was generated. You can look at the Uzi logs. We also have links to the various MapReduce UIs. And that's all available from the cluster page. So, okay, go ahead. All right, so we're just looking at a few logs here. And eventually this will go to succeeded and say that it's finished. There we go. So now we have our data. And I believe this now is going to, okay, it's doing a listing of the data on the node to prove that it's actually there. And then the next job will be the job that does a combined clean and analyze. So, one of the nice things about Java actions is they're completely freeform. You can do whatever you want. You can pass anything as arguments. Some of the other job types are more constrained. Like, for instance, a traditional MapReduce job. There aren't so many parameters you can give to it. And it demands that you pass an input and an output object. So here we're just reusing the same jar we had for the, well, actually, okay. All right. So this job is, would you hit pause again? All right. Sorry, I'm just trying to get my bearings there. This job is a Java action that actually runs PIG as a second step. It uses the PIG APIs to run it directly. And so this has a PIG script and the big pet store jars lumped into it. So it will execute the Java portion first, which generates the clean data. And then it will use the PIG API to execute a query on that. And it will pull out the final result, which I believe we will see in a nice little web display, color coded map of the US when we're done. So again, and the Java actions are very flexible. You can run anything you want. Some of the example jobs are things like estimating the value of pi. Pretty much whatever you want to do. Here, we're adding libraries. Some of the job types will have a main script. Some will have supporting libraries when you have multiple things bundled. And we should go through the same process here. Launch on the existing cluster, specify the class you're going to run, and the values. In this case, it's just an input and output path. Because it's going to run on the generated data and then clean it. See how we're doing for time. Okay. We're still good. Actually, so while that's finishing up, there are a few other things I wanted to note. This example, the big pet store stuff will be in the Sahara repos really soon. So you can run it yourself. We have a repo called Sahara extra. That's where the examples go. We also have CLI, which does all of the stuff you can do through the web UI. I've actually written an integration test against the CLI. I use it to launch clusters all the time. It's great. It's very easy to use. Just take some JSON inputs. And so we should have more examples coming your way. Let's see. Let's see if we get, there's our display. So that's our processed output with the result of the query. And we should be pasting it here. There we go. And that's our color coded map. So this is your executive summary when you have to talk about how many pet supplies you sold. You can impress people with a flashy graph. And it all came out of Hadoop on Sahara. So that's it for the demo. I can show you the live one out on the picnic table if any of you want to see it. Can we switch back to the slides? I have a couple slides left. So what do we do next? I believe EDP was new in Havana. And we've developed it pretty rapidly. And we like it very much. But there's still a lot of things that we'd like to do. And so here are some potential areas for further development. Other job models beside Uzi. Right now we're locked down. Everything is expressed as an Uzi workflow. But it doesn't ultimately have to be that way. That's just where we started. So we'd like to make the job system pluggable. So you could run other stuff. You could run custom stuff. And then the current Uzi offerings would become just one option. I know we've talked a little bit about a spark plug-in. Other things like that. That would be, you know, even a bigger divergent from Hadoop. But ultimately we would like to, you know, be able to run all different kinds of workflows with different engines. In Uzi itself, there are some things we'd like to add. One is user uploadable Uzi workflows. Uzi can do an awful lot. And sometimes those things are not always easy to express through a web UI. So if you let you design your own workflows and upload them, then there are no constraints. You can do whatever you'd like. And going along with that, we'd like coordinated jobs. You know, directed acyclic graphs, input, you know, output from one is input to another, that kind of thing. One thing that we're very, what I want to say passionate about is usability. We would love your feedback if you have a chance to play with it. We want better error reporting. Nothing is worse than when a job fails and it just says failed or killed and it doesn't tell you why. Somebody knows why, right? Some layer of software knows why. It knows enough to say killed. How about tell the user what happened? So we're looking into that to make that a better experience. And then just the user experience in general. We want this to be something that people just love to use that they think of as very easy. I have some big data processing job I want to run. I'm going to go use Sahara. Because it makes my life simple. So we are on pound open stack dash Sahara on free node where they're all the time. Please stop there. Ask questions. Some friendly person will answer you. We have an open stack dev mail list, which is pretty active. Just open stack dash dev Sahara in brackets in the subject and you'll find us. And I think it doesn't want to advance. That might have been the end. There we go. So we have design sessions in 304. Thursday and Friday we're going to be talking about some of these things. How to make this more usable. How to make it consumable. How to make it pluggable. All that kind of stuff. Thursday and Friday. So if you have any interest at all in that, go ahead and drop by. And if we have time, oh great. We have eight minutes for questions. So if you'd like to ask questions, then one of us will answer whoever. Seems most appropriate. So I have a few questions. The first one is are you supporting just non map reduce type workloads as well on Sahara? Is there road map intention? So I'm looking at streaming stuff like cloud era impala or some of the newer processing paradigms that have been made available with yarn on gen two. So right now it's only map reduce. Because we're still relatively young in the project. But with the move to Hadoop two, and you noted yarn, you know, yarn can execute anything. So there's a big field for us to expand there. So at present it's only map reduce. But we would like that to be more than that in the future. Got it. Second question is through Sahara. Can I provision a Hadoop infrastructure end to end, including the name note and the job tracker? Or is it attached to an existing, you know, infrastructure? Yes, you can launch Hadoop cluster straight from Sahara. In fact, that was part of what I was going to show you live. If I had time, essentially, you define node group templates. So usually you have one or two master templates, depending on how you want to break down your master components. Then you'll have some worker templates. And you define those and then group those into a cluster template, which gives the number of each type of template, and then you just launch. And then once you have the cluster template, you can hit launch all day, assuming your data center can handle it. And it's very, very easy to create clusters that way. Got it. Thank you. Final question is performance. Have you guys benchmarked or are you looking at benchmarking Hadoop on OpenStack versus bare metal and non-virtualized? Yes, we have a talk about it to spread the session tomorrow. You can go to it. It's named something like Hadoop Performance on OpenStack. It's a bit tomorrow. Okay, got it. Great presentation, guys. Thank you. So it looks like we're out of time. So there are no more questions. You can always find us in a Brantis booth, probably on Design Summit. You can attend and take a look on a live demo. This guy and this guy could show you. Okay, thank you all.