 Check one two, okay Guys ready hi folks Thanks for showing up. So We're about to start Today we will do a session with you called OpenStack and Hadoop 101 Sorry if anyone will not see our Black slides here. I was actually I was actually thinking that it will be like usually in OpenStack summits with the large projectors Anyways, I will distribute them after it after that. So if you don't see anything, no need to take the picture all the time So today we will do the session We originally called it getting big data cloud Getting big data cloud downright Eventually, we renamed it a little bit colon or getting big data clouds Building blocks in place. I will explain later. What's the difference here? So We will discuss a couple of things We will discuss the main use case for a virtualized big data solution on top of OpenStack cloud. What's the difference to running on bare metal? What are the building blocks on OpenStack layer? What are the building blocks on Hadoop layer? So we will do an overview of Hadoop ecosystem an overview of OpenStack ecosystem and how these two map together technically, so how Which you can get out of agenda, but how Project Sahara puts them all together and how it works in practice And we will do a discussion about some of the best practices and some of the known problems and questions which exist in this area So some of the gaps which are not yet closed for OpenStack Some of the gaps which are not yet closed for Hadoop and some of the testings and certifications which we are still yet to do So I'm Dmitry Novakovsky. I'm product manager ex-presales at Mirages and together with me. I have two more folks today My name is Sergei Lakhianov. I'm PTL of Sahara project in OpenStack and I'm principal engineer in Mirages leads in the big data track Oh, it's working. Okay, great. I'm Trevor McCabe from Red Hat. I'm one of the Sahara course I've been working on Sahara for Two and a half three years So right folks to ask question about Sahara, that's for sure Before we start I like to do a quick quiz. So can you please raise your hand if you're running OpenStack cloud in POC or later phase today in your organization? Okay, now, can you please raise your hand if you're running big data workloads on top of OpenStack? Great, okay So the rest of the folks are here to see how it's how it's actually can be done. That's good so Some a couple of words about use cases. So those of you already running will probably know the words Here on the slide you see like two dimensions. So one dimension is technical use cases, right? So how can the big data solution be consumed out of OpenStack cloud? The basic and the simplest use cases obviously to whatever provision Hadoop cluster on top of Virtualized OpenStack environment consume it directly like you would consume it from any other solution any other vermetal deployment Another use case is elastic data processing or EDP a separate API and point exposed by OpenStack Sahara project, which we will talk about. We will discuss in details later which allows to create OpenStack Hadoop clusters on demand on top of running OpenStack cloud. So not not pre provision it and then run the work The data processing job on it, but actually provision it on demand for a job that you already have The third approach would be to tap on the workflow engines on top of pre provisioned on top of pre provisioned Hadoop cluster on on OpenStack And that's the one dimension from for the technical use cases for the business-friendly use cases for those of you who need to whatever justify introduction of big data solution on top of OpenStack. There is a really nice white paper from cloud era and here is the link for it Called 10 Hadoopable problems. So those of you who are still struggling to get your management to approve the project for it might benefit from using Exactly this one. Now We will transition I will hand over to Trevor to do an overview of Little bit of a history of Hadoop ecosystem and its current state today And then we will get into showing you how the Hadoop ecosystem maps to OpenStack ecosystem today All right, thank you So this is my timer here to make sure I don't go over so we'll start the countdown And if it goes off then I'm done Let's see There we go. Okay. I can work the buttons. All right. So what 101 class would be complete without a little history, right? We always have to have a history lesson Hadoop as we know it today really began at Yahoo in 2005 They had a lot of searching to do a lot of pages to rank and they started looking for a solution so it turns out that Google had published some white papers in the early 2000s on Distributed file system also on the map reduce algorithm and Yahoo picked it up and started working with it So things go on 2008 is a is a big year for the project Cloudera is formed in a separate effort Yahoo moves Hadoop to Apache and is now a full Apache project 2011 comes along another big year Heart and works is spun off from Yahoo to focus 100% on Hadoop and Apache releases Hadoop v1 in December 2013 comes along and we have version 2 With some improvements that we will talk about soon. So that is kind of the The beginnings it grew up pretty quickly. So today we have a choice of multiple Hadoop distros There is the Apache project itself is the upstreams the foundation for everything you can find all the pieces there You can pull that stuff down and run it but we also have some commercially supported Distributions and these are the the biggest ones out there You have CDH which I believe stands for the Cloudera distribution including Hadoop from Cloudera Hortonworks data platform from Hortonworks and you have the map our distribution All of these are supported by Sahara by the way So you can run vanilla Apache Hadoop or any of these other three Let's see. Okay. So what was the primary problem that they were trying to solve you you probably know this But we'll go over it anyway. So the problem was how do we search or run analytics on data sets of ever-increasing size, right? It's not megabytes and gigabytes anymore. It's terabytes and petabytes, right? So how do you do that and the answer is divide and conquer? This is not necessarily a new idea if you go back to the 80s and 90s If you remember, you know all the development and excitement around quicksort, right? Same idea break stuff down into smaller and smaller pieces solve the problem and reassemble that is exactly what map reduce does in In a parallel fashion you take your data set chunk it up throw it out to a pool of servers Get intermediate results and then reassemble them and anything that is Anything that you can decompose into smaller Chunks and process independently is a good candidate for a map reduce type approach Okay, so what are the common components in Hadoop? There are basically three and we'll talk about the fourth one in a minute, which is really just a refinement Hadoop common is Sort of like Oslo and open stack it is it's the kitchen sink for all the common libraries, right? So anything common across multiple modules Goes in Hadoop common and is not you know reproduced elsewhere HDFS is the distributed file system All the data and executables are pushed around a Hadoop cluster using HDFS Map reduce is the data processing platform itself Originally was I think Java only but there are a bunch of different language bindings now so you can write MR programs in in C++ and Python and probably other things yarn is a resource management and scheduling component You'll hear if you do some research about Hadoop version 1 and Hadoop version 2 and so the distinction between those is really this initially map reduce Handled the data processing platform as well as all the resource management. It was one big monolithic piece and in v2 Apache Broke the resource management and scheduling out of the map reduce piece and put it in yarn The advantage of that is that yarn now is general purpose. You can actually use it for other stuff It can run any framework you want it to as long as you write a application masterclass See what my timer says. Oh great plenty of time Okay, so what does HDFS look like? This is a pretty simple diagram a high level view of HDFS Basically you have a name node The name node tracks where your data is right Your data exists on data nodes. It's replicated across multiple data nodes for high availability When you have a client that wants to interact with HDFS data, it asks the name node Well, where's my data the name node responds tells the client Oh, your data is on these data nodes and then the client goes out and reads and writes the data nodes directly so that is HDFS and a nutshell and a HDFS really underlines a lot of things there are there are different processing frameworks We'll talk about a couple of those later on but really this HDFS component of Hadoop sort of Undergirds a lot of them. I'll also mention that HDFS is as much an API as it is an implementation So there are things like Gluster FS and From Red Hat map R has their own implementation of HDFS The map I forget what the term is but it's the map R file system so there are a lot of Possibilities there We'll also talk about the yarn architecture here a little bit There is a single resource manager. This is very analogous to what we just saw with HDFS the resource manager talks to node managers on each of the execution nodes and Then for each framework that you run you'll see the little purple Circle there or oval That's the application master for a framework and so what the application master does is it goes back to the resource manager and says hey I need you know this amount of resources to run this framework and then the resource manager allocates it and the node manager is kind of the middleman managing all that MapReduce itself in Hadoop has its own application master That's included so obviously you don't have to write that yourself and it's deployed for you But that's really what's happening under the hood Alright, so this gives you enough to write the typical hello world Hadoop app which is it's probably word count right take a book Read it in and then find out how many times each word in the book appears right that's sort of everybody I know that's their first Hadoop app. That's very impressive But you need you need more right that's not enough we've got the core, but we need other stuff so in production environments You want things like cluster monitoring? If you have complex workloads that are you know are expressible as a graph You might have synchronization points where okay, you know tasks You know one two three you have to finish before task four can go on You've got migration out of legacy data systems or back and forth across frameworks So those are some of the the kinds of things that you need and writing MapReduce jobs is not always easy You want high-level scripting right you want more expressive things so you can get your job done faster You want a sequel like front-end so that you can query Big data databases as if they were sequel right so all of these sort of desires and things people found as they started Using to Hadoop led to the development of a pretty rich Ecosystem And we'll talk about just a few of those things here There are there are a lot of them. So this is just a handful Pig let you write scripts. There's a language called pig Latin and this is sort of freeform scripting over Hadoop data sets It compiles them into as many MR jobs as it needs and it runs them for you So that's very nice hive is a sequel like query language You've got Uzi which is a job scheduling and management system. Sahara actually uses this under the hood It supports pretty complex DAG type jobs if you know what a DAG is it's a directed acyclic graph and basically multiple jobs with dependencies zookeeper I think a lot of people here know about zookeeper choose than other instances, but you can use it for high availability and synchronization and distributed config H base gives you I believe it's a I always get this wrong. It's millions of Millions of columns by billions of rows or it's the other way I can't remember but it's really big That's the point. So H base lets you look at really big databases H catalog gives you some interoperability support across You know when you want to use your data and pig and hive and map reduce Each catalog gives you some metadata that that enables that and you've also got scoop for moving stuff in and out of Hadoop and legacy systems This is just a handful. There are many many more people are making new projects around this stuff all the time And there's also stuff outside of Apache like Hugh Nagios ganglia other things that are being added So there is a very rich ecosystem here So wow, you know, how do I choose? How do I know what's what's going to overlap? How do I know that it's going to work together? What version should I use have I forgotten anything? Well, this is this is where the distros help you out They answer that question. So here's a typical HDP deployment You can see you've got multiple Methodologies under the data access layer. There's pig there map reduce. It's got storm Of course HDFS and yarn are still there. You've got things like security In governance around the outside You can see Uzi on there. So this is a subset that this looks like HDP version version 2 But this is a subset that that Hortonworks put together To give you all the all the capabilities you want Likewise, there's something here from CDH very similar In that it has HDFS and yarn at the core, but they'll they'll choose Services a little bit differently to give you the same Capabilities like for instance they have in Paula if they're doing sequel, right? So distros are your friend. They help make the choices for you And as obviously as new services come online these companies release new stuff And the beautiful thing about this is Sahara gives you access to all of them Simultaneously, so you don't have to choose you can have everything so Great, I hinted at Sahara a little bit, but then the next obvious question is how does this all map to open stack? Right, how am I gonna get this on my cloud? Hey, that's my alarm perfect So I'll hand it over to Sergey at this point Thanks very much Okay Okay, let's take a look on it from the opens tech point of view So what's the place of big data in open stack? Obviously It's a it's a workload on top of open stack. Actually, there is another place for big data for the for the cloud infrastructure itself for example for Logs processing and management, etc. But let's take a look on a workload part so In the open stacks big data implemented by the Sahara project It's the codename of the officially integrated data ports in program and open stack So is it was already say that some there are two main goals of Sahara project the first one is a Vision and operates the data processing classes and the second one is a schedule and parade data processing drops and From the Sahara point of view. We're working not only with the Hadoop, but with the data processing frameworks to Support the new ones on the mend with a pluggable Approach so for now For in Sahara data processing means support for the Hadoop Spark and storm So how Sahara works actually it's fully integrated inside the open stack and Like if you'll go from the user point of view. We have a Spreaded dashboard inside of horizon That exposes all of the functionality of Sahara through the UI Teases spite in Sahara clients and Sahara integrated with one second a day and using heat for for the underlying resources provisioning such as virtual machines instances IP addresses volumes, etc So started with me duck with a Heat usage for provisioning will be on the option. So We've no way to use directly the service without heat So For for the external data storage Sahara uses swift as a default object store in open stack and the Sahara integrates with swift to To enable it as a file system for Hadoop to be Transplantly used as a file system from the Hadoop jobs. So some some words about the current state and Ecosystem of Sahara in open stack. So as I already said, it is an official open stack project for like last two releases, I think so it was initially started why these Merantys and the Red Hat and Hawthorne drugs joined And so so we have now much more contributors and For example, we have vendors contributing their own drivers for example map are fully supporting their plugin inside of Sahara Okay, so both the provisioning part the main and mostly interesting part from this Provisioning stuff is that Sahara provides and users ability to manage all of the configurations layout For the Hadoop clusters. So in Sahara, there is a template mechanism That provides users ability to specify the whole layout and configurations manage them and Create the clusters from the templates Many times so you can have a template and reuse it for example for the bare metal and virtualized deployments There is a concept of not group templates in Sahara that Consists of the specification of the Process that will be executed on a On a node it's in fact a role for the cluster node it could be for example master worker and It could be least of the process of Hadoop that should be executed in this one To it contains open stack flavor to specify number of CPUs RAM and etc. That will be used for the virtual machine creation and there are some storage and networking specific configurations such as the syndrome volumes or Net network that should be used to allocate the IP addresses for the cluster The another template type is a cluster template and it specifies list of not group templates with the numbers of instances for each of them in addition to it it specifies some cluster-wide configurations like affinity for the Processes inside the cluster or some cluster level Hadoop configurations like replication factor Then Sahara checks Check the configuration before actually starting the cluster. So it validates number of services configurations for them, etc Based on the already created cluster you will be able to scale up and down the cluster using Sahara interface just by changing the number of instances for the node groups Here is an example of the template for the HTTP cluster It consists of the three node groups the master node group contains the name node the source manager history server and a bunch of additional Services like for example unbarry that is used for the actual cluster provisioning And it could contain some some other master sources for example like each base each master The another node group is a is a worker node that contains actually the main Worker processes for the Hadoop is data node for HDFS and not manager for yarn And we could have a secondary name node that will Like it will be another node group for a secondary name node for Hadoop and for example for Uzi server and client so this Three types of node groups build there the cluster topology and we could reuse it To create the clusters of the exactly the same configuration to reproduce same configuration So a few words to the supported distress right now we in Sahara support three vanilla plugins. It's named vanilla because it's just a upstream package versions of Frameworks, it's Hadoop spark and storm and we support three vendor distress. It's a HDPE CDH in mapar actually all of the plugins Supported or initially supported by by the vendors for example HDPE plugin were written by Hortonworks initially Cloudera plugin were written by Mirantis But now very actively supported by Intel folks who are working with Clouder and mapar fully contributed and supported by by the mapar team So how it grows with the ecosystem There are many new frameworks in a Hadoop ecosystem that are joining Apache and Hadoop world for example Cloudera and Hortonworks bring in some new frameworks each their major release and By the very pluggable architecture Sahara seamless support Could support any new products in framework actually any framework could be implemented through the plugins And that's actually the intention of the project is to Support the new frameworks based on their own demands and integrate between them to to provide Data ports in stack. So right now It's a provisioning part is a fully pluggable. So any any framework for the data processing could be implemented as a plugin in Sahara and the EDP part is not so pluggable right now, but it's becoming more pluggable for the Transparent support the new job types and data sources as a plugins and our actual plan to to extract all of the EDP stuff as a plugins to make it even more flexible and to make support Easily for the for the new data sources for example So right now Sahara is mostly Hadoop centric But it's already contained as some spark and storm and growing to support other data frameworks as well so Go yeah, so Okay, so you heard a little bit about How do PECA system you heard a little bit about how it maps into open stack via Sahara project So the next question which we usually hear at Mirantis and Redhead. Okay, how do I put this all together and in this section? I will share some practical tips or some practical measurements which we did on By running big data workloads on some of the test clusters and some of the best practices we came up with I Mentioned at the very beginning that we changed the title of presentation because to be honest We wanted to bomb you with some more statistical data and performance data But what happened is that the lab which we are using to Which we were using to do the measurement it became unavailable for us to some period of time So we had to postpone some of the statistics publication. We will do it over Current cycle still we will give you some we will give you some best practices now and we will also open for questions at the end so before again before I start Sahara as Sergey mentioned is now the part of Sahara is now part of official all the official open stack distributions including Mirantis open stack including Redhead open stack platform including canonical so you can get a pretty also installed pretty much anywhere automatically now As we go to actual dude actually doing the notes So first question we usually hear is how to compare the how to compare running hadoop jobs on In virtualized environment as opposed to running it in bare metal How do vendors and like cloudera and Hortonworks were usually recommending positioning for you So the rule of thumb the number which we usually say is that KVM as a hypervisor introduces roughly 10 or lower percent of overhead for KVM itself, of course it varies depending on the On the specifics of how do jobs that you are going to be running But again if you're talking to your manager if you're talking to a customer 10 percent so lower is usually the safe number to expose Obviously you need to I you need we recommend and you need to isolate the networks So separate storage traffic from open stack management traffic at least to make sure that Traffic that hadoop workload will generate will not kill your open stack cluster. It's really not nice on the storage level for On the storage level for hadoop VMs themselves The best practice recommendation is to actually use not swift not default LVM driver But to use block device driver So the driver which allows to you pretty much to do a direct pass through of a block device into virtual machine and Avoid ice-cazi overhead, which is significant Especially if you attach 10 to 12 volumes to a single virtual machine avoid LVM overhead, which is not significant But with LVM you cannot bypass ice-cazi. So you'll have to do like this and in such way you will pretty much utilize the full Hardware disk and pass it through into virtual machine yielding the best performance Two ways to do it today. At least if you're using mirages open stacks So one is like the guy here in the link described Configuring it manually. Okay, it doesn't work like this anyways Second way is again if you're using mirages open stacks, there is a fuel plug-in Which will make which will make public publicly available soon if you want to test it earlier shoot us an email We will get it will get you to test With which you can configure your open stack cloud at the deployment time to support block device driver and allow VMs to be created with with persistent storage like that and One more item here is scheduler hints passed by Sahara. So the way how Sahara works by default is that it hints Nova scheduler to Schedule Hadoop VMs in a way when they will run on the same Compute node as the volume is going to be created So in such way you preserve the main rule of running Hadoop workloads is to keep your compute as close to storage as possible So Sahara does it by default strives to do Sample configuration so that's the sample the sample hardware plus virtualized configuration for for Big data cluster which we once came up with for one of the customers So here you see open stack compute hosts. It's quite beefy. It's has two sixteen core two sockets worth of sixteen cores CPU to two hundred and sixty six gigs of RAM Simple simple simple pair for Open-stack and host operating system and well for tele for terabytes hard drives in j-boat mode You used via j-boat the rest of the configuration are virtualized Are virtualized Machines for the Hadoop components themselves so for the manager not for master not and for worker nodes and A couple more nodes so on the networking sites on the networking sites. Sahara uses standard networks to When it provisions big data When it provisions big data clusters with Hadoop So again, make sure you separate not only storage but also tenant networks if possible on different physical links obviously Because again, nobody likes killing open stack because of placing too much load on the Hadoop itself on RAM Please don't oversubscribe if you're subscribed CPU. It will just get slow if you're subscribed from it will never finish running so well usual practice with KVM, but especially relevant for big data workloads and And Last but not least as we were discussing with folks today With all the practices of which we which we can list here You still are in the chance of running into poor performance or running into killed open-stack cloud The key idea is that you will still most likely for real production workloads We will use some consulting from your big data vendor whether it's Horton works called there are power somewhere else so again on the open-stack site we can optimize for more or less generic usage of Hadoop on open-stack, but still it will very much depend of what will you will run It will depend much more on what you will run compared to when you're running generic stuff in individual machines So if you're deploying big data get ready to get some consultancy from folks who know big data really well not only open-stack Some of the open problems and questions which are relevant for Sahara and for big data workloads today So first of all, we're all working in the community to get Not only virtualized virtualize Hadoop for you, but also Bermetal Hadoop for you So Sahara natively obviously integrates into open-stack capabilities to do Bermetal via ironic Though there are some specifics which one needs to account for so for example here right now We have the patches in the review, right? What already merged patches for in the current cycle of Sahara to do some smarter things with ironics So for example when Sahara would be provisioning in Hadoop VM into a Hadoop Image into the Bermetal node. It will also automatically take care of provisioning All the available disks to be properly used by this node So by passing nice Kazi by passing pretty much and everything including Cinder itself So one more thing in works is to allow Different Cinder volumes to be used separately for HDFS and intermediate storage results and Things which we are also looking at which I mentioned that will be presented in our next development cycle is Certify some reference architectures on the hardware So this is the distribution which we use this is a Hadoop distribution this we used this is how much performance we get This because this has been taken too long So we've been focusing a lot of developing Sahara developing plugins and so on so forth and not so much on the hardware Certification we're closing this gap right now and final last but not least New my capabilities have been merged into Nova lately mostly for any three workloads We'll also got to take a look on how to properly leverage them for a big data including things like socket affinity So pinning the VMs to a specific CPU socket to avoid cross socket switching That's already gets us beyond the level of having Sahara as an obstruction layer and having support from for different Distributions that gets us to a point of actually running it in production and making sure it get we get most of the performance Available from the hardware that is used within the cloud Questions Sergey Why do we recommend with what why is our default recommendation with goes with block device driver not with swift? Actually the main reason is that when you're working in a swift it's implemented in a Hadoop in a way that you will you will access data in a swift through through the proxy nodes and it means that you will have a bottleneck on this note and Like a performance will be not not so good as using directly data and When a Hadoop sites and mostly all of the workloads are very data intensive So you need like very performance storage It could be dozens of terabytes of data and so that's like the default option for us is a ability driver with a direct attached disks for it and Swift could be used for some some input or results storing for example results could be much much like the size much could be much lower than the inputs Or you need to store the result that would be much smaller than the inputs seems to be used for it like as a for example When you have no Hadoop clusters right for example, there is the issue of running So We would keep sender volumes backed by BDD think of it as a block device pass rule directly into VM I think we're supposed to use the mic for questions we should have mentioned that earlier. Do you want to restate your question from the mic for? Posterity so just curious about the Scaling the cluster up and down if you know HDFS replicates the data as you scale the cluster up and you add data nodes The cluster has to rebalance. Is that something you find happens quickly and easily or like how often would you typically see? People rebalancing or scaling the clusters up and down so While you scale scaling down the cluster secure doing the decommissioning before they disable the data node. So it's in fact, it's The state of the note put as a maintenance mod like on the read only and Here are in it Starts the decommissioning process that moves all the data blocks to another data nodes. It's a full scaling down and for scaling up rebalances like we're not doing rebalance because it's like it could be endless process in Hadoop and the real There is no real advantages of rebalancing when you add in the new data nodes because because if you will start Writing new data, it will be put on a on a free note first It's not for HBase So it's only for HDFS for HBase you will probably need to to run rebalance manually HBase is a very specific case We have our object storage service based on SAP RGW, but not based on SWIFT So when can we integrate with the SAP? instead of relying on SWIFT So there are two options, basically the first one is to use as a rudder's gateway for SAP Is it exposed as SWIFT API, but it's it will introduce a proxy nodes and the performance will be bad Because you all data will go through the rudder's gateway There's another option is to use a self file system that is doing exactly the same, but as a file system and the performance will be most performance There is no like native self support for Hadoop and we're evaluating writing the Hadoop plugin to support this as a self natively, so we're thinking about Now let me make sense and how it's implemented We actually run a step in In-production with the SWIFT API and then Hadoop cluster on top of of it with with SWIFT as the data store for that So there's that we're gonna give a talk actually tomorrow on that So if folks are interested just to see how we're doing that come by So does the Sahara API Have planned to support the scenario where you know You have a separate user for creating a cluster while different group of users are you know only mind-submitting jobs and getting the results so in Vintaka we have a Probably in the end of Liberty release we added the ACL support So you could you could mark the cluster public And it will be publicly available turns In open stack and the users will not be able to operate the cluster itself, but we'll be able to run the EDP jobs on it Another question about the API so for some of these emerging spark like interactive Analytics right it'll be desirable the user can query the API and get the response right away if let's say you're doing a Query on some big data But does is there a plan for the Sahara API to support that kind of model? So we don't have such functionality right now and about the plans probably It's not in a roadmap for now Probably in the veto API we'll think about supporting the interactive queries So the BDD driver. It's in fact It's a very very small driver that just Using the VRTIO to attach the Block devices that does hard disks from the node to virtual machine in fact, it's just Saying to to leave weirds to add the VRTIO device The VRTIO performance is just a few percent slower than than bare metal so the VRTIO is like Like doing only a very small overhead it's like it's directly bypassing the disks to the virtual machine So on a scale yes, it it will like in total the overhead will be the same 12% for the performance You know, it's not introduced on the huge He is a huge difference for performance of your client actually, we were running tests and On a long running tasks, you you will never see this one two percent