 The music's gone. I guess that's our cue. We should start. All right. So hi, everybody. It's third day of summit afternoon, and you made it here. I'm very happy to see you all. My name is Hart Hoover. I am with Cisco. This is Tony. Tony, say hello. Tony Dubio, Metacloud, for Cisco. Solutions Architect. Yep, I'm a product manager for Cisco. And we're here to talk about using OpenStack orchestration for big data. So here is what we're going to talk about. Surprise. We're going to talk about big data. We're going to talk about deployment options for big data workloads. We're going to talk mostly about OpenStack orchestration. So even though this is on the big data track, psych, it's not really big data talk. It's a deployment talk with orchestration. And then we're going to talk about what the community talks about with regards to OpenStack orchestration and big data, and finally, a demo. Ready? All set? Strapped in? Let's do this. Let's do this. So big data. It's a thing. You're going to track for it. The reason is because the digital economy is disrupting all the things. As we grow as internet citizens, as the internet gets bigger and bigger, and we generate more and more things that generate more and more data, this is how many bytes of data we generate a day. Unfortunately, most of that data is never captured or never collected with no action taken. This is in context of business workloads, so not whether or not your light is on or off in your house, but more of like I am interested in internet of things light bulbs. Do you record that data and use that to mark it to me or not? And as we grow on the internet, by the year end of 2018, 25% of durable good manufacturers will utilize data generated by smart things in customer-facing sales, billing and service workflows. And by 2018, they estimate 6 billion things will request support. And that's based on a Gartner report released recently. Use cases for big data and analytics are all over the place from retail to law enforcement and defense to health care to education resources, research, excuse me, financial services, communications, automotive, oil and gas, all these things. Big data is transforming the world that we live in. It's becoming an integral part of the enterprise IT ecosystem across major verticals, including these you see here and significant business opportunity for Cisco partners and customers exists. So to that end, what are we going to talk about deploying today? Hadoop. So Hadoop is a database framework for scalable and distributed software. It's based on a MapReduce algorithm from Google. It enables intensive processes, huge amount of data, running on commodity cluster hardware. Hadoop is open source, part of the Apache Foundation. And it's led by Cloudera, Hortonworks, and MapR. Hadoop has several compelling value propositions, mostly around it is very flexible, able to merge traditional reports with unstructured data, all in one database. So big data is not just about corporations inside world. It's about leveraging the entire world of data that exists out there and applying that data to business decision making. This is the architecture for Hadoop at a very high level. So you've got a primary node where the resource manager lives, which is basically, hey, a manager, master, that manages resources in the Hadoop cluster. It has a scheduler to allocate resources to the various applications across that cluster. And then secondary nodes are known as data nodes. Hey, guess what that has? Data. Also referred to as a worker node, it's responsible for storing and computing data and response to the primary nodes for file system operations. Generally, data nodes, surprise, surprise again, require a high amount of storage space. Why are we talking about Hadoop and big data workloads at an OpenStack summit? Because you people care about it, or users care about it. This is from the 2016 user survey report in April, last year's report, where big data workloads made up 27% of what people are using OpenStack for. 18% of those workloads were in production. The report that just came out, that number is in total 26%, so roughly the same. But you see an increase in production workloads from 18% to 21%. So that's why we're here talking about Hadoop workloads. I'm going to turn it over to Tony to talk about big data deployment options. Hey, thanks, Hart. Take it away. We're going to use two deployment options as an example to set up a backdrop for the demo. So Cisco has several different, what we call, validated designs for big data. And Hortonworks being one of those. So we're just going to use that as a reference point. And we're going to compare UCS bare metal to MetaCloud. So what I mean by UCS is our unified compute servers. And we're talking about our C series, which are Rack servers. And we use them both in bare metal and in MetaCloud. When I mentioned MetaCloud, I'm talking about Cisco's OpenStack service. And we'll talk more about that in a few minutes. But before we do that, I was just going to do a quick level set of Hortonworks. So most of you are probably familiar with the Hortonworks data platform, the HDP. And really, that is the Hadoop cluster that back in that basically that we would use for the data at rest. And then more recently, a lot of our customers are using the Hortonworks data flow, the HDF. And that's more of the front end. And it's focusing on the streaming analytics, so the data in flight. So when you couple that together, Hortonworks, in this case, we're using 2.4.2, a documented release in our validate design. But you could use newer releases. It's really a combination of multiple open source Apache applications that Hortonworks provides a service of stitching those together in the best practice for delivering your Hadoop cluster. So what we've done is we validated that design. And if you're interested in more details around the specifications around the hardware for each one of the applications, then you can follow this link. And I mentioned we have several other validated designs as well. I just wanted to pick on CAFTA as an example, since we're talking about streaming analytics, where because it's a distributed analytical application that's ingesting data, so the network's really important for that type of use case. And we'll touch more detail in the next couple of slides on that. And lastly, we want to talk about the benefits of the APIs, in particular the OpenStack APIs, because the availability of those and the functionality is really consequential in the success of the heat templates and the hot orchestration that we'll use for deploying Hadoop in an automated way. So first, in terms of bare metal, I just want to talk about the network infrastructure. And in our validated design, we use a switch called the Fabric Interconnect. So within Cisco, that switch functions as a very rudimentary switch. It only provides local connectivity between our servers. But it also provides our management system. So if you're familiar with Cisco servers, you know that part of our secret sauce is our UCS manager. And the UCS manager provides a toolset for provisioning the servers. And this toolset is delivered in a GUI. And it also is provided in an XML API. And we leverage a notion of service profile so essentially you can create pools of hardware resources and make those available in policies for the firmware, for the BIOS, the network to compute the storage. And it makes it very easy to provision those servers. But the challenge that we have is that the scope of it, again, is internal into the pod. When we start talking about streaming analytics, the network is paramount in terms of not only bandwidth but performance and quality of service. So we need a much more feature-rich network edge. So the external network is very important. So as an example of that, I included two ASR routers, Cisco routers, and two Nexus 9K switches because those are newer products that offer NetConf. And they have their own APIs. So the network is very programmable. And you can provide orchestration to those devices. Now unfortunately, in the bare metal configuration, those are separate APIs, the UCS manager versus the network devices. So you would have to stitch those together with additional solutions, such as UCS director as a Cisco example or other tool sets like Ansible. So just one area of complexity there. The next component is HDF. And I mentioned Kafka was an example of a distributed application. And it has specific hardware requirements because of the high IO. So in our configuration, we have to use more expensive SSD drives, solid state drives as an example of that type of application and other applications that we would build in particular nodes with purpose built hardware that would support the specifications for those particular applications. When we talk about the HDP or the Hadoop cluster, this is where we can take advantage of more commodity type hardware. So we can use spinning disk, SAS hard drives. And it's a low cost option to build a significant amount of storage for HDF. So for that Hadoop file system, we're able to build 2.5 petabytes of usable storage in a single rack. It's seven petabytes, but because of the replication, you have one third of that available for you. And you can use more commodity class hardware. And then in a scale out, we scale out horizontally. So in this linear scale out, we're able to take a rack. And then because we have top of rack switching, we can extend it to four racks or 64 HDP nodes, up to 10 petabytes of usable storage for the Hadoop configuration. And if we want to go beyond that, it's just a matter of adding additional fabric interconnects and top of rack switches that you can scale out to larger numbers in your data center. So here's a summary of bare metal. Now, although I mentioned APIs and there is self-service, it's limited in that it's not ubiquitous across the entire platform. So you're needing to tool your network with different solutions than you would use for your servers. And then obviously, the ambari component of building out Hadoop is not a part of UCS Manager. So from an automation perspective, it's a bit more difficult and more scripts are needed. And then from a managed perspective, currently, at Cisco, we do not have a full managed service for bare metal to deliver Hadoop. So it's more of a do-it-yourself model with a validated design. So for our customers that are looking for fully managed and from a virtual Hadoop perspective, we have MetaCloud. So MetaCloud in a nutshell is positioned for customers that are looking for a public cloud experience, but they want to deliver it inside of their own firewalls, in their own data centers, or in colos. And MetaCloud is delivered as a service, so we deploy all of the hardware infrastructure and OpenStack services and make it available to our customers. So what this represents is the head end. So for us to install OpenStack, we remotely deploy OpenStack on three controller nodes, three C220 servers that you see depicted here. We also include a router for out of band management so that we can remotely manage the hardware infrastructure. And then we include a pair of ASR routers and Nexus 9Ks because we can leverage those for hardware-assisted neutron. So we install a neutron plug-in into the controllers, into the neutron service, and all of the internal and external networking is programmable through the APIs of neutron. So what that provides us is self-service that is pervasive across either using Horizon Dashboard or the APIs. And from the remote management perspective, all of the underpinnings and complexity of the lifecycle management of OpenStack, we're covering that under a 4.9 SLA as part of the managed service. And then in terms of OpenStack, it's truly HA because we're also installing HA proxy for the controllers, CoralSync, Pacemaker, various other probes to make sure that the availability of the APIs are always there. And then for the heat service that HARD is going to use, we make that available so that you can orchestrate on top of the OpenStack to build your Hadoop. And then from a networking perspective, we're using enterprise class hardware so that we have 40 gig infrastructure east-west between servers. And then we also create an edge network that's programmable and it's fully HA and has failover capabilities because of the Cisco networking technology such as HSRP. We can scale out in much the same way. We can use the same types of hardware that we use for bare metal. And again, when I'm saying bare metal, I mean traditional bare metal, not ironic. And we'll talk about that in a second. So in a traditional bare metal, we can use expensive or more enterprise class hardware for things like HDF. And we can take advantage of host aggregates to reserve hardware for those use cases for those workloads. And then for the Hadoop cluster, we can use lower end hardware as needed to build out an environment. And from an expansion standpoint, we can scale out in a linear fashion in our head end that notion of a single set of controllers and a network in the APIs. We can grow that up to 400 compute nodes. So your Hadoop cluster could grow fairly large within a single head end from that standpoint. So the benefit here is that we do offer a single framework of automation in APIs that we're using across the compute, the network, the storage, and also extending that to the Ambari APIs to build the Hadoop cluster. And you'll see more of that in the demonstration. And then from the lifecycle management perspective, we're taking care of not only the open stack, but the hardware, the servers, the network infrastructure, and providing firmware updates, rolling updates, upgrades, the actual underpinnings of the operating system KVM. And for customers that have concerns and are a bit skeptical of virtual machines for a Hadoop cluster, we offer a PCI passthrough as an alternative to Vert.io for hardware access to those RAID controllers. And moving into our product in the near future, we will support bare metal for customers that want to mix virtual machines and bare metal together, but provision it under the same umbrella of APIs. So that would be the overall benefit of a metacloud provided platform for a virtualized Hadoop. So with that, I'll hand it back to Hart for the actual heat overview and demonstration. Thank you. So now let's talk about open stack orchestration. Now that we've got a cloud deployed, how do we get Hadoop workloads on there? For those that are new to open stack orchestration, what is that? So open stack orchestration is also known as HEAT. It is a service used to create and manage the lifecycle of an open stack cloud application. Every application is represented as a stack of infrastructure resources required to run the application as well as the application itself. So HEAT is made up of several components. You have the client side stuff, either using a Python HEAT client, or there are tools built into the open stack client that you can use client side. Server side, there is several services. The HEAT API, obviously the HEAT open stack native REST API, a CFN service which provides AWS style query API compatible with cloud formation templates and processes API requests by sending requests to the HEAT engine via RPC. The HEAT engine itself, which is the main worker performing the orchestration, and then some shared services, MySQL and Rabbit Message Queue. MySQL is used to keep track of stacks and resources and messages are used for, guess what? Passage message is back and forth between open stack services. When you are doing a post to the open stack orchestration workflow, you are talking directly to the HEAT API, which then drops a message in the queue picked up by the engine. The engine does the work of deploying resources for you and then records that work in the database. When you do a get request, you want information about a stack, it just reads directly from the database and you get your info very quick. High level overview of open stack orchestration. If you're in here in a core contributor to HEAT, I apologize in advance. So let's talk about hot. Hot is the HEAT orchestration templates. They're instructions on how you tell HEAT what to provision. HEAT templates or hot components of a HEAT template. There are sections of a template, parameter constraints which we'll talk about, some pseudo parameters and then intrinsic functions. So as far as sections of a HEAT template, we have what I call the meta section which is basically what version of HEAT are you using. That will depend on what resources that service is able to provide you. So you can use it by date or by open stack version after Newton. So you can say HEAT template version Newton. And then provide a description. A description is always important. So if someone comes back later and wants to know what does this HEAT template do? It's a nice description there for your coworkers or your future self. Parameters in a HEAT template. These could be described as inputs. So these are things that I want to have inputs to a template. They can be used for, so you can use the same template for multiple environments. So for example, you have one template but you use it for dev and production. So in dev, I want to use this SSH key in production. I want to use this key or this image or this flavor, et cetera, for compute. Resources, oh, sorry. So parameter constraints, since we're talking about parameters, you can set constraints on your inputs. So for example, if you have a key, you can say I want the key name to be this mini-character's long. Allowed values is also really good for that. You can say only use existing keys. You can have a range. So if you have, for example, I want to deploy X number of Hadoop data servers. You can provide, it has to be between zero and 10 or something. Heat also allows you to do regular expression patterns and customs constraints if you want to write your own. That's a bit outside the scope for what we're discussing today, but constraints on parameters. Back to overall sections. Resources are what heat actually deploys. So in this example I have here, it will deploy a compute server and using Nova. It has certain properties, based on flavor and image or key or network. You can also use parameters in your resources. So a parameter, for example, of a flavor or image. And then in your resource, you say get param image, right? Heat template, last but not least, your outputs. Those are actually outputs. So this was what you would get at the end of your template when it's run. So in this example, what is the IP address of my deployed compute instance? And the value would be the first IP address that is provided by the service. Heat also has some pseudo parameters that are just part of every stack and are available for you to use. You can use the stack name that you're actually deploying, the stack ID or project ID. These are good to use in resource names just to keep resources straight, especially if you're deploying multiple stacks in a single tenant or project. Just words to the wise, you use some pseudo parameters in your stacks. And finally some intrinsic functions with regards to heat. These are just available as part of the heat service in your templates. You can get attributes, you can use files, you can refer resources to each other, which we'll show in a minute. You can do string replace, that is good for things like cloud init information. You can use the resource facade as a custom resource you can use when using parent child templates. That's like some advanced heat stuff again outside the scope of this talk. The digest is good because you can use it for hashing things like passwords. And repeat is used for repeating resources given a list of items. For example, security group rules. So with that, let's talk about heat templates in a demo. Luckily, the OpenStack community provides a fantastic heat template for Hadoop deployment and it's available under the software config section on the website where you can see this. Shamelessly stole this graphic. Our Hadoop architecture that we discussed earlier is right here with the master node here, resource manager. And then these are the data groups or data cluster data nodes here in a cluster separated by security groups using neutron. The edge and admin nodes are a little bit extra. The admin node is used for just straight up system wide administration. The edge node is basically acting as a jump box for a user. For security, direct user access to the Hadoop cluster should be minimized. So this way you can have a jump, kind of a jump box here where the user coming in can either put data into an object storage or just use CLI tools to talk to Hadoop directly. All data import and export processes can be channeled on the edge node. Again, that big data deployment is available here at this link, but I'm going to pull up template here. Where's my mouse? I lost it. There it is. Cool, you can see that. So this is on OpenStack.org. So you can see where I shamelessly stole the graphic from. You can see the services included in the configuration. Looks like, yeah, okay. And what's nice about it is they show the projects through the project navigator. So it shows their adoption rates, how mature they are, et cetera. Fantastic resource. And the heat template is somewhere. There it is, small link. It's a zip file. You can download and look at it if you want. I'm gonna pull up over here. Should have had a bookmark, of course. So everyone can see that, right? Super small text. Cool, so I'm gonna make it still bigger. So you can see I have my meta section, my awesome version, and my cool description. This deploys Hadoop and stuff. I've got my parameters here. These are inputs that I'm providing to my heat service. I have things that are hard-coded. I don't recommend doing that. It's just for ease of use for me. But they're just hard-coded UU IDs as a default value, because I'm lazy. But you can provide these at runtime, all these parameters. Then we get into resources. We're using the software config resource. This is all from the community. It is a very, very, very, very, very long bash script, which should surprise no one. Cool. And then we get into actual open stack resources, sender volumes, scroll down a little further, keys, networks, neutron subnets, networks for each group and security rules that manage them, admin network, routers. So all this stuff is provided in the template and defined under version control for you and your teams to manage. It's fantastic. Was it most of the bash script, the Envary install? Yes, yes. So what this does is installs Hadoop and all of the on the nodes as they come up and connects them all together for you. So it takes a lot of shell scripting. And then finally, this template has no outputs. So we just get to the end and there's nothing there. You can have an output here of your edge node IP. I don't, but that's okay. As far as a demo goes, I am going to deploy that on MetaCloud in a video. So again, I'm gonna skip this because I just did this with you live. So logging into MetaCloud, open stack service, go to orchestration, click on launch stack. Did I pause it? Yes, I did. There we go. Click on launch stack and then provide your template file. You can then, it will auto-fill all of your parameters for you based on your defaults. Give it a name and MetaCloud allows you to roll back if it fails, which is nice. It will start creating all the things and bring up this jumbled mess. So this jumbled mess is all the resources that we're deploying. It looks jumbled because there's a lot. First, the networks come up. So what's cool about orchestration is that it brings up things in order as they are, let me get rid of my video player, as they are in a certain order. So it'll make decisions based on what can be provisioned first for the fastest deploy time. So all the networks, keys, subnets, all of that stuff gets deployed first, then sender volumes, and then finally, compute hosts get brought up with all that Mbari automation happening. And if you wanted to use better hardware, you could use the host ag or get in the flavor to point it to whatever you want. Of course, of course. So as it deploys, when it finishes, you'll see the parameters that we discussed, what you actually ran it with. You can list the resources that you have deployed. So all of these correspond to the name and the template. Each resource gets its own ID. I'll pause it there for a second. Each resource gets its own UUID, tells you what it is if the name isn't enough, and then gives you kind of a status, like what happened during your deployment. And as you click over to instances, this only deploys a few instances, the master name node and then some data nodes with sender volumes. As you deploy, you can also see events like what happened during this deployment. If something errors out, which it probably won't, it'll probably be fine. And then when you're done with it, you just click it, click delete, and it's gone. So you can do your work on a Hadoop stack using heat orchestration. You can show they turn red as they go away. It's kind of sad. Bye, bye stack. That was cool. But with this, you can deploy your big data workload, quickly do your data processing, and then turn it all off. You don't have to leave it running. You don't have to leave it running in incur costs, especially if your departments are using some type of charge back model. You definitely don't want to use resources any longer than you have to. And so with that, I say, I display windows to low, bye. If you want to know more about heat, you can ask us questions for a little bit longer or you can come see us at the booth. We're in the marketplace. We are also on Twitter, I'm at H-H-O-O-V-E-R. What's my Twitter handle? And Tony, where can they find you? At T-D-U-B-I-E-L, T-D-U-B-L. Thank you very much for having us and we will take your questions. There are mics right there. The heat template I showed in my browser is on GitHub. If you want to look it over, again, the OpenStack community, provided one from the Enterprise Working Group is on their website. You can look at both, compare and contrast for fun, for great learning. Cool, question, awesome. How do you see normal users accessing shared data sources? Are they connected over to other HDFS setups? Are they pulling from Swift? Are they pulling from a Swift-backed Ceph unit platform? Our users avoid this type of style of just picking up their own. Usually they're accessing shared data sources. Sure, so usually they're using a Swift data source for ingest and then they bring it into their Hadoop cluster or process it and then put it back out into an object store. That makes sense, okay. I have not seen much Ceph use that way, like mostly Swift. Well, you don't push the whole data set back, you just push back the result, yeah, right. That is an interesting question. So I've given this talk before and Sahara has come up. The reason that I used heat instead of Sahara directly is only because I don't see Sahara available as much as a heat service. I would love Sahara to take that over, but so far vendors don't seem super eager to release a version of it. Or like support it in their clouds. MetaCloud doesn't, so we use heat, but that definitely doesn't knock on Sahara. Sahara back end uses heat anyway, just manages heat stacks for big data. But yeah, I'd love that. Yes? Eventually? All right, oh, yep. Question, sweet. One open source big data project that I've looked at a little bit is Panda, and just wondering if you're familiar with that and any thoughts about working with that. Yeah, we've had that come up for service providers who were using Panda for collecting network telemetry. And it does have a validated design supported on OpenStack, so it deploys fine from that perspective. I'm not sure, like the adoption, I think it's been mainly like two service providers, not so much in the enterprise space. Have you deployed Panda? Yeah, yeah. Well, cool. We'll be up here for a little bit longer. Or stop by the booth, thanks. I'll leave the screen up.