 Okay, we'll get started while Kevin sorts himself. I'm George Castro. This is Kevin Monroe. I've been coming to scale for four years. This is his first year. So I'm really excited to do this talk with him. How many of you have been to one of our big data talks before? Last scale. None of you? Awesome. So you guys are fresh. So let me tell you guys, first of all, thank you for coming on a Sunday morning to a talk. I realize with all the partying that's going on that it can be difficult to wake up in the morning. So thanks for coming. I appreciate that. The way we like to work is I don't like to go to talks and sit and have a guy preach at me with slides and stuff. That's not how we roll. We like to do stuff for real. So what you're going to see here is going to be real on bare metal here on this portable cloud that we brought for you. Kevin's got stuff fired up and all sorts of public clouds. And I want it to be driven generally by questions or we want to know how you guys are running big data. If you are, what are your blockers? How can we help to make your life suck less? That sort of thing. So the way we're going to work is we're going to bash you some intro slides. Kevin kind of gives you some key concepts of the stuff that we're working on. Then maybe you guys can kind of tell us what you're working on. What would you like to see? And then we can either start firing up stuff live on clouds or have a discussion on how we can help you get your project to completion. Does that sound good or is that okay? No one's getting up to leave so that must be okay. So with that, Kevin, let's get started and thank you. Yeah. So in the spirit of George's comment there that we want to be workshoppy, very engaging with you, I've prepared 26 slides to kick off our engagement process. So this is real world big data. What I mean by real world is that this is not theoretical. This is not academic. All the stuff that you're going to see in these slides and in these demos is out there publicly available, totally free. You can deploy, like you mentioned a few times, to metal, to clouds, public or private. And there's no barriers to entry here. So what we really want to, I want you to leave this talk in knowing is that if you've ever thought about, man, I would like to do a Hadoop service or a Hadoop solution. I just want to use Hadoop, but thought that the barrier was always getting it deployed and configured correctly. We hope after this talk you'll see that that's not a big issue. So George and I for this closer work at Canonical. So we use Juju. We think it's fantastic, but fully recognized that not everybody does use Juju. There are plenty of config management and other types of tooling that you could use to set up big data. But in this talk we're going to show you how we do things with big data and Juju. The core tenets that I have here though should transcend the model that you use, the tool that you use, whatever. And I want to just make sure that we're all on the same page with these basic principles of big data deployment. The first bullet there, first of all, Juju, if you don't know, there's been a few talks about it and you may have heard Mark Shuttle or speak and mentioned Juju a few times. It's just our modeling language. It's our way to take a complex environment like big data or OpenStack or any of a number of really complex things and make a model that is deployable. The easy part, just putting the software on a system. And then make sure that it can relate to things. So if I have a service like a name node, I know I'm going to need a data node. They kind of go hand in hand. So set that up so that I can relate to that. And then if I only have one data node, it's not going to last me very long. Maybe scale up to 10 or 1,000 or however many data nodes to make a large distributed file system. So scalability is something that we also consider important in the model. Once you get those and you have some mechanism for deploying and relating and scaling, configuration and all that, then these next three bullets become important, especially important to us. I hope this is common sense. Again, I just want to level set us all. If you don't think of these three things, then you're a terrible person and people that inherit your stuff later will be angry with you, right? So the first one, reliability. When I deploy something today, maybe it's a proof of concept. It's a very small cluster. And suddenly it takes off and my project has worked and I want to go up to whatever scale or whatever production environment I might want. It would be awesome. And it should be the case that if I deploy that again, it works the same as it did before. I don't need to tear down my environment and restart from scratch again. So the model that you use and the services that you deploy should be very reliable and that they do the same thing over and over again. And that kind of bleeds into repeatability, right? If you're on a laptop and like see containers and you say, okay, I'm ready for the cloud, let me take that model that has worked and do it in the cloud. Conversely, if you're in the cloud and you're like in Ashley Madison, you're like, crap, why don't we put this in a public cloud? I want to come down to a private open stack or something that is maybe more in my control. I get the hardware and everything. So you want that model to be repeatedly deployable and repeatedly set up and working for you. So that's a core tenant for us. And then observability, right? So this is sort of a blanket adjective here. But what I mean by that is when the thing is deploying, I want to get some status, some feedback. Can I observe what the thing is waiting on? Am I blocked? Am I trying to spin up something and solve the operating system, waiting for relations to things like name nodes and it also kind of extends into did my model perform like I expected it to? So is this the right cloud for me? And for that maybe you would use some form of benchmarking so you would maybe do a tear sort or something to say. This is what I thought would happen. It's nice to be able to verify that what you're seeing in the real world is what you expected. So these are core tenants and again they transcend juju. They're just good ideas for any big data deployment. Then we took that a step farther in juju and with our big data charms and models and we said, you know, there's a big barrier to entry and we've heard from people that are third-party service vendors like they want to just create a big data service like the next Hive or the next Spark or whatever. And they said, jeez, I don't want to be a Hadoop distribution. I don't want to support HDFS and MapReduce and all that stuff. I don't want to use it. I know I need HDFS but I don't want to have to be suckered into deploying or bundling it. And what do I give you? A big blob just to test my service. That doesn't work well. So we thought, hey, let's try to make this model in such a way that you just plug into it so you get to forget about the stuff on your left. That diamond on the left, that's what I call core Hadoop. That's MapReduce or HDFS or stuff like that. And then you get to just play over here in your client level. So if you're a server application developer, that's cool. You can just plug into Hadoop. It also works for people that are on the user side of big data, right? I'm a data scientist and I've got this neat model that I want to run on my data. I, again, don't know and don't want to know. That's on my field of expertise to know how to set up Hadoop and configure it. Did I put the SSH keys in the right place? Did I distribute Etsy hosts around my cluster so that these things can talk to one another? So we've tried to model this in a way that at this box here you can forget the stuff on the left. That's the easy part, right? And easy in that we've done it once so we can do it again quickly. Not easy in that if you tried to do this from scratch you wouldn't be hurting. So that's why we have this little client thing over here. All that represents is an endpoint that you can talk to. You can get into Hadoop with. And you know that as long as your plug-in is with you that you are compatible with the cluster on the left there. So it's got the right Java version, the right Java vendor, the right Hadoop version, the right Java libraries that you may want to do the utilities like DFSLS, stuff like that that you may actually want to get in there and do. But this client, an endpoint, right? That's not all that great. It's just an SSH bot or something to SSH into. So I want to go back to talking about sort of if you were an application developer what kind of client applications and services you're talking about, and I'm going to burn through these really fast. So stuff like flume for ingestion. Again, that's not all that crazy cool, but it does let you get stuff into HDFS. Maybe you didn't know how to do that before. You can now use a flume service. Maybe you want to expand it to include Kafka, right? So you've got a PubSub model where I've got, I'm subscribed to a topic and I get messages in, and maybe I want to send those through flume and get to HDFS, and of course zookeepers to keep Kafka in check. So that makes sort of an ingestion story. And then we've got some analytic stories that we can start building solutions with. Things like Pig. Again, Pig is very simple. There's not even a service that goes along with Pig. It's just a binary. But then maybe Hive is something important to you. So now we can talk about database relations and can we swap this in and out with Mariah or DB2 or Oracle or whatever. I mean, you can start to get a feel for why a pluggable model is special to us. So you may say, all right, well, great. I've got these little ancillary utilities that can help me, but I read on the Internet that Spark is cool. Okay, sure, plug Spark in. We can do that. And you may say, look, Spark's cool, but I don't want SSH to Spark and run Spark Submit. That's not my forte, right? I want a notebook. I want to type my codes into the Internet. So we say, all right, here's a notebook. Plugs right into Spark. It's already plugged into the main cluster there on the left. You say, no, Kevin, I don't want iPie Notebook. It's terrible. I want something else. And I say, fine, use Deplin. So we have these services already available for you and you can just start to mix and match your thing. You say, no, Kevin, I don't want Deplin. Fine, build it yourself. We have tried to sort of survey the landscape of big data and pick out those core services that are, we think, important to you. And we've got those in sort of a Lego building block in Lincoln Law kind of fashion where you can put these together either as an application developer. Maybe you want to, I don't know, do some analytics with Pig and Flume, for example. There's something with Bing that you want to put together with those two blocks. You can use those and either accelerate your application development or just get to work in using those services. So what George is doing is switching over. I'm at the Build Your Solution slide since I queued them up on that. What we want to do now, this will take about 20 minutes, which, if you think about it, is pretty incredible. What we're about to do is deploy what we have considered a real big data solution. It's log analysis, right? Anytime I hear people getting started in big data, they do two things. They sort some logs or they analyze some logs and then they ingest some Twitter stuff and they're like, cool, I got some practicality with my log analysis and now I can read tweets and that's awesome, right? That's how people usually start with big data. And so we put a bundle together in our parlance. A bundle is just a collection of charms, right? These little repeatable service units that you can deploy in a solution. And one of the bundles that we have, and we're going to kick off right now, is the, it's called a syslog analytics bundle. And it exercises a lot of those components that I kind of went through quickly in the charts. These are things like syslog forwarding to some listener, right? And in that case, we decided that listener would be a flume agent. So our logs make it to flume. And then once they get to flume, they forward on to any number of scaled out flume agents and finally make their way into HDFS. And once they're in HDFS, then we thought, well, now we got to do something with them. We're not going to HDFS cat logs, right? That's terrible. So we got Spark again because the internet says Spark. And so we now have a few tutorials and a few jobs that you can kind of smoke test these services together from ingestion into HDFS, processing with Spark. And then on the other side, we're going to take a look at it from Zeppelin since you guys hated the Python notebook. So what we're looking at here, for those of you who haven't seen an orange box, it's, I don't know, do you want to give the TOD on the orange box? So this is a 10 node portable cloud that we take everywhere because people don't believe, you know, you see a presentation on a laptop. It's like, hey, here's your cloud. They're like, I don't believe you. So these are real. These are 10 nodes. It's consumer hardware, though. So just pretend for these purposes, this is your cluster at work or your data center or things like that. But everything you do, each of these lights represents a node. So when he starts firing off stuff, you'll see a power node as the node comes on and then like the IO blinking and doing things. So for these purposes, just pretend this is your cluster at work or whatever. But you can also do this on your laptop, but this helps illustrate, you know, what's possible. So the reason we broke there is just, again, with the networking and the amount of data that we're going to pull down to demonstrate this, I want to make sure that we have time before the session ends so this will actually be stood up. What you're seeing here, this is our metal as a service. So again, this is just a representation of any hardware you may have laying around in your lab. Maz is our, Canonical's offering to sort of talk to those boxes, talk to bare metal. It can put the operating system on there for you. It works really well with Juju. So those charms and services that I went through in the slides can deploy directly on top of those and it works pretty seamlessly. What you're seeing here is just, this is just a list of the nodes that are in that box. What really gets exciting for you people on the right is that you'll see the lights turn on and then you'll know it's working, right? People on that side, you'll just have to trust and if you can't see the lights, come on. So I want to flip to a terminal. Here we have, so we did a little bit beforehand, we just bootstrapped a Juju environment. What that means is Juju has a state server, something that knows about all the other nodes in a deployment, just a way to facilitate that communication. So we went ahead and bootstrapped that and it's got a GUI of its own that you can look at and see what services are deployed. The real magic though comes in these, I don't know, 30 characters, right, or 40 characters. This concept of Juju deploying a bundle and again a bundle is just a collection of charms. It's a collection of these services that we have identified as providing our big data solution. In this case again, it's analyzing syslog information. So if somebody hit enter for me on this, what we have at the bottom, this is the only command that you have to type. Once you've decided you want to use Juju and you've done our get started page and installed the client, that's it, and it's pretty sweet. What you'll see up here, this is in a watch. You'll see machines start to get allocated. You'll see the light bulbs turn on up front that signifies that mass has powered them on. It's going to install the operating system on there and then Juju will start deploying charms, a bundle of charms in particular. Yes, sir? Okay. So I know you're going to go through all the details but what about the configuration for the hardware? So how does it know about the 10 nodes? How does Juju know? Yeah, so the collection of hardware that you have is controlled by our metal as a service offering, right, so mass. Yeah, and what you'll do is you'll register your hardware. So the hardware that it works with, for example, are KVM hosts, right? It knows how to power on and off a KVM box. And it works with things that have IPMI or BMCs or whatever sort of power management controller you might have in your hardware. In the case of these things, these are nukes, right? So I forget what Intel calls them, AMI? AMT. AMT, thank you. So MAZ has the knowledge of how to turn those things on and so you register that as saying, here I have 10 machines and I want you to use them in this environment. Juju just comes along and says, hey, whatever substrate this is, I need a machine. I need something to install on. Nope. Also free to set up. Also free to kick the tires with. Yeah, okay, certainly. Yeah. MAZ is just an example for managing a lot of bare metals like usually this is data center size scale. If you're using Juju with a bunch of machines, you can just say, hey, Juju, here's a bunch of SSH to machines. Start using these now, yeah. So you're not limited to using MAZ just for this bare metal, but it makes managing that metal. Kind of gives an API in front of your servers as you would with like Amazon or Google or Azure or anything else. Yeah, that's a great point. So MAZ is nice when you want to be able to power that stuff remotely. But like you said, you could just have a bundle of SSH droplets from DigitalOcean and tell Juju that these are available to you. And MAZ is also has a full RESTful API, so you can just also drive it manually without having to actually worry about individual machines. So if you wanted to get in there and do something. The question was, is MAZ free software 100% free software? Yes, sir. Oh, I should have mentioned in the beginning, everything you see here is 100% free software. What? Not the box. Yeah, no. How much is a box? So we don't sell these. The company makes some tranquil PC. They sell them. This one's about 12, 15 grand pounds, depending on what the exchange rate is that day. Yeah. But it's a commodity hardware. If you want to build it at home, you can do it for a lot cheaper. You can subtract the whole tax. Yeah. If I wanted to test out the demo on my own hardware, obviously I don't have 10 machines or however many are in the box. Do I need MAZ in order to manage the VMs on my box or does Juju manage that directly? Oh, that's a good question. So Juju could speak to certain pieces of VM infrastructure, mostly just LexD. So if you have LexD, which is the quote unquote hypervisor for LexD containers, which are just system containers like you'd expect to find, they're a little lighter weight. There is no direct like live vert or any other kind of communication layer. You'd have to use MAZ for that or spin them up manually and then add them to Juju manually with SSH. And I have to install MAZ on those VMs and then enroll them? Yeah. So the question about being, how does MAZ fit into this thing? So you could actually just install MAZ on one machine and then you tell MAZ, these are where my machines are either via MAC address or via power credentials or via live vert. It's very easy to say this is where my live vert machine parameters is. This is the machine name. So you don't have to do anything custom to these machines. MAZ will turn them on, commission them, get the hardware information about it, turn them back off and then list in the GUI fur these are available for you to install stuff. So it's a lot less of a barrier there. Oh yeah, I'll just relay the question. So the question is, where do you put MAZ in this whole kind of scenario? Without getting too much into the details of it, MAZ lives to manage your DHCP or DNS for that rack below it. So it works best when you have maybe like a box where you have two nicks on it, one to the outside network and one to an internal switch where all your machines are going to power on are. You can install them on the machine where you're running your VMs. It's probably the easiest. So you have a dev box with a bunch of VMs on there. They can manage the bridge there. So you could also put MAZ in a VM, managing its VM network. There are a bunch of different configurations for it. You can get more complicated as you start drilling down further though. But there are different scenarios to support what you're looking to do depending on what you want to do. Sure, so the question is if you could use Ansible or some other form to create the VMs? Sure, so there is, from a Juju perspective, Juju doesn't really need anything outside of the base operating system. So it expects to have a clean image of either Ubuntu or CentOS or Windows to manage on top of. So I'm not sure how using Ansible would work. I'd like to talk maybe a little further afterwards about how that scenario would look. But there would be nothing, and maybe I'm going to understand an example, but there would be nothing for Ansible to provision because Juju is just using a base operating system. So there's nothing on top of that operating system image that Ansible would need to put that Juju would be able to know. If the machine's already on and running and SSHD is installed, which most are by default, SSHD credentials and Juju can connect to it. Yes. Juju connects to most every public cloud you can imagine. Yeah, sorry, I should have made that a little more clear as well. So this is just for bare metal, but Juju also speaks natively to Amazon, GCE, Azure, Dreamhost, Rackspace, Joints. If you can name it as a public cloud, it probably can talk to it. Yeah, but I see what you're asking for now. That would make sense. If there wasn't a cloud and you had an Ansible thing in the IP address, you could do that with Juju. You could glue those bits together. I do want to make a point that afterwards the charms that actually execute to put the services and stuff, there it's up to the charm author to do what he or she wants to do. And that's where we recommend you use your Ansible, your chef, your puppet, and then you get the bits on disk the way you want them. Does that make more sense? My question is, when you write the charms, is there, do you have to write them differently depending on what, if you're on bare metal or different infrastructure of service, especially for address management? I know when I use Amazon, you've got to be very sensitive if they're public addresses, IP addresses with ACA. It's very finicky about it. So this is kind of uniform and give you a uniform way of dealing with addresses? Absolutely. So for the big data charms that we've written, they are totally cloud agnostic. They're metal agnostic. What we care about is that we are on the machine and then we use things like hostname or whatever. There is logic in the charm that will say, what is my IP address? Is it a routable address? Does it belong in Etsy host? And we will configure that as needed on any substrate that we deploy to. So the same charm, the same thing that you're deploying on that bare metal box. I have deployed in a Amazon box. I didn't change the charms whatsoever. I typed the same command, and it works the same. That is because the charm author, the person that has the domain knowledge of what a name node, for example, may want in Etsy host, has said, try if config. Do I have it? Grock the IP address. Is that a valid IP address? I don't know. Okay, well then use hostname. Is that valid? I don't know. Then read into, what is the kernel thing my network address is. Done the logic, if you will, to determine how it knows what IP address it is so that you don't have to worry about it when you deploy these charms to different clouds. I mean, just a flat network, it's trivial to a lot of times, you have a pretty sophisticated network. I mean, even the simple environment, which DMZ and F5 is the network, you've got to say what subnet is spinning on. So how does this interface with kind of your networking? So Juju's modeling how to present these machines for you, but it also models how to do storage and networking in the side of that. So we're talking about using Juju against Amazon. Juju knows how to create and manage subnets inside of DPCs and attach those subnets to services you deploy. So the length in which you can model this service belongs to this subnet should be able to bridge to this subnet services. It can all be modeled with Juju. So that complexity is there. And MAZ also helps you model things like how do I manage which NIC goes on to what VLAN and things like that nature. So modeling that level of complexity past a flat network is something Juju allows you to do using the native APIs or whatever service you're running against. And what about automation? Not just the NIC where it's bonded, but what network do you just connect to? Is that part of that? Sometimes a physical plug when you have a virtual network on the VM where you're trying to figure out which VLAN to plug into. Right. So the question about how far you go with automation being able to tell where you put networking stuff is and Juju does that, again, talking to native provider. So we have a VMware provider that knows how to speak vSphere and all the other components there and you can use Juju to say declaring this is a Juju construct which correlates to this VLAN inside of vSphere or this VPC inside of Amazon and when I deploy services I deploy them into these subnets or into these spaces maybe one or more subnet or network VLAN or whatever that may be. But that's possible there with Juju. But great question though, fantastic. Yeah, these are great questions, but I do want to turn your attention to what has been happening since we ran this deployment and I didn't start a timer, but I think it's been about 10 minutes. What you can see, I remember we just ran the one command down here and that bundle, that remember we're getting to the syslog analytics, we're ingesting stuff with Flume and we're attaching Spark and going to run some visualization on it. That bundle was comprised of these services. Compute slaves for us it just means both a data node and a node manager and this will be consider a slave, right? A couple of masters, there's the yarn master and name node. Some our syslog stuff to forward the logs on some Spark to do the processing. So these are the services that are there at the top of the status output and again this is sort of ties into our observability I can see what's happening. Down here this is a representation of the actual machines. So our slave for example, that thing that has the data node and the node manager, there's actually three units power behind it and I actually, we forgot to tell you to watch for the lights, but all the lights are on now. But they were off and now they're on, so you know it worked, right? So what we're seeing here is what each of the states of these individual units providing these services is in. You can see our slaves are blocked. Why? Because the name node isn't ready yet. Why isn't the name node ready yet? Because we're still installing Hadoop base. So once the name node becomes ready we will send a message and say who wants to connect to me? The data nodes will come along and say I am for you and we'll establish that relationship configure themselves accordingly, exchange SSH keys, exchange Etsy host information, set the config, etc. And so at this point so HDFS master is now setting up. We should be just a couple of minutes away from seeing what that looks like. And you can also just look at the pretty version of that if you want to. Oh yeah, fantastic. I forgot. We have a, when you install it when we deploy a bundle it comes with a GUI, right, that you can see pictorially what services are being installed and it has Oh boy. A lot of stuff going on here. It has some amount of color coordinated things so you can see that Flume is doing something right now. It's white, yellow. And things will turn green. And again this is the graphical representation. Before you say anything I know there's a lot of elephants on the screen. We're working on how to differentiate what those elephants are. One of them is a name node. One of them is a yarn master and you're going to put little hats on them or something so you can tell which ones are which. Yes sir, you up front. Random stranger. Yeah, when you move these little dots around is it of any significance at all? So the question was is when I move stuff around is it any significant? No, this is the human version. So, you know, I like to order mine top to bottom based on what things are doing. But you can certainly everything that he did in the command line and configure different metrics and config files and when you do this you can do a commit and you'd see who changed the cluster and whatnot and that kind of sort of thing. Yeah. But the state is in real time, however. One thing I like to do is have this up on like a big screen where everyone can see it and then everyone's working on it and you can see what's happening which is kind of cool from making a grand vision kind of thing. So I call this the manager view. I don't know if that gels with you there. This is my view though. All right, so it Oh, I'm sorry. So I understand right now what's publicly available is version one of GGM and Moz. And I hear a version two is coming out really soon. So if I spin up things on version one how easy is it going to be to transition over to version two? That's a great question. I think you know what it is. Yeah, I just realized like I may have given up the get a vote. So yeah, we are coming out with a 2.0. It is a mostly compatible version but we are breaking some backwards compatibility with the CLI, with the UI and with some of the deployment methods. If you're deploying on the latest version of juju1.25.2 you should be okay as long as you're not doing anything into lexy containers because that's changed between what's in the 1.x and 2.x series. If you're starting to plan to do something with juju1.2 you're going to deploy something the next month or two I would recommend using the alphas. They'll be a little sharp and pointy as far as experience goes but it will be more in line with what will be coming out but we'll be recommending everyone go to. But yeah, that's a good question. This is running everything on alpha because we like the bleeding edge for demos. And it's April, it's not that far away. Did you say what the big features are? So your additional question about networking space is that something that's just landed in the alpha that's coming out in 2.0. Another thing is the idea of how to manage models. So right now it's very expensive. For every topology that we model here you have to have a bootstrap node for every single one. That's a machine that you have to pay for that you have to take time out of your cluster. In the next version we have this idea of models being shared on a single bootstrap node. So you create one node that manages orchestration but you can do like a namespace segregation of topologies in there. So you don't have to do a very expensive spin-up. You can have a GUI that you switch between actually go back to the GUI, Kevin. If you look at the top of the GUI here there's a drop-down box where it says MAZ. You'd be able to create a new blank canvas just typing a new environment may hit new and instant new canvas. So it makes it really easy to start spinning up different test and dev environments without having to go through the 5, 10 minute bootstrap process. That's a great way to do that. With this model you get access to this model. You get access to both these models. You only have read access to this model. So adding this idea of user ACLs with the ability to create these real quick disposable models. So this is your model. Your playground is completely isolated from everyone else but as an admin I can see all of your models but you're only confined to your one little deployment essentially. That's a great question. There's a few other things as well. We're redesigning some of the command lines. We're making things easier. We're enhancing our storage support and making more clouds available through the tool. So those are the main components and just kind of streamline the whole user experience, correcting all the things that we had in 1.0. So we're breaking a lot of the command lines, making the API more robust, things like that. So great question. Yeah, you also have a question. No, that's a great question. So are we running Maz and Juju on the same machine? Actually we are. We are using this case with the node 0 in here which runs the management node for Maz is also running Juju but that doesn't necessarily have to be the case. As long as your laptop or whatever client you're using with Juju can connect to the machine it can talk to it. So Kevin has on his laptop a Juju client running which he has connected to all the clouds and bootstraps. So Juju just has to live on the machine that lives where you are the client. In this case this box moves around so much it's easier to put the client on the box. That's a good question though, yeah. What's the concept of inventory like Ansible does and how do we deal with secrets? I'm just a guy in the audience. So you'll have to quickly explain what Ansible Inventories are because I don't quite remember what they are. Okay, sure. Yeah, so we have a bit of that here or you can't see it now it's kind of scrolled off screen there but on the GUI we have that idea Juju doesn't really keep track of things across multiple deployments so deployment is a silo that manages itself and knows all the resources that are in there and where they are and what they're allocated to. With this new multi-model idea you could have multiple deployments all on a single node where you could theoretically see across them and see how these machines are allocated here. When we do things against cloud providers we also do tags and naming properly in those so in MAZ we tag, this is what service is on there in AWS we apply tags. These are the services that have been allocated to the machine so from a management view if you're using those dashboards you'll see them as well. That's probably as close as you'll get to something like inventory for Ansible. On the other side how we deal with secrets in Juju well it's still a really hard question to solve. We haven't quite encountered how we handle secrets we've been toying with the idea of when you have to supply a configuration that is a secret you can declare in the configuration file this is a secret so for read-only users they would see stars or something of that fashion but anyone with admin access will still be able to retrieve that. We're also building into an idea resources into Juju where you can declare this is a binary blob or some kind of deliverable payload that goes to the service and that could be potentially an encrypted an encrypted blob of data as well. So we're still figuring that out we're kind of watching what are the other tools are kind of how they're doing this. I know HashiCorp's just come out with Vault and there's some other things so we're going to kind of see how that plays out and what the best practices arrive around there because it's still a very tough thing to solve in general. But that's a great question though. There is absolutely nothing stopping you from saying I deploy my own Vault, I go in and manage my secrets there and then having a relationship that can distribute those secrets out. There is nothing there of course. We'd like to eventually evolve that into a primitive Juju understand so that it's right exactly but today there's nothing stopping that of course. I think a lot of people would appreciate it if someone were to do that. We could definitely do that towards the end I don't want to bogart too much of Kevin's big data talk. Yeah so this is a big data talk? Not a Juju-Maz talk I don't know what you people are doing. I don't know what Notch is. I do not know what that is still. Okay, okay. Yeah so that's very valuable feedback because we're interested in there are so many services around big data for lots of different facets ingestion and processing and visualization and things so it's very important to us to understand where the community is going and what services people are wanting in a big data solution. So it's awesome to hear that kind of stuff. We would love to engage folks more and we'll have some links about how you can get in contact with us if you note a service is missing and that can certainly go on our roadmap. The most recent thing that happened we were ingesting things in the Kafka and sending our Kafka messages in the flume and somebody said why are you doing that? What's the point of that? Use Goblin, right? Or use the follow-on to LinkedIn stuff and we said oh thanks for the tidbit and so we charmed up Goblin for that reason to put Kafka messages directly into HDFS. What I wanted to note again just that now all the services have become ready I think we're about 20 minutes in of that deployment. So what we've done here though I mean it's pretty neat. We've spun up 10 or 12 or so machines connected them all together and made it a big data sort of proof of concept that things are working and just to show you the tidbit and then I'll fire up Zeppelin and log in so you can see. I don't even see. I'm going to expose a couple of services. Oh you're right. I'm sorry I don't. But if you were in a cloud by default you may see here's some IP addresses. I'll show you on that Amazon deployment. These are obviously local to that box but you'll see IP addresses and ports. We don't just open those up for you. You know you want the service exposed and by default they're turned off and you expose those. I'm going to grab that. $90.90. Please silence your phones. At least turn the mic off. Alright so we just deployed Zeppelin. This is our interface into Spark. We're going to run a quick job just to verify that ingesting something has happened that stuff landed in HDFS. We're going to sort some logs real quick. However many syslog messages have been generated. It won't be that many again because it's only been up for 20 minutes. But we'll get to see those. We as the charm author, I wrote the Zeppelin charm and I added some default tutorials so that folks can smoke test things. It's real important to know okay it deployed does it actually work. So we're going to run the flume tutorial here. Zeppelin if you've never played with it I highly encourage you to play with it. It is one of my favorite visualization projects so far. Ankash you might know this offhand. Do you know is it still an incubator? Status? Okay. It's a fantastic Apache project. Bundled by Big Top by the way. Bundled by Big Top by the way. We are very grateful for that. So we're going to run through these. I'll just go ahead and click the start button and I'll talk through what it's doing. The first thing is Markdown. It's showing you that it has an interpreter for Markdown in case you wanted to put some nice headings. What it's doing right now, shoot I know this is going to be small. Unfortunately this is a, yeah someone will want to mess with the light switch. Again this is just a sample tutorial that we're using to demonstrate to show you that flume has worked. This is just a shell script that I'm running on Zeppelin that will run on the spark unit and all it does is it just SSHs it's just guarantees we have syslog messages. These will time out and not show any actual data but it will show the hit in Varlog Secure. So we know we have SSH messages here. It has finished. We've got, this is an HDFSLS just to make sure that we've actually got some flume data. This is today's date so you know I'm not fibbing. Scroll down a little bit. We're going to make a, just showing some Scala in the spark context ability of Zeppelin and this one, jeez this thing went fast. I know right? So we're going to create a table right? So we've got all these now log data. We've ingested them into HDFS. We know that this is the directory because when I wrote the tutorial that's where I put these data bits. So that's coming out of HDFS. We're going to make a temporary table to process on. And then we get to the neat stuff, right? The data visualization part. Again this might be a little lackluster because it's only been up for a few minutes but there's some SU activity that's happened. Hover over that, jeez I wish I had a remote. Mouse, doohickey. So what's that for? SUs? Hover, I got it. Yeah, yeah. So there's our SSH hits, right? We did a loop of 10 SSHs over and this is just a nice way to show what's happened in the syslog on the machine that we were monitoring. That's coming out of HDFS. We've got some time stamp and visualization stuff here. These dates are not valid currently but just to give you an idea this is just a smoke test the fact that all these big data components have talked together, right? We generated some messages. We got it in HDFS. We processed it with Spark and we're looking at it with Zeppelin. Zeppelin, I just want to touch on this one real quick. This is what I really like about Zeppelin. There's tutorials that the Zeppelin folks have written that are neat to show off as well. Scroll up, scroll up. Oh wait, control V. I typed up space. Page up, page up me. Thank you. I know, right? Yeah. How many people does it take to type in a thing? So would you click save for me and then we'll run this thing. This is a neat tutorial because it shows W getting some stuff. Maybe you don't have log data that you want to analyze but you do have a giant CSV file somewhere. I know a lot of people have large data sets from data.gov that are freely available to W get. Hit that play button, yes, sir. Again, this was a tutorial made by the Zeppelin folks and it just shows how easy it is to interact with. This is what's running up a little bit. We're W getting some bank data that the University of California, Irvine, I believe, is UCI. We imported some of that down, down, down. Again, we're going to make a temporary table and here we're going to do some analysis of some banking data. I think this is mortgages based by the age of the people. I'll show you a whizzbang neat thing about Zeppelin. So we've already done the analysis of mortgages owned by 30-year-olds. Let's make them, I don't know, 25-year-olds. We can rerun this live and it'll redistribute that graph based on the data. So it's just a really neat interface into running Spark jobs. It supports Spark SQL, Shell, Scala, PySpark, stuff like that. So that's our live demo on the orange box. Again, there's nothing in the box that's fancy. I mean, whatever. This is just a representation of metal. So don't put too much value on the fact that it's orange. It's just a box of hardware that you may very well have laying around. I do want to flop back to my stuff that's on AWS just so you know that, again, if you don't have MAS, but you do have AWS credentials, and this will segue good into your thing. Meanwhile, any questions on the services that we just deployed? Yes, sir. DCOS? You know, so I just went to a talk that compared Kubbs and Mezzo and a couple of other platforms for supporting big data. And I have to admit that I am learning along with you at scale. So I do not have enough background on DCOS to compare and contrast with Juju. I don't know. I mean, you seem to have lots of answers. I don't know. I can answer your question. So when you look at things like DCOS and Kubernetes, you're modeling all your infrastructure as golden images. You're pressing and rolling them out, and it's a mutable infrastructure. The difference with Juju and the way that makes you, if you're going to compare and contrast, you get a lot of the same features. But Juju is mutable by nature. It's designed to do long-running life cycle management of services over time. So it is a different approach to doing software delivery and maintenance over time, whereas DCOS, you're rolling out containers, application containers that you're recycling often. Juju modeled a lot of the same infrastructure DCOS does, but it does it with a different methodology applied on top. So that's what I see here. Yeah. A comment from Mezzo. I'm familiar a little bit with Mathers, not so much with DCOS, but essentially this is your data center resource management solution rather than, say, just a deployment and making sure that all the bits are fit properly together and stuff like this, right? So I think Juju is solving sort of an orthogonal problem where it guarantees the correctly out of all the bits and pieces of your stack that you push it out in a variety of providers and you get yourself working solution, whereas DCOS is more focused on if my little microservice died, how I actually guarantee that it pops out somewhere else on the different node. And that leads actually to a number of the problems, if you have long running services, like you said, right? So in case of, say, HDFS, DCOS is actually in a deep trouble because data retention actually is something that you cannot solve easily. So it doesn't address data retention. I mean, basically if your HDFS data node goes down, right, you need to rebalance your cluster. And this is a lengthy process. It takes time. Honestly, I'm not here to argue the benefits of the marathon, right? So what I'm saying is that this is a different with purpose tools. That's all. So as far as where Juju's sweet spot is, kind of my understanding of it's very good for initial deploys, but if you contrast it, for example, with Mbari, which does whole life cycle and monitoring, where you could spin up Gangly or spin up Prometheus or anything like that, what does Juju do and is it well suited for doing application deploys and integrating with SBT or Maven and things like that, or it really is a tool just for kind of infrastructure, not code deploys? That's another great question. So one thing that Juju does that when you see it, when you realize it kind of transcends a lot of the stuff is Juju's not just inherently a big data tool, which is, you know, you look at Mbari, Mbari is they've put all their expertise behind how to deploy a big data solution. We also use Juju to deploy at scale HA production grade open stack. So of the telcos out there, the largest telcos are running open stack deployed and managed by Juju and Maven, they're bare metal, and it's the same primitives here. So we're going up against in that same vein in Bari to big data, we have things like Red Hat has their own Red Hat open stack deployer and there's also, who's the other? Mirantis has a company that's like in Bari where they just do open stack deploy. What we find a lot of customers and people really liking is that often times you're not just deploying a big data solution, there's other things along that side and Juju can be that common language that is how you model those services and how they maybe interconnect or don't interconnect with each other. So Juju doesn't just stop with deployment, it is also life cycle management for services over time, it's done in a generic language that's not meant to be specific for any solution but is the common language of all solutions that we've seen deployed are done at. Okay, so I guess there was two points. One of them is the main you're treating here. Right, you're closer to Ansible, puppet chefs, salt, et cetera in that your general configuration management and could do end to end data center, whatever it is to integrate into. But the second question, one more detailed question. So do you find there's fundamental issues that are different in doing initial deploy versus doing things that are more workflow based, more deploy based? Besides one of the things we found that a lot of the deploy stuff you want to have a lot more control over the workflow as opposed to have a tool automatically generate the workflow for you. So it's more kind of inherently sequential, although you could do it in a model based approach. It seems like these tools kind of struggle a little and then you have past systems they integrate with. They really are specialized towards the code deploys. Another great question. I'm not sure I can answer in as much detail as you've explained it, but we haven't really found anyone, especially of our enterprise customers that are using Juju and the people out in the community that are using Juju now that have had that problem where once they had that infrastructure deployed, whatever it is, whether it's a big data infrastructure or a container infrastructure like Kubernetes or OpenStack, that once they had that deployed, they don't find themselves that they need to then start integrating with other paths. They find themselves instead getting their vendors to start putting things into charms because past the primitives of deploy, connect, and scale, Juju also allows you to do those administrative level tasks of, you know, how do I, as an example, how do you run a Terrasort against a big data cluster? Juju has primitives that allow you to say, here's how I do a Terrasort, or here's how I do administrative tasks, and it's modeled in that same repeatable fashion. Here's how I do manage backups and restores and failovers and things of that. All that stuff is still leveled at that charm. So charms do more than just the initial setup and deploy, it also does the, how as an admin do I manage this over time? Last question, be a very simple example. Suppose you use Eplint to actually launch going to Spark Shell. Would it be natural to automate, for example, doing Spark submits with Juju so you could automate deploying your application through Juju? Our display dongle has failed us, but we're going to switch over to a different laptop. Okay, I'm sorry. The question would be, I can tell you bring up the infrastructure, but is it a natural, unnatural thing to use Juju also to automate, for example, doing the Spark submit, so submitting your jobs. Ah, fantastic, yes. So we have in our Spark charm that we have created, we have the ability to run certain actions, and it will under the covers call, Spark submit. So for example, by I think all the default example stuff in Spark, one of the big defaults is Spark Pi, right? So we have an action that can say Juju action run or Juju action do on Spark. You can either point it to a jar that you want to run, and it will put the right Spark submit information in, the right Spark submit information being if I'm connected to a yarn cluster, that means Spark, I probably want it to run in yarn client mode and let the yarn resources do it. So that's a flag on the Spark submit command. You don't have to know that now. You just say Spark submit at whatever mode you're in based on the deployment model. We will submit your job appropriately and stick the right flags in. All the way down to the number of cores you can set on Spark, you can say how many executors, how many workers, things like that. If you have configured your model to say even if I'm on a 100 core machine, I only want Spark to execute on 10 cores, we will acknowledge your configuration and when you run a job through Juju, it will set the right flags to make sure that that submission is adhered to your model. Yeah, so that becomes a big issue, or it can turn into an issue of where do you put those jars. If you're already sitting on top of HDFS, that's a nice location because then whatever other service endpoints that you might have, you may want those jars available to other, like Spark slaves or something like that. So let's put those in HDFS. And we do have capabilities to put jars into HDFS and then we have the option of setting Spark class paths and things like that so that workers and people will know where those are. So it helps to have a distributed file system to ease the how do I put this jar somewhere that everybody can get to. But yeah. Does this approach play well with something like your manager? Yeah, so that's very interesting. When I talked about the pluggable model, the other side of the pluggable model, there are two people that we have discussed. One is the application writer that wants to write the new Hive or the new Spark. The other is the user that doesn't want to deploy Hadoop or package it or support it. They just want to use it. And then the third class of people are the people that are on that side with the diamond that say, hmm, I want to make sure that I run well with Clodera. I want to make sure that I run well with Spark, IBM, MapR, whomever. So by having this plug-in sort of interface, you can swap the core Hadoop out for something else. And then you can still have your services on the other end. So I know that this is a roundabout way to get to your question. So we do fully envision that you will be able, if vendors are supplying charms to provide a Hadoop core, we will plug either side of the model, make that as decoupled and easy to plug services in. Specifically for Clodera Manager, right? This is a service. Was it rebranded Clodera Director or is it the same, a similar thing? Yeah, so this is the thing where you install this management console and then you can install other services like I want Clodera Hive and whatever in your deployment. So we've actually started work on charming Clodera Manager so that you would deploy that and then let Clodera Manager take over and do the actual service deployment from within that if you wanted to. And we haven't finished that, but it's certainly, just like I mentioned before, we're interested in what community folks are looking for. So we have heard Clodera Manager in the past and have started charming that and what that looks like in a juju environment. But it's still in the works, if you will. A little comment to the Clodera Manager. I'm intimately familiar with that. Uh-oh. And Clodera Manager is not compatible with standard Linux life cycles of the demons. You have to write these little stupid Python wrappers in order to actually control the stuff you do there. Yeah, well, non-free, I'm not even getting there. It's technically inferior. All right, real quick. So I want to announce something before we wrap up. First of all, thanks for coming. We had lots of really great discussions. Was this valuable to you guys? Not? Okay, nobody hates us. Good. All right, so we have this thing called developer.jujusolutions. That's the website you're here right here. And we want you to play with the stuff that we showed you here. So you go here, you fill out what you want to use juju for, whether it's a little research project, you want to kick the tires, you want to do that. We give you 10-M up to 10 nodes. Okay, right. So by default, we give you like 10-M, 3-X larges or something like that. And we auto-reap in 24 hours so you can mess around and do all that kind of stuff. But if you want to do something bigger than that, it's just more serious than that, talk to us because we want to ensure that we're reducing the friction it is to get you guys to do your stuff. So if you want to do a benchmark on a certain piece of infrastructure or do anything that would be interesting, please talk to us. We're dying to write up your story and tell the internet on how we helped you do something like incredible. So again, the... Sorry about that. The URL is developer.juju.solutions and where can we find you, big data team specifically? Yeah, so we... We hang out in FreeNode a lot. There is just a generic juju channel. So if anybody's an IRC on FreeNode, just pop in and say, hey, I want to talk about Cloud Air Manager, for example, or whatever you want to... a service that we may be missing and then we can fork off into a channel and we're happy to chat with you there. We do have... Yeah, so we've got this blog up. This is where all our repository... First of all, their blog. It's our blog of what's going on. We'll have a write up of what we've learned at Scale thus far. But there's also... If you get started, I'll show you all the repository of all of our charms, all of our bundles, how you use, how you deploy things like that. So our blog is bigdata.juju.solutions. The free AWS cred is developer.juju.solutions. The slides have all this information in them and they're on the Scale website, or they will be. So don't feel like you have to take pictures or remember things. And finally, if you like juju and you say, hey, man, I want to give this a try, there's our Get Started page. Again, it's in the slides. So you're welcome to pull those down. This is just... Here's a couple of... type this in and you'll have the juju client and now you can deploy a syslog analytics bundle. So I'm sorry I didn't get to the... We weren't able to show the AWS because we had a foobar and now we're out of time anyway, but I'd be happy to show that off. Believe me, it works in the clouds. You don't have to rewrite the charms for the cloud. It's the exact same bundle on any substrate that we deploy to. This is the same demo where you're showing different functionality other than it works in the cloud. That it works in the cloud. We also have some benchmark. Actually, this guy here did a lot of work on benchmarking. And so we have this concept of, let me run a Terrasort with 10 node managers. How's that run? Let me run it again with 100 node managers. How's that run? Because we have the ability to scale relatively easily. So we scale up, run another benchmark, check the status. Or I'm in AWS and man, it's expensive. Let me try GCE. Oh no, that's more expensive. And so you can sort of move around clouds. You can go back and forth between local and cloudy environments to see from a benchmark perspective what that looks like. And so all of our bundles have the capacity to have a benchmark charm attached to it. And that just gives you free benchmarking of those workloads. He said yes. We are out of time. I don't know if there's a new speaker here. Thank you so much for the great questions. This is awesome. Really appreciate you. And the slides will be online shortly. Thank you so much.