 If you could get up close, it would be better. Yeah, we've got a vintage projector here, 1980s. All right, less than a minute. Three, two, one. All right, time is up. So good morning, everybody. Good afternoon, almost. I almost didn't make it here today. Last night, I spent seven hours at SFO. My flight got canceled, switched terminals. So I made it. I got a Virgin flight. Left, I got in my hotel room at 1.30 AM last night. So happy to be here, though. I've really been looking forward to this. So I was going to drive if I had to. So my name's Bruno Turkli. I'm a principal software engineer at Microsoft. Let me do a quick bio here. So by trade, I'm more of a software engineer. Let me describe kind of what I do at Microsoft. So I am an orally author. So I've got my second class coming out. And these are long classes, like eight hours. I work with ISVs who are migrating to Azure. So we're talking about the customers I work with would be Docker, GitHub. You might have heard of these companies, Red Hat, Messosphere. And so I'm in this space at Microsoft. I help these, because I'm in Silicon Valley. That's where all these companies are. And I help them bring aboard our platform. I also kind of work with the leadership team at Microsoft. Once in a while, I'll get introduced to a company that we're thinking about working with, acquiring, investing in. And I kind of evaluate their software and see if it makes sense for our portfolio. One of the things I'm passionate about, I'm going to talk about today, which is the world of containerization, the world of distributed computing, which I think is the next big kind of, the new big thing, in my view. One day, we're going to laugh at the fact that we named our computers, that we actually gave them cute little names. It's just not, it doesn't scale. So we'll talk quite a bit at the end. We'll talk about that. So just by show of hands, how many of you here are actually developers or software engineers or cells? OK, about half. How many of you are like IT pro, admin, sysadb? So this is the right place for you. I plan to cover both those topics appropriately. So this is one of the classes I've got. Today, I do a bunch of stuff with Java. I show you how to install Redis, Postgres, MySQL, Mongo, all these different open source software. And then I show you how to develop software to do your CRUD operations against them. And I've got another one coming up for Java-based web services and as well as containerization. So a bigger view of what I'm going to show you today. OK, excellent. So when we think about today's goals, it's going to be about kind of what is Azure? What about the portal experience at Azure? Take a tour of a data center, which I did recently in person in Dublin, Ireland. Very exciting. I'll talk a little bit about provisioning infrastructure, Linux-based infrastructure. Lots of hands-on. I can never count on the networking when I'm doing these talks, so I've recorded it. And it's going to just march right through. It'll be a very visual, fast-paced hour. You will not be bored. I'm not going to just kill you with PowerPoints all day. I'm going to show you things in action here. So I'm kind of ready to go and do that. If it's OK with you, I'm just going to jump right in here. So when you think about our data centers, and I'll show you kind of the map of it later, basically you have 24 global data centers. And if you think about cloud computing today, you look at the three big cloud providers. One of them is a retailer. One of them helps you find things on the web. And I'm going to talk to you about the Microsoft data center here. Notice we have basic infrastructure, compute, storage, and networking. When you think about compute, you think about virtual machines. But really what we're seeing more and more of, and this is an amazingly big phenomenon, is the world of containerization. A lot of people think that Docker invented this, but does anyone know that it's been around for a decade and a half all the way back to the early Solaris days? And so this is kind of like the brave new world. Everyone is doing it, including enterprises in test and dev. We have yet to see the maturity of this in production environments. There are a few leaders, thought leaders doing that. But if I were a betting man, I would say in a year or two at most three, the world is going to be doing containers in production. So you can see the writing on the wall. So if I was preparing for the future, I would be thinking distributed workloads. That's where I'm putting my career. That's of interest to me personally. And that's what I'm going to talk to you about today towards the end of this class or this session. You think about storage. There's a lot of ways to store your data. We don't need to enumerate all of those. Blobs, tables, queues. You have no SQL. You have SQL relational. That's a big deal, of course. That's why I did a class on it, networking. Traffic manager is one product. Let's say you have two data centers. One of them goes down. Wouldn't you want your customers routed to the other data center where you've replicated to? Or you have a customer they want to get to your application. What's the fastest data center? Well, traffic manager can help you do that. Express route. You want to hook on premises to the cloud, but not be on the web. Kind of a direct connection with 10 gigabit throughput. Express route. So some of the basic infrastructure, but really I think the future is more about platform as a service. For a long time, developers have loved to log into their VMs and tweak them and set them up. But the trend, of course, is a more abstracted perspective. And if you think about some of these here, let's take the one web apps there on the web and mobile. That is Apache Tomcat or IIS running as a service. And that's what my class kind of shows. You deploy to that environment. And the Azure Fabric Controller automatically scales it, maintains it, updates it, patches it. You don't sit there and worry about a web farm and load balancers. You let the infrastructure do that. So I think the trend is gonna be more platform as a service. And specifically, you're gonna see kind of orchestration offered as a built-in service. Microsoft has now in preview mode the Azure Container Service. And a lot of people are going that direction, obviously. So that's one of the big directions here. What else do you see here maybe? When you think about the categories of software, obviously the internet of things, machine learning, data science, that's a big deal. I think some of these pillars are fairly obvious to you. What I wanna do is show you at the portal what some of these might look like. So let's take a quick tour of the portal, the Azure portal. I have it up right now, but I have a little quick video I can show you of it. Here is the Azure portal. And imagine I'm gonna say I wanna provision something new, something Linux based in our example. Obviously we support Windows or Microsoft, but I'm here to talk to you about Linux. So let us those categories, the same ones we saw before. And that's how we organize it kind of in categories so you can find things easier. Notice all these container apps, right? So it's built right into Azure, the support for containers. Nginx, Rattus, Postgres, the most popular containers out there today. Now I work with partners and part of what I do is I help the cloud errors of the world get their infrastructure running in Azure. So you go here and say I want a Cloudera cluster or a GitHub or whatever, Hadoop, and it automatically provisions. You don't sit there and configure it by hand. That's the whole point of the Azure marketplace. Lots of Linux based workloads. I'm gonna give you some comprehensive lists of how all that looks. Maybe you wanna do an Ubuntu server. Maybe you want MySQL that's been clustered in the Percona cluster. So I'll kind of talk about that as well here today. So let's take a look at, say, one more here. Let's say I wanna do a new Ubuntu VM. Now obviously you wanna do things in the command line. You probably aren't gonna go to the portal. So we're supporting a lot of the distros, free BST and the like. So there's a lot of activity on onboarding Linux distributions onto Azure. So let's say I'm gonna go search here for Ubuntu because that's what I'm interested in provisioning. You're gonna have a few choices to make. What data center do you want that thing in? How big of a machine do you want? How big do we go? We go up to 32 cores, 448 gigabytes of RAM, bunch of networking. You can choose SSD storage, super fast IO. You can have say infinite band throughput on your networking side if you like. But here you're giving the VM a name and it's gonna resolve to some DNS name or you can give it a public IP address and I can show you how you connect up to these pretty straightforward. I'm gonna call this VM Scale X. I'm gonna give it a username password. Nothing interesting there really per se. So size is that next number two option there and in size essentially you are defining the size of the VM. Maybe you wanna calculate, what did you see in the paper? They calculated a prime number to 22 million digits. Maybe you need the power of the G5 or a set of G5s. So these are all of our machines we have available. DS series means SSD storage, but we have ones that are optimized for high CPU, ones that are optimized for high memory and so on depending on the workload, right? So here's the G5. That's the big screaming machine we have. I mean it doesn't come cheap either, right? Cause it's a heck of a piece of hardware. But that's the guy you might wanna do for say deep analytics, machine learning, high compute workloads. So this is essentially one way to provision your infrastructure. But all of us here probably are maybe using Chef or Puppet or the Salt Stack or the Command Line, Python scripts. Obviously that's where we're kind of headed as an industry. This is, you know, for you the developer spinning up a VM. Let's go on to the next topic here. So I recently was in Dublin, Ireland. I went to the data center tour. If you've ever, anyone here been on the data center tour? Super interesting to go on those. Like incredible to watch what they've done. Let's go take one together right now. This is the global footprint of Azure today. Millions of servers everywhere. Now when you think about where we put these, there's a few factors, like 30 of them. But the main ones would be proximity to customers, the availability of talent, the bandwidth and networking capacities available to you at the time. And then the other one you might not think about is energy, right? You need energy to run these data centers. So there's like a bunch of objects in the cloud. I mean, it's growing very fast today. Everyone moving to the cloud. Some, you know, reluctant, but it's clear that we're moving there. Tons of fiber. There's a Dublin data center and it's a very automated location. The ratio of machines to people is mind-boggling. And that's why they're so cost-effective. From 2004 to 2007, we began to design our own large banks of batteries, ensure electricity. So here you can see the battery backup. In the event of a short-term power disruption. If you do have a radio, please shut it off. Emergency generators provide backup power for extended outages and for planned maintenance. I'll fix that right now. There we go. That's the dam. Apologies for that. One and a half megawatts per year this thing generates. Very modular data centers. A lot of it is on the roof in Dublin because they have the perfect temperature between 20 and 80 degrees all year long. And so they have swamp coolers. They don't even have traditional air conditioners there. The Chicago data center works off these modular containers, these shipping containers, pun intended. And so inside of these are thousands of VMs. So when enough of them go bad, we actually replace the whole container. So talk about modular architecture. These are the way the Chicago data centers set up. These things are very secure, these data centers. Getting in, there's a bunch of biometric surveillance cameras that record everything. Here you can see the cooling systems on the roof. They literally open the roof up to kind of let the cool air in and the hot air out so things kind of migrate upwards. We went out actually on the roof, it's pretty mind boggling. These data centers after seven, eight years, they don't get retrofitted or upgraded. They tear it down and build a new one. The technology is moving that quickly in the science here. Very secure, not so easy to get inside. Even though I was a Microsoft employee, I had to get a background check and show them my passport. Pretty safe. Each customer that has their workload running, even if you were to make it in, you wouldn't be able to find, not even the employees know which customer's running on specific VMs. So obviously highly automated. And I really enjoyed that tour. Just the diesel generators were amazing to me. They're actually cooled. The coolant in these diesel generators is diesel. And they run them like once a week. So apparently that those diesel generators can power the data center indefinitely as long as they have access to diesel. There's a lot of compliance, of course. Another reason to have lots of different data centers is all the laws around compliance. If you go to Europe, they're very concerned about the laws, the privacy laws and so on. So when you think about security, here's some things to think about that Microsoft takes very seriously. We invite you to try to hack into your VM. But if you decide to do that, you need to let us know because we're gonna shut you down. We might think you're a denial of service hacker. So we do invite you to go try to break in and we invest heavily, obviously. There's a lot of interest in being secure at Microsoft. Okay, deployment. Now historically, when you've been deploying your stuff, there's been two approaches. The imperative approach where you write script and go through and program things out. And then there's been the more declarative approach that you're seeing here below. That is, you define a JSON file or a YAML file in some other environments to really say, this is what I want. The way to kind of think about it is, do you wanna define the end result, the blueprint, or do you wanna actually go through all the steps? Because you know the problem with that is if halfway through something goes wrong, you made a mistake, how do you go back and clean things up? The other kind of challenge is, how do you do things in parallel? There are companies now spinning up 1,000, 2,000 nodes at a time. If you use the declarative approach, the fabric controller can go and notice, hey, you want 1,000 VMs, let me do them in parallel. If you do the programmatic approach, you can optimize like that as easily. You typically go one at a time. So this is another big investment from Microsoft and we'll talk about this a bit today is how do you provision your Linux infrastructure in a public cloud using this format and tools? So when you think about all the templates, all these blueprints that we make available, this is the list. I had to write a macro to animate all these. It was just, I wasn't gonna go one by one and do this. So this is all the templates that we have here. And if we go, say, search for some of this, let me see if I can find you one. So if I go to Azure QuickStark templates, you'll see these, all these templates, hundreds of them that define various aspects of applications you might want to provision in Azure. So let's maybe do a little quick demo what that might look like here to jumpstart your provisioning process. It's all there at GitHub. You can just do a git clone locally to your machine, get it all there, modify it to your heart's content. I'm the one that worked on the GitHub Enterprise template. So there's two main files, a deployment file and a parameters file. And you just issue them at the command line. The parameters file are things like, what's the username, what's the machine name, what's the machine, things that change every time you do the deployment. And then the deployment file is the resources you want, the storage, the networking, the operating system, all the core stuff that makes up your deployment. We also have a deploy to Azure button so you can immediately go from here to the portal automatically. So let's say I wanna do a MySQL cluster. What's involved in that? How do you do that? Well, because we have good connectivity here today, I'm impressed, I'm able to actually go and do hands on a little bit here, which I'd rather do. So here is, for example, the MySQL deployment. Now at the command line, which I can kinda show you how you do some of this stuff, you have the deployment file, which is all the things that make up the deployment. So if I search, for example, virtual machines here, let's go to raw so you can kinda get a better picture of this. So if I search for virtual machines, you'll see here the virtual machines that I'm gonna provision. Now notice they're passed in as variables and parameters, the network card and so on. So this is the custom extensions, for example, here I can actually execute scripts from these templates after the provisioning takes place. And this is good if you wanna do things like create a database, set some permissions, code to execute right after you provision the VMs. So here are some of those extensions. Now if we go back to the command line, maybe we log in over here. So here's GitHub, the one I worked on. So if I look for Azure Deploy over here, you will see my Azure Deploy file. Let's go and take a look at this one real quick. We'll go to say the resources section, because there's three sections here. There's parameters, variables and resources, but resources is where you define your storage account. Let me put some line numbering in here. Notice on line 85 we got the storage account, your public IP address and so on. So you specify your networking, all that stuff that represents your deployment. And then on the command line, I have a file here, well I'll just bring it up here, deploy.sh, you just say go ahead and create, put it all in a resource group, it's just a name where you group everything together, put it in the west region, and here's my deployment file that we just talked about. You pass in the deployment file, that's where you define your infrastructure. And that parameters file is the things you wanna change like the name, the location, and the hardware. So to sum this up, what I'm showing you is the declarative approach we take to provisioning. Go back to the deck, these templates. So let's take another quick kind of review tour of this. So there's the command line that I showed you and there's the deployment file with the resources, which is what you wanna build out, and you essentially execute it out. So the basic structure is as follows, it's a JSON file with three sections, and the three sections are parameters, variables, and resources. This might be a TMI scenario where hey, there's too much detail here Bruno, but again it's useful to know that there's this declarative approach to building out your infrastructure in these three sections, and there I kind of explain what those three sections do. So in resources is where I'd actually type things out, so I'd go and edit that resources that I would add by networking, my compute, my storage, other things that I want, even scripts I want to execute automatically after provisioning. That's the Azure Resource Manager. Okay so deploying GitHub, I kind of talked about that one, that one's pretty straightforward, but I will show you basically how you could go and remote in, so that's kind of obvious to most of you that our developers or admins is remoting in, it's what many of us do day to day. So I was just gonna walk through, I kind of had the network connectivity, so I was able to show you this already in person by going to that folder, so I'm just gonna go past this one here. Let's talk a little bit about MySQL, that was that one template I showed you to do in MySQL cluster, so if you think about the infrastructure we want to do in this case, it would be you're gonna have your application tier, and you're gonna build out your web tier, whatever that might be, and so what this template's gonna do is build out that lower infrastructure, the data tier, three kind of load balanced MySQL VMs with a Percona cluster. Now the template that I showed you, it's gonna have to provision a few things. It's gonna provision all that in one template. So really in one command, I can build all that infrastructure out using this template mechanism. So I think that's really the takeaway here when you think about the Azure Resource Manager, is I can start building out this infrastructure with these JSON files at the command line. And so there's two links called the Azure cross-plat CLI, so if I say Azure by itself, you'll see all the commands that I can actually execute. I could say Azure, for example, excuse me, VM list, and this will go ahead and list out my virtual machines, provision new machines, et cetera, and then obviously with this, I can pass in those templates as well. So I can do it manually in the imperative way, or I can pass in these big complex templates. I'm gonna fall soon. Is everyone okay? Are we in good shape? Okay, so I wanted to show you real quickly maybe at least the clone command here. So the way you would start working with these templates is as follows. Go to the GitHub repo, grab the endpoint here, go to the command prompt in some folder and do a git clone of the template mechanism. Once you do that, obviously, all those hundreds of files are stored locally on your machine and you can go edit them and you're ready to go at that point. So I'm gonna go to the Azure Quick Start Templates folder now that it's been replicated and you can see, you know, there's everything here that you could imagine, just about everything. So in our case, I was talking to you about the MySQL Percona cluster. Well, I just go into that folder and you can see the files that make that up. So those are the files I go edit. Now if you notice that azurepxc.shell script there, that's the actual, let me go back a second. That's the script that executes after provisioning on each of the VMs. So that's the mechanism that allows you to go learn after deployment and do some work that might be relevant for your deployment. It's called a custom script. And then, of course, the templates, which I talked to you about as well, and that is the things that you wanna pass in, the parameters part of it, the name, the username, and various things that you wanna be able to pass in during deployment. Okay, great, talked about this one already. Let's go to the next slide. So why am I showing you a picture of a truck with a Gerber baby food in the back? What's the point here? Anyone know where I'm going with this? This revolutionary technology. Well, this is essentially the birth of the container back in 1956 when this was the breakthrough at the time in the world of shipping. And clearly, this has been a breakthrough as well in our industry. So docker and containers. How many of you are in the space of containerization in your role today? About half of you are dealing with this, so you're gonna find that it's one of the growing trends. And the reasons are pretty simple. I'll get into the value proposition in a second. But the ability to break things down into running containers is changing a lot of things. The speed of deployment. Let's get into the things that are enabled by containers. Some of the value proposition. You can run these apps in isolation, and I'll give you kind of a lower level diagram of that in a moment, but essentially, you can start deploying app A and app B. Each one has their own dependencies and they don't interfere with each other. They might have different G-Lib libraries they're using and you can actually run them in parallel without a conflict, that's a big benefit. How fast do these containers fire up? Seconds compared to VMs, which are in general a minute or two at the best. So running apps in isolation, you're abstracting the plumbing. You're democratizing distributed apps because now you can start bundling them together. And we'll talk about that later. I'm gonna show you Docker Machine, Docker Swarm and Docker Compose. Getting them into productions a lot more fast. That's really the big, if you look and read about the value of fast deployment, there are studies that prove that customers are happier, employees are happier, the software has fewer bugs and you're able to actually innovate more quickly. So these are, it's more than just cost. It's actually better for your business to be running to production more quickly. So if you take a look at this diagram right here, I'll talk about microservices architecture here in a little while and how Docker is really paving the way for microservices architecture. These are some of the points I just raised, but at the end of the day, it's about microservices. So this is the change away from monolithic applications, the three tier architectures that we've been all working on. Companies now are breaking things down into microservices. So that means that one of my dev teams might be working on the notification for taxis. Thought I shut my mail off. Excuse me, sorry about that. So I might have another team doing payments, another team doing the passenger management and it allows me to really break down a complex problem. This also enables other things like I can update the payments without affecting other sections so it enables a faster deployment. So this is the other giant trend in parallel with containerization, I think that's happening is this notion of connecting these up with HTTP, some RESTful on API to basically bridge together. Now there are obviously downfalls to some of this. You could argue that this is more complex by certain measures, but again, this is the trend that we're seeing in the industry. People are moving to this type of an architecture. It ends up the bottoms and the lower, the things brought up in the lower right of the main reasons. So when you think about a Dockerized app, I know you're not here to listen about Windows, 2016 will be containerized as well, released sometime soon. It's just the way people are going with their technology today. So when you think about Docker containers, we'll talk about how they can run anywhere. So this is the new architectural style I was talking to you about, the microservices. Now if you think about all these images, you can go get from the Docker repository, because in the Azure has the same repository. People just download these images. When you run these images, they become containers. That's the vernacular here. These are all the available images. So if I wanna stand up Nginx, I just go and say Docker run Nginx. If I wanna run MySQL, Docker run MySQL. We have still the proof that you can run MySQL and Postgres at high scale in a containerized world, right? This is good for dev and tasks, but the world has yet to see whether this is gonna work in production at high scale for databases. We know for web servers it kind of meets the need for the most part, because you can just add more containers. But these are the apps that you can go get today from the Docker repository. And you can reference these from inside a template, an Azure template that I showed you before. So containers can also be deployed through the templating mechanism. So this is just another kind of reiteration of microservices here. And in a moment we're gonna take a look at how you go provision these and run them at scale in a cluster. Again, it's the new architectural style we're seeing in the way people are writing their applications. It's all about being able to update your app and get it into production more quickly. That's kind of perhaps the main motivator here. So when you think about virtualization, there's been two perspectives, right? There's been do you use just virtual machines or do you use containerization? What's the difference between the two? Well, when you think about virtual machines, if I wanna start up my apps, I gotta start them up in different VMs and I spin up a new operating system on top of my host OS. So that guest OS is VM one, two and three. That has to spin up three versions of the operating system. That's why it's slow. With containerization, all the running containers share the host OS. So you're not firing up every time you want a new app with its own binaries, a whole operating system. You're sharing the operating system among other containers. And that's the real value proposition that we have here is that you have one operating system in containers but three operating systems booting with virtual machines. Your question might be, well, Bruno, I thought that you run containers on virtual machines. Well, we'll get to that topic next. So you can run these containers on bare model in a cloud or it's saying your own private data center. So when you think about these virtual machines running in a public cloud, you have your cloud hosted server, your Ubuntu box. On top of that, you have your host OS. And in Azure case, it's Windows. Yes, that's right. When you run Linux in Azure, it's running on top of a Windows hypervisor environment, right? So the hypervisor steps in. There's a Docker extension that is part of that hypervisor. And then you have your guest OS, which could be Linux, it could be Windows, it could be both at some point. Now, when you think about the guest OS Linux, on top of that, you have the Docker demon. So when you install Docker, a little executable kind of resident application, a demon, if you will, runs. And then the container is managed by that demon. And that demon communicates with the Docker extension to kind of make it all happen. And then app A and app B can be running in a container, although in general you run one app per container. And then another container can be running other apps as well. So that's kind of the high level architecture of a running container in a public cloud today. You might argue that there's double virtualization happening and there is. And I think the primary reason is there's a security issue here. The world is still trying to work out the security model for containers and whether or not in a multi-tenant scenario, you're comfortable with your VM running alongside someone else's VM with containers. So if you think about what you're really getting here, if you notice in the containerized world, app A and app B each have their own dependencies, their own binaries, their own libraries, and they don't have to be synchronized. They could each have their own version, maybe different by some small thing that would make app A and app B incompatible if they run on the same VM. But in the world of containers, they each bring along their own version of the binaries and still be able to share the operating system. But if I want the same thing with just a virtual machine, I actually have to do what? Bring up another VM. And that's where the slowness comes into play, is that you have to bring up another whole VM just because there's application binaries or libraries that are not compatible. Talked about these. Yes, sir. We have our hypervisor has an agent that manages that conversation with the Docker daemon in terms of its visibility and orchestration in the cloud. Right, and I'll talk about the Docker daemon getting installed in a moment, but yes, there'll be like a swarm agent and in that environment, if you're doing swarm and the Docker daemon at the same, both of those will be running on the Linux VM. I'll demonstrate that actually. There's a pretty cool demo coming up in a minute. Thank you. So that brings up kind of the next point here, orchestration. How do you decide where these containers go? You have 100 machines in the cloud. Do you really want to think about where those containers should go in that cloud? Now there's a lot of technology that lets you define what I want my cache to be with my web server or I want my WordPress to be on the same server as mine. You can set up those affinities. I'm not going to get into all those details, but in general you want to just say go deploy this. I don't want to have to worry about what machine is available, how much room I have left on it because of other containers and so on. So that's the kind of brings up these technologies today. And I would add to this the Azure Container Service, which is in preview today. So these are some of the big products that orchestrate the running of your containers in the public cloud. In fact, I just did a video with Mesosphere that's going to be released soon around orchestrating Spark and Kafka and a large workload automatically orchestrated by their software. So when you think about this space, Docker Machine is the approach I'm going to talk about today. I'm not going to get into some of the other ones because I have some demos here that might be interesting to you. Docker Swarm, Docker Compose, the Azure Container Service. So Machine is going to allow me to set up Docker on any number of hosts. It helps me provision them on a bunch of raw machines. Docker Swarm lets me have a clustered network. Docker Compose lets me define the way I want my applications to be bundled and deployed on that network. So the great takeaway for you today is to have a working knowledge of what these things do because they're fairly significant in the world today is Swarm Compose and Machine. Now you could argue some of the other products out there are competing for this. The world is still determining who's going to own orchestration. Is it going to be Mesos? Mesosphere is it going to be Docker? And the other public cloud providers, there's Kubernetes, right? Amazon has their Container Service API. Who's going to control orchestration in the future? That is yet to be determined. So this is what I want to demonstrate to you today. So step one is how do I provision the VMs here? How many do I have in this particular place? Well, I have my client, right? But there's four really I want to provision. The Swarm Manager, think of that as the master mode and then three slave nodes or Swarm nodes. So my running containers are going to be put on two, three and four. Notice I'm also going to want to install the Docker Demon and the Swarm agent here. But the one that's going to do all the work for me is the Swarm Manager or the Swarm Master. So I'm going to basically say to the Swarm Master, and I'm going to show you this, go run my containers. It's going to figure out everything for me. I don't have to think about these three nodes. To my software, it looks like one big container to run my workloads. And so that's what we're going to look at first. The first step is to use Docker Machine to provision these four machines in Azure. So let's look at that. So I'm going to build up a little shell script that's going to show you the various commands here. I'm going to need to put in my Azure subscription ID. And I'm going to create the master and the three nodes. So what we're doing now is just writing a little shell script that I'm going to execute to create those in Azure. There's nothing really fancy here. The assumption is that I have Docker Machine installed on some VM in Azure. And this client that's going to control everything is going to set up my environment. And that's what we're doing here. We're setting up a Swarm cluster here. So notice the command here to do the master and the three nodes. So once I've created this shell script, I'm going to quit out of here and I'm going to execute this script. And we're going to see it actually create the three VMs or the four VMs in Azure for me. So let's go ahead and run this thing. Now through the miracle of video photography, I've shortened this down for your benefit. So I think it took me about seven minutes to run this, but you get to see it in less than 30 seconds. So what it just did is it set up those four machines with Docker. You can see in the portal that I'm part of the way through. So they're showing up at the portal now. The VMs have been created, in other words. So at this point, I've got this thing set up, but it's not yet, no it's a cluster and I have not yet orchestrated workloads on it. I've just set up the four machines with Docker. I have a couple more steps to do to make this happen. So we need to define of these four nodes, who's the swarm master and who are the swarm mode slave nodes. And that's what this next demo is going to do. So we're going to run this command to create the swarm. That's going to basically say, okay, I'm creating a swarm. You're going to get this cluster ID now that you need to track. It's that in the green box. You need to keep that. That represents the ID of the cluster I'm creating right now. I haven't fully assigned everything yet. That's what this command is about. I'm going to now define the master node with the following command. It's pretty straightforward. I don't even need to name machines. I just say, this is swarm master. You can see that word there. Using that token, we just copied. That's the cluster ID from the step before. And I did swarm master. Now I need to do the three nodes. So let's now write that same script practically for those. It's going to be very, very similar. You're not going to notice a tremendous amount of difference for the next ones as well. The nodes themselves. Let's do those. It's going to be pretty much the same command we saw a moment ago. Obviously, I've created some certificates to be able to do this at the command line, the open SSH command. I've uploaded the certificates to the Azure portal. So we're just creating the nodes now. Okay, when this command finishes, we have our swarm cluster. What do you think is the final step, the cootie graw over all of this? What's the end game here? To run containers. Not only that, to be able to scale those up and scale those down with just a simple command. So there you can see my nodes here. Notice that the Docker machine LS command lists all the nodes in my cluster. Again, this is in some data center in Azure, who knows where. But the ultimate next step is to start running containers on this infrastructure. And I'm just going to say, I'm going to define what containers I want to run here in a moment. So at this point, we're very close to actually doing something useful. We've already got our environment set up. Now it's just a question of defining which containers or images I want to run. Those images become containers at runtime. Okay, excellent, let's go and do that. So we've defined the nodes. Now, there is a container out there. You may have heard of it folding at home. Does anyone know what this container does? It is a protein folding algorithm that's trying to find curious to diseases. So if you go look around the web, it's kind of one of those things. What do they call that search for extraterrestrial life, SETI? It's kind of like a SETI except for gene therapy. So I'm going to run that container. Now that file up there, Docker compose, is where I define the images that I want to run as containers, Docker compose.yaml. All I have in there is a word called worker image in the name of my container. Now imagine that you want to run a bunch of different things, maybe my SQL with WordPress and any other number of literally thousands of containers out there on the Docker Hub registry. I'm keeping it simple. I'm just going to run this one image as a container, but you could modify Docker compose to be much more complex than this. So this is kind of the exciting step actually. I'm going to define this declarative syntax for running my image as a container. And so the next step now is the important next step is to run the containers on my worker nodes. Notice my swarm master doesn't do that. My swarm master is just keeping track of the infrastructure and doing the commands on my behalf. Normally without a swarm cluster, what would I need to do? I'd have to run the, issue the Docker run command on each one of these myself. But instead of what I'm going to do is say, execute that use compose to just run that workload. And I'll say, give me three of those or give me one of those or give me whatever. I'll just say, go do it. I won't have to worry about individual machines. That's the key point here. So let's do that. It's so amazingly easy for this final step. So I'm going to basically say, every command I issue now, send it to the cluster, not to the machine that I'm on right now, but every command from now on, send it to my Docker swarm. So I set that up now. Every command I issue now will be issued against the entire cluster. So here I'm defining that worker and I'm specifying the image that I want to run and how many I want will be issued at the command line. So that's it. My Docker compose is done. Here is the magic command at the end of it all. Well, not this particular one. When I do my, I'm just showing you that I have a Docker compose file here. Now I'm going to scale it. I'm actually going to scale up and down. I want three copies of that Jordan protein folding container to run. So I'm going to say Docker compose scale, give me three. It's going to fire up three containers. I'm not worrying about which node is doing this. I'm letting the infrastructure manage all that. So at this point, I'm going to say, well, what's running? The Docker compose PS command. Well, three of them are running. What if I want two? I say Docker compose scale equals two. It takes one away. So that was the final demo here. So what have I shown you? I showed you the portal. I've showed you how to provision with templates, all the available packages on marketplace that you can choose from. And then I kind of showed you the whole microservices containerized world on Azure. And where I think the next level of activity in the cloud will be is running containerized workloads. Any questions? Are we good? Yes, sir. Right. So you're talking about the Docker VM agent running in the hypervisor. Let me see if I can find that slide for you. Just find that fancy drawing that we had. That's an artifact that manages the communication between the hypervisor and the guest OS with the demon. I don't have you. I can look that up. That's just a necessary artifact of making it work. But we can look that up. It might even be there to notice that if these are active or if they've failed, it might be looking at stuff like life cycle of the running guest OS to make sure it's still active. So it might be a mechanism just making sure that keeping track of what containers are running. So remember when I did Docker swarm and I said PS, maybe it's this extension that goes to the hypervisor and says, give me a list of the running containers. Good question. Any other questions? Yes, sir. I think that it is possible to scale, but I think if you look in the real world today, very few companies are running production databases at scale in containers. In containers, right? Yeah, apologies if it wasn't clear. Yeah. So running like an Oracle or a SQL server in a container, like Microsoft doesn't actively offer SQL server in a container today. Thanks for clarifying. Yes, that definitely would not say that. Open source databases are awesome. Adam. I can figure to Dr. Damon if it's not a specific port for the given serves. And if provided, watch us at the given containers without the post. Yeah, I think it's an artifact of running in Azure to kind of manage the communication between... Yeah, I think that's fair, yeah. It's a good question. I'm wondering like if you install the Docker client on a VM, if it automatically validates that there's an extension there, I would have to dig deeper to give you a clear answer. Happy to do that. I have my email, right? I'm happy to follow up. Any of you have questions? Contact me. Yes, sir. What do you mean by institutional governments? Government governance? Well, so we're launching the Azure Container Service, which is a very extensible way for you to layer in your own orchestration. We do give you the raw infrastructure that we have a lot of ISVs that I work with that build out their own kind of platform as a service on top of our raw infrastructure. And there's certainly nothing preventing you from doing just that. Is leveraging the raw infrastructure to build out your own service yourself, like for example, Elasticsearch has done that? I'd have to look at the dependencies that might exist for that. I'm not sure. I'd have to take it a case by case basis. But you are allowed to upload VHDs and run them in Azure. Happy to follow up. I'm easy to find. I'm all over the web. I like following up actually, because I learned something. Great questions. Keep them coming. What time is it? Yeah, we've got four minutes. Use it up. Yes, sir, in the back. Good question. You're a gaming guy, Adam. Do you know Xbox? The question is about where does Xbox run? Do you know? On Azure. So it's in Azure. I don't know if we published the data center. It's probably replicated worldwide for proximity to customers. And I think it's starting to leverage open source even. I'm not 100% sure. Sir, do you have a question? What exactly are you trying to migrate over? So what operating system are they running? Which flavor of Linux? Yes, we directly work with Red Hat now. You'll find in the next month or two, you'll see full support for Microsoft in a strategic relationship now with Red Hat. So those should come right over as is. I'm sorry, where's that last point? See what? Not to merit that. Oh, C panel, okay. No problem. Follow up with me. I can look into that for you. There's my email. I can find out about C panels. I think I saw C panel on template actually. So I think it ought to be supported. Yes, sir. Which one? CentOS. It is supported today. Open logic. I can find out. But we do, I would say it's probably compatible, I would imagine, with your flavor of CentOS. He was asking about there's a community edition for CentOS. So we do support CentOS, and I'm not sure exactly how it differs from the community edition. Yeah. That's my understanding is that if you want the full-blown Red Hat, you get Red Hat, but then if you want the open-source, non-Red Hat, you go CentOS. Email me with that. I can find out for you exactly what that means. Happy to help you. Excellent. Yes, sir. We could use help with that. Yeah, that's, in my opinion, that is not as good as it could be. That's what I do. I grep through them, looking for similar things. Agreed. I remember it wasn't anything else. Go, my rally class talks about it. Yes, sir. Yes, sir. Right? At the end of the day, hyper-wise. Right. I understand that the hyper-V was... Thanks, everybody. I can't live without having it full-performed, VNs. For others, I'm running on Zen, or I'm running on the hyper-V, running Linux a little bit, looking at the first class, Citizen. I would like to see how the Azure that the hyper-V is there will tweet Linux as the first class, as far as the performance and all the resource application. All of these are going to be on the Microsoft fabric. The driver is going to use, or is that going to find anything? Well, my first answer would be, there's no substitute for testing yourself and really finding out if there's gaps. And if there is, we can work with you to figure out what the, why it's happening. But as with anything, when you talk about performance, it's all contextual about the workload, the type of hardware you've chosen, the flavor of Linux. I hate to give you kind of a wishy-washy answer, but ultimately it is about, let's test it. Let's compare it and figure out where the gaps are. I agree, it's a fair question. I'm sorry, the... I haven't posted them yet, but I will post them. Just you can email me. I can share it with you, my slides. You're welcome. Thanks for coming today. Thank you for coming. Appreciate it. Thank you for coming. Are we supposed to leave that? All right, well, I hope you guys thought it was a good talk. Thank you. It's a lot more work than it looks like. Yeah, it's, that's why it's probably great to hear. Testing, one, two, three. Good afternoon. Hi. Testing, one, two, three. Good afternoon. Welcome to scale. My name is Anthony Chow. I am a software developer. My employer does not endorse the sponsor me for my open source work. So I'm not going to, I'm not listing who my employer is. I am on Twitter. GitHub, you probably don't find much. My objective for 2016 is that I will have more presence on GitHub so that I'll be able to contribute more or do some work on different projects. And I also blog. The blog is for me to pick up. I am on a journey to the cloud. I read a lot of things and every day when I read something I think it's interesting, I blog about it. The reason is that when I write about things I know more detail about all those things. So it helps me and helps other people. And why do I submit a talk on contributing to OpenStack? Actually, one time me and my wife went to Starbucks. So she went to get the drink I set down. When she came back, she said, oh, you know what? Today the drink is free. Then, okay, oh, I said, who paid for it? I don't know. She said, because she has said someone paid for us already. Okay, so we enjoy the drink, go home, and then in the afternoon and evening when my daughter came back home, we said, oh, you know what? Today we have a strange experience. We went to Starbucks and it's free. And my daughter said, so did you pay for the one behind you? Me and my wife said, oh, what is that? Oh, they're supposed to pay for the one behind you. This is called pay forward. And this is exactly what I'm trying to do here. I learned something. I'm now trying to pay forward. And then when you learn something, and this is why one thing, one reason I like the Linux community and the open source community is that everyone contributes. Later on, if you pick up something, you contribute, you write a proposal, I'll come to a session and I learn from you. So let's start. Why do I want to contribute to OpenStack? If you read the title of this session, you'll probably think about contributing to OpenStack. Is this some of the reason that I listed that you were thinking about contributing to OpenStack? I hope this is now actually, I have the pointer. I should use that. This is the classic example of not testing out the equipment. But being in the Linux community for a long time, I have a backup system. Let me bring up the backup system. So when I'm doing this, what are some of the reason? Why do you want to contribute to OpenStack? What do you think it is important to you? Obviously, if you come to this session, you will think this is something useful for you. Is this something listed here? You might be wondering, why is this, let me get this out real quick. Yes, sorry about that. Why would this picture is in this slide? Actually, this is the reason why I contribute. This thing doesn't work. My backup system also doesn't work, but anyway. But what do you see in this picture? Do you recognize Linux? Dockers? Anyone know about Go? Go is very popular. I think Docker is written in Go. What I see in this picture, you have two things. They are very happy. All of the characters, they are very happy. And there's also food. You look at my side, you know I like to eat. These two things is what I'm trying to pursue in my career. I want a happy environment to work on. And also I want to bring food for my family. So the reason I contribute to OpenStack is eventually I will be able to get a job in the car. And I hope today we'll be able to see if there are some barriers we can go over and try to see if you can start or you're not afraid of contributing to OpenStack. I think everyone of you know what OpenStack is. So I don't need to go through. But there are two things I need to point out. OpenStack is 99.5% Python based. As of now I still am not able to find what is the 0.5% written on. I know it's mostly Python. Every most of the code we use is Python. But what are the risks? But you go to the Twitter or you go to the social media you will see oh, it is not 100% written in Python. So if you can find out what the rest is written on let me know also. So contributing to OpenStack, I think knowing Python is one of the big requirement. And the other one that is important for us to know is that it is a biannual release cycle. If you are an operator, you don't really care about the release cycle. Although you just need to one stable build or one stable release then you'll be able to deploy that in your environment. But this biannual release cycle is important if you want to contribute. The reason is that when you look at a bug if you try to find a bug to fix, this release cycle might affect if you fix will be merged into the mainstream. One time, how many software developers we have here? We, you know about release cycles. Release manager will say oh, release time is next week. Although, you know, you can say my feature is very good. It's very useful. But if it says, I'm not sure or I cannot 100% say then the release manager will tell you don't check it in. This is the same thing because this is a biannual release cycle OpenStack is released twice a year. So you have a special window to work on the bug. This is something we need to be aware of. This is, which I have to point out. But this is something we need to know about OpenStack is that before it is using a model call incubation and integrated. If you look at the picture itself, it was that the older model, you have the call, you have the integrated system. What integrated means is that they will test your fix in the integrated with other features, other services as one. For incubated, the source code is not stored in GitHub. It is stored in Stackforge, so until the technical committee decides that this feature is going into integrated, then it will not be in GitHub. This pose a big problem is that people tend to only contribute to the integrated projects and not the other one. So it is not helping the community. So that's why they decided to call something called a big tent. Just like we are all under a big tent and also all the products is being tagged and they are now being hosted in GitHub also. I have a link here. Today in the slide, I will have a hope, different URLs put it on the slide so that we can have future references. The slide, I will put it on Slideshare and if you look at Scala, they should post it also. I think this is helpful if you can go back. Let's take a look at some of the projects. These are currently the OpenStack projects. There's a whole bunch. So sometimes every release, they have new ones. So not actually, I think Chef is one thing. Maybe it's not written in Python. Maybe if you contribute to Chef or OpenStack, then you don't need to know Python. You just know Chef. And there's also an Ansible. Wrecked Space is heavy on using Ansible to deploy a cloud. So knowing Ansible is also a good way to contribute. And this I-18N, this is for translation. If you know other languages, translation is also a good way to contribute. But let's, we'll talk about it later on. Sorry, so we were able to find one of the current OpenStack projects. Do you feel this way with OpenStack? Have you ever worked with OpenStack? I think it is kind of a complicated system. And one reason why OpenStack is not heavily deployed is that it is not that easy to use. Although the Taiwan move on, people are doing different distributions. HP is doing some kind of, what is that called? A turnkey system where you buy just one hardware. Ubuntu has one system too, where you buy one hardware, it's all integrated in, and then you can deploy it from there. But it's getting better. But sometimes if you feel like this, this may be a hindrance of why we do that contribute to OpenStack. I find this interesting because I bought things at IKEA also, and sometimes when I get home, it's not that easy to do. So if you look at the picture, it looks easy, but if you actually try to put the two things together, it might not be that simple. Okay, so I have to ask you this question. Why, what is holding you back from contributing to OpenStack? Any reason that I have not listed here? Doesn't matter. I hope after today we'll be able to, when there's a barrier you have, it will be taken away, and just look at this slide. Whenever it's impossible, becomes I am possible. I hope you'll be able to get over some of the barrier. Is it some barriers you are encountering? Because this session is trying to overcome the barrier. If it is not listed here, to share it with the audience, and maybe we can try to overcome that barrier also together. Oh, okay. So let's try to go over the process and see if that will help you. I don't claim to be an OpenStack expert. The reason of the, I mean, I'm still learning, but the objective of this session is that, if we can overcome the barrier and you start contributing to OpenStack, I meet my objective. So let's move on. First of all, at least, what can we work on in OpenStack? There are many things we can work on OpenStack, but in the context for this particular presentation, we'll concentrate on how to make coding changes and documentation changes. In OpenStack, documentation changes is also considered as code changes. In the sense that it goes through simple testing procedures, you have to write a commit message, you have to check it in and then go through the Jenkins gate system and then before you can emerge. Anyone contribute to other open source projects? They use pull requests. This is very different. They don't use pull requests on GitHub. If someone did not know how, most open source project does is that, there is the, doing GitHub, you will clone it to your repo and then you clone it to your local machine. You make a change to your GitHub account and then you do a pull request and they will merge whatever in. But in OpenStack, it doesn't work that way. You don't submit a pull request. You will just submit it for Jenkins for review. We'll go through that later on. But you will go through a code review and also a unit test system before they will merge it for you. And this is also a good link on how we can contribute. Like I said, testing is very important too. But I'm not sure if this is something you want to do. Testing, because there's also docs that someone report a doc you want to call something called a triad. I know I didn't say the word quite right. The triad process is that you test out the doc to reproduce it so that people can work on the fix. To prove that indeed this is a doc. It's the same thing we cannot get. If you know other languages, you can do the translation. I know in China, in Japan, and also in Korea, they use heavily in OpenStack and if you can translate the documentation into the respective languages, it would be useful also. You can, if you want to see what we can do, if you go through that, go to that link. I'm not going to need to go through it here today. Before we choose what to do, there are a few things that we must do is to create this account. You have to create a kind of launch pad. You have to join the OpenStack Foundation as the individual contributor. You have to sign the agree license agreement and use it also. This is a system that we need to, you have to create an account also. And then obviously on your local machine, there are two things that we need to do is to install, get, and review. In fact, on this machine, I do not have this installed yet. Maybe it's a good time for me to do. Should do. I am mostly a Ubuntu user. I don't know much about Fedora CentOS. Get, install, get. It's a very simple process. And it's pretty quick. They do have a fast network. I think the network here is faster than the one I have at home. Get, review. There's another thing that we need to install on the local system. If you want to contribute, these are the two things you must have. Of course, you need to have the Python and the one that will go through it later on. Let's see. Slice off of current. This is, these two URL described about this process very well. So I hope if we have, you have time and you have, I don't think you will get stuck. But in case you do, this one from Sematic.com have all the, have all the screen print that talks about which field you need to fill in. It talks about the whole process. And then this is like, if you go to Launchpad to create account, these are the things you need to fill in. It goes through the whole process. I don't think you guys have, we have problem with this, but this is put into the slides for just for reference. Current slide. So how do we find bugs? Or how do we find things to work on? They have a bug system. So the bug system, again, this is a very important URL that we need to know. In the bug system, it talks about, they have different status and they have the priority. It's being a critical, high, medium, or low. Of course, if you have a low priority bug and it is close to release date, most likely they will not merge it into the bill. And then this status is something we need to take a look when we find a bug in the bug system as that we have to pay attention to the status. And this is the triart. It's the word I'm trying to say is to that people was trying to repeat the test process and to see if that is indeed a bug. So, but how do we find a bug? Not here, this one. This is where we find, try to find a bug. This is all the open stock bugs. It's listed in bugslaunchpad.net. Well, of all this, which one is suitable for me to look at? Am I going to go through all this and try to find a bug? Well, this is one thing is that we can go to the advanced search. This will limit the possibility. Well, status, for me, if I want to choose a bug, I would not try to put a new bug because when a new bug is reported, it might not really a bug. So let status settle down first. At least the confirm is very good that we need to search on because if it's confirmed to be a bug, then this is something that we should choose to work on. And of course, in progress, means sometimes you're already working on unless you want to work with the other person, it might not be fixed committed. Of course, we don't want to search for that because it's already fixed, it's committed. For incomplete with respondent, respondent depends on how you want to find a bug. Sometimes there are incomplete response. And if you can take the initiative and follow up and fill in the problem, that will also help. So it depends on what type of problems you are trying to find, then these two may or may not be useful for us. And for the assigned yee, for me, I will choose nobody because that means that bug has not been picked up by someone or being assigned to anyone that's working on. Although that might not be a problem too, what I find is that some of the bugs that they have, they assigned to somebody, but and then later on when you go back to the log, it said, oh, the person is inactive for half a year already. Even though it's assigned, so you can still pick it up to work on. And this is some of the, you can fine tune the search. So if Leslie, if we pick this, how we can find some bugs? Well, you can, so many of them, let me go and go back and then do a search. Then the limits still, we'll be able to find almost 3,000, 3,000 results. Then again, it's difficult. How do we find a bug to work on? One thing good about is that in finding a bug for OpenStack, there is something labeled as a low hanging fruit. And these are being tagged as a problem that should suitable for the first time committed to work on. They're in the middle list where I'll say talk to say, oh, how about take away the low hanging fruit and tag the bug as suitable for first time committed. But I was on the middle list and then I reply and say, oh, this will not be good because I have submitted my first bug that I don't qualify to be a first time committed. But still, I am a casual committed for OpenStack. I don't work on the very detail. So low hanging fruit is still important to me. So I think I hope they will not take. But if you go and search with this tag, then this is something I would suggest this is good for first time committed. These are the bugs that is labeled as a low hanging fruit that means they should be very easy to fix. And I recommend doing the search with this tag. So if you just try to find something, wow, they have so many problems for fuel. Then this is one way we can find out what is a good bug that we can work on. If you find a bug, what do you do? Maybe you can, for me, I am a more cautious person. I would not immediately assign the bug to myself. Let's see, we go back to the slide. This is what I will do. Let's see, let me find, there must be some problem with lip office. You should be able to see the slide. Finding something to do. Another good way is that in the open source community, IRRC is your friend. See, it's slide show. I'm not sure why it's not letting me get the full slide, maybe for the whole presentation, we have to live with this kind of format. But IRRC is a good way. There are two things about IRRC. Let's go to jump to this and you can, I'm not sure if you have IRRC client in your workstation. If not, I use this web bay, web chat. Free note dot, is it ORG, I forgot on dot net. Say free note dot net. This is how we lock in, I use, every way I use this. Which channel? I make a mistake before I was going to open stack. It is not a right good channel for developers. Actually, developer.dev is a good channel. Mountains, this one, this one, this one, this one. Did I miss anyone? I am not, well, probably it's not IRRC, let's see. Huh, select all the, this one, this, is this one? By, let's see. No more? Did I select all? It's only two? Let's see. Yes, I'm not a robot. I am not a robot, yes I am. See, this is how we lock in the IRRC. There are two things about IRRC. This is, you can ask question about IRRC. And also, different committees. They do not hold the meetings at open on this channel, but they have individual. Let's say if you want to contribute to Neutron, they have weekly meetings. I think they're trying to go with the bi-weekly meetings too, because there are two different time zones. People can attend different meetings. But this is where it's discussed, what to check in. And if you are new to a high scale, so someone is locking in to IRRC too. Just one way with that, I can ping that person and I'd be, is this someone in our audience? Hi, good, thank you for locking in. Then I ping him. This is someone, if you wanted to contact someone. Let's say if the bug is being assigned to him, and I want to work on it, or I want to ask question, then I would say ping him. And then if he is online, and some people, they have an IRRC proxy. They will appear to be online, but they don't, it's not there, they will answer you later on. That's how you have to pay patient. And then you have to find out, whatever project you want to contribute to, what is the best time developers go to IRRC and talk, then you have to adjust to that time and talk to them. So if you, let's say, then I can ask him if I can work on that bug and then we will just confess to him, we just go on and on, and then eventually I'll say, some people are very, I would say, people in open state, I have that in one of style, people are very friendly to first-time committers for some reason. So you just don't worry about asking the wrong thing and then you just talk and talk. And then actually for me, I was very fortunate. For my first bug, I check-in, it's a documentation bug. I was on IRRC channel, I forgot which channel I was on and then saying, I'm a first-timer, blah, blah, blah, just try to introduce myself. And then all of a sudden, a guy asked me, I think it's a guy, I don't even know it's a woman or a man, but he's asked me, oh, do you mind not working on documentation bug? I said, oh no, anything is good. Then he get me a bug number and I sign it to myself. You know what that bug is? It's take out two right spaces. That is the bug. Actually, contributing to open state is not that difficult. From where I'm lucky, my first bug to fix is take away the two spaces and then I run. It's not that simple, I'll go through it later. Even though it's taking out two spaces, it still take me more than one week before my fix got motion to the main master. But anyway, this is something, if you want to contribute to open state, you should master IRC and then use it as a tool to find bugs and to work with, because open state is a collaboration of resources to work on. So if you talk to more people, you make your presence. In those individual, let's say, neutron, they're okay. They have the weekly meeting. If you attend, they say hi. That meeting is being locked to somewhere. I forgot where IRC will lock it, but your presence is being locked in the minute, so people know that you are there. And then if you ask questions that you will make yourself known, people will know who you are and they're, oh, it's such a serious here again, asking question. I think we're not saying you're asking stupid questions. At least they will not see it in IRC. But they will say, oh, this guy is very serious about contributing to open state. They will help you out, they will talk to you. So this is something we should be aware of. And this is your friend trying to find the bug. Let's say, let's go back to the slide. And of course, there's many of us. Oh, I know what. Let me quit. Now let me, oh, still no, don't let me quit. This is something, something I don't like. I myself are developer. I don't have a customer facing job. Talking in front of audience is still not comfortable for me, but so forgive me if I look comfy or the presentation is not as smooth. But at least my objective today again is that if there is any hindrance or barrier to contributing open state, let's see if we can, through this session, we can try to overcome it. Let's move on. There's also the mailing list. This is, I have the URL here that you can go in. And this is a lot of good discussion. It may not be a bug related, but they are good discussion about the project. They talk about the directions of the project, what is good, what is not good, people like this. They want this feature to be had, to be heading this way or not direction. This is very resource, very good, but then bear in mind, I think better not to use your work email address to sign up for the mailing list. Just use one dedicated somehow, some Yahoo or Gmail mailing list so that you will not, you will, every day you will not be going over hundreds of emails. And then there's also the Ask OpenStack. I personally do not use this much. I don't know, but of course you can ask question about OpenStack on this channel. Let's move on. This is something, let's say, assuming you already find a bug to work on, you have assigned yourself a bug. How can we work on a bug? What I find mostly is that OpenStack developers, the year-ending systems, this is why I find, I think, you've been to in particular, it's very friendly to openStack. Of course, there are people from SUSE community, from the Fedora community, especially RedHack, they use their own system. But I think, I would say, 60% of the development user use Ubuntu machines. Of course, you have your own choice. This is not limited. And then there's different ways to work on a bug. There is, you can use a KVM to speed up a machine. Anyone heard about Vigran? Vigran is really nice. You just speed up a VM and it's just right there for you. Let's see, Vigran. I think they have some other product. This is how easy you can speed up a VM. So it's two rounds. And then after the VM is up, you just Vigran SSH, then you will be in that VM. You can configure it as whatever you want. And this is good for development systems. And of course, VMware Workstation and Fusion, I'm not sure this is a Linux convention. Not to many people. I am involved with the VMware community. I do have a VMware Workstation. But one bad thing I find out is that my VMware Workstation is an older one. It's version eight. It does not support 14.04. This is some hot way I find is that I tried to speed up a 14.04 Linux machine on my Workstation eight and then I see crazy behavior. Now actually I find out is that version eight does not support 14.04. But it is not a problem. Another two things that we look at is DevStack and RDO. Usually when we configure OpenStack in the production environment, we use different server. But limited if we have a developer, we don't have that many resources. And even if we have that much resources, it's more convenient if we can configure OpenStack in just one single machine. They have, let's say DevStack is very good. I use DevStack very, more often. DevStack is a community. You can spin up. This is very nicely described how it can work. You just get cloned. And then you just run the script and then later on you'll be able to start. And then you will have a virtual machine with OpenStack running to work on. And of course depending on the project you work on, you will have add-ons. But you can easily Google your particular project. Let's say if you work on Neutron. If you Google with the keywords DevStack and Neutron, then you will have tons of information for you to see how to configure a machine. Although I have to, at least this is my problem, is that even though there are instructions to specify the environment, sometimes it's not that straightforward. And that's where IRC come into hand. And then you can ask different questions. Sometimes, especially my experience with the OpenStack community is that it is not that easy. That's why I come up with this slide. Sometimes I really feel like this way. But the good thing is that don't be afraid. It is not that simple. There are customer support on that side. It's a community, it's very friendly. So don't worry about that. That should not be a hindrance. And then this is some of that. Of course RDO is the San Luis Fedora redhead version of DevStack. I don't use it that much. But the same idea is that you can, in a single machine or in one, you spin up one development environment, you can do your changes and test your changes. This is not something specific. And then fixing the problem is depending if you have a low hanging fruit, fixing the buck, like in my case, I just have to take away two white spaces. And that's my fix. Sometimes the fix is just like that. But still, after the fix, the problem comes is that you have to do unit testing in your local machine. You just don't want to be a bad guy in the user community. Do some changes and check it in without testing. You can do that. No one will prevent you from doing that. But this is not a good practice. And then there are a few things because OpenStack is mostly Python based. We use TOCs to run local testing. Usually the project itself, if you're working on, they have very well-developed and well-tested test scripts. So you just run this one command and we'll just run it for you. And this article, I find it particularly useful. And it's very detailed. It talks about how to do unit testing. It details the script. It tells you that you need to use PIM to install TOCs and then these are the things you need to do. And then you can go to... One problem with OpenStack for now is that Python has a version two or version three. Some project is using version three, some project is using version two. So it depends. And sometimes depending on the project you're working on, you will have to... But usually the projects, they have good documentation on how to run unit tests. You can go through it yourself and this is something not difficult to do and it has good suggestion. The unit test, this one has a nice screen layout is that you will tell you how many tests being done and they are used as successful or not. Most likely, it's difficult for you to see the... Because of the color, it's not easy to see but you will see this one is better. Run 14 tests in how many seconds and say, okay, so your unit test is good to go. So this is something that we can do. But what I find is that after the unit testing, committing the fix is the most difficult. But you will think, the bug is already fixed, why is it so difficult to commit the fix? Well, let's take a look. Garrett is a system, a code review system that will trigger and you check in, you will run the continuous integration test and you'll want to see if your fix is good. The commit system, Garrett, is based on a voting system. For sure you need two, two, two plus two. Let me phrase it the other better way. Plus two is one. You need two of them before your fix can be merged to the master. And also there are some things, sometimes they will give you a minus one. Like I said in here, minus one does not mean your fix is wrong. But there is something not quite right the format that it will give you a minus one. Sometimes Jenkins, they will give you a minus one because when they run the unit test, it fails. I think I'm not 100% sure, but what I am, my understanding is that Jenkins also use DevStack to spin up machines to run the test. So using DevStack or RDO to run your unit test is also good. Sometimes the problem is not on your side. It's because of some other people's problem, the test fail. Then you have to follow up and find out what the problem is. Sometimes you can request to run the test again. So don't worry, you will most likely, when you do the first commit or first through commit, even a seasoned contributor to open step, they will get minus one because they are of different reasons. People will say co-review, I don't like to do that. One of my experiences there, as I forgot to mention on the release, there was one low hanging foot I picked up. The problem is that they will say, oh, there are two flags is being deprecated, not being used in open stack anymore. So I'll say, oh, good. That's the easiest thing, there's no logic change. I just need to take out the two flags. They do a grab on the tree, whoever uses it, just take out the code, no problem. So I did that and I commit the code. I did the testing, we all tested well. I make my commit. Some reviewers say, oh, it's good, good to go. But there is one person, they give me a minus one. He said, oh, no, this is not how you do things. For a deprecated flag, you cannot just take it out. You have to, in this release, mark it as deprecated. And then we will take it out on the next release. So for me, okay. Then what I have to do is I have to do a rebase. I have to make all the, whatever changes I make, it's not good. So I have to do it all over again. And then I submit my changes and then later on, on the next release, they get taken out. I didn't do it myself. But commit message is not as easy as you think. It's very, because they follow a very strict, and I recommend you to go through this document before we do the commit. This is a very good description on how to write commit messages. We're not going to go through this, but this is some of the main things, some of the next white spaces. And I think what this, it is a good thing about this article is that they have example of good commit message and bad and explain to us why this is good and why this is bad. And sometimes you might get a minus one just because the format of the commit message is does not adhere to a good commit message. So they will say, oh, redo your commit message. And one thing is that later on we'll go through is that because of Jenkins, the change ID is very important. We will try to keep the changes because if they're later on, or some will say you're fixed, you need to make additional changes that will be good, then you will check in with the same new change ID so that your fixes will be in the sequence in the lock. And I highly recommend we go through this if you do a commit to go through this first before you really commit your fix. I think this is something good about OpenStack is that they have this sandbox for us to play with before we really do. This is something very safe. And I think we should be able to go through this here. Try to see if you actually, if you have your laptop, you can also play with it. I did the URL was on the slide, but this is the time to try to play with a good commit. So you see, or see this for me already, download git and git review. There's a git clone. Let's see who were my first. CD, I don't have an OpenStack. Git clone DTPS set fingers. OpenStack dev sandbox. Good, hope it's right. No, simple, it's fast. And this is, you have this directory CD sandbox. We have all this, what we're going to actually, this is very, I did it once. So this is very nice. I wish I know about this when I have my first check-in. I don't, I think, and then I have to learn it the hard way. So now we are there. Let's follow the instruction. See what I have now. Config this. This is done by the sandbox. This is, I never, because I just download git, I don't have anything. So I need to git, config, user, name, for me. Git, no, config, no, configure. Git, config, user, mail. Actually, there is a story about by this vCloud beer thing. I like to drink beer. So beer is one thing, and my last name is Chow. So when most of the joke people talk about me is they call me Chowder, and I like food. So when I pick a username on Twitter, I say, I will say, and I like, I work on virtualization. So I call, initially, I call myself vChowder and beer. But later on, after I work on the virtualization someone, the cloud come into place. They say, oh, it's good nice things. Then, as well, from Chowder, I changed my username to be clouder, and I've been using this for a long time on most of the social media, on Twitter, on GitHub, on even, I opened a Gmail account with this. Just do that. The next thing is very important for me. I have been working on Unix, Linux for a long time, but I never forget to learn Emacs. And forget the default editor is Emacs. The very first time I make a commit, I have a Emacs screen in front of me, I don't know what to do. Because for me, the most difficult thing is, I can do the insert, but I don't know how to get out of Emacs, and I have to call my wife and say, hey, how do I get out? So this is very important for me that I must do this on a new system. I need to change to VI. Let's see, VI. I don't use VI M2, I'm an old schooler. Just VI is good enough for me. So let's take it, config. This, this is what I have. This is written for me by the Sandbox. So I have configured my username, my email and the editor. Let's just follow along. This, I'm not sure we need to do this, but just follow the instruction we'll do it now. Let's say, no, let's skip that. Just do a good checkout, new branch. This is checking out from the Sandbox. We're not checking out from any tree from GitHub. New branch. So change new branch. And let's follow along cat. First file is my first change set for, what was that? Well, you can type it anything you want. Okay, say git status. I have one, but I need to add that. So git, add the same. Git status. Now I have one new file. So it's easy, so let's say I make change and git commit. So this wrong typing, how come it didn't change? Someone have to help me out here. So we need to write a good commit message. This file fixes a big problem. Then what they recommend in the article is that you have one line that is within 50 characters to describe what the big problem is. Then you describe what's been changed. Add one new file so that the problem is fixed. This is a bad commit message. Actually it doesn't really help other people what it is. But let's do it for now. One thing I forgot to mention is that writing a commit message, you have to say, this is fixed, but this is a keyword, very important. Let's say I request, I work on bug one, two, three, four. This is bug fix. And there are a few keywords that you need to put on too. Because it might fix, change, I will say API. Say API impact. Of course, if you change API, there's also a bug impact. These are the very keyword that we need so that the API, the group will know that this bug, there was an API impact that they need to log on. Bug impact means, let's say you have check-in, affect existing documentation. So the documentation group has to take a look too. And sometimes there is that impact. If your fix has security impact. Let's say your fix may be good, but if we introduce some security hole, then you need to put this in or your fix, it's fix some security bug, then you put all these three. This will flag Jenkins or a Garrett system to inform the other group that whatever you have changed the fix has all these impacts so that they will follow up on that. And then, I think they should change IDs to be given to me. Let's see what it is. And how do I save and get out? Is it control X? Control X and Y. Enter. All right, thank you so much. I thought I changed my editor. So I make a commit with a commit message, but that's not, we're not done. Get status. You will say nothing. Nothing to commit. But the most important is that after you do the commit, we type in this important command, it's getReview. I'm going to type in getReview, but my command is going to fail. It's not going to work. So see if you can figure out what went wrong. GetReview, say it's going to fail. Okay, ask me, yes. Garrett username. What's wrong? This give me a trace back. Try again with this, it's a desk. Okay, this trace back. We don't know where your Garrett is, but let's see, they should know. Okay, there is a getReview. GetReview. Anything wrong with this? Okay, sorry. Yeah. Let's see, let's just do this. Right, but actually this is not my biggest problem. It does know where it is, but there is one thing I'm missing that I have not done. Let's get it out. Let's go to, where should I go? I should, this is the review. Is it review, open stack, or? Huh, review, let's sign in. This is where the review is. One thing I have not done is that there is one thing we need to know. Settings, SSH proper key. My laptop is not set up. I don't have the key. That's why it's not allowing me to do that. These are some of the things that might be frustrating when we contribute to open stack. This is a little things. It's not a difficult things to do, but just that we don't know the flow. We love doing it as a second nature. So we need to figure out, but let's say if we're looking at this, these trace back and these things, even though if I do do this one getReview.less, it might not necessarily tell you what really happens. And this is where Google comes into place. I think Google is the most common or most well-accepted search engine. I don't think any people would use Bing as the search engine. Even the ayahu, I think not very friendly to use, but anyway, this is a side comment. It's not important, but this is where search engine, internet search engine is good for us. If we want to contribute to open stack we find a problem. We just have to say key in the right word, try and narrow say, let's say we can say, cannot connect to open, to get it, then you will try to find out what exactly, what is wrong. And for me at this time, I think I need to pull out a cheat sheet and to create myself an SSH key. So actually to do that is not that difficult. Let's say usually when I, at the first, and let's say cd.ssh, there is a .ssh in your bundle, but I need to generate SSH, T gen, minus T RSA, you can go to RSA. Enter file, which to say just take the default. Okay, this one is a, now before I only have a known host, but now there are two files called for me. This is what I need to do is that I need to cat, ID, RSA, pop, dot, pop it. And then, let me put this away first, and then I need to copy this and go to net key, paste. Yeah. Now I should be able to, let's see, let's give it some time. Usually it may not be that fast. I'm sure, let's see, let's go back. CD, CD sandbox. No, open stack sandbox. Let's do the review again, get review. See if this will work this time because I added the key. I think this time is okay because what I see is that I have a number given to me. This is the number. And this, when you do the commit, this is generated for you. Then we just click. Yeah, let's go there and take a look, copy. See, this is what I have committed. And this is, you can see the reviewer. In this special case, this is Jenkins, Nimble Storage and EMC are up so on. This is the reviewer. See Jenkins, give me a, because one for me, good to go. Nimble Storage, say no, not good. Then I need to find out what it is. This is one, so this sandbox is very nice to see. We can play around with it and you can, if you have a successful check-in with this sandbox, then you can do your regular check-in for the regular bug fix. Then you feel more comfortable. Let's see. This is the change ID, this is important. Let's see because this is my commit message that I have typed in earlier. And what else can we get from here? This is the workflow. There's so many things we need to look at, but this sandbox is very nice tool for us to get used to the commit process at the workflow. So the key thing for me to overcome the barrier to contribute to OpenStack is not really the bug fixes, it's the process that we are not familiar with. If we are familiar with the commit process, I think it will make us more easy to commit to OpenStack and see, I'm not sure. That's because I think this is only a test. When you do this, I think it's recommended by the web page that we abandoned this change. I'll get back to it later, but this is something and demo time is a sandbox play with it, and this is what I have. So like I said, OpenStack community is particularly friendly to the first time. If you identify yourself as a first-time committer, people is going to help you out and then gonna give you advices. Trust me, they are really friendly. And if you do not trust me, challenge what I'm thinking and do a first-to-be, a first-time committer and you'll see if people are friendly to you. You can just come back and say, oh, they're not friendly to me. But don't feel discouraged because, like I said in this slide, you get nothing to do except, nothing to lose except time. But if you start contributing to OpenStack, well, remember the slide and this is, it's helping me out tremendously. In fact, let's see, on my Twitter, I put that as my background just to remind me that I need to stay focused and not to give up and try to continue my journey to the cloud. And that's all I have today. Is there any questions, comments? Is it helpful? I hope you go back and start contributing. Well, you get nothing to lose. The worst thing you get is minus one. But thank you. One, two, three, perfect. Can you hear me? Thanks for coming today. My name's Darren Froze. I'm Darren on Twitter and Darren on GitHub. And when I'm not crawling around an underwater wreck, I'm a site reliability engineer for Datadog. To start things off today, let's briefly talk about what service discovery actually is. So in its simplest form, and for our purposes today, it's comprised of two main components. Service registration, where some service on some node or in a container or maybe even in a unicolonel, assist to a central authority. I provide this service at this IP address and port. And on the other side from there, service discovery is the other component. So where some process on a node or in a container says to that central authority, hey, can you tell me where to connect to this service? Now there's obviously other parts around that, but that's what we're gonna focus on today. And that's really it. So Datadog's journey to service discovery started near the end of 2014. We had about 370 VMs in AWS. We were ingesting about 1.2 million metrics per second. We've been around for four years, but we were in the process of cutting apart our monolith and taking out the components piece by piece. So we were growing in staff and in machines we monitored and we were having some pain around configuration management. So rapid growth is always challenging. It exposes the areas you need to deal with next. We had gotten to where we were by doing things a certain way, but we couldn't do those things the same way anymore. In order to scale to the amount of traffic that we were seeing, we needed to add many more machines to share the load across our entire platform. As that quote indicates, you can get pretty far, in our case, up to 400 machines, but it was increasingly getting cumbersome to manage raw IP addresses. We needed a better way. And by the way, you may not be able to see it, but there's an article at that link at the bottom of the blog, and it's great. Even though it's two years old, it's got lots of good stuff in it. So at the time we were using a hybrid of chef searches that take about 30 minutes to update, and large numbers of manually managed IP addresses. Those environment files in our chef repository, those are some of the hottest files that we dealt with on a day-to-day basis. And there's nothing really wrong with that, but it was getting harder and harder to manage. As you can see from the graphic above that I unfortunately had to obfuscate a little bit, the amount of services that we were extracting out of our monolith and adding to our application to keep up with growth was growing and growing. If we were to prepare for the future when we all moved to containers, pods, and unicorns, there's no possible way to keep all the locations of all the things in a single static file. Plus it's really error-prone to manage that file. It was getting really troublesome to merge. We could see the writing on the wall. So I had first used console back in 2014 of June to serve as a backing store for environment variables for my Docker project Octo-host. I was only using the integrated key value store at the time, but I knew that it includes service discovery, and so I wanted to give it a shot. So I got approval to take a quick spike into seeing if it would work for Datadog, and here I am, 16 months later, still on that quick spike. At the time, we thought our desired end goals were pretty simple. We wanted to have a register and provide a catalog of services on our cluster and provide the integrated key value store in our cluster. And unfortunately for people who like to complete things and move on to the next project, those two little goals led us to an almost infinite amount of yak shaving and rabbit trails that on the quest for infrastructure nirvana. So some of you may not even know what console is today, and thanks for coming, regardless. But I'll go for a quick introduction. Console is a great tool written by the guys at HashiCorp, and there's a distributed and strongly consistent key value store that sits in console that you have access to from any node in your cluster. It has pretty flexible ACLs that allow you to lock down parts as needed. You can register a watch against the keys in the key value store. And when the key changes, it will automatically run a handler for you. Very, very, very efficient way to get information out. It has an opinionated service discovery framework that gives you built-in DNS and HTTP endpoints to query. You can also create locks. You can do remote orchestration and job execution as well. It's really quite cool. Console has server nodes. Yeah, worked, nice. Console has server and agent nodes. You run console on every node. So the binary is identical. The only thing that changes is the configuration. So the server nodes participate in the RAFT consensus protocol to keep things consistent. It's how they agree in a distributed system. There's always a single leader out of those server nodes, and if the leader slows down and stops responding, the other server nodes have an election, kick out the guy and start a new one. That leadership election is not really a big deal. It's not like a Postgres failover where you actually have to do something. It's hands-off and it happens automatically. Now, during that election, for approximately five to 10 seconds, it depends on your network. You can't read or write to the key value store and most of console. It's kind of in a degraded state. If you have time and want to learn more about RAFT, that animation can be found at lots of information at the link above. It's got some great explanations of what RAFT actually does. So given that console is awesome, and it is, I'm telling you, we still weren't sure if it would work for us and if it would help. How would it fit into our environment? How would it work given our needs? How do we even end up using it? We really had no idea. So we rolled it out into staging. There were about 100 nodes in that environment and we used M3 medium size for the server nodes. Our phase one plan was pretty limited. It was really just an initial deploy of the server and agent nodes. We added some registered services and we were exploring the service catalog. We really wanted to see how would it act in our environment and would it interfere with anything? We quickly found that no, it didn't interfere with anything. In fact, the agent binary only took between 15 and 60 megabytes of RAM on each node. Everything seemed pretty calm, maybe a little too calm. Now, given that I work for Datadog and Datadog monitors things like console, this is the part of the talk where I say you need to be able to monitor console if you want to roll it out. At Datadog, we have a philosophy of monitor first, which means if we can't monitor it, most of the time doesn't get rolled out into prod and it really helped us with rolling this out. Over the next couple of weeks, we built the Datadog integration to monitor console. We learned how to break it, we learned how to fix it. We figured out that it likely wouldn't break the world if we rolled it out to prod. So it's probably fine and we shipped it. We shipped it in the state that it was, enabled and not really being used. So it's sort of like a dark launch. It was running an active on every node, but we didn't depend on it and it wasn't being utilized except in a very exploratory sense. At that time, we were about 370 nodes in production. We spun up five M3 large instances and then we started adding agents. Three at the time, you can have three, five or seven server nodes and at that time, three didn't seem to be cutting it. It was having a few too many of those leadership transitions. So we wanted to have a bit more cushion to survive a failure if something was to happen. And to the astute viewer know that's not a token ring, that they're not being connected in a circle. They actually all connect to each other. Every node actually talks to every other node. When everything finally got rolled out into prod, it was stable, which was pretty awesome. During our first explorations and staging, we had discovered what we considered to be the two most important metrics to monitor to know if console's working correctly. Do I have a leader? And one of the last leadership transitions was their one recently. Overall, these metrics will tell you whether or not your cluster is healthy. If you have a leader, well when things are good right now. And if you've been having lots and lots of leadership transitions, even though they're hands off, it's not really a great sign. We're gonna come back to this because those two metrics play an important role in today's talk. So now that we're live and on prod, what do we do? One of the first things we did is we added what's called a data dog service. We wanted to use the console service catalog to have every node in the service catalog so that we would be able to do all sorts of fun stuff with it. Fun stuff like the next slide after this, of course. So we use Chef to install the service as a small JSON file. That's an example of what the JSON file ended up looking like. We also use Chef to add all the nodes' roles and the availability zone as tags. So now that we had a complete picture of all nodes, we could do things like, can you guys see that? Oh yeah, sweet, okay. So we could use a command that we called SSH to role or we could SSH to a node that had a specific role. We could also use another command called host by role to find all hosts with that particular role. It's incredibly useful and it's how, it's the primary way that people get around our clusters today. At data dog, we already had an orchestration solution that many of you are probably familiar with called Capistrano. It did all sorts of things for us and console has something that's similar called console exec. Now it can run any command you want on any group of nodes. We're limiting that command to just the console server nodes. The catalog of nodes is always up to date as nodes are added and then moved to your clusters. You don't have to manually do anything, which is great. And it's fast, it's pretty nice. So things that used to take multiple minutes under Capistrano, you can do in five, maybe 10 seconds if everything's slow. Unfortunately, we aren't super happy with some of the security trade-offs with console exec. You can't really turn it off unless you really do a lot of extra limiting that we weren't willing to do. So most of the time it's disabled except when I need to do something really fast. Sorry, Mike, wherever you are. One quick pro tip, if you're using console exec, do not tail a large or active log file because all of the server nodes participate in sending you those bits and all have to agree. They use RAF to agree. And that's a very quick way to melt down your server nodes. So they're aware that I'm hoping there's maybe a fix, but we'll see. Remember the strongly consistent key value store I talked about earlier? Well, turns out it's really handy to have a global data store on all of your nodes that's available from local host and an HTTP call away. So we wanted to use this for configuration data, but we also wanted to know who made a particular change and we wanted to know when those changes were made. Turns out somebody already built most of this and Git to console is the solution that we used and it works great. So we took Git to console and we created our config repo that has some selected configuration data that's widely used across our stack. It's a very popular repo at Datadog who knew that if you build something that's really easy to use and does things quickly, that people use it a lot. Every 60 seconds, Git to console, checks to see if there's any changes in the Git repository, pulls and merges those changes into the key value store, then console distributes those changes to all nodes. And the processes we have in place then act on those changes. So it's given us a whole bunch of really cool capabilities like quick reaction time and flexible configuration. It's also capable of sending a broken config file to every node very quickly. It's a tool with a very sharp edge. So as we were using Git to console and our console config system more and more and because of how we were using an abusing console, this explores what we felt was the weakest part of console at the time. And that was how console reacted when we were reading from the key value store at a high velocity. So console's leadership transition mechanism is tied to latency. After not hearing from the leader for about 500 milliseconds for whatever reason, the other server nodes kick it out and elect a new leader from that group. The old leader kind of comes back with its tail between its legs, promises to do better next time and joins with the others waiting for its next turn. When you read from the KV store at too high a velocity or from too many locations at once, console 0.5 has a tendency to freak out and have those leadership transitions. As you can see the above graph from January to May, it wasn't that awesome. And in all fairness to HashiCorp, this was at least partially self-inflicted. Each leadership transition took approximately six to 10 seconds on our network to be complete. And during that time again, the KV store is unavailable for reads and writes. So we adjusted our code, we made it a little more tolerant of those interruptions and moved along. Here we are at the end of May and everything's turning up to the right. We're growing in all areas as we start to use more and more features of console. By this time we're getting serious with service registration. As you can see, we only go up to the letter D there. Many of our services were being registered and we're playing with how to use the various discovery mechanisms. There's a simple HTTP API that's built into console that can answer many things about nodes and services in your cluster. It's flexible and easy to use, but for us to use it would require a little bit too much re-architecting on some of that monolith. So we decided not to go this route in our app and in our extracted microservices. We decided to primarily go with the DNS interface, at least for us inside of our application and those microservices. So it's simple and flexible. As long as you know the name of the service, you can easily get a list of IP addresses where those services are running on. As with anything to do with DNS, it has its own drawbacks. There's some strangers with some libraries, but it just works the vast majority of the time. Our newest services, a custom data store and a metric service that we deployed last summer and fall, the only way to get to them is through this DNS interface. There's no other interface allowed. As we add additional instances to handle the load and handle our requirements, those new nodes are added into rotation and the DNS updates and we're good. As with anything new though, we were worried about services being unstable, flapping in and out of the service catalog. What would that do to our app? Would it happen because of console? Would the health check mechanism work as we hoped it would? So we did what many pragmatic engineers do and we cheated and rigged the game a little bit. In some case, like the above, we made it never fail by calling been true every 60 seconds, so it's never gonna fail. In other cases, we remove the check altogether. In still other cases, like Cassandra and Kafka, we use proper health checks at a decent interval and I'm happy to say that service flapping generally isn't a problem. I know some of the HashiCorp team have told us that there are some issues with services, but it's not a console problem. We're at least not seeing it and we have many, many services. If your service is flapping, in our experience, it's because your service is flapping. Sometimes we'll see a bunch of Cassandra nodes roll in and out of the catalog and that's because it's Cassandra doing whatever the heck Cassandra's doing. It's proven to be a very reliable mechanism. Another quick pro tip on the whole flapping thing, console has the concept of a data center. Within that data center, every node needs to be able to talk to every other node on multiple ports on TCP and UDP. So if any of those aren't correct or you have some firewalls and things like that, that's when you see flapping, but again, that's not a console problem. One of the side effects of having a very fast-paced system that ingests millions of data points per second is that using DNS was always a very risky endeavor. Like everything's always a DNS problem, right? In fact, before console, we didn't use DNS internally at all. It was always bare IP addresses for the speed. And given console's proclivity for read-induced leadership transition, we were a little concerned about what adding millions of queries per second to hundreds of machines would do. So we did a few things to help mitigate this. First of all, we installed DNS mask in front of console so we wouldn't be querying console directly. DNS mask will intercept anything with a .console domain name and direct it to port 8600, which is the port that console listens to for DNS requests. Secondly, we added a short 10-second TTL to all of the services that console knew about. Console's TTL by default is zero and that's just a little too quick for us. So we also looked at creating a host file based on the contents of console's service discovery database. Okay, I guess that's the end of that slide. So we looked into our existing bag of tricks around console, we grabbed console template, another Hashacore product and we started with that. We, our plan was to build a host file on every node and load it into DNS mask directly and it seemed like the most straightforward option and it would have been great. But even in our staging environment, it was chaos. It looked a little something like this but with everything flying and no little girl. When everything was updating, when each node was querying console service catalog, it was putting so much repressure on console's data store that it was pretty much eternal leadership transitions. Multiplied the nodes by the number of services and the number of records for each service and yeah, it was not feasible. To be clear, again, this is not console template's fault either and I side noticed that console 0.6 fixes this. But here we were, console 0.6 was still a twinkle on Hashacore's eye and we couldn't wait until December. So we got into thinking. Let's build this host file on one node that we'll use the KVStore to distribute this file to all the nodes. We'll use one of those little nifty watch things to write it out to the node on the end. And it worked. It worked really, really well actually. We weren't seeing any problems at all. There was no transitions. There was nothing. It was super stable. We very quickly found out that without rate limiting that process, reloading all of your console agents for whatever reason on each node leads to those services dropping in and out of the catalog. And if you're dealing with an automatically generated host file that has all the nodes in it, so over a 30 minute chef run, that's about 20 nodes per minute that were dropping in and out of the catalog, which meant that 40 times a minute this host file was being regenerated, sent to 600 nodes at the time, and written out. Every one and a half seconds it was doing that. And that was pretty stressful for me personally, as I'm watching it. But we made sure to enable some rate limiting. And there was one thing that we noticed during this entire process. One thing that surprised me is that there wasn't a single leadership transition during this entire time. That even then we were sending around this 40K file to 600 nodes every one and a half seconds. There wasn't a single problem. Console didn't crack under the pressure. Whereas on the other hand, with our console config repo, every single time we made a config change and console updated, we had between one or two leadership transitions. Wasn't a big deal, but it happened every single time. So this was an important clue for us to learn about how to handle console. The very next day, this anonymized commit, obviously to protect the blameless, said into our configuration repo and says, JSON files are now pretty and standardized. So somebody took it upon themselves to lint and clean up some inconsistencies. And that's great, most of the time. It's normally a really good thing, but that's not. This was the next couple of hours where we valiantly battled against console and get to console. Get to console, grab the changes. It would try to load them into console. The first few keys would get updated. The watches would fire. Everyone would try to grab it at once. Crash console, then get to console with crash. So it was tons of fun. It was way too much repressure. And again, we were seeing that limitation. So in the end, we ended up having to disable console on a whole bunch of different nodes and then allow Chef to restart them over the next 30 minutes slowly. So what have we learned here? Well, one of the first things we'll learn is that it's very hard to be one of the early adopters of a new distributed system. There's no console stack overflow where you can go and just steal some magic incantation that will fix everything. At the time, there was really very little real world, real world information available other than the docs, which are great, but we wanted to know how other people were using and deploying this. Another thing we figured out at around the same time and something that I suspected for a little while is that we were sort of doing it wrong. Blamelessly, of course. We had about a hundred tiny keys and these tiny keys were the keys that were being read at an extremely high velocity across the whole cluster. Essentially, we were DDOSing ourselves. This was not the sort of problem that showed up in staging, of course, that staging didn't have enough nodes. So we didn't know that it was, this was the problem until it was a little bit too late. But just to remind you of the fact, so console service or a coup d'etat when the leader just can't keep up. It can't keep up because it's doing all of the things all at once. Bigger CPUs can do all of the things and more of the things at once and we're in the cloud at Amazon and what the hell, we just raised a series C. So we went to bigger nodes. We upsized our server nodes. We deployed these new and bigger servers one by one and we put the old ones kind of out of their misery and it was good. And things were beginning to be made right again. They were still happening but nowhere near as much. We just took a little detour and talked about leadership transitions and you might be asking, well, how did the DNS mask and the host file go down? Like what happened with that part of the story? Well, to let you know because I know you're all really interested. It worked really well. You might be able to see the above. We added, or the below, I guess. We added an additional host file and then gave it a 10 second TTL. And DNS mask is one of the few things in the known universe that actually seems to honor a DNS TTL properly. It works great with console. In testing, you can see it only forwarding the first request, serving the cached answer for the next 10 seconds and then going back to console to re-resolve. It's pretty nice to see software that actually respects a TTL. And it's quick. I'm not a DNS guy, but I was trying to measure with dig and I was getting zero milliseconds which is obviously pretty hard to measure, zero. So I built something to query for me and for items DNS mask sends to console, on this node, it takes between 600 and 700 microseconds to get an answer back. When DNS mask is serving out the cached answer, it takes between 100 and 200 microseconds. And for things that DNS mask has in a host file, again, between 100 and 200 microseconds. So fast enough for us at the moment. I added some metrics generation around a DNS mask log watcher on one of our nodes last week. And this single machine, one of our thousand-dish, is constantly doing about 20,000 console DNS requests per minute. So clearly the Python libraries are not respecting the DNS TTL, but it's querying DNS mask and I don't really care. So because of the multiple responses, there's about 50 to 60,000 responses per minute. Querying console directly on our system would be mayhem. It would be a problem, like to put it mildly. So take this and multiply that by a thousand and yeah, it just wouldn't work. So even though we'd been having some success with console, we were still sort of tiptoeing around where kind of people were freaked out. We were only at 600 nodes at the time. What would happen as we grew even further? We had already doubled in size from previously. And internally there were some concerns as we scaled. Like they didn't trust that console would be able to keep up. So a parallel service discovery app was written and deployed beside console. Unfortunately it's still there. I'll get to that at the end. But some people were certain that the apocalypse was coming. And I honestly wasn't sure either. The next month, so this was in August, we hit 700 nodes and it seemed that the fear was warranted as all of a sudden nodes started randomly going deaf and mute. They couldn't talk to the servers. They couldn't see the updates that we had placed in the key value store. They lost their console lock so the services disappeared. You know those two services were consoles the only way to find them. Yeah, that was not good. Oh, it was bad. So when we bounced console on those nodes it usually fixed the problem but that gets pretty tiring after five minutes. And the biggest problem was that I couldn't duplicate it reliably. But I could see the grumbling and slack whenever it happened. Shortly after that started happening I was in New York for the week. I worked from Canada most of the time. And I heard someone mentioning, I heard verbally, heard someone mentioning it that it was deaf again. So I immediately went into branding Greg mode. Well, at least partial pseudo branding Greg mode. And I was finally able to duplicate it reliably. And then it cleared up. So I did what we normally do at Datadog and I made some graphs. I wrote some code to watch the URLs that were exposing the problem. And I waited. These graphs and a large amount of wire sharking that James from HashiCorp was sitting right over there ended up reading a lot of, thank you sir. That helped us to track down what was happening. A server node or nodes was losing their connection to the leader. And as a result, the agents that were talking to that particular server were going deaf as well. HashiCorp quickly found and fixed two deadlocks in the multiplexer code that was underneath it. And there was also another bug that spun up hundreds of additional connections per node that was proving to be a bit of a problem. But it was still happening. The last puzzle piece dropped into place when James, again, that guy, many beers, asked me about the Zen AWS Linux bug, the Rides the Rocket bug, the old Quake reference, which we thought we had fixed. But for whatever reason, it was still there. It was still happening and it was interfering with the server and agent communication. They were dropping packet set, which just never got read. So a couple of days later, that bug was totally eradicated from our systems and prevented from returning through judicious use of Chef Fu. So I held my breath and the GitHub, and I just kept the GitHub issues that our team had filed internally, they remained open. And the Parallel Service Discovery app ran alongside console. But the tide had turned and the sentiment was trending in the right direction. That takes us up to October. And we're fluctuating between 800 and 900 nodes at this time as we're retiring some services and adding some new ones. Kafka is one of the most important systems at Datadog. All the data that we ingest goes into Kafka and all of the consumers read from Kafka to do all the things with all the metrics. So in October, we used Console Confed to completely swap out our primary Kafka cluster without so much as an external peep that entire time. It was pretty cool. We also started using Console Exec. And so Console Exec, sorry, Console Events. Console Events are sort of like Console Exec but with predefined actions. So you can't just do whatever you want, you do the things that it tells you that you've already predetermined. So that watch right there, what it does is it waits for an update event, determines if this apt update event is new. And if it is new, it's not an old one, then it actually runs apt-get update. In our old cluster, using Capistrano, it used to take between 20 and 30 minutes to do an apt-get update. And that was if the static file was exactly the same when you started the update as when you were getting to that node that was in there that might've been removed. If that was the case, then it would crap out enough to do it all again. Now, it takes between 60 seconds and 90 seconds to run an entire apt-get update in over a thousand nodes in production. Our staging environment takes about 10 seconds now. We were also using Console Lock for a number of processes. And I gotta take a drink, sorry. With Console Lock, you can run a highly available application and have a hot spare step in if something happens to the currently running application. So we normally run three instances of these jobs and with only a single one running, when that one would decide to take a rest or crash or do whatever it feels like doing, then one of the remaining processes will automatically jake over to make sure that there's always at least one of them running. It works pretty well and we have a few of these at the moment. Get the console runs under this so it's always pulling in the console changes and our DNS host file runs under this as well. Here's an example of an Ubuntu and Ubuntu Upstart script, there we go, that works with Console Lock. It's pretty easy to get quite a bit more reliability out of an unreliable application that you know is gonna crash all the time. When the leadership transitions grew, we again bumped up the size of the server nodes one more time because that's what you do with Console 0.5 when you get a lot of leadership transitions. Now, which is the very bottom right, obviously, we see a leadership transition every one or two days and that's it, it's pretty awesome. And this is one time where money did buy happiness, at least my happiness. So one important note is that we had two small outages last year with related to Console but at no time during this entire period was it ever Console's fault? Or was there ever something that happened to Console that we couldn't explain? The first time in March that we had a three minute outage that was caused by a packaging chef problem and that was kind of annoying but it was quick and it was done, we had one outage in July that was related to somebody who again shall be blamelessly nameless, who they were started all the server processes at the same time. And that is unfortunately one of the things, one of the big no-nos with Console, you don't do that, just don't, that's a pro tip, just hey. So Console's been super solid this entire time, in fact there's been more than one time where some kind of network partition has happened in US East, you know, it's Amazon, you never know what's happening, it's still, it's always still green of course. But there've been more than one time where this has happened where either Kafka or Zookeeper or one of, or Cassandra or something has got screwed up and we'd have to intervene, we've never had to intervene with Console, we just break it ourselves. As a side note, if you wanna come work on these sorts of fun problems we're hiring, so make sure you come chat with me or go to the website, just let me know. So here we are at scale, it's January 2016, and what can I say that we've learned over the last year working with Console? Well, no matter what you heard me say during this presentation, Console is awesome. It acts like an incredible data center backbone that helps to scale your operations by having these helpful primitives available to you alongside orchestration tools, persistent local data store, service discovery. I'm a total fanboy and if you can't tell already, I love it. I promise you I don't get a cut or a kickback from HashiCorp if you run it, but you all should if you have a need like this. Even Hoff loves it, like what's not to love. So monitoring as you deploy it is not optional. It really should be a no-brainer. I happen to know a few people that might be able to help with that, but just, again, just saying. If you're starting out, make sure you upgrade to 0.6, and if you're not there yet, find a way to upgrade. It's really not that hard. I did it a couple weeks ago in an afternoon. The HashiCorp team took a lot of community feedback and a lot of the bugs that I'd been filing for the last couple of years, and they ended up totally rewriting the storage back end. Version 0.6 has completely solved our read velocity issues. It's not to say it doesn't occasionally leadership transition, but it's not every time. In fact, like I said, it's once every couple of days now. They also fixed a number of bugs and added a whole bunch of new features, ones that I haven't even touched today. The new client binary takes about 1 third of the memory from before, so from the 15 to 60 is now quite a bit lower. It's now stabilized down about 20. The servers take up about a quarter of the memory and it's a little bit harder to see, but that red line is when I finished upgrading all the console and all the server nodes, and now it sits around about half a gig. It moves up and down a little bit, but it's nowhere near four or five gigs that it was before. It's when's the last time that a software upgrade actually used less memory, right? Console 0.6 is the bomb. There's no question. So console servers really love a sizable CPU, so make sure you feed it the right size of machine. If you're in the cloud, just make sure to upgrade your server nodes until you don't see leadership transitions anymore. If you're not in the cloud, then get your POM early and get yourself some new machines racked because it'll be a while, but I have some example sizing that I did with 0.5 specs, so you might actually be able to get away with smaller nodes now because console 0.6 is just that much more efficient, and as always, your mileage may vary depending on how you use it, how many services, all those things. One thing that's been emphasized through this whole process at Datadog has been to architect for failure better. Console is a distributed system, and so you don't have the luxury of having everything on one node. Connection problems will happen. Nodes will connect and disconnect. Add retries to your connection routines. Add exponential backoff and circle breakers. All these things will make your stack more resilient, and that's obviously not a console-specific recommendation, but something that was really pounded into us when we were dealing with last year's shenanigans. So the next one's fairly self-explanatory, and it's if you're inducing consistent leadership transitions with the velocity or volume of your reads, you need to upsize your server nodes and or change how you're doing it. For example, if you're running an app server, and the app server spins up multiple processes, please don't make each of those processes read from the KV store on the same node. That's bad. If you can avoid having each process do that, it's a pretty big win. Make efficient use of these connections, then console will go much further. If you've got a number of machines reading all of your keys at once, then you might end up having a lot of pain like we did. One thing we're looking at trying now is using fewer and larger keys, not anywhere near the 512 kilobyte maximum, but having lots and lots of really small keys is really the root of a lot of our problems. One thing we decided early on was to lock down the key value store using those ACLs, so that changes were only coming from places that we knew and feeds that we could audit and examine. If your configuration can induce behavior change and, well, configuration's supposed to be able to do that, then having it available as a free-for-all is a bit of a nightmare. Lock it down and make those configuration changes audible. Console watches are super powerful and it's how we distribute the data using the key value store, but they can fire a bit too often. So I have a small go utility that I built on a weekend that we've been using to make sure that it only fires when it's a new event. So it tracks the state at the time of the watch last fired and then if it's the same event, it doesn't fire again. So it's called Sifter and it's available on GitHub, so we use it in production to data dogs. Pretty nice. My last tip is pretty key. If your output isn't unique and you can get away with it, then don't build your configuration from console data or services on every node. Build it on a single node and then use the KV store as that transport mechanism. Now that's my last tip, but related to that last tip, I have one more thing today. The last year has been a huge learning curve for us working with console and transporting configuration data using the KV store has been a very pleasant surprise. It's fast, it's stable, it's reliable, and so we've been steadily revising how we do it. As with many things, the first version was written in Bash, the second version was written in Ruby and the last version is written in Go. Ruby needs to get in production now for a couple of months. It operates tens of thousands of times per day for us and it's been really fun to develop. It handles that delivery life cycle, the inserting the configuration data into the key value store as well as the extracting and delivering it on the other end. We also have a Chef cookbook that helps you to create the console watch that's needed to most efficiently distribute the config files. So we call it KV Express and we're making it available today. It's a tiny Go binary. Again, it handles the inserting and delivering of the files. It emits, we're a metrics company and it emits metrics and events just so you can audit and measure what's happening. The file that's uploaded is the same as the file that's delivered. We compare hashes to make sure that it makes the journey intact and we only use the finest of all the hashes. So it's efficient. If it doesn't change and it's trying to be reinserted, it stops, it doesn't reinsert it. If console thinks that the watch needs to re-deliver it, it checks to see if it's the same file and it doesn't re-deliver it. It's optimized for safety because we don't wanna write a blank file and that, cause that's generally pretty bad to whatever service depends on that config file. And you can also run commands after the file is delivered. It's also super fast. Once it triggers from a console watch, it's under 500 milliseconds to deliver a 40 kilobyte file to 1,000 nodes. The stats are weighted most heavily under about 300 milliseconds but there's always the stragglers that make my histograms and heat maps look terrible. I originally tried to measure the entire process from start to finish based on syslog timing but everything was happening within a single second so that wasn't helpful. So I added some higher precision logging and inserting to the key value store takes usually about 100 milliseconds in under and that's on the left and delivery from the key value store usually takes about under 300 milliseconds in down very often much closer to a hundred. When it insert records into the key value store, it can throw out an event to Datadog so that we can audit what's happening. You can see on the bottom we added some Postgres node somewhere and on the top there's a Kafka node that's getting removed as well. These events can be shown on a timeline, they can be graphed, they can be measured, you name it. Again, because we're a metrics company, we emit a ton of metrics, the size of the file, the lines in the file. We don't want, again, zero line files are bad. When it's firing, how often it's firing, how long it takes, we also emit like other things like panic metrics, not long enough metrics, checks on mismatch, things like that. And I have a very, very quick demo. It should take just a couple of minutes. I originally tried to make a video of the demo from start to finish and unfortunately even for me it was super boring because it's just, it's kind of one of those things that happens in the background. So what I'm gonna do today is I'm just gonna show the, I'm gonna clear that and I'm gonna show an ad hoc usage of KV Express on our prod cluster, so don't tell anyone. So in addition to be able to use in watches to deliver things, you can also, sometimes you might need to deliver a config file to a bunch of nodes quickly, right? Because you're having an outage, you need to change something and Chef has busted or something. So we also have something called, we call it ad hoc usage where I set aside the temp key space in that program called console sort of the temp key space and we can use that to write. So I'm just gonna cat my demo script. And so here, I'm just gonna sudo. So this command is the in command. It says KV Express in. I'm feeding it a URL and it's just a 600 line config file I grabbed from a cell stack website this morning. And I'm gonna insert that key into the scale 14x key and in the temp prefix. And so what, when you do it with a URL, it automatically shows the output just to make sure that the thing that you put in is the right thing. And if we take a look here, KV LS Crap Timp, we can see it's there. So console KV. So there's the SHA-256 checksum for that, that text file that we grabbed. So what I'm gonna do here, I'm gonna take this file. And so this is a console exec. Again, it's one of those scary things that sometimes you get to use. And what we're doing is saying for the service data dog, run KV Express out, minus D is send out the metrics for the, again, looking at the right key in the right prefix. And we wanna write the temp slash scale 14x.conf. And then the dash E at the end is the running the command at the end of the process. So you might, if it's a config file for Apache, you might wanna do sudo service Apache restart or something like that. In this case, I'm gonna, I'm just gonna, well, correct the file name first so that it's not remaining there. And I'm gonna change the name because I think I did this a few times testing already today. So like I said before, if it's the same file and the exact same text, it won't do it. So that's gonna go across 1,100 nodes, almost 1,200 nodes, and it's done, right? So it's kind of fun. Before we used to do that with Capistrano, again, it would take, sometimes we would just end up stopping, like stop everything and do it slowly because it would take so long to do everything in parallel. So, here, wow. So that's KV Express. It's available there as soon as I can click open. So it'll be available quite shortly. And I think we have time, a little bit for some questions, so. Sir. Of the leaders, okay. So, encryption for secrets. We don't have any secrets in there at the moment because there's no good way to do it at the moment other than maybe using something like Vault. So, hopefully that now that this has calmed down and the presentation is done, now I can start working on Vault to do that. Now it's another hash score product. It ties in with console. It's probably a better way to do it. We looked at a few other ways. It just, there's nothing really good. Far as the horizontal scalability of the servers, like I said earlier, you make them a little bit bigger until the leadership transitions go away. Now, I don't see a problem with console 06. You could probably get away with three nodes in production and it depends on your use case. There are some people who have very little traffic, very little read traffic. We had a lot. And sometimes we were hitting the key value store 1,000 nodes with some of them with 10 proceeds at once. We're reading 100 keys. We're still doing that and it's still not breaking any more under console 06. So it's great. It's ready for production unless you have 10 or 20,000 nodes. I think at that point it starts to break down. That's your question? Okay, anyone else? Oh, in the back, sir. Well, we're not saving any money, that's for sure. Now with those five C, the C32XLs, we're not saving anything. The biggest problem was around that service discovery and it really was whenever we were trying to update any of the things in that environment file, it was taking two hours to get out. And that's too long. If we had a problem, first thing people would do was, okay, stop chef everybody, right? And we would stop chef and then we would roll things out manually just because we didn't have enough time and no one trusted chef. It chefs great. And we use a lot of it, but just in that 500 milliseconds versus 30 minutes or two hours, you had a lot more capabilities that you could do. Like I mentioned earlier, when we replaced our entire Kafka cluster, we wouldn't have been able to do that with Chef without a lot of pain. And it would have taken us a lot longer than I think it was the two or three days that it took us. So it just gives us the agility that we need to do things, especially when things are hitting the fan. That's right. Sorry, sorry. You there. I know, that's crazy. We wrote at the same time. Now, console template is great. We still actually use it to template stuff from the service discovery. The dedupe feature and now Armand had a talk on Tuesday and Armand's one of the original authors of console. It essentially does a similar thing. The only thing that is different now, we've integrated into our Chef workflow so that we use an LWRP that when it goes through Chef, it looks to see if there's a file there. If there's no file, it grabs it through KV Express and then it writes the watch automatically. So we don't have to do any of the setup on every node for all of these watches because it's handled through the LWRP. Essentially, they're both doing very, very similar things. And if that had been available in November when I was writing this, then we might not have done this, but it wasn't so. Sorry, sir. Oh yeah, we do all that. We have stale, we have all that stuff. We actually never found some of the consistency modes to make it a very big difference in our environment. I might have been doing it wrong. I know I do a lot of things wrong, but we never found that to be a very big difference for us. We still have it on, but it really doesn't make a difference. Yes, we have some of the KV requests that we make are in that mode, yes. You know what? I'd never got around to timing it at that time. It was more just get it to work because everything's burning, so yeah. Any, sir, over here? Sure, so your first question. So Consul comes with all the stuff with ZooKeeper. Consul comes with all the service discovery built in. So that parallel service discovery app that we wrote was based on ZooKeeper because we have a bunch of clusters because of Kafka. But it doesn't have the DNS interface. It doesn't have the services. Like it's not built. You have to kind of handcraft it all yourself. And we like ZooKeeper, it's pretty good. But for this, we also want it to be able to use HTTP to query it and not have to have a really thick client and that just isn't available for ZooKeeper as far as I'm aware. A second question was around. Well, part of that is we don't get to depend on any stuff that only AWS offers. And we're doing about five million metrics per second. We've increased five times over the last year. A lot of the things that Amazon comes out with right away, they can't handle the volume that we're at at the moment. So at this point, it would be nice, I think for some low volume stuff, I'm looking at using some Lambda stuff to throw it into Kinesis for things like logging and some metrics that aren't kind of our critical path. But Kafka doesn't work, we're screwed. So that's just kind of how it is. Sure, and it totally depends on your volume too. Like if you have a similar volume, then it might work. But again, for us, we couldn't at the time. Anyone else? Sorry, I saw you first, sorry. Yeah, you. No, sorry, that's 300 milliseconds from the time that the watch fires in the console on the local mode to the time that it's done writing the file and finished. So it's all in one data center, but that's, so I actually wanted to measure for this talk. I wanted to measure, so how long does it take to write and then what's the gap in between the two before things start, before they start reading? Only problem was with all the, with NTP, I couldn't get the sub microsecond precision. So it was, before it was even done writing sometimes, nodes were already starting. I wanna figure that out exactly how long it takes, but it's so fast as to be, I couldn't measure it yet. No, no, we're still in, we're just in, we're in three AZs in US East at the moment. We have hundreds and hundreds of terabytes of data in S3 and in all sorts of other things. So it's moving to multi data centers is pretty interesting for us. So, sorry, you. Yes. Yeah, that'd be a problem for this. For sure. We don't have that, we don't have a situation that we're still all on VMs and we have consistent ports. Containers aren't, haven't been an option because of the amount of data that we have. We have nothing, almost nothing that's stateless. So we can't use containers at the moment. And when we go down that route, we might have to go and overlay network or something. Maybe you would do like live network or something else that has like IP addresses per container. But otherwise you'd have to use SRV records, which console supports as well. I just, we just haven't got, we just haven't needed to yet. Yeah, I think so. We haven't, we haven't tried that yet. So, Brent, if there's another one, sir, over there. Yeah, our applications are not currently aware of SRV. They're not querying for that. We are registering with a small JSON file in each node based on the role from Chef. So when we have a Chef role, it writes a specific JSON file for that role only and the JSON file registers the service. Again, once the node gets killed, the service goes away. Is that an answer? Okay. Who else? Yes? Oh, one back there. And then, sorry, you, after that. Not yet. Just, like I said, I just finished upgrading two weeks ago. And we haven't, but we're looking to that because of the AZs and some of the weird stuff we've seen between them. So we're gonna try it, but we haven't yet. Sorry. No worries. No, I've, so on my Octahos project, which is basically a very tiny, terrible, not terrible, but like a very lightweight Heroku that does, none of the things that Heroku does around state. I register with an API call and point to those containers. So you can do it that way as well. We just chose to do it with these JSON files because of the stability of our nodes and Chef. But you can do it with API calls and it's no problem to, you know, say, here's the port, here's the IP and you're good. Did that answer your question? Yeah, you could. I know I do that, you know, I host probably 60 containers on one box and all the config is built out of console, service discovery catalog that's registered via an API call. So, oh, sorry. Sorry, I didn't totally understand that. We haven't integrated with it, but I know there is some movement around that. I know HashiCorp is building their own Docker image. And I know there is some integration. I just haven't used much of it yet. Sorry. Thanks for having me, everybody. And if you have any questions, let me know. Test, test. All right, it's all working. Well, thank you all very much for coming. I'm Parker Abercrombie and I'm a software engineer at NASA's Jet Propulsion Lab. Today I'm going to be talking about how NASA uses the cloud to enable virtual reality Mars. So the project that I'm going to be talking about is called OnSight. And OnSight was developed as a collaboration between JPL and Microsoft. And it allows scientists and engineers to work on Mars through the power of virtual reality. So the way this works is the user puts on a virtual reality headset, and we use the HoloLens device from Microsoft, and then they see Mars around them in their office and they can walk around and explore the scene as if they were actually there. So the reason that we do this beyond just as fun is that you can get a different sense of the nature and the scale of the Martian terrain by looking at it in an immersive 3D experience that you can get by looking at a 2D image on your computer screen, and then trying to reconstruct the 3D scene in your head. So we work, our users are scientists who work with Curiosity Mars rover, and they're mostly geologists. And when geologists study a place on Earth, they actually go out in the field and they walk around and look at the rocks. So we're trying to give them as close to that experience as possible for Mars using the technology that we have available today. So to make this experience possible, we needed to create this virtual reality scene of Mars, but the rover moves every day. And we wanted this tool to be useful operationally. So we actually needed to not just do this once, but we needed a way to create these scenes easily and automatically as the rover moves and new imagery is downlinked. So the onsite team built a custom image processing pipeline that takes stereo images from the Curiosity rover and builds a 3D reconstruction of the train around the rover. And then we build a cloud architecture to let us run this automatically in the cloud as soon as new data comes down. So this is a cloud talk today. I'm not gonna say much about virtual reality and I'm not gonna get into much detail about image processing, but I'm gonna talk about how we've used cloud computing and open source technology to enable us to run this process automatically when new image comes down and push that out to our users. So by the end of my talk, I hope you've learned a little bit about how we've used the cloud to solve our problems. Hopefully you'll get some ideas that might apply to your own problem domains. And I hope you'll be excited about space exploration. So the data that we work with comes from this instrument, Curiosity. Curiosity is a rover on Mars. She landed in August of 2012, so she's been on Mars for about three years now. And to give you a sense of scale, this thing is about the size of a small Jeep. So it's pretty big. Curiosity had a lot of different instruments on board. The ones that we work with primarily are the stereo cameras on the rover mast here. You can actually kind of see the two eyes of the stereo mast cam there. So as the rover drives, it takes these pictures and it sends back the pictures to Earth, and then we take them and we process them into this. So what we're looking at here is a 3D reconstruction of a scene of Mars along Curiosity's drive. And I want to point out that everything you're seeing here is real. This is all real imagery sent back by the rover. There's no artistic retouching, and this is produced completely automatically with no human in the loop. In the front here, you can actually see a little bit of the rover's shadow that was captured in the imagery because the sun was behind the rover when that picture was taken. If anyone follows the Mars Science Laboratory mission, this mountain coming up in the background is Mount Sharp. This is Curiosity's final destination. And right now, the rover is exploring dark dunes that will be coming into scene just in a second here. And there's this band of dark dunes at the base of Mount Sharp, and that's where the rover is today. So this is where we want to get. So we're going to take the stereo images and make a scene that looks like this. But before we do that, let's back up a step. So this is Mars as seen from orbit. The Curiosity rover is exploring Gale Crater, which is the yellow dot here. We'll zoom in a little bit. So the blue path here is Curiosity's traverse. So everywhere that you see the blue path is somewhere that the rover has driven or stopped and taken images. So what we're looking at here is Mars as seen from above from an orbiter. And that instrument is the Mars or Constance orbiter. It has a number of cameras and sensors on board that produce these orbital maps of Mars. And this imagery is about a quarter meter per pixel, which is fairly good as orbital imagery goes, but if you're going to put it in virtual reality and stand on it, a quarter meter pixel is pretty big. So as Curiosity drives, you take these images and then we take those and we build these three scenes. And we need to do this for basically everywhere along the path. So everywhere the rover stops and takes pictures we consider as seen and we'll build a reconstruction of that part of Mars. At some point we would like to link these all together into one kind of super scene of all of Mars, but we're a little bit of ways from that today. So I'm not going to get too deep into image processing in this talk, but I'll give you at least the high level version of what we do. So we take the stereo images and using a stereo correlation algorithm, we derive a range from the camera position to each point on the terrain. And from those ranges, we can create a point cloud. And from that point cloud, we can do a surface reconstruction that gives us a mesh geometry for the scene. And then we can take the images from the rover and paint those back on top of the mesh to get the fully textured mesh. And you can see a little bit of that process here where the background is kind of hard to see on the projector, but you see only the wire frame mesh geometry of the scene. And then as you get closer in, you can see how the mesh and the texture start to interact until you have the final product appear next to the rover's wheels. So when we're doing this, there's a couple of different types of imagery that we need to combine because we don't have high-res curvilary imagery of all of Mars, unfortunately, especially places where the rover is just arriving. So the main types of images that we work with are black and white, oops, black and white images from the rover's navigation cameras, which give fairly low-resolution grayscale imagery in color mass-cam images. So this is the high-resolution science camera. And where we don't have either of these types of imageries, then we'll fall back to orbital imagery from the Mars or Constance orbiter. So for every part of the mesh, we'll have some type of imagery and some is better than others. So there's an element of sensor fusion and how to stitch these things together in a way that you're using the best possible data for each part of the mesh. Sir, it's an in-house pipeline. So there's an element of sensor fusion in combining these things into a good-looking final product. And for the purposes of my talk today, I'm gonna treat that as kind of a magic black box where images come in, magic happens, and then this textured mesh comes out the other end. And then once we have that textured mesh, then we can load it onto the onsite software running on the HoloLens and look at it in virtual reality. So we've been running this pipeline for about a year now. We've processed several hundred scenes all along the rover's drive. The size of the scenes varies quite a bit depending on how much input imagery is available in a place that the rover has been exploring for a while where we have a lot of images. We might have several thousand. In a place that we just arrived, we may only have a handful of images. Typical scene is about a thousand images or about five gigabytes of data. And over the course of our processing, we'll crunch that into a couple hundred megabytes of mesh and that process takes a couple hours running on a single note. So when we started this project, we developed the software just on our development workstations. And we would do kind of the simple thing that you would expect. You'd expect your input files to be in one directory and you'd put your output files in a separate directory. And when we went to run it, we'd copy the appropriate input files to directory A, hit go, wait a couple hours and then grab the results from directory B and then put them where they actually needed to be. And that worked pretty well for development but it obviously doesn't scale very well when you move into operations. So that's what led us to port this into the cloud. So at the high level, we have a batch processing problem where data comes in, we need to detect when that data is available. We need to pull it down, run this long resource intensive task that produces some output. We need to put that output somewhere where our users can get to it. We want to be able to see what's happening as the system is doing all of this and start new builds and stop builds that are running, to see the status of what's currently running. And we have a bursty workload where we get downlink from Mars on a roughly daily basis. So downlink happens, we need to spring in action, grab the new imagers and build a scene and then we kind of go back to idle tell the next downlinks. We don't want a lot of expensive competing resources running while not doing anything and costing money. So we built a system that we're running in Amazon's web service cloud to solve this problem. And here are some of the Amazon services and open source technologies that we're using in the system. We use the Jenkins Continuous Integration System both to compile and run our image processing code and tests but also to actually run the image processing jobs. So we can treat them as if they were compilation jobs. We use Ansible to deploy and configure our Linux servers. We use the loopback Node.js framework to expose a data API on top of our database. And we use loopback AngularJS and Bootstrap to create a dashboard view of the system. And I'll go into a little more detail on how we're using all of these. So here's the high level schematic view of the system. Things really start on Mars where the curiosity sends data down to Earth that goes into the mission data system and it's cataloged in that system and that's where we pick it up. Now our system falls into kind of three main pieces. We have a build manager whose job is to find out when new day is available, index that data into our database and request that builds start. We have a build cluster that has a master node and a fleet of worker nodes that actually do the terrain builds. And we have a distribution system which stores the results and pushes them out to our anti-users. So first thing happens, data comes down from Mars to Earth. The build manager periodically pulls for new data and as soon as it finds some, it indexes that data into our database. Then it requests that the build cluster start a new build. The build master Jenkins will find a worker to run the build, creating a new one if necessary. That worker pulls the data that it needs to do its job from the mission data system, runs the job and then pushes the results out to RS3 bucket. Then it notifies the build manager that new build is completed. And then our users the next time they launch onsite will pull down the new data and see the latest location on Mars. So here's how Amazon's web services fall out in our infrastructure. Imagine most people in this audience are probably familiar with these but I'll very briefly introduce them in case anyone's not. In the middle column here, these are all EC2 instances. So these are all virtual computers running in the cloud. The build manager and Jenkins are kind of medium sized instances and these are long running. They're pretty much always available. The workers are the beefy machines that actually do most of the image processing. And these we treat as disposable resources. We create and destroy those all the time based on our workload. We use the SQS, the simple queue service to communicate between the build cluster and the build manager and within the build cluster itself. So we have a couple of queues set up for different types of communication. We use the relational database service. This is Amazon's relational database as a service. It's a kind of a veneer over a relational database to store information about the available source data in completed terrain builds. We use S3 to store the results of our builds. And CloudFront is Amazon's content distribution network. So CloudFront takes the files that we put in S3 and then pushes them out to data centers that are geographically closer to our end users so that they can experience faster download times. So this being a Linux conference, I'll also talk about operating systems. Our build manager in Jenkins, we run in Ubuntu Linux. And our worker knows we actually run on Windows. So it's a heterogeneous system where we run both Linux and Windows machines. And to communicate between those machines, Jenkins helps us. Jenkins is pretty good at sending jobs out to the Windows machines. And for other data, our usual solution is to use a simple queue if it's a small amount of data or to push results up to S3 and then pull them down on the other end. As long as it's not a terrible huge amount of data and it's all within Amazon's data center, it's pretty easy and it works. All right, so I'm gonna go through each part of the system in a little more detail. And I'll start with the build manager. So the build manager's job is to discover when new data is available, orchestrate the terrain reconstruction jobs and present a dashboard view of the system. It also is kind of the interface between the rest of the system and the database. So it exposes a REST API. We used the loopback as a Node.js framework to build a REST API on top of our database and the build manager exposes that to the rest of the system. So nothing else in the system needs to know anything about the database. We also have a dashboard view that we built using AngularJS and Bootstrap. And I'll demo that now. So this is our build manager dashboard. So this is created with AngularJS and Bootstrap. At the top we have some summary statistics about the number of scenes we've built and the number of builds of those scenes. I can see the status of the cluster. The time I recorded this, there were two nodes running, three allocated and one build in the queue that hadn't been allocated to a node yet. Moving down, this chart here shows me the trend in time over the past month of different builds. As I mouse over, I can see a little bit of information about each of those builds. The table on the side shows me the status of the build queue. So when I recorded this, there were two builds running. You probably can't read that, but it gives me the time that they've been running. Moving down, I have a table that shows me recent build failures. I'm happy to say that there aren't any in the last two months. So this table tells me a little bit about each of these builds. I can click through to view the log and see why it failed. And we build a feature to acknowledge failures so that we can track if they were something that we had actually looked at and determined it was not a problem or fixed or something that still needed to be reviewed. So this next view shows me all of the builds that have completed in the system. So each row in the table represents a reconstruction of one place on Mars and gives me some information about that build, where it was centered, if it succeeded. I can enable or disable the build, which means if it looks like something broke, then I'll disable that so our users don't see it. Fortunately, that doesn't happen very often, but it's handy when it does happen. Builds can be deleted from the system. The table is searchable if I'm looking for a particular build. And we generate these preview products as part of the reconstruction process. So I can click into any of these and preview what actually was built and see how it looks without loading into a 3D viewer. In addition to the PNRM previews, we also generate these fly-through movies that show a little bit of the 3D reconstruction. In these video preview products, it's just a camera that pans around in the actual on-site experience you can walk around. We have imagery, so the question for the recording is what happens if you walk toward an area where you have no imagery? We have imagery everywhere, but it may be orbital imagery, so it'll just get low resolution. Okay, so to store the data about source images and completed builds, we use Amazon's relational database service. So the reason we use RDS is that we get automatic snapshots. We're already running in Amazon's cloud, so automatic snapshots that can be very easily restored is a nice thing to have. It doesn't really take any maintenance. As the back end of this, we use a MySQL database. We did consider using a NoSQL solution when we were setting up the system and decided not to because databases aren't really our problem in this application. The amount of data that we're tracking is relatively small and is pretty easily handled with a single traditional relational database. On top of our database layer, we built a REST API, as I mentioned earlier, using Loopback. So Loopback makes it easy to add the standard REST API and features on top of a data source and it plugs into different database back ends. So this gives us a little bit of database independence. If we do decide to switch out MySQL for different technology later, then the changes are isolated to just this one layer. So other parts of our system don't have to query tables in MySQL. They can just use these REST endpoints to get them post-train builds or search with filters. And Loopback lets you add some logic in JavaScript as hooks that run where your tables are updated so you can add business logic at that level. Another Amazon service that we use in the build manager is CloudWatch log management. So this is a way of pushing logs from your servers up into Amazon's cloud so you can see them and process them from the AWS console. So this is, I think this is nice because if something's going on with my server, I can log into the AWS console and look at logs instead of having SSH in that box and then trying to remember where the log file is. And once they're in AWS, AWS, they can be filtered, they can push to Elasticsearch services if you wanna do more in-depth analysis. You can set up alarms based on keywords that appear in your logs such as errors. So this has worked out pretty well for us. It's pretty easy to set up on Linux. It's just a service that you install on your machine, some scripts from Amazon. I know that it can be done on Windows. I've not done it myself on Windows, so can't speak to that. So the next piece of the system is the build cluster. So the build cluster has a masternode which is running Jenkins. Jenkins is a continuous integration system. It's similar to tools such as bamboo and cruise control. And the typical use case for Jenkins is that you would have it monitor your source control repo and when code is committed, Jenkins would pull that code, run a build, and then make the results of that build available. So it turns out compiler code is actually not that much different than reconstructing my mastern. You have some input images that you run some executable on and then it produces some output that you need to put somewhere. So we use Jenkins to both compile our code for continuous integration and also to actually run the image processing jobs. So what Jenkins gives us is the ability to manage a fleet of worker nodes, configure scripts that will, or jobs or scripts that will run on those nodes and then Jenkins handles keeping track of the build queue and parceling that work out to the nodes in the cluster. So this is the Jenkins interface. If you've used Jenkins before, I think this will look very familiar to you. Over on the side, we have the view of the build queue. So at the time I took this, there was one staging build in the queue. Down below we have the status of the build cluster. So the master node was running a job called manage nodes, which I'll talk about in a second. We have a couple nodes offline and then that third one is running, I think a production build. And there's a couple more down below. This table on the right shows the job descriptions in Jenkins. So each of these is basically kind of a script that does a certain thing. So the top one is build scene staging. So that does a staging build for a certain scene. The next one is build production scene. There's one that builds the preview products and a couple other miscellaneous ones. We also have Jenkins jobs configured that can create and destroy instances in EC2 and manage the fleet of worker nodes. So if I wanna see information about a particular build in Jenkins, I can click through to see details on that build. I can click to see the console output to either after the build is completed or while it's running, if I wanna see what stage in the process it's at. Jenkins tells me that this build was started 59 minutes ago and it's been executing on this host and it gives me an estimate of when the build will finish. And it gives me some information about what version of the code this was running. So this was running the production code at that certain get hash. So Jenkins out of the box gives you the ability to keep track of worker nodes and parcel work out to those nodes. What it doesn't do is it dynamically create and destroy the cloud instances based on workload. So we extended Jenkins with some custom scripts to make that happen. So the way that we do this is we have a periodic task called manage nodes that runs every couple of minutes on the Jenkins master and it looks at the size of the work queue and it looks at the size of the build cluster. And if those two are too far out of whack, it will try to equalize them. So if it sees that there's more work in the cluster than there are nodes available to do it, then it will spin up a new node and then add that node to the cluster. Likewise, if it sees that the work queue's empty and there's a bunch of vital resources around, it will shut those nodes down. So the system scales within some parameters that we control to adapt the amount of work that is actually needed at the time. So there's a little kind of Jenkins specific trick that we use in the system. So our main job is this kind of large monolithic image processing pipeline that takes a lot of resources and runs on beefy computers. So we only wanna run one instance of this task on a node at a time. So we don't want the Jenkins master to ever try to run two of these on the same node at the same time. But we do find it convenient in our Jenkins scripts to be able to call out to sub jobs that don't take a lot of resources. So the way that we set this up so the Jenkins master will do what we want is Jenkins has this concept of execution slots on each of your nodes. So you can allocate more slots to nodes that are more powerful. So we set all of our nodes to have seven execution slots and then we set our main heavy weight jobs to take up four of them, which leads three slots for lighter weight jobs that the main one will call out to. So that will make the Jenkins master never try to allocate two big jobs to the same node. Okay, to summarize our use of Jenkins, we have periodic tasks that run to manage the scale of the worker cluster and bring nodes online and offline, depending on how much work is actually happening at the time. We use groovy scripts to automate Jenkins and we also call out to some command line scripts that use the Jenkins Resta APIs to view the size of the build cluster and status of the system and submit jobs. And we use tags to mark different types of nodes. So we maintain separate development of production environments and we tag some of our workers as reserve for production and others as reserve for developments. That way we know that there'll always be production nodes available when we need them. All right, so the next part of the system are the worker nodes themselves. These are the nodes that actually run the image processing code. And these are a little bit of a different beast because these are GPU enabled EC2 instances that run Windows 2012. So our image processing pipeline was implemented in .NET and it runs on Windows. So when we made the move to the cloud, we just chose to keep it in that environment. So the way that we manage these is we take a stock Amazon machine image for Windows 2012 and then we install all of the dependent software that we need onto that image. So this is not actually our image processing code, but it's all the dependencies of that code. So it's the right version of .NET and the right version of the Jenkins jars and the right versions of the different image processing tools that we call out to and a handful of other things. And then we bake that into our own AMI. And as our needs change, we revision that AMI and update it. We expect that to happen infrequently because it's kind of a pain to re-bake it. So we try to only put things that we expect to rev fairly infrequently onto this machine image. So these are GPU enabled instances. And in Amazon's cloud, there's two offerings in the GPU line. So they're both in the G2 family and there's the XXL and the XXL images. So the XXL has one GPU and eight virtual CPUs, 15 gigs of RAM, and the XXL is basically four of those put together. So it has four GPUs and 32 CPUs and 30 gigs of RAM. And it's really kind of a beast of a machine. We use both of these for different things. So the life cycle of a worker is it comes online, it registers itself with the Jenkins master and then it waits around until new work comes in. When work is available, that node will first pull our code from get and build the version of the pipeline that it's going to run. Then it'll pull the source data that it needs from the mission data system. Then it will run the image processing pipeline and then assuming everything goes successfully, it will push the results to the onsite S3 bucket. And then it will notify the build manager that a new build is available. And at that point, it can go back to waiting for more work, at which point the Jenkins master will either give it another job to do or it will shut it down if there's no work. So in setting up the system, there were a couple of things that we learned. One thing that we like to do is to keep the workers basically as dumb as possible. So we created and destroyed these things all the time. And we do have to install a lot of software on them and it's a manual process to do that. So we bake as much as we can that we expect to not change frequently into the Amazon machine image. And then we try to keep all of the rest of our code in get or other places that the worker can pull from. So we try to keep the workers in a state where they can be brought up very quickly and pull down the resources they need. Baking things in the AMI is necessary for us to get the workers to spin up quickly, but it's a little bit of a pain to manage. And the other thing that we found with these machines is using the GPU on the cloud. It can be a little bit troublesome. We had to fiddle around with this quite a bit to make it work. We would write code on our desktops that would work fine and then it would not work fine on the cloud. We did get it to work though, so it does work. So there's another thing that we use. This is another of Amazon's offerings. So EC2 has two kinds of types of instances. On-demand instances are the normal ones. They pay for by the hour and you pay a fixed rate. So there's also a spot instance market. So spot instances are a little different in that you bid what you want to pay and then if there's excess capacity in the system and no one outbids you, then you get your instance for that price, which is great because you can get instances at much less than market value. But there's a catch, of course, and the catch is that your instance can be terminated at any time. But actually for a lot of work, that's okay. So obviously we don't use this for anything time sensitive or production critical. But for a lot of work, it's actually all right. So when we get it down like we need to build that new scene for the place the rover just arrived as quickly as possible. So spot instances are a bad fit for that case. But for when we get new imagery for whether the rover was a couple days ago or for when we're running builds in our staging or development environments, those aren't really that time critical. And if one of the instances gets terminated, we'll just try again in a couple hours. So the question is if a spot instance is terminated, can you restart from the middle or from the beginning? And the answer is yes, you can restart from the middle if you program it that way. In our case, we actually choose to keep things simple and just restart from the beginning. In practice, our instances don't get shut down that often. And having said that, there were actually two that were killed in the last day. But other than that, they usually don't get shut down that often. But what we use this for is we bid on the 8x large instances for the price of the 2x large instances. So usually we're able to get four times the processing power for the same price as the small instances. So spot instances have worked out very well for us. So now I'll get into storage and distribution. This is by far the simplest part of our system. And it's really as simple as we put our builds into S3, then we use CloudFront to push those results out to data centers that are geographically closer to our end users. We have users at partner institutions all around the country. We are at the moment restricted to North America. And then, so when the on-site software on the whole runs needs to load to terrain, it's simply HTTP calls to CloudFront URLs. And those are just cured with signed cookies. So I'll talk a little now about how we deploy our system. We use the Ansible tool. So this is an IT automation tool that probably many people are familiar with. It's similar to tools such as Chef, Puppet, and SaltStack. And what Ansible gives us is the ability to capture the configuration of our Linux machines and code that we can check into our version control system. So the goal of this infrastructure is I would like things to be set up so that I never have to SSH into my servers. I'd like to be able to just create them and destroy them and deploy them automatically from scripts in source control and never have to SSH and manually configure things. And I would be exaggerating if I say that that's always true, sometimes we cheat and we SSH in and manually configure things, but that's the goal. And Ansible gets us a lot closer to that goal. So Ansible organizes things into playbooks. It's a very powerful tool, and actually the session in this room later today I think is about deploying web apps through Ansible. So if anyone's interested in Ansible, check that out. So I'll show a couple snippets of the playbook that we used to deploy our Jenkins instance. So these are written in YAML. So the first step is we need to create a new user named Jenkins, we'll run that command as sudo. And then we need to ensure that that user has the SSH directory with proper permissions. So we'll create a directory home slash SSH, again using sudo with proper mode. And then we need to copy some configuration files up. So we can point Ansible at the config file template and where we want to put it on the remote node and the mode we want. Ansible provides pretty good tools for templating these config files and then swapping out variables for different things. So we have a set of environment we can deploy for production and environment we can deploy for staging and development. And we've separated these two with these variable substitutions. So when it comes time to actually deploy a new one of these instances, it's as simple as creating a new EC2 instance, which we can do either through the AWS console or through an Ansible playbook. And then pointing Ansible at that instance with the dash i flag and then the script to run and the environment. And then this final argument is the vault password file. So the way that Ansible deals with secret keys and credentials that need to be deployed to the remote machine is that it stores them in a file called the vault, which is encrypted and stored with your project. And when you run Ansible, you need to provide the password to unlock the vault so that it can pull out the pieces that it needs. One time when I gave this talk at JPL, almost immediately after our talk or my talk, we accidentally terminated our build manager instance. It was complete user error on our part. So I can speak from experience that this actually really is as simple as these two commands to recreate that instance. Because if you check it into your source control system, you don't want unencrypted passwords and keys available. Also on that occasion, we've learned about EC2 termination protection instance, which is a good thing to turn on for your critical nodes. All right, so we use Ansible to configure all of our Linux machines. We're not yet using it for Windows. I would like to be in the near future. We haven't had time to get around to setting this up. What I would like to do is I described for our worker images, we manually configure a machine image that serves as the base of our worker node. I would like to be able to build those with Ansible instead of doing it manually because any manual process is error prone, especially if I'm doing it. So my goal for the near future is to use Ansible to automatically provision everything we need onto Amazon machine and then bake that into the base machine image. Okay, so to summarize a bit, things start in Mars. Curiosity rover captures new imagery. That imagery is downlinked to the mission data system on Earth. The build manager periodically pulls that system for new data. When new data is available, it's indexed into a database running MySQL on Amazon's relational database system. The build manager requests that the Jenkins master allocate a new node to do the work of running that terrain reconstruction. Jenkins will find a node to do the work, spinning up a new one if necessary. That node pulls the source files that it needs from the mission data system, runs the terrain reconstruction, and then pushes the results out to the onset S3 bucket and notifies the build manager that a new build is available. At that point, sir, that we use for a train bill? Oh, I'm afraid I don't know off the top of my head. That's a good question. The way that we run the pipeline right now is we actually run it on a single node at a time. So one job runs on one node. The reason that we've done that is we chose to optimize vertically before we scale that horizontally. So for this phase of the project, our time targets were to build terrain for scene in about the three hour regime, which we were able to hit by vertically optimizing and pushing some parts of the pipeline onto the GPU. So to keep things simple, we've kept things for the moment running on a single node. That said, I do not expect that to be the case going forward. That's correct. Yeah, yeah, welcome. Okay, so once the results are in the build manager and in the onsite S3 bucket, the next time our users launch the onsite software running on the HoloLens, they'll pull down the latest terrain and see the latest place on Mars next to where the rover is currently. Okay, so what I've described here is very much a snapshot of a system as we've built it now or as we've built it and as it's running now, I'm not gonna say that it's the best possible solution to this problem and we do expect to improve this going forward. One thing we'd like to work more on is improving our auto-scaling ability. So I described how we use Jenkins to manage the size of our worker cluster and scale it up and down based on the side of the queue. Now it may occur to you that most cloud providers offer kind of similar things that also can create mistransitions based on some kind of workload. We made the conscious decision not to use basic scaling features from a cloud provider because when we were designing the system we wanted to maintain some level of cloud independence. So we ended up rolling some of this stuff ourselves. And it's worked out pretty well for us but having gone through this, I think we would actually rather not be writing and maintaining that code ourselves. So we'll probably in the future be looking to see if we can swap out some of our auto-scaling logic with something that's just provided by a cloud service provider. As I mentioned before, I wanna start using Ansible to manage the Windows Worker AMIs more closely. That's one of the kind of the fiddley bits, manual bits of the deployment process right now that I don't like. And to your question, sir, right now the image processing pipeline is this monolithic thing that goes all the way from stereo images from the rover all the way to a 3D terrain reconstruction. Now obviously there's a lot that happens between these two end points and we'd like to split that up into a little bit more modular services that do different things and can be scaled out more easily because going forward, we will need to exceed the three hour threshold by a lot, especially looking toward 2020 when the Mars 2020 rover will land. So I do expect to do a lot more of smaller services that are scaled out horizontally, probably all still using the GPU and then we'll need to look into how to distribute the source images to the different workers that need particularly those images. Right now we're in a kind of easy parallelization world where each worker can just pull all of the stuff that it needs and it works out fine. Okay, I'd like to acknowledge the great work of all of my colleagues on the on-site team and our partners at Microsoft. I hope you've enjoyed this talk. I hope you've learned something. If you have any questions for me during the conference, feel free to grab me or to say hi in the hallway. If you have questions afterward, feel free to email me at parker.apicromby at jpl.nasa.gov. If you'd like more information about the on-site project, please see our website, opslab.jpl.nasa.gov. And if you have any feedback about this talk, I would love to hear it. I'd love to hear what you liked and what you didn't like at the Google forum at the bottom. With that, we have some time for questions. Sir. Mm-hmm. Okay, yeah. So the question is why are we using Windows for the worker instances? To be honest, that decision was made before I joined the project, so I don't have a completely historically valid answer for you. In general, though, at the time the decision was made, they felt that was what fit the problem best. Sir. So the question is why can't we just throw more workers at it to reduce the three-hour window? And yes, we can, and that's probably what we will do. We will need to rewrite parts of our pipeline to run in parallel on multiple workers. So up until now, we focused on, by design, we run on a single machine and we focused on making it run as fast as possible on that single machine. We do parallelize extensively on cores on one machine, but we made the intentional choice not to scale out to multiple machines yet. But going forward, I think we'll do exactly what you have in mind. There's one up here and then. Okay, so the first question was, how does CloudFront scale, I believe, even geographically? Yeah, so I'd recommend looking at the CloudFront documentation on Amazon for its worldwide reach within the continental US, which is our region of interest. It has very adequate coverage. And the second question was, what kind of GPU problems did we run into? The challenge we ran into is that on our windows machines, for some reason, the GPUs sometimes weren't recognized, possibly due to misconfiguration on our part, or I'm not 100% sure, I'm actually talking to some engineers at Amazon next week about exactly what was happening. There was a question back here. Yes, we are. To be honest, I haven't looked into a lot of detail on this. One open source tool that caught my attention recently that I have not evaluated, so I can't really speak to is Spinnaker, which is an open source project from Netflix. I saw an interesting presentation by some of their engineers last week, and we'll be looking more into that. Yeah, okay, the question is, is there anything in the open source community, open CV and other image processing libraries like that can do to improve our situation? And that's a deep question, and the short answer is yes. My hope would actually be that some of the innovations that we've made in this pipeline can be pushed out into the open source community. I don't have the details enough in my head right now to speak to that in much depth. I'd be happy to chat with you offline though, in the back for archiving. Our strategy at the moment is to keep things in Amazon's data system. Archival hasn't been a big focus of ours yet, but probably moving forward, we'll want to sync up with other parts of the Mars mission and leverage whatever technologies they're using. I'm afraid I don't know off the top of my head how they do it. In terms of run, I'm afraid I can't answer that. Oh, okay, the question is, are each of the scenes that we build complies entirely of, I think you mean, new imagery from the most recent downlink, as opposed to older imagery from earlier in the mission? And the answer is the latter. So what we will do is for a scene, we'll find all of the images that are taken in that region of Mars. So that includes what was most recently downlinked and anything that might have been downlinked in the past. So in theory, the rover could actually drive in a circle and you would pick up data from actually much later in the mission. This has actually happened a couple times in the mission. Yeah, so deciding which pieces to use, I showed a slide earlier that showed color imagery and black and white imagery next to each other coming from different places. So this is actually a very deep question. How do you decide if you're doing a mesh reconstruction and you have a set of images and for any given part of that mesh, multiple images saw that place. So which one are you gonna use? We have some heuristics that optimize this for our use case. There is a lot of different ways that you can do this. To give you a couple of examples, we kind of pretend that Mars is a static thing where nothing ever changes and this is not true. This is a lie. The most obvious thing that changes is this huge Jeep-sized rover runs over it. So you'll even see in some of our scenes like rover tracks appear and then disappear because some images were captured before the rover drove and some were captured afterward. But you could imagine using your heuristics of which imagery you're gonna choose to skew the reconstruction to capture the state before the rover drove or after the rover drove, depending on what you're interested in or any number of other things. In this application, no, we did not have such a requirement. In the other parts of the Mars Science Laboratory data system, there are images available to the public that can be searched by different things. I can't speak to exactly which technologies they're using. And this application was to finish on your point. We limited the amount of metadata that we tracked to really just the bare minimum of things that we need to know about. And thus far the searching has been within the scale that we can do within a relational database. I think a question over here. The question is, is the final data product available to the public? The answer right now is no, but there are some thoughts in that area that unfortunately I can't share right now. But stay tuned. Not that I know of when the decision was made, Amazon was what we chose. Yes, I think that is possible. I think the more interesting question is, can you send the data from a data system directly to the HoloLens? Getting the data from Mars to Earth is actually a whole other question. And maybe not surprisingly, the bandwidth between Mars and Earth is kind of limited. Yeah, sir. So the question is, how soon do our users want the data? I think to paraphrase. And the answer is now. It's actually a very subtle question that you're asking. So the way that Mars planning works is there are certain planning meetings that are time sensitive for planning rover operations. The way that those meetings line up with how data comes down from Mars is very complicated. Ideally, our terrain would always be ready at the beginning of the meetings. That's not always the case. So in some cases we may get it down like in the middle of the night and after the rest of the night to do the work, in which case no one is calling us. And in some cases we may get down like half an hour before it's supposed to be ready, in which case it won't be ready. So the answer is that the users would like the data as fast as possible, and we were trying to meet that need. Scouter. The question is, is there a way to view the data without using the HoloLens? In onsite today, no, it is a HoloLens only application. However, I don't expect that to stay that way forever. In fact, yeah, not for long. Sir. Well, that's what we aim to make it. Okay, that's a great question. And that really speaks to why did we even do this? Because we have these 2D images already, so why even bother with this 3D reconstruction? And yes, the reason that we do this is that I think there is something perceptually different in looking at images in 3D and versus 2D. Ideally, you would actually send a scientist to Mars and they would look at real looks for real. We can't do that. So the next best thing is we can put a scientist in virtual reality. And using the HoloLens, you're not tied to a different sort of place, you're tracked, so you can move around the scene and walk around it as if you were there. Clearly it's not quite as good as being there, but you can use some of the spatial cues that you use in your everyday life on Earth to understand the scene around you. And at the end of the day, geologists are looking at rocks. They're trying to understand how big are the rocks? How are the rocks laid out in relation to the scene around them? And it's very hard to do this in a 2D image. And there was some work before this project started that motivated the whole thing that tried to quantify how people understand a scene that they see in a panorama image versus a scene that they see in virtual reality. And I'm not gonna go into that today because I don't have the numbers in front of me, but the punchline is that people are actually dramatically more accurate in understanding a scene when they see it immersively than when they look at it in a 2D, even experts who do this in 2D. So that's the whole motivation for building this application that lets you look at Mars immersively. Skylar. So the question is, I think you're asking if you walked from one scene to another along the rover's path, would you switch from orbital into the new scene? That's something that we would like to support at some point, and unfortunately we don't today. Well, conveniently we have a rover on Mars. So the question is, why did we choose Mars as opposed to other planets? And I'm sorry to be a little bit snarky there. We have much better data on the surface of Mars than we have for other planets. If you think back to the orbital maps that I showed at the beginning of my talk, there's a lot that you can see from orbit, and Mars is absolutely beautiful when seen from orbit, as are other planets and moons, but for a virtual reality experience where you're trying to put someone on the surface, orbital data just doesn't really get you there. It's just not high enough resolution. So where we have really good surface data is Mars, where we have robots that collect this for us. At the moment, we don't have such data for other places. Earth, yeah, okay, Earth is a counter example. I'm not sure who was first, man. Okay, the question is, when in the process of planning the Mars mission did this idea of creating a virtual reality tool come up? I'm not entirely sure, to be honest. I've been at JPL for a year and a half, so I was not at JPL when the curiosity landed. I know that the ideals of looking at Mars and virtual reality have been around for years and years. This particular project has been going on for about two years and this group has been doing work in this area for much longer than that, but when the germ of the idea that is now on-site was born, I'm afraid I don't know. Okay, the question is there one for the moon? We are not doing one for the moon today. I don't know if we will or not. I think you were first, sir. Sure, so the question is, is there a reason that we chose an augmented reality technology, such as the HoloLens versus a virtual reality technology such as Oculus? One thing that we get from augmented reality is that we wanted our users to be able to continue using the tools that they're used to while they're in their virtual world. So I didn't really go into how on-site works in user interface, but we actually detect where your computer screen is and then we cut that out from Mars. So you see Mars everywhere else and then you see your computer screen. So the reason we do that is you can continue using the tools that you usually use to work on data from Mars while you're looking at Mars in virtual reality. Sure. Yes, so the question is how accurately are the spatial details maintained when you're going from 2D to a 3D reconstruction? This is a deep question. So I'm not going to get too deep into it. We aim for the kind of centimeter-ish level accuracy in our mesh. There's a lot of ways that noise can come into the system and in fact the source images that we use, we get range products out of those and just by the nature of how stereo correlation works, the quality of that range falls off as you get further from the rover. Fortunately, usually the things we're interested in are near the rover, so that's not too much of a problem for us. One of our next steps actually is going to be doing more quantitative analysis of our mesh versus the input images. So the question is, how specific is this technology to curiosity and is it possible to process data from other missions? I assume you were talking about perhaps a spirit or opportunity or other Mars rovers. In general, this is not specific to curiosity. This is something that can be applied to any kind of stereo reconstruction process. Currently we are a little over-fit to curiosity. We have talked about supporting other Mars missions. Right now we don't have any plans to do so because it will take a bit of development effort to make it happen. Schaller. The question is, do we use location data about where the rover is to inform the image stitching process and the answer is yes, we do. I think that's a little more detailed than I want to get into today. I'd be happy to chat with you offline. Okay, maybe one more question. Well, we'll see. Okay, well thank you all very much. Is it working? About now? Is it muffled or anything? Okay, you'll get the high notes. Awesome. Okay, my name is Rami Alvani. I'm a, oops, that came before. Is it auto advancing? Sorry, that was a bad one. Okay, so let's start again. My name is Rami Alvani. I'm an engineer at Symmetric Corporation here at Culver City. And today I'll be talking about one-click deployment of cloud applications using Ansible. To begin with, of course you have my details there. If you want to reach me on social media or the best social media ever, IRC, you can find me there. But I want to start a small experiment. I always try to pitch to companies that are not usually, or at least parts of companies that are not usually involved in open source to try to support open source a little bit more. Since my manager is here, actually. Rami, can you say hi? Since my manager is here, I thought if you enjoy this talk or you tweet about this talk or you like something about it, when you post to social media, would you mind using the hashtag scale OCD as in one-click deployment? And if you don't like something about this talk, you can meet Rami after the meeting or call more in support. Both of them will help you out. Semantic owns more in just in case you didn't get that job. Okay, so currently I'm at a DevOps position. I'm my team at Semantic. And the reason I'm here, or one of the reasons I'm here is I believe in what the DevOps dude said. He said at DevOps, we are awesome. So we take the awesome that's in our head and give it to people who don't have that awesome in their head. Because one important part of being a DevOps person is sharing. It's one of the core values of DevOps. And this is one of the things I'm doing here, okay? And I actually found out who that person was. His name is Jason Freeman and he works for Stackstorm right now. So, and my slides are auto advancing and I don't know why. So this is gonna be a little bit difficult. So I work on a product that's called Semantic Unified In-Point Protection. And yes, it is soup. So, that's supposed to be a joke. So, thank you, I appreciate it, thank you. So, and it's a wonderful product. Actually, if you saw, come on, this is. Sorry, I'm gonna exit the normal slide thing and use this, okay. I hope you guys don't mind. So, the product that I work for and this is the last you're gonna hear about Semantic in this regard, basically manages cloud applications. It's a brand new product that Semantic grew from the ground up to manage end points that the company manages through different products, okay? And like the gentleman who was presenting him from JPL app before me, he actually has a Semantic endpoint installed on his box. That's why you saw the Semantic logo there. And this is the product that will end up managing all those. I also identify as a PhD student of computer science at USC. Can I get a fight on from the Trojans in the run? Awesome. So, I also identify there. I also attempted to reboot the USC Linux users group. So, no, you don't get to, you should have said fight on. And we have some good work there. So, one of the major things before we go to road click is one other thing I identify myself as is a husband and a father. Then I have my family at the end of the room there. So, thank you for letting me come to the scale. So, on the road to one click, you need to know and understand where you're coming from in order to know where you're going, okay? Because every organization has some sort of system in the way they deploy and orchestrate their systems. I don't believe anyone here is still doing it by hand if they have more than three or four boxes, okay? And even if you're doing that, you're not cool if you don't have the configuration management system. And the growing demands nowadays of maintaining security, patching, et cetera, demands that in order for you as a DevOps person to grow up and live on and have a good life, you actually need to have systems like this in place. And if you're a DevOps person and you did not get there yet, I'll give you six more months and then we can talk. So, your goal is to actually sit down, have a monitor in front of you and start looking at everything as it goes along. So, the actual idea is you have everything pegged out, all your code, all your infrastructure ready and all you need to do is just watch it go through and see the green okays and everything is nice go forward. That's, you can say the engineering fantasy or what you're looking forward to. So, who knows what this is? Yes? Excuse me? That's Morpheus, where is Morpheus? He's in the construct, right? Now, this is where one click starts, okay? One click starts in the equivalent, in the cloud equivalent of the construct. Basically, a big one with nothing. You have no cables, you have no networks, you have no routers, no switches, not even power. Well, cloud doesn't need power, so let's forget that on the side, okay? But you don't have anything and that's the true nature of one click because if you see that everything I have is gonna work like it is or it can stand up an environment which is a big deal in a lot of places, stand up an environment from scratch just using code that can be triggered only by one command. That's where you need to start from. You need to start from the construct. But in reality, even the construct is not completely empty, right? You still have a TV that you can use to see the matrix. Now, does the TV show the matrix in the green squiggly characters? Mindful color. Does anyone know why? Ah, the nerds in the room. Why? They can see it in full color and not as they see it in the screens in the nambucca nether. Okay, that's homework. So, regardless of what happens, you still need some components that you cannot always start from one click, okay? You can incorporate their own code that you can use to stand up at any point of time, okay? But still, they need to be independent from your infrastructure as you go along. So basically what you're asking for is to go from this to this to this. Well, this is a gif that should be working actually, okay? Where all the guns are coming in the background. To this, where, whoa, I'm up. And actually the first time or first two times you do it, people will be four a.m. in the office looking at each other, wow, it actually works. I can't believe it, okay? And it's a good feeling. And it's a feeling of engineering triumph. And we all at DevOps, we actually need to have that. Can you hear me really well? So, one of the things, I used to be a teacher at USC. I used to TA a little bit, so I'm used to talking to people. What I'm not used to is not people talking to me back. So I would love to have this more of a discussion kind of thing and we can speed up and slow down as we go along depending on how the time is. But more value will come out of this talk if we start, if you engage with me. So feel free to talk back to me and ask questions. So we're gonna go through some steps that will help take you through the one click process and ending in some things that you would like. So, first of all, you need to understand where you are deploying. One of the main things that you need to know is in terms of the sizing of your tenants. Now, some of the mistakes that I've been seeing talking to people in meetups and things like that, all your environments are bundled in one account or they're not segregated logically from each other. They're actually at some point under the same network that are not even subnetted away from each other. So that's something that you really need to keep in mind. Also, you need to maintain access control to your environment. So you may have some people who can see the VM list of VMs and routers and switches, but they can edit them. But you may have engineers that you don't want even to go close to that. There are engineers that you want them to be able to see logs for production, but not the productions VM themselves, but not, and you want a combination of all those. So this is another thing that you need to keep in mind when you do your own click, because as you configure your environment, you need to make sure that these access controls are in place and not because the code base or the current state of the system doesn't support it. You are forced to give SSH keys or passwords to people that technically you should not give to them, okay? Also, there are some resources that need to be taken care of. Like for example, package managers and code repos and code repositories. Are they shared in both environments? Is the same package repo that you use in prod is the same one in stage, or is it the same one in development? That's something you need to consider because mistakes can happen and will happen, and we need to take care of that. Also, you need to make sure that you segregate your private and public networks. Now, I'll be adding extra emphasis on some private networking and some of the aspects that may not be standard for the usual deployments that people have, just because those are some of the things that at Symantec we emphasize, making sure that everything is isolated two or three degrees of checks and balances for anything that we do. So it may be a little bit too tight for your taste, but at least you can go back and make it easier from there. Does that make any sense? Any questions so far? One question, please? Yes, no? I'm gonna start pointing out, yeah. I need to build servers. How you do what? Well, the build servers need to push to the package repositories, right? So if everybody shares the same, you need to make sure that you don't have development build servers having access to your production package repositories just because they happen to be on the same network. And when you whitelisted the firewall, you just gave the slash 24, for example, in order for all of them to access. So the main thing is you need to outline all those as you go along and document them. That way, the initial documentation is gonna be on paper. But as you get more and more into your DevOps work, into your configuration management, that will be actually documented in code. And what will happen is, especially if you're using Ansible, we're gonna come to that. It's YAML, so you can technically write code to generate the documentation from your configuration itself. And tools like Ansible Tower, et cetera, can help with that. Thank you for the question. Also, there are some intricacies for infrastructure that you need to keep an eye out for. And I'm gonna point out some of them. Hardware versus software accelerators. For example, you wanna ask, I have SSL termination, for example. Provider comes to you, say, I have a box with SSL termination. Well, is that box a software one or a hardware one? If it's a hardware one, where do you keep your keys when you reboot the box? Do you have an HSM? Where do you keep your keys just in case? What kind of ciphers does it support? Does it have all the PCI-approved packages, et cetera? Also, are there any layers before the traffic hits your load balancer? Is there an F5 sitting up front? Are there two F5s sitting up front? Now, not everyone in your DevOps team needs to know that information, but one or two key engineers, when they start debugging something catastrophic, you actually need an understanding at that layer to make sure that you got what you're, that to be able to understand. Also, some intricacies. Are my security groups or firewalls all the stateful or stateless? Do you all know what the difference between stateful and stateless firewalls? Okay, so there are some people who don't. The idea is, a firewall, you block traffic coming in, right? Okay, and you can do it going out, but let's take traffic coming in. Traffic coming in, you either block or you allow. If you're blocked, you're blocked, it's done. But let's say you allow the block from within your company, for some reason, a whole block and somebody keeps calling. If it's a stateless firewall, anyone who calls in will get in, okay? Even if someone is sending a sin flood your way, as long as they're whitelisted and they're open to come in, a stateless firewall will let the sin flood come in. Whereas a stateful firewall will understand that there was no negotiation, so I'm just gonna drop that packet because there is no state for this connection between these two hosts. Make sense? So, small things, especially if you're not going with your standard AWS or your standard workspace some of these things actually come into play. Also, the definition of availability zones, okay? If you go, I work with OpenStack primarily. If you go to multiple OpenStack providers, you see some of them define their availability zone as their whole data center, okay? So technically if your database and it's failover are in the same data center, they're considered one availability zone per the SLA. So if both of them goes down, you're expected to have one in another availability zone. Whereas other providers have each rock as its own availability zone. So per the SLA, if you have them in two different locks, they're good. Maybe not per your own confidence, but still. Others have it as a hypervisor, which is scary, but at least you have a clear definition of that and you know it when you're going in to design your system. I'm gonna try to speed up a little bit so we have time for the rest of the info. So I'm gonna pick TTLs for load balancers. Who uses RabbitMQ or message queuing service? For example, in order to maintain a state between RabbitMQ and some other devices, you have a device, RabbitMQ cluster. There is a connection between one node and RabbitMQ, okay? And that connection stays up until a new connection is made and goes to another node. Now this node may only know the address of that exact RabbitMQ. So if it boots up and that node is down, it doesn't know how to reach a cluster. So how do you load balancers? You can use load balancers to do that. However, RabbitMQ has strict policies on what's the TTL for that load balancer. So if your cloud provider does not support it, you'll be debugging RabbitMQ for like three days and you don't know why until you figure out that it was actually the TTL that's the load balancer not enforced. So some things like this happen quite a bit. So in this talk, we're gonna talk about how we deploy on top of OpenStack. So I think I'll give a five minute preview of OpenStack in general. So OpenStack basically is your open source infrastructure for cloud virtualization. You have your compute nodes, your networking nodes and your storage nodes. And then you have an interface that you use to interface with all of those. The way OpenStack works, they have different projects. Each project is in charge of a specific component of OpenStack. And as a person dealing with your infrastructure, you need to understand what each component does and how to deal with it. That said, not every OpenStack provider provides each of these components, okay? Now, do you guys mind if I walk down here? I really like to move when I speak. So Nova is the compute layer. So you do all the provisioning, et cetera, through Nova. Swift is for object store. It's one of those things that get dropped if you're in a private cloud and you don't request it. Glance is the imaging as a service. So it's your image management, uploading images, downloading images, et cetera. Keystone is for identity. And it's one of the cornerstones of OpenStack. And it's the bane of my existence after one other one. After Neutron, Neutron is software defined networks. And it is the... Anyone from OpenStack Foundation here or anything? It is the messiest part of OpenStack. It drives me nuts, okay? Because you have software components, hardware components, software components for anyone. Other software components, it's just crazy. And Horizon here, it's actually a dashboard you use to modify and do a portal basically for OpenStack. Any questions? Okay. So the next step after understanding infrastructure is making sure you understand your application. And there are certain things that you need to keep an eye out in your application. So the application that we're working on right now, there's a Spring has a known Hello World sort of application that's called PetClink, okay? And this is the one that I'm using for this exercise here. So you're here, you're gonna be talking to this example to an engine X node that's acting as a proxy. And it will do a proxy pass to all the traffic to Tomcat which has the application deployed on it that talks to MariaDB in the backend, okay? We have some supporting cast in terms of Git and the second most thing I hate, Jenkins after Perforce, of course. Jenkins and also the package manager, okay? So this is the example that we will be following going on through this exercise. So one of the, so as you start your application, your application is mostly still in development and your developers are developing on i7 machine, i7 hardware CPUs with at least 32 gigs of RAM. So if the application does anything wrong, they really don't see it. And it's really hard for them to profile it before they actually send it down to a profiler or even discover it until they deploy it on a VM that's yay big compared to their computer, right? So as you start deploying your application, you'll notice that you need to account for your CPU and RAM requirements and document them, okay? There's a point in time is gonna come when you say, oh, I need this mini deployment for that PO who's gonna go and demo it somewhere. How big do I really need to make it, okay? And this will come in handy. How much object store, how many block storage do you need? Also, and I emphasize this is an unusual scenario, but your VM egress and your application egress requirements. You need to make sure you account for anybody calling into any VM or container that you have, okay? You need to account for everyone calling in. You need to make sure you allow them explicitly to call in, okay? And your application in general calls out to different locations, okay? And you need to make sure you account for all of those. I went through the exercise where we had this application running completely in a lab environment, okay? And we took it to an open stack instance that was completely firewalled. And egress access is only allowed by permission. So if I wanna go to Google.com, okay, they're not a competitor. So if I wanna go to Google.com from a VM, I actually need to file a firewall change request in order to access Google.com, okay? So in order, of course, you can just do a VPN to another node that actually has access, but nobody knows about that one. But regardless, in order, so that actually you sit down and you start seeing stuff failing in your environment. But better yet, environment is environment. You can open up tickets. You'll see your own CM code failing, okay? Why? Because this specific node package required to go to that weird repository instead of GitHub that you got white listed for two minutes. Oh, then I need to actually clone that locally so and change the configuration. That way, node will pull it from there as opposed to pulling it from GitHub, for example. Or there's a trend going nowadays for a little bit more complex open source software where you say, oh, just curl this URL, pipe, bash, set up, and hey, it works. That never works if you're sitting down the firewall. And it drives me nuts not knowing what's installed on my machine. I actually, what I ended up doing, I installed CoreOS and put in a huge Docker container and installed the application and did a diff and sat down to see everything in order to be able to deploy the application. Stackstone, if you're watching, that's you. Okay. Yes. So that's actually the first thing that I did when I was diagnosing that issue. What I did is I set up Squid Proxy on an environment that has complete egress access, full access to the outside. I set up Squid Proxy and I deployed my whole application and I sent all my logs to Elasticsearch and then I started combing through them, got a lot of grip and regex and I got the main set of URLs that I need access to. Okay. Then there's the, you want to hear about intricacies. There's one small intricacy, okay. There's a certain set of firewalls that does not allow you to whitelist URLs. They only look at up by IP. Where is that a problem? Any guesses? SSL, that's a problem. Yes, but mostly I'm lagging, I'm calming something outside. Anyone who works with mobile devices develop apps for any reason, okay. Your phone talks to Google and any service you interact with needs to talk to Google, okay. On the back end to make sure that you can push notifications, you can push profiles that goes down to the devices. Good luck tracking Google IPs. You can't do that. And you go to the security office and say, oh, just do that Slash A network. Give me the whole thing. That will never happen. Okay. So you need to work out some tricks. So you want to know a trick? How I dealt with that one? Hands. So what I did actually, I have a job in the same environment polling the same URLs, okay. And whenever a new IP comes in, spits it out, sends an alert, then we follow a request for it. All IPv4 in our case. We're all private networking. So IPv6 doesn't really make sense. And if you have a partner engineer who became a Java engineer and you want to explain how to ping local host using IPv6, it's going to take some time. So we kind of switch to that. Okay. So going back here, also bootstrapping requirements. You learn a lot about the application if you go into a clean arm environment. How did I ever bootstrap Caltrace? Where did that user come from? Oh, only because that deep match the SSL certificate? That's how they got it and we couldn't deploy it anymore. So there are certain intricacies and lessons learned.com on. And I'm looking at it now because I'm bored. It's because I want to make sure we give enough time for other components of the talk. Okay. Also one important thing, you need to know where you stand. In terms of we're all mature dev ops engineers or aspiring dev ops engineers or hoping we can be dev ops engineers, you need to know the tool set that you have. Like a lot of people, we were invested in Puppet for quite a bit of time. Let's, I love the Puppet guys. Actually last scale, we sponsored Puppet labs at scale. And I'm so proud of that. But you need to know your tool and you need to know its limitations. Okay. For those who are using Puppet, one of the main issues that we had with Puppet is the construct. Okay. You want a tool that can bring up itself. A chicken and egg kind of problem. So guys, how can I bring up an environment where you need a Puppet master? How do I bring up a Puppet master? A Puppet master. Or you do it by hand. That's one thing. Also it actually Puppet doesn't, at least at the time and until recently does not have support to provision resources in OpenStack natively. The support that they have is repurposed EC2 costs. So if I actually wanna provision to an OpenStack using Puppet, that won't work for me. So you need to understand the limitations. And some of the things that you, that you need to look into also decisions that you need to make about environment. Are you gonna use this environment as pet or as cattle? If a node misbehaves, are you gonna just pet it? Please come back to health. I'll patch you. I'll do whatever you want. Just come back again. Are you just gonna kick it to the curb and, well, Peter's not here, but at least put it on the side and try to, and just bring up another one that's healthy. Okay. Also you need to decide what layers of tolerance do you have for that misbehavior. Do you say, okay, if it's just a misconfiguration, property files went away? Maybe the last Puppet one or Ansible one didn't catch something? Will I throw it away? That may be more costly than just deploying your property files. So you need to have some sort of measure there on what to do, as opposed to operational failure, that you just totally kick it out and bring something else. Also, you need to make a key decision, especially if you're going to Ansible about your machine state. One of the main powers of Puppet is that it ensures that a specific node is in the state you expect it to be. So even if a work developer went in and changed something, if your Puppet code is right, five minutes later, 30 minutes later, that node is gonna be back the way it was before. Now, if that has value to you, then yes, you may think you may Puppet, you may go Chef, you may go other resources, that will work for you. Ansible may not, it could be, but it may not be, it may miss some of the components that you really like. If that thing is important. For example, in our team, we adapted that if it's bad, kick it to the curb mentality. So machine state is not really that big of a problem. And if you have enough sensors in that let you know what the machine state is at a certain point of time, you can put in the code to actually auto heal as you go along. Also, you need to make sure that if you're using Puppet or any language that has a DSL in it, you need to make sure that the more qualified people are writing code for it. Otherwise, you're gonna end up with Puppet code, and almost everyone I talk to who uses Puppet are even one of camps. They either have a dedicated team doing their Puppet code and they're extremely happy and they even have Puppet Enterprise, and I am with it, okay? Or on the other hand, everyone in the company is writing Puppet code and the guy who actually needs to run it is going crazy. Okay, all kind of weirdness in the code and everything at the end is an exec. Okay, to a shell script. So if you're really into code or coders as shippers, then you may need to revaluate the tool that you're working in. Also, no, you don't necessarily need a Dumber Deployment Tool. You can have a combination of both. Like a lot of shops have a Puppet and Ansible Deployment where they manage their machine state using Puppet but do the orchestration using Ansible. Well, then you need to start putting the same qualities you have for your code, for your DevOps code. Testing, code reviews, making sure no bad code goes in and at the end you start phasing things out and that's what we're doing actually at this point because you can tolerate it up to a certain point but afterwards it's like, I have bigger fish to fry, I cannot deal with this anymore. But if you're in an organization that they can write that code, I think it's harder for you to do that unless you're in a place of authority which usually equal developers don't have that within a team. You should do it regardless but you'll find people who I'm reviewing the code that's next to me or Oscar comes to me and say, hey, I'm sending a code review, can you approve? It's not even in my inbox yet. Okay, so things like this happen and it's the way things work. Something is really broken, it needs to be fixed but at the end you're gonna incur so much engineering debt that at some point you're gonna just need to invest in cutting the code. Does that make sense? And then we can discuss it more offline so we can go along with the presentation. Okay, also this is, if you take anything, anything out of this presentation, I hope you take a lot, but if you take anything, you need to take inter potency. It is the golden rule of demos. And it is so important that any tool you pick, if you achieve a cut out of potency coverage of at least 80%, you're in really great shape. Okay, what does that potency mean? It means I have my code, I run it once, it'll deploy my whole system A to Z. Okay, I run it twice, if nothing is wrong, it'll come back, nothing is wrong. If I run it three times and a node was down, it's gonna run, fix that node and finish the run and report that one node back, okay? So much headache and grief can go that way. And that's the fastest way where you can achieve this new buzzword auto healing, okay? Because if your code, if you enforce it, because I'm pretty sure if anyone opens their configuration management code base, whatever their tool is, they're gonna find a curl call, a curl call or an exit line, just executing a shell script, okay? The problem with that, okay, that shell script itself could be at your point. I could be making an API call that is mundane. If I run it 10 times, it won't screw anything in my system, but here's the problem. If you run an exit script on a curl call, every time that curl call runs, it is considered a change in the state of that node. So you will get our report back saying there is a change in state. That is bad. I should not, and I repeat, I should not get a notification of a change of state unless something actually changed. Does that make sense? So part of your effort, part of your work will be to make sure that even if there's a shell script, and by the way, this is something Ansible shines in. Because you can write a bash script that if the bash script understands that nothing happened, it can spit back nothing happens and Ansible will return nothing happened. So in that case, the actual call will be fine. But in that case, it's not a shell script, it's an Ansible module. Make sense? Okay. Any questions? Do I need to start pointing? For questions? You know I'm gonna call you out. Okay. So one important thing to know also is when to shut up. I know a lot of you here are actually here for Ansible. So now's the time we switch there. So before, so let's Ansible. So if you ask me what Ansible is doing and what it is, as far as I'm concerned, it comes, it ranks like in the top 10 of the best thing after sliced bread. At least in my case. Why in my case? Because my responsibility is to provision a complete environment from construct to guns, lots of guns. That's a quote from The Matrix, okay. Complete provisioning of everything, setting up everything, calling out to external services, reporting back that anything goes down, okay. I happen to use OpenStack. And there's amazing support for OpenStack and Ansible. It's not just because Monty Taylor, one of the Ansible, one of the OpenStack board members is in charge of the Ansible code for OpenStack. Has nothing to do with that. But actually there's a huge community behind it, okay. And if you go, puppet takes the idea of you get anything you need from the forge. Ansible is the complete opposite. Everything you need is baked in. So they have two repositories. They have a core modules and an extra modules repository. And people just contribute code there, like crazy. So anything, especially for a cloud site, anything you need is there. The orchestration component of Ansible is very mature. So what does Ansible do? People say a lot of things, but for me these are the three big things. The first one is cloud provisioning. If you're in my case, you're using OpenStack. You can have in one file, okay, the credentials and details for four OpenStack providers in one file, it's called cloud.m. You can have four OpenStack providers and you can run the same code base once for each provider and you'll get the same outcome in all four. Okay, I did it for two and it works great. I didn't test it with four, but with two and it works. The same code base, as long as the OpenStack backend didn't crap out on you, which happens, you're gonna reach that same level. So it does that. Infrastructure has a service orchestration, so adding a load balancer when you need one, removing a load balancer when you need one. Like, I'll give you an example. We have this case that we call the magic packet case, okay, where we have this client that calls to a load balancer, okay. This load balancer is an HA proxy in a active passive mode, okay. So you have an active HA proxy that load balances the nodes and then you have a passive one that if this one goes down, this one will come up and a new one will be provisioned somewhere else and it will become the passive one. So far so good? Okay, so what happens is there's this, as far as we can tell, and as far as Juniper Network says, there are SDM provider, there's this magic packet that comes in and will go to this load balancer, it crashes the HA proxy. This one comes back to pick up the slot, packs our buffer, the same packet goes to this one, it crashes by the time the next one is up and it crashes by the time the next one is up and it crashes and we have this cascading failure of load balancers, okay. All we needed to do is to provision another load balancer and change the DNS entry from that load balancer to this load balancer, since we were using local DNS on the same hardware, we are golden, okay. Make sense? So things like this is really easy and fast to deal with Ansible for multiple reasons, one of them being how easy it is to pick up as opposed to other tools. So the way it works is, anyone has a pointer by any chance? I'm really used to pointing and stuff, okay. So what happens is you have an inventory. Everything in Ansible is based on an inventory, okay. I have these sets of nodes and each node has certain properties, so I say give me all web servers, it'll give me all, oh, thank you so much. Great, appreciate it. So you have the inventory file here and any VM you own is part of that inventory, okay. And you can tag your files in INI format, your VMs, any way you want. So if you say all my Phoenix two nodes, you can list 20 and you can have all my web server nodes and it can have five of those 20, okay. So when you query Ansible to do something, it'll just pick one of those, whatever group you give it to deal with. And also playbooks. And if I have one wish, we stop calling things recipes, playbooks, scripts, whatever hats and knives, et cetera, just pick a terminology. We should write out a name for everything in computer science, except configuration management. So I would love to change those, but I'm partial to playbooks so we'll go with them. So you give an inventory and a certain playbook, which is a set of tasks to Ansible to run. And Ansible will figure out where to run it on. It'll SSH to those boxes. Download its mini small agent, put it on that box. That box has, does not need anything, except Python 2.4, by the way, to be installed on it. And then that whole thing will run. Come back and return what happened. That's it. Now, if you happen to have an evil shell script, it'll take that shell script there, it'll run it, and it'll come back telling you that it changed regardless of what happened. Questions? Okay. It sounds rosy, trust me it is, until it's not. So, how do you install Ansible? Let's say you have an existing system right now, and you want to use Ansible to run it. You have VMs provision. If there are Linux boxes, and there are more than CentOS 6, which most likely is the case, or if it's the last LTS for Ubuntu, you have a Python that's big enough to handle Ansible, okay? All you need to do is clone the latest version of Ansible if you're so inclined. Or of course, you can use your package manager or pip to install it. Just make sure from now on, that you're using Ansible 2.0 or more. Because there are major differences between the previous Ansible and Ansible 2.0, okay? You're going to Ansible, and you run this file, just to make sure that your environment is set up properly. Now, if you want to deal with OpenStack, you need to add these lines here. Sorry, Ansible needs these on your local box, and these are for OpenStack, okay? OpenStack, basically you need the Nova client, Neutron client, Keystone client, and OpenStack client. And Shade. Shade is actually the Python library that all the calls that Ansible makes to OpenStack are translated from Ansible-ish to OpenStack-ish using Shade. Yes. Yes, it is agentless in the sense that you don't need to install an agent on any box. However, the lead functions, when you tell it, go install nginx on those boxes. It'll SSH there, create a temporary directory under your user, install its agent, which is basically a Python shell, that will run everything that it needs to run, and then it will clean up after itself and come back. The agent, oh, you just got it from here. It's part of the Ansible code base, okay? And I looked at Ansible code base, it's actually beautiful. If you're not very good in Python, and you wanna look at good Python code, in my opinion, I'm not sure are other people, but this is actually some good code to look at. And you can make sense out of things. There are some intricacies here and there, and we're gonna see some of them in the code samples, but it's some good code. Now, small hint, if you ever have a problem when you SSH, especially if you're using a Bastion box, you need to install this small package that's called SSH Pass, okay? Just a small hint, if you want into an SSH problem, just install this and we'll fix it. I have no idea why, but it does. Though, don't get it from Source, it's hosted on Sourceforge, and they have so many ads and stuff on their website, so just any version from your package manager will suffice. And yes? Well, okay, well, you, I know, okay? Because MIT is too liberal for you, right? No, no, it's not for security reasoning. It's for, okay, when you use a Bastion host, the Bastion host, all the SSH work is offloaded to the Bastion host, like a jump box. So, instead of you as a switching to the end server, you need to SSH this server, which can SSH over. And you need your SSH agent to follow through. There are components of SSH Pass that you need for that. Okay? If you need your password, not your key. I can reproduce the problem. Don't get me started. Can I trash Symantec? No? We'll talk offline. And by the way, ladies and gentlemen, Professor Ted Faber, he's my open source mentor, so that's why he's giving me a hard time. Yeah, I'm gonna blame him, and you can blame Scale on him. He's actually called the Godfather of Scale. He was there when it all started. So, thank you. Round of applause, everyone, thank you. That'll get you to stop talking. You're gonna ask questions. Okay, everybody else. Okay, so this is when we're gonna go to code. Yes. Okay, so think of it this way. And I'll give you another example afterwards. But this one is, so as part of securing your own network, okay, not all your boxes need to be accessible from the outside. So what you do is you create a private network. Think of it like your home network, okay? You have your own router that's set up. All your computers that are inside your home network are using a 192.168.something. That's something IP address, right? But you cannot SSH that box directly. But let's say you need to manage if you're a user inclined, and I'm sure a lot of people are. You want SSH to your home box from Office. So how do you do that? You either put your box in a DMZ zone, or you have a box inside your network that has a public IP address, okay? And you SSH to that box, and that box and only that box has, think of it as two network cards, one network card attached to the outside world, another one attached to your inside network, and you can talk to your inside network from it. So that's where you use a bastion box. So especially if you're using open stack or private network, you have the luxury of having none of your boxes accessible from the outside, which is almost a requirement, okay? And you have one box that you can use to jump to all those other boxes. Yes. You usually have, what happens is, like in our case for example, we have one box in each tenant. So if we have a specific project, we have one box in each. That's, and that's called the overlay network because that's the virtualized software defined network. Then you have actually physical boxes on the underlay network, okay? That are controlled, that are almost bare metal, basically. And you allow those to calling, okay? And you can also whitelist an external box as long as it has an IP address that's publicly accessible and known. So it depends how paranoid you get and how much security precautions you wanna take. So apparently you only have 10 minutes to go through the meat of the talk and you can blame your non-professor favor, okay? So, okay, and I officially hate this. Okay, so I put up this QR code before the talk started in hope that people will actually, I can, you know what? I can use, just in, do you mind coming on my computer, please? Okay, so this is the code base where we have the examples for here. So the first thing that I want you to see is, can you please go to the Ansible configuration file? Okay, now let me bring up a case and tell me what problem you see coming there. I have, I'm provisioning brand new boxes, boxes I never talked to before and I'm gonna SSH to them, give me one problem I'm gonna face. The user's provision as part of the image. Host key, correct. So how do I deal with that? Yes, that's the easy way of doing it. However, in the production environment, you might wanna actually have the machines call back with their host key so you store it locally. That way you don't, you keep the keys as you expect. So, host key is checking, you just set it to false, okay? Also, you might wanna allow the forwarding agent. What is the forwarding agent? Let's say you're using a jump box like we talked to about before. Now, you load your keys on your box here, my personal laptop and I SSH to the jump box. I don't own that jump box. Joe Schmoe owns it. I don't wanna put a public and private key that are attached to me on that box because Joe Schmoe has pseudo access and he can read my private key and he can practically impersonate me, right? So I shouldn't be doing that. So what I do, I actually load an agent and that agent forwards my credentials with me as I jump through boxes. Make sense? So these are things that make my life easy as I go along. Okay, let's go back. And if we scroll down, let's go to clouds.default.yaml. The one under, yeah, that's why you're up there and I'm down here. Yes. Okay, scroll down please. So this is the, if you're dealing with OpenStack or even you're dealing with AWS, you need to define your credentials and your details somewhere. So this is called your clouds.yaml file where you can define an n number of clouds in this file and remember it's yaml based. So indentation, like Python indentation does matter, okay? And you really said if it's a private cloud or a public cloud, if there's a region prefix, so if you're in Virginia one, Virginia two and Keystone is the authentication details and the username, agent Smith, the password, I hit oracles and all these details, you actually get them from OpenStack itself. So once you're given access to an OpenStack instance and use Horizon, you go to security, you say, don't give me the details in a bash file or just give you a bash file with all this information except your password. Make sense? And the code is on GitHub so we can follow along. So let's go one step back. So the first thing I wanna do is go to the construct. So the goal of the construct is to basically build my network infrastructure, okay? And pay attention with me here because we're actually going through some of the Ansible basics as we go along. So first things first, we said Ansible is based on inventory, right? But in order to create VMs, I'm calling an external server. I'm not calling a server that already exists. Should I put that server as part of my inventory? No, I just say use local host. I'm just making API calls. I'm not SSHing anywhere, right? So you can actually have modules that do not need to run on boxes. They just need to send alerts. They just need to make an API call. They just need to make something. Even email someone, okay? And you just set the host that it runs on a local host, okay? And you set the connection to local because the default it was SSH to your local host if you don't tell it. So you set it to local and gather facts. Everybody familiar with Factor in Puppet? That's basically a script that you call and gives you everything that could be known about your box, okay? And that's how Puppet uses it to keep track of everything. So you're telling it, don't waste the two seconds you need to collect information. I don't care about the box that I'm, I don't care about the box information right now. Now, Ansible has something called pre-tasks and tasks. It's simple. You have a set of tasks that you wanna run, okay? It goes one by one, predictable. Anyone tried to do one by one in Puppet? Have you ever succeeded? I don't think so, okay? You can, okay, I know you can but you need to be an engine, okay? So sometimes you wanna run tasks but in order to get tasks you need to have some information collected beforehand. And that's what we do in pre-tasks. So in this case, I said I have a file that's called project.yaml, okay? That I need you to import all the information from. So would you mind duplicating the tab? Go one step back and go to group vars directory. Here, yes. And project.yaml, okay, scroll down. So here I'm putting some information that are variables. This is where I keep my variables basically. The whole idea is I wanna keep my playbook as generic as possible. So if I run it on different environments, all I need to do is change the variables, okay? So I define, I'm calling my project the matrix. I give it the user credentials. I also provided the SSH key. Why do I need to give it my SSH key? I'm SSH'ing into boxes. So I need to give it my bubbly key, otherwise I need to put my password everywhere, okay? By the way, hint, hint, if you're using Ansible, all your boxes should have SSH. You should have your keys installed on all the boxes. And you should have password list pseudo, okay? And the most common playbook that people write is the playbook that SSH to all the boxes and installs your key. There's actually a module for it. There's a question? Yes, very, very good question. So Ansible Tower, which is the product that the Ansible company says has an SSH key management system in it. In our case, we have a hybrid SSH for full production. We actually have a routine that goes in and deletes all the keys and even deletes all the users because we use LDAP to authenticate. So if you're gonna use something in production, provision a key on the fly and plug it in for you. And it's taken out after a while, so nobody stays in production for longer than they need. Okay, so there are different ways to do it, but if it's not production, you can be a little bit lenient about it. There are projects that you can use to do it, but I really didn't find one that I liked other than vault.io, but I didn't have a chance to vet it properly. Okay, so also I wanna define some networks. Application needs some networks, right? So I'm gonna define a network that I'm gonna call the jump network and in OpenStack, each network needs to have a subnet because the network can have multiple subnets. So I'm saying, great, this network that has this subnet and this is the IP address range that I want that network to have. And I go down for five networks. Would you mind going down there? And also I wanna define a router that I can use to connect all these networks together. So far, so good. And basically these variables define everything. So let's go to the previous tab. And so the first thing you do in here, I'm assuming the most vanilla application you could have. You don't even have your key uploaded. So what you do here, the way Ansible works is you call a task, you start with a tick here and you give it a name, this is a comment basically. So when you do an Ansible run, this is the comment that's gonna come out. This is a module. Notice that the name and the module are in the same indentation level. Trust me, indentation will kill you. So keep that in mind. Then we indent a little and these are the variables that this module takes. It takes a cloud because remember the clouds.yaml file that we had? It needs to know which cloud to pull out of that file because you can have 10. So you need to tell it which one. And I, because I'm smart, I kept it in a variable file. And I wanna say present because you can say absent and you can go in and delete a key from being there. Okay? And then I give it a name because I need to use it later and I gave it the public key file. Makes sense? By the way guys, I kept my key there so if you wanna put it on any machine you want, feel free to do so. Okay, then I wanna start creating the networks. Now we're doing a little bit more in the advanced Ansible-ishness, okay? And two minutes I promise I'll stop. So this is basically what I'm doing is I'm doing the loop. Okay? So same thing, I have a module, I have the properties of each module but I'm looping over project networks because remember I have project networks and I didn't have a set of networks under it and then each network had its own properties. Okay? So I said loop over all the networks and from each item in that you're looping over give me the net name. Network name. Fair enough? Let's scroll down. This is creating the subnets, okay? And it's the same thing, I'm looping over it and but I'm using more variables. So forgive me the net subnet, the CIDR and actually the network name also. So far so good. Now the last one, it's called OS subnet task. Sometimes when you're running something, the inventory is not kept alive. The inventory is right at the beginning and it's used forward. Sometimes you need to load stuff from memory as you go along. So you have module like this that tells me just give me the facts please. Okay? And you give it the facts and it'll store in a variable for you that you can use and it's basically just a dictionary and you can parse it any way you want. Okay? So unfortunately, thank you, Justine. Unfortunately this is as much we can go to at the moment but I'm gonna sprint through a couple of things quickly and if you don't mind sticking around for five or 10 minutes. So the second step that we'll actually go through is the provision document and this one is where I actually provision the VMs and in this case I'm using an Ansible role that I created. The role is called provision VM and the role takes input that are the cloud name and the specifications of the VM itself. Now, since I'm a good boy, what I do is I store all those into variable files. You can see I stored the VM specifications in a variable file. So the name of the VM, the key name that I attached to it, the size of the VM and basically how many I want of it. So it'll create as much as I need. So going back to the presentation. So you start basically from an empty set. You call the construct and you get the networks that you see here just by calling it a playbook and then you call the provision and you have the hosts created after you call the provision. And when that happens, actually the agent technically lost because you actually provisioned everything that you wanted to provision and I guess I will say now you know kung fu as Neo famously said. So we can discuss this more offline. I appreciate your time and your patience. And I hope you learned something new today. Thank you very much. They're gonna be posted on the talk itself, on Skype.