 Started. Welcome and good afternoon, everyone. Sorry, our presenter this afternoon, Monty Taylor, will be presenting No Ops with Ansible and Puppet. Please welcome him. So yeah, I'm going to talk about those things, or I might talk about something else because I might ramble off in the middle of a thought somewhere in one of the slides. So we will either copy this topic or maybe just show you pictures of worst cats. These slides are all, at some point, I probably am supposed to send them to an organizer somewhere who will probably hate me because it'll take me a while to do that. They are already on the web, on the Githubs, and they're in HTML, so you can fork them and do with them as you will, because what's that? What license? So it's entirely possible I may have forgotten to license them, which is ridiculous. I take bugs. Anyway, so sorry, you can contact me at those places if you'd like to. What probably would make my employer really happy is if some of these logos on here said HP on them. But I am a distinguished technologist at Hewlett Packard, or excuse me, at HP. We're not Hewlett Packard Enterprise yet. I will not be going to HP Inc. because I don't know anything about printers. So that's a lot of fun. It's probably most useful to point out that I do a lot of stuff with OpenStack, and I'll talk about that a bit. You'll get sick of hearing that. I do all of those things, so I sit on lots of things where I have to be in meetings, which is very exciting if any of you wanted to get into the world of management. I highly recommend not, as much as it's a lot of fun. I also used to work for MySQL back in the day, and at one point in our lives, Stuart Smith and I hacked on this thing called Drizzle, which I believe the OpenStack project is the world's largest user of Drizzle in production. As for the longest time, our paste bin service was running on top of it. We did uninstall that recently. Anyway, less important. So as I said, I tend to ramble before getting to the point of the talk. So we're going to talk about all four of these things amazingly enough, even though I can't even get one topic into a 40-minute time slot. So I'm going to do four. It will be successful. I will attempt to define no ops because we aren't having enough fun defining DevOps. We need to go to no ops. Talk a little bit about cloud applications and one specific cloud application that we'll use as sort of an example case to talk about some other things. And then in the process of that, I will teach you everything you've ever needed to know about both puppet and Ansible, all in 40 minutes. Because that's going to happen. So the shortest section of the talk is just on no ops, which I kind of thought, I did not realize it was a contentious term. I kind of thought it was just a little funny. Apparently, it really pisses a bunch of people off. And I'm not really sure why they're all mad at each other. I think it's just because of the type of people who like to be mad at each other all of the time. But essentially, it comes to me down to something like this, that as developers who like to run code, which is what I and some of my colleagues are, you can code and let a service deploy and manage and scale your code. It wasn't Wikipedia. Wikipedia don't think has an article for this yet, because it's too hipster. But this was like on whatis.com or something like that. But to me, it kind of comes down to this. I don't actually want to spend my time doing ops. If I'm doing ops, then I've probably done something else wrong. What I really want to be able to do is I want to change the system that I'm responsible for by landing commits into a source code repository. Because if that's the mechanism by which I apply changes to the system that I'm running, then those changes can do wonderful things like go through code review. It turns out running sudo rm-rf slash as root on a machine, maybe it's the thing you want to do, but it certainly doesn't go through code review and peer review as to whether or not that's the thing. And sometimes maybe it should. Maybe before you delete all of your files, you should have the opportunity for one of your colleagues to look at you and say you are in fact a moron, and you forgot to type the rest of that command out. And so those of you who've worked with me know that if you want, it's really better for everybody if there's a code review system between a commit that I'm going to land to run a system and that going live. So this is sort of, in a very short sense, the thing that I want to accomplish. And I'm not going to pretend that we are actually in a state of no ops in the thing that I run, but it is the state towards which I believe we would like to trend. I believe it is impossible to get to the point where there is absolutely no ops. But ultimately, if I have to use my root access on machines to do a task with typing in a command, I like to consider that a bug. I have many bugs in my system right now. There are things that I have to do that with, but I would like to over time consider those to be not things that make me special and things that make me kind of a really awesome sys admin that lets me prove my metal, but really a sense that I haven't fully automated things and I haven't fully put in the checks and tests that I could, such that I don't have to shell into a machine and do that. And there's some community reasons for that as well that I can talk about, or that I will talk about in a few slides. But in order to talk about this, one of the things that allows us to get, or that has made it easier for us to do this is we're people who run, we work on cloud software. And in doing that, there's this idea of cloud native applications. It's actually an idea that I hate. I think it's a kind of a silly idea, but I'm going to talk about it anyway. And the idea behind this is that if you're going to write a new fangled application, you're going to write as a cloud a native application. The theory goes is that you're going to have an ephemeral compute service that doesn't really store any local data or anything like that. You're going to have some services in which you put your data, and you're going to design all of your applications to be resilient via scale out. That's where you're going to get your high availability. It's going to be where you get your things, all the auto scaling is going to happen. You're going to magically, you know, your website that shows cat pictures is just going to magically handle the Super Bowl commercial that you ran because you're crazy. And then you decided that you needed to run a, oh wait, that's terrible. So there's this thing in the US called the We Call Football, which is a different sport and has a large, anyway, sorry. That's terrible. It tends to drive traffic if you put a TV commercial on during one of its main events. And in theory, in your magical cloud app world, you shouldn't have to know it about that. And it'll just work. Of course, that always happens exactly like that. But one of the neat things about this, and if you design your applications in this way, is that you're getting high availability via scale out, right? Your individual components of your system may fail. And you're like, okay, well, you know, I lost that slave, but I have 20 others. So who cares? And that's nice, because in the realm of not wanting to do ops on things, it's really nice if each of the individual components of the system itself is throw away, right? That way, if one of the systems has a logging error, or, you know, because it ran out of disk or something like that, just throw it away. Just make another one. And the system, in fact, if you have written your application in the right way to do that, theoretically, your auto scaling system should be able to just delete the problem node and create another one. And you don't have to get a page on your pager, and you don't have to, you know, take, you don't have to spend one Friday a month not going out drinking, because, you know, you need to be in ready access to the computer terminal to log in to fix problems. The problems just fix themselves, because they're not really real problems that have to be resolved. And you can come back in on Monday, and you be like, you know, we lost 30% of our database slaves over the weekend. We should maybe investigate that and see what the algorithmic problems in our thing are. And you can sort of do it leisurely with a cup of coffee rather than, you know, panic because you were supposed to have not been drinking because you're on call, but you really were. And so now it's two in the morning and, you know, you just passed out and your pager went off. So it's safer, really, for sorts of things. And the theory that people, the people talk about in here is about forgetting long-lived systems. All of your things are supposed to be scaled out. So there's supposed to be no special snowflake servers that are important that sit there and that you care for and that you even update. They just, you know, it's magic. So it's all like, it's like shared nothing, which Stuart and I used to work on a shared nothing database, which was fantastic. You lose a node and it was great. I will say that there was absolutely nothing autoscale about that shared nothing database. You had to pre-plan all of the memory allocation because it did all of its, it didn't use malloc anywhere in its code base. You had to like configure it to allocate on startup all of the memory buffers you need. So these are the old days before we had all this magical cloud stuff, right? So it's fantastic. It's a great thing for new applications. If you're writing, if somebody gave you a new task and you're like, hey, I've got a cloud, I'll write a cloud native application and it would just work and it would be great. And you probably wrote a cloud foundry something or other or something you did Bosch, something or other and magic happened, fairies and rainbows or whatever. One of the things that we've run into in our world, which is the one I like to call the real world, is that we have existing applications. It's all well and good to think that you can write all of your new applications from scratch, but it turns out that's not reality. Reality is there's things that you have that pre-exist. And in the systems that the team I work on is responsible for writing and running and managing, we have some of those. That team is known as OpenSec Infra. It's possibly not the world's most descriptive name and possibly not the world's most unconfusing name if you don't happen to already be in our circle of people, but it's essentially the team that runs all of the developer infrastructure, tooling automation and CI for the OpenSec project. And that may not sound too particularly sexy or exciting, but the problem with us is we're a victim of our own success and we have around 2,000 developers at the moment. We have 2,000 developers and we've decided early on that we want to do massive amounts of testing on every single commit that any developer ever pushes up before we let it land on a thing. And that system is fully automated, but we do a large amount of integration testing on all of the commits that can be produced by 2,000 developers before we land them. Each of those tests, each of those integration tests that we run runs on a single-use cloud slave. So we spin up a machine and we run code in it. And when we're done, we delete the cloud node because what we did is we installed a cloud inside of that cloud node. And that's not particularly a very clean operation and uninstalling a cloud from inside of your cloud node is not something that we really think we're going to trust because it probably did something with the networking stack inside of the kernel or something else or as we just deleted everything. And that's fine. So this is part of our system that is kind of cloud-scale out. For some numbers, okay, so this is maybe a two-month-old slide, but when I wrote this slide originally, we had run 1.7 million test jobs in the six-month period preceding the writing of the slide, which in fact each one of those had its own machine that was created and destroyed just for the purposes of running that test. Oh, no, why did that? Oh my gosh, I hit a button and it skipped all of the slides. What's going on? Open source sucks. All right, well, so you're seeing all of these things in a really weird backwards kind of way. So we're going to see what the heck just went on there and yeah, that's how about, let's see, here maybe? All right, let's see if I hit the button if it's going to do this weird thing that just did because I have no idea what was going on with that. It really wanted to show you some Ruby. I'm going to blame Ruby on that. Hey, look at that. This is supposed to happen when I hit the space bar. So we ran, wait, wasn't that there? Anyway, sorry. There's 1.7 million test jobs in the last six, no, well, all right, so sorry you get to see this again. Wow, that's really, really fascinatingly strange. All right, we'll see if we have to do this every single time I need to advance a slide now. It's going to be an adventure. That produced 18 terabytes of log data over that same period of time and there's a whole other talks, in fact, that we can just talk about just about collecting and analyzing that log data, but I promise I won't bore you with the details of that system at this very moment. All right, next slide is it going to go to the Ruby DSL slide? That's exciting. It's just something about that. Oh, I think I know what's happened to that slide. Yeah, so there's a bug in my HTML. I'm sorry. So you miss a slide. So there's a slide in it that apparently wants to tell you about Ruby DSL that's supposed to tell you that we ran about 15 million tests themselves in the month of December. In any case, I've now lost my entire train of thought because of HTML. This is the reason HTML and JavaScript are evil. So the fun part about this is that all of this infrastructure that we're running is itself a cloud application. We do not own, for handling any of that stuff, we own absolutely zero computers. We own some accounts that are free that allow us to spin up as many virtual machines as we like to. It's kind of nice when a couple of your big contributors to your project are themselves cloud providers. It makes it really easy to get free cloud accounts. So this system that I'm talking about, and sort of our example for talking about why you do some of these things, runs as a cloud application across both HP and Rackspaces public clouds. So that's where it all is. And this is a simplified version of some of the architecture of the system. It shows some of the elements of it. And actually, the last time I showed this picture, the screen was much smaller, and I had to apologize that you couldn't read any of the words on it. But these screens are so enormously large that I actually feel like I could have used smaller type, which is an input gotten more things on there. So we've got a bunch of different things, and the sort of the way that I've drawn it here, hopefully shows that we've got sort of three different types of things. We've got things like this Garret, the Zool, and NodePool are all single machine entities. They're things that are, you know, in some ways single points of failure in a system. If they go down, then there's a service outage issue. There's other things like these, like the arrows going south from the Garret box there are to a field of Git replicas. Those are just machines that are Git replica slaves. Same thing with the the eight Jinkun servers that are over there. These are sort of older schools. A lot of these are, and I'll talk about the pooling stuff, these things are sort of repeated boxes are their scale out, but they're like manual scale out. We add another one when we need to based on capacity planning and whatnot, and it's pretty easy because we have a cloud. So we just edit a couple of files, and we'd run a couple of commands, and we get another thing, but it's not sort of adaptive or handling load in that way. And then in the middle where there's a sort of cloud looking things that themselves are in a cloud, because it's clouds of clouds, I guess, that's actually where we have a whole bunch of dynamic pool of machines being torn up and spun up and torn down all the time. And those actually are completely adaptive and actually predictive in some ways. And so they'll pre-spin up things based on looking at what the incoming thing is. And one of the reasons that I'm pointing this out before I go into the actual sort of puppy ansible parts of this is that in a, this is a fairly large system with a fairly large and complicated control plane, and we kind of have all of the things. We have this Garrett guy right there, it's a Java application. It behaves just like all of the great enterprise Java applications that people tell you are very un-cloud like and that you shouldn't use. Turns out it works really well. It's been running in a VM in a cloud for like three years now. And it does that quite successfully. And we have that next to some scale out things and we have all of those things adjacent to some dynamic things. And so all together that actually creates a service that our developers, our users consume. And so not all of the pieces of our application actually have to be sort of this magical cloud native sort of thing. And in fact, to take some of these, which are pre-existing pieces of software and to rewrite them, re-architect them completely to make them cloudish would be a massive investment in time. And for right now, the fact that they're not operating that way isn't a problem. They're operating just fine. We'll deal with other problems and we'll come to them. And this keeps coming off my ear, which is weird. Apparently not as good as Steve Jobs is, Steve Jobs is. In any case, that architecture as most of the architectures I think that I've ever that I've been involved with, it did not start out that way. We didn't start the project. I said, you know, we need like 76 nodes of control plane with a scalable logging service and maybe eight Jenkins masters. And you're going to need to write a plug in for Jenkins to help it scale out in that way. And it did not. We had we had three machines that that I manually spun up. And the way that they were configured is that I I logged into them with a with a password and then manually added some stuff. And then we needed a second one. And so I did that again and ultimately get to the point where that starts to get annoying. This wasn't very repeatable. There were also at this point in time, a couple of people who weren't even necessarily associated with admin thing who had permissions to things that also had spun up an additional machine over here that I didn't have a login to like it's it's the way that you start projects, which is there's some things in some corners. And you have no idea how to recover them if they if they were to go away. And it's and it's pretty terrible. And we had a couple of services that were were outside of our control that we were consuming from somebody else that may have been on 100 machines may have been on one machine. I don't really know because they're an external service. So that's that's sort of that's sort of step one that that doesn't that's a great starting place because you don't want to you don't want to spend six months planning for your initial initial stand up of a Jenkins server turns out Jenkins is pretty easy to install just just in and of itself. But this doesn't this stops being fun after a while after after you get to the point where where you've added the fifth slave by hand. And then you had to add the shell of the login accounts for five or six different people who needed to be able to do it. And like, oh, okay, I've just I've just typed in 70 different things. And I've read this little ad hoc shell script. And it doesn't really work fully correctly. And I've got a bootstrap it on there. It sucks. So step one in making the system better was to was to learn what to me at the time was new fangled hipster nonsense called puppet. And it was extra hipster because it's in Ruby. But it was it was it was it was a it was a new good learning. I'd been doing Linux administration stuff for quite a while. And I'll tell you at the time people told me I should use either puppet or chef. And I looked at both of them. And quality, I did not make the decision to use puppet on any real basis that was it was better or worse than chef. Just for the record, it was actually because at the time we were a BZR shop. And, and chef was pretty, pretty in the get bandwagon, you you used to get to upload things. And I was like, well, but I'm using BZR for everything else. I don't want to use get to upload my, my things to the chef server. So meh. So puppet is how we selected that. It's a great selection criteria, I assure you. Although possibly a weird object lesson in, in how tying a thing to a particular version control system when there's other choices out there is, I don't know, I do the opposite now. So anyway, so if you haven't heard of puppet, which is possible, I go to different places and people have heard of different things. Puppet is an open source config management system. It's written in Ruby. So if you like Ruby, that's awesome. If you don't like Ruby, it's less awesome. And, and that's not even necessarily to pick on Ruby, although I do like picking on Ruby. But largely it's a thing that you extend by writing Ruby extensions. So if you're not good at writing Ruby extensions, then you're going to get yourself into a really weird place. You're like, I need it to do something. And I'm looking at this weird Ruby code and I don't know what to do. So, so that's, that's either a pro or a con depending on your particular predilections. One of the really important things about Puppet is that it models a state. It's, it's supposed to be a declarative description of the state you want a server to be in and not the steps that you would take to get the server into that, into that state. Right. So you, you do a whole bunch of, of, of declarative stuff. And, and it, it, it figures out what things it needs to do, to do that. And you tell it, I want this file to be this. It doesn't care what the state of the file is before. It's just going to make a file with that content now. And so it's, it's very much squash. It, it really in, in, it really wants to own the entire system. You can use it to just manage some stuff, but like, it's really in the business of, of owning your system. That's what it wants to, that's what it wants to do. That's how it's effective at, at its job. One, if you haven't gotten the config management religion yet, it's, it's a, it's a fantastic next step after a hemmed installing a couple of things on a couple of machines. One of the best part about it is that it gets you repeatable and consistent machines, which once you go from, from four machines to, you know, say 20 machines, it starts to be really, really important. The other thing that's really, that we've found exceptionally great about, about it. And I mentioned this earlier in the sort of no op section that I really just want to write commits and put them into a source code is that it's, it's modeling the, the state of your, of, of your system in source code. So you can, you can show patches to, to your coworkers, to your colleagues. And they can say, yeah, that's not going to do what you think it's going to do. That's going to do really bad things. And you can, you can then collaborate. This is actually one of the, one of the other main things that we got out of this is we have a, we have a large and ever growing community of people. And we would, we would like for all of them to be able to collaborate with us on, on the management of these, of these systems, right? It's not just I, I don't want to be the special one with, with root powers. I want for anyone to be able to come in and to be able to offer improvements to the, to the system that we're, that we're running and to be able to take that, the thing, the, take the things we're doing and, and repeat it somewhere else in, in that whole sort of collaborative thing. It also means less repetition for me, as you might have been able to tell from me not even be able to operate a, an HTML presentation. Sometimes I fat finger things or make mistakes or leave out, I believe, closing, closing HTML tags. And so the more I have to manually repeat a task, the, the larger the chances are that I'm going to get it wrong, right? Whereas if I can encode those tasks in, in, in config management, then it's less likely that puppet is going to apply it incorrectly. Not impossible, but less likely. It's more likely to do the same thing over and over again. So the, the, the bulk of our, actually this is not entirely true. The, the entry point to our, to the puppeting for all of the OpenStack developer infrastructure systems is in that git repository right there. So you can clone it. And you can install all of the things that, that we, all the systems that we run for your own. I recommend it. Actually there's, I was about to tell the anecdote, but I've got a slide on it. And we all, there's also a thing associated with this called PuppetDB that we, that we run that you can go and you can look at the, you can look at the results of each of the puppet runs of somebody sends the patch and we land it. The result of running puppet on the server that that was affected to will, will get sloughed into this, into this puppet DB system. You can go look and see whether it, whether it worked or if there was a problem or any of those sorts of things, which is, which is kind of cool from an opening up our, our, our infrastructure sort of you know, graveyard land to, to folks pretty much anybody on the internet. There's, there's no barrier to entry, well other than signing a CLA. But other than the CLA there's no barrier to entry for anybody submitting a patch to us. It would like to get rid of the CLA, but that's a whole other talk. So Ruby DSL, which is the, the, the slide that it really wanted to show you a couple of minutes ago. So, so puppet itself is basically a Ruby DSL which is, which is kind of neat and it gives you some things like this. So if you want to make sure that Git is present on your, on your system, which I believe there's very few of our systems that we don't want to get present on actually. You can just write something like this, package Git, ensure present it's reasonably readable, at least that small snippet is. And, and that translates itself into, into some Ruby data structures and Ruby happens and more Ruby, Ruby's things and, and, and, and you wind up with Git being installed on, on your system, which is, which is kind of neat. There's a problem though, it being a Ruby DSL is that some of their internal data models leak. And so this is where I pick on puppet, which is really the whole point of everything, is this is if you, if you want to reuse chunks of your, of your puppet code in, in different places, say on different servers and match things and, you know, do modular programming. I mean, not even, not even object oriented, but just some sort of modularity in your, in your programming. It, it doesn't really understand the idea of an idempotent package declaration. So you might imagine that if in two different places I declare that I would like Git installed, that both of those still have the outcome of Git being installed. So maybe declaring it twice wouldn't blow up the world, but you'd be wrong. So you need to be really careful to only tell puppet one time that you want Git installed, which is really fun if you consume somebody else's third party module of puppet code that you want to stick on your machine because you may not know exactly what this, and without reading all of their stuff, what, what all things they do and you might have something else that you're installing on there that is actually trying to do the same things. Anyway, blah, blah, blah, leaky extractions, maybe, maybe not just exposing your internal data structures through a weird DSL would be, would be nicer. So there's a kind of, there's kind of three different ways that you can go about taking your source code repository of puppet code and applying it onto your, onto your servers. And we've done all of them at various different points and we've done all of them thinking that the others were better in, in pretty much all the combinations. So there's, there's a, there's a command called puppet apply which basically is just a local application of, of the stuff. You have a, you have a directory with puppet stuff in it. And you say apply this, you give it a file, but you say apply this, and it'll just run in the local machine context and there's no, no network server architectures or anything like that. Now which is fine because all you're gonna do is, you know, get check out a, a, a set of puppet code and run on the machine. It's pretty easy. And it's a great way to get started. It's very, there's very low overhead to, to doing that. You really don't need a lot of, a lot of other things set up to be able to do some simple public fly commands. You can also run a puppet master server, which is a Ruby on Rails app. I think, oh my God, seriously? All right. I, I, I promised you that I would babble and that is apparently what I've done. So you can run puppet as a, as a master and have agent demons running on each of your, each of your machines that pull back to the master and, and grab things. And you can also, and this is how we're doing it right now, you can have a puppet master and you can run a puppet agent, but in non-demon mode. So you can have some other thing that isn't running the agent as a demon start and run it in a one op mode, but it's still, it's still not contacting back to the, to master for the data that should apply. And there's a specific reason why we do that. Just as another way of picking on, on puppet, you can do things that are great. Like, say you need to install a bunch of users on your, on your machines, right? So I want, I want to have a user account on all machines and I want Jim to have a user account on all my machines. And we probably want those things to have SSH keys attached to them. So this is the first part of the puppet manifest that you need to use to do that. And then this is the, the next part. And then this is the, this is the next part. It's, it's not the, it's a pretty simple basic task. And I would originally sort of hoping that, that using an open source config management system would give me some more basic building blocks from which I could do standard tasks without having to do a lot of work and that, that seems to have been a little bit of a boondoggle. Also there's a really fun flaw for any of you that are using it. The SSH authorized key, it's got a great SSH authorized key primitive that allows you to install an authorized key content into the authorized key file of user. It, it doesn't really have great built in ability to do anything like rotating keys or updating keys. So you kind of have to manage serial numbers on the ends of, of key IDs and stuff like that so that you can delete old ones and create new ones and because again, they're leaking their internal data defection. So I'm gonna, I'm gonna move on a little bit. So, so I mentioned that we're putting all this in Git repositories and I pointed you at a thing on the, on the web where you could go and grab from a Git repository all of our puppet. But again, this is the real world. Many of these servers have, have private keys and, and certs and, and passwords and things like that. And, and it's, it's not a good idea to put any of those things in a public Git repository. I hope that's not information to anybody, but you really shouldn't put your private keys in, in your, in your public Git repository. So there's a really cool thing that the puppet has called Hyra, which is, it's a simple YAML database. We, we have it in our case sitting on the, on the puppet master. And what happens is when, when we run the puppet agent on a machine on which we want to apply some, some, some puppet. And it calls back to the puppet master. The puppet master actually can then inject some of the secrets from the, from the Hyra data. So, and, and, and then pass them along the wire over a thing that's already had a SSL cert. It's a, it's a, you know, pre-signed cert. So you know that is the, the machine that is asking for it to say, hi, I'm reviewed at opensack.org. Please give me the puppet that I want to apply. And when it's passed over the wire, it will have the appropriate secrets encrypted and passed, and passed into it, which, which works pretty well. So we can put things like this snippet in our, in our public repository, which says, I want you to apply this, this class to this server. And I want, I want you to, to pass into this parameter, the secret data that has the key of sysadmins. And you might wonder why the list of sysadmins for a server is secret and, and not something you could put into the repository. And it turns out that the theory of people being able to reuse our puppet to spin up servers of themselves isn't just a theory. It, it happened actually way quicker than we thought it was going to. And we started getting sysadmin mail from the people who just took our puppet and just applied it on their own thing. So I was getting all these mail balances. And unfortunately the machine was behind the firewall. So I couldn't shell into it because of course my SSH keys are also in the puppet thing. I thought I would shell into it and fix their server for them. But it was on the other side of a firewall. It would have been nicer if they'd got IPv6, but Linus told us we don't need that. So, oh, I'm sorry to say that. So anyway, so, so that's all well and good. But essentially what all this is, is a whole bunch of things running on cron jobs or as demons checking in every five or 10 minutes and running. And so when we run puppets as an agent on the machines, it's Ruby, so it hangs inexplicitly. And, and that's not great because the whole point of this is me not having to shell into machines to processes. And so if the thing that is managing all the processes of the machine hangs, it kind of sucks for fixing things. So, so we, one of the, one of the next things that we, there's the sort of two things that we wanted to accomplish. One of them is have, have a way for us to run and time out an invocation of the puppet agents that if something goes wrong with it, we can, we can recover in a, in a subsequent run that, that'll happen. And the other one is we've, we've got some things we got, we got complicated and, and things needed to be able to go into, into, into sequences. And so we added, we added in some Ansible here. So Ansible is a, opens our system management tool. I didn't call it config management. All it can do that as well. But it's a, it's a, it's written in Python. So if you like Python, this is really great. If you like Ruby, this is less great for you. But, but you know, such as life. As opposed to puppet, which describes a, a, a state and, and then figures out magically what it is that you want to do on the machine. Ansible is very explicitly a thing that describes a sequence of steps that you want to perform on, on a thing. It is, it is extremely sequential and extremely, extremely linear, which it turns out makes it really easy to debug in terms of where did it break? Well, it broke in step three. It's almost like we've reinvented batch processing from the, from the 80s. And I think maybe it works pretty well. Works over SSH, which is also pretty cool because it turns out SSH is everywhere and SSH is already security audited, except when it breaks, except when open SSL has vulnerabilities. But, you know, but if SSL has vulnerabilities, every other thing that you're using to do security also has vulnerabilities. So it's, it's pretty good in terms of, of your security thing. And one of the things that hopefully this will show a little bit is that it's really good in incremental adoption. I first started using some Ansible commands in our shared infrastructure. And I don't believe that I told anybody on the team that the first few times that I tried it because I didn't have to because I didn't have to install anything or let it take over a machine or whatever. It's really good at doing ad hoc remote execution. For instance, if you just pip install ansible, you can run ansible star dash mshell dash p uptime and it will run uptime on all of your machines. And how it knows what your machines are we'll get to in a second. But you can do this and the only thing you need is a SSH, the ability for the account that you're running this is on to SSH into the other machines that you happen to have, which means you can start off doing little things with it and then slowly grow, slowly grow over time to where it's taken over the world. So it also, in addition to having a command line ad hoc execution mode, it has a Yambelson tax. So it's the most of the files you're going to stick into your Git repositories are going to be just going to be declared of YAML files, which means that it's really easy to test them for validity without actually running the stuff. I may not have mentioned that it is extremely difficult to figure out what puppet is going to do without actually just running the puppet, which isn't the world's best thing. So this is this is a little playbook that we've got that will go out and clean out a Jenkins workspace of a particular project name on all of our static sites. I believe I've used this once, but it was kind of neat. I was able to take a slightly annoying admin task that we have had to do a couple of times and encode it into a thing and it does the right stuff, which is neat. You could also probably do that with a bash script with a for loop, but in this case YAML, so it's better. I'm sorry, it's not Toml, which is apparently more hipster. But anyway, so and you can run that on the command line, ansible playbook, that dash f10 right there it says please fork off 10 processes. So it'll do this in parallel in chunks of 10, which is which is kind of neat if you want to do things in batched operations and not blow all of your resources at once. And you can pass some extra variables into it and stuff. So that's pretty cool. You probably don't just want a whole bunch of ad hoc ansible commands that you type in because then it would be kind of pointless. You can do that from time to time, but ultimately you probably want you're probably going to wind up collecting a whole bunch of YAML files and sticking them into some directories. So there's some organizational structure is essentially four things. There's more, but like there's, you know, I've already probably over time. So there's there's a breakdown into modules, plays, playbooks and roles. Modules are Oh, sorry. Yeah, this is the example that we're going to use to talk about those four things. So we're going to use ansible to run Puppet, which might seem weird, but it's the it's the thing that that wound up being really good for us and allowing us to time things out. So so a module is is a is basically Python that's like a like an it's like adding an extension primitive into into Ansible's Ansible's YAML syntaxy stuff. So actually a few slides ago when I showed this guy dash m shell, that's actually just calling the shell module, which is just a Python, you know, a Python thing that implements doing remote shell execution on on host. So you can all of the the things you execute in Ansible are are just some Python. So this is actually and this is actually the code from the the the Ansible module we wrote to run Puppet on a machine. And you'd think, oh, why don't we just use the shell module of all you're doing is remotely executing Puppet? Well, it turns out that Puppet doesn't really have great return codes from it. So it's a little bit complicated to run it correctly. And I just like showing that because it's funny. So this is this is basically all the things you do. You it's got some helper things so you say, I'm going to make a module and I've got some arguments, right? So it's going to take a time out and an optional puppet master. And then it'll collect those things and you start doing tasks. In this case, first, I'm going to find where the puppet command is. And if it doesn't find the puppet command, that's a failure because if you want to run Puppet, you need to have Puppet installed. It's very hard to run Puppet without Puppet being installed. Next thing you're going to do is you're going to run Puppet. And this is the command that you need to do to run Puppet successfully and consistently. You do, in fact, need all of those things and because if you do their simple version of this, which which collects all of them, it leaves out one of them, which is the detailed exit codes. And then you're unable to determine whether or not you succeeded or failed in running Puppet, which is weird. And a couple of other default things that I did some some Python-y stuff for substituting things. This is the one that I that I really enjoy. So I want everyone to enjoy the the the wonderful logic of return codes and Puppet. So, luckily, they did get one thing right. And that is if it exits zero, it is success. Thank God. And and if it exits one, it it's a failure, but for one of two possible reasons. And the only way you can differentiate amongst those two possible reasons is by parsing the standard out from the from the command. Now, I would think to myself, maybe that's because they don't know how to return other exit codes other than zero and one. And so the only thing they've got to do is then that, except that also two, which they return in some cases, is success, except it actually successfully applied the changes you asked it to apply. So, excuse me. So, yeah, I boggle. And so that's the thing. And then we're running it in the timeout command because even though it has a timeout itself, it's a Ruby thing and it hangs. So we actually run it in the context of the timeout command and then timeout itself. If timeout times about that returns 124. So, so with all the and then by God, if you get something else out, I have no idea what happened. But, and it's just a failure. But you need all of those things to be able to consistently run it and know whether it succeeded or failed, which in terms of running something in system management land is important to know because you might need to alert somebody that something has gone horribly, horribly wrong. But it's great because we're able to take that little chunk of logic and then use it repeatedly. Now that we've got that, we can run that in a play. So this is how using that looks. I've got a little yaml file. I give it a descriptive name. So when it's printing stuff out on the thing, it says I'm running puppet rather than something that is less descriptive. And because I called that module puppet here, just telling it to run puppet and these are the parameters you're gonna pass in. So once I've written that, it's actually pretty easy to consume in plays, which are then the sort of smallest unit of operation inside of Ansible World. After a while, you're going to need to organize your plays because probably on your servers, if you're using Ansible to manage them, there's probably more than one thing you want to do, unless you're doing the thing that we're doing, which is just a running puppet. But if you're actually using it for more config managing things, you might want to do more complicated sequences of things. And so this is where roles come into play. You can take one or more plays such as that and stick them into a yaml file and there's a directory structure that's just the directory structure you're gonna put something in. And this, by putting that yaml file that I just showed you into a file called main.yaml in a directory structure roles of tasks, this gives you access to a role called puppet. And I apologize that I named my module and my role puppet, but you just have to deal with it. And so then you can put those things together into what's called a playbook, which you'll see has not just references of things, but also hosts of where to run it. Because having a play that does a thing is all well and good, but without telling you where to run it, it's not particularly all that useful. And so what this is doing, and this is actually one of the reasons that we started adding this is in our infrastructure, we need to update our git replicas first, then update our git master because otherwise if we're adding a new project and we start putting things into master, it will be trying to replicate to our slaves, which will not have the target things. So it is important for us to apply our puppet onto our git replicas first. And if any of them fail, and you'll notice their max fail percentage, it's got git zero stars, it's gonna run it on all of the hosts name, git zero something. And if any of them fail, it's gonna dead stop. So we're like, nope, I will not perform any further tasks, which is exactly what we want, because we don't want it to do then the next update tasks on the master server, which is our reviewed at OpenSecretOrg. Then it'll run that and then finally, it's gonna run everywhere else because we don't really care and they can just do everything except for AFS servers because they're slightly different. And so we'll do this and we can just run this over and over again. And you'll notice that each of the invocations of that role that we made are kind of the same. So on those hosts, you're gonna apply these roles. And it's not the world's most unintelligible thing and it works kind of like you want it to. So I've showed a couple of times where there's like stars or something like that or some sort of pattern matching on where you want to run it. Ansible keeps what it calls an inventory, which is its way of knowing what servers you have so that it can know what servers it might want to run something on. It doesn't in fact just go out and recursively query DNS for the entire world and then do pattern matching on that, although that would kind of be cool. I guess you could write that, that'd be very strange. So it's a list of that and also actually some variables and some groupings and things like that. The very simple version of this is to put a, and there's a bug in that slide, is to put a simple quasi-ini format file into a file called Etsy Ansible Hosts. And you can also instead of having a simple declarative file, you can instead have a dynamic executable that will return JSON. And instead of reading the file, it will execute that thing that you've configured to do and I'll show both of those if I can manage to do that in like a couple more minutes. So this is a simple version. So this is just a file and it's just got a list of the servers in it. And then also we've made a couple of groups here. So I could say to Ansible if I had this inventory file and Etsy Ansible Hosts, hey Ansible run on Git, right? And that would run on everything in this little Git section down there at the bottom which gives you a pretty simple way to organize your things. It doesn't have to be exclusive. So you'll see I've got a couple of these servers listed twice. It's smart enough to know that you might want to group things in different ways. So that's fine. It's fine with that. For us, we have a slightly more dynamic system. So, and we have all this puppet stuff going around which also already has pre-signed certs for each of the systems that we've got. So we wrote a dynamic puppet inventory or a dynamic Ansible inventory that gets the list of servers from the puppet list itself because if we're using Ansible to run puppet then we probably don't want to run Ansible on any hosts that puppet doesn't know about because that would be silly. So we just ask puppet, hey puppet, what hosts do you know about? And it tells us and then we run things there. And this is in fact the entirety of the dynamic inventory to tell Ansible about all of our puppet hosts. That there might be a copyright header at the top of that but there's no omitted code there. And it's pretty easy. The third thing, and this is what I've been hacking and recently is using Ansible for cloud management. I may have mentioned that we're running all of these things in a cloud and Ansible is a thing that can run sequences of steps. So if I want a new, say Git server, describing that in a YAML file in my Git repository becomes very easy because Ansible can just take care of that for me, which is kind of neat because Ansible modules are just Python. So I can have them do anything that Python does. So if I've got steps that I run on a particular place there's a list of steps and it can just provision servers. So that's actually infrastructure as code for real. I actually land a commit in a Git repository and I get a new server out of it. And so if you had something like this, you can imagine that what this might do is tell, this might be input data into telling Ansible that you would like a machine called pipi.dfw.opensag.org in the Rackspace cloud in the region, the DFW region running Ubuntu with a volume attached to it. And another one in the HP cloud in region Vg01. And then some, we can reorganize it, but I'm out of time, so I won't talk about that. These are the steps that you have to take to launch a node in a cloud. For those of you who haven't done the loveliness of launching a simple server that's going to run Ubuntu, it takes pretty much all these steps. But this is Ansible is a thing that allows you to encode the steps that you need to do to perform a task and it's pretty good at that. So you can wind up with things like this where each of those steps is I need to launch the node, I need to create the volumes, I need to attach the volumes, I need to wait for SSH to work because it turns out that once a cloud tells you a server is ready, it's not actually ready, et cetera, et cetera. I can add SSH hosts, I can add public IPs to the thing because if I didn't, that's not so great. And then at that point, I don't have to ask Puppet about the host that it has. I can just ask the cloud itself for the inventory because I probably don't have any machines that aren't already in the cloud if all of my machines are in the cloud. So I can just ask the cloud and that gives me the opportunity to say cloud more often. And so, and this is an example of the Ansible inventory that's coming out of the cloud metadata. So it tells me I've got a server called PyPy that DFW opens at org and all of this information that the cloud knows about that server that I can then use to do things like, hey, it's got this device and I might want to format and mount that on the server, which is a thing that sometimes you want to do if you've made a volume for a server. And at that point, I'm actually over time into the question answering section. So I'm not going to tell you about secrets. What's that? It's a secret. Secrets are about secrets. Are there any questions that anybody has? So I'm very sorry, but unfortunately, we're running very close to the opening ceremony. So we'll have time for maybe two questions. Oh, two questions, excellent. Just two. So who'd like to ask a question? Okay, one, two. Sorry, closing ceremony, not opening. Closing ceremonies. So, hi, you mentioned that on a couple of slides, you're talking about running masterless Puppet and then you're saying there's no need for Puppet Master afterwards. Is that something that you do or is something that you would advocate doing? It's a place that I would like to get to because the Puppet Master itself is a scaling point. So it is a server that's serving out data to all the things. And so if you went, say, from 75 nodes to, say, 750 nodes, you start to hit scaling things. And that's actually, and I don't mean this to be too entirely snarky, but that's a little bit of the Puppet Labs business model is to sell you a Puppet Master that scales better. And I imagine that I could figure out how to scale the Ruby application server better myself, but I'd rather just make that easier and have there be less of a piece that's sort of essential and that could fall over and die like that. So I would like to, we're not to the point where that is viable for us yet, but I would like to get to that point. Yeah. So just one more question. I'm not quite following why you use both Ansible and Puppet. Yes. Why not just put everything in one or the other? So that's, that's that slide that I didn't get to. So I think that's possibly a nice thing. The thing that I'm really liking, so right now what I don't have, what you can't do with Puppet is sequencing. I can't say run Puppet on this machine, then run it on this machine. It's just not, it's not how it's designed. And in most cases, it's sort of an eventual consistency kind of it'll run in places and eventually you'll get to the thing. And for the most part that works well. In our case, once you get into slightly more complicated topologies where you might need to do some sequencing and how you're rolling things out that starts to become untenable. So because Ansible is actually very good at controlling sequencing of operations. And we can do that without replacing the giant amount of Puppet we've already got. It's a way for us to supplement and sort of do a stepwise approach to introducing this technology. It is possible. One can conceive in the future that after we've got enough of it in and we're happy with it, that a decision could be made to move fully to have Ansible do the things that Puppet is doing. Because Ansible is totally capable of doing all of those things. But at the moment it's actually a pragmatic one. We have a ton of Puppet already. There's a thing that Puppet doesn't do that we're accomplishing with Ansible. And actually I think that's a really great check mark in Ansible's book is that in order to take advantage of that feature we didn't have to throw out all of the work that we'd already put into our Puppet infrastructure to be able to start taking advantage of that. It also may be that we're like, we're happy with the amount of Puppet that's running and it's not worth it for us to do the translation work that it would be to replace it. So far it's working out fairly well as far as that goes. But it certainly could from a clean system perspective imagining one system rather than two clearly is less moving parts. Everyone please thank Monty for his presentation.