 Good afternoon, folks. You guys hear me back there? Good. So my name is Will Foster. I've been at Red Hat for a little over 12 years. I first started working at Red Hat after I'd gotten out of prison. And they were building the old headquarters. And I was doing construction work. And I just kind of never left. And so I'm just working at Red Hat now. And that was 12 years ago. And I haven't gotten any more trouble. So I work in the performance and scale department at Red Hat at a small DevOps team. And we do a lot of the performance and scale engineering work around all of our portfolio, a lot of upstream projects, open stack, open shift. Basically, if we make it or our pals in the upstream community make it, we try to break it at scale inside of our shared environments. But my talk today is not on performance and scale necessarily. It's on some lessons that I've learned developing Ansible playbooks. And being in the position where I didn't have a lot of time, but I needed to very quickly spin up a complex application stack using Ansible. So included today is going to be a demo. And it could go horribly wrong, which will be amazing for everyone in entertaining. Not so much for me. I hope that it doesn't. So let's get started. So everyone here, I'm sure, uses Ansible or knows what it is. This is sort of a de facto slide that I want to put in. So Ansible is a configuration management system. It's a little different than, say, Puppet or a couple of the other solutions out there. It's clientless. It's written in Python. It uses SSH as a transport mechanism. It uses YAML for a lot of the logic and a lot of the templating. And it also uses Jinja 2, which I'm quite a big fan of, for templating config files and making them modular. So why should you use Ansible? Well, the main reason for us is we don't have a lot of time to do things. And we don't want to spend it manually installing anything if we can. We want to save as much time as possible so we can use our time for other stuff. So really, it's a time saver. With any sort of automation, it reduces complexity and human error and really tries to give you your time back so you can spend it doing more important things. So I'm a really big fan of the movie Office Space. Who here is seeing the movie Office Space? All right, so Office Space is a dystopian film about a place you never want to work. It's like the exact opposite of where you want to work. But there's a lot of good life lessons that come out of Office Space here. So our friend here on the right, it looks like you've been missing a lot of work lately, Peter. I wouldn't say I've been missing it, Bob. So you're not really missing that type of work if you're using Ansible to automate your workloads. So some of the goals, and this is going to be sort of high level that we try to strive for when we write automation, is we first want to strive for idempotency. You're not always going to get idempotency, but you want to strive for it. So it's really the idea or the methodology around you can run a repeated task many times. And if it's in its desired state, nothing should change. So when you write an Ansible playbook, when you go along the journey of maybe making a complex playbook or maybe you started with something simple and people started to use it and share it, and then people started sending pull requests. And then the scope of the playbook grew substantially. You always want to try to strive for the idempotent nature of it, that if you run it multiple times, nothing breaks. You also want to try to template as many things as possible, and that's where Jinja 2 comes in. It's a very powerful templating language that you can use. And some of the other themes that I'm going to talk about and hopefully illustrate are the idea of breaking up your roles in Ansible and sort of compartmentalizing each thing into its own specific area. This helps you with when you have to do a lot of things and maybe one part depends on the other, you want to make sure that if you introduce something that breaks or if you have a troublesome part of the application stack, it's not breaking all of the other things. And Ansible provides a whole lot of very useful modules by itself, and you want to see what's out there because the Ansible community is gigantic. They're developing at a very constant rate. And new functionality that might not be there six months ago might be there today. So a good example is, say, the firewalling module in Ansible. I've done a very poor job of fighting logic in Ansible before there was a really good firewall module that did both IP tables and firewall D. And now there's a very robust implementation of that. So at the time, this is a good example of I should read the documentation and go back and use Ansible's way of doing it rather than trying to invent it myself. All right, so Baby Yoda is really important here because when Baby Yoda writes code, he holds his hand up and he's got the little three fingers. And there's the three major pillars here. You want to compartmentalize your roles. You want to plan for growth and extensibility. So if people contribute to your playbook, you want to make it easy for things to be extensible to do other tasks. And lastly, you want to strive for idempotency. Now, unfortunately, when we write code or when we write playbooks, we can't do it as good as Baby Yoda. All Baby Yoda does is he rolls up. He's got his three little fingers. And he just concentrates. And then perfect code spots out. No bugs forever. It's the best thing in the world. And he passes out for 48 hours. And then that code is the best. It's never going to be any better. But unfortunately, we have to actually write the stuff ourselves. And we're going to make a lot of mistakes and a lot of bugs along the way. OK, so I promised you a demo. So let me talk a little bit about what the demo does. And this is going to be our example of sort of a complex stack. So who here is familiar with the Elastic Search stack? Oh, that's fantastic. OK, cool. So Elastic Search is a pretty common application stack that's being shipped around nowadays. There's lots of ways to deploy it. You can do it in containers. When I started getting involved with Elastic Search a couple of years ago, I still preferred sort of a VM approach with RPM packages. But it's a good example of what I would call a complex stack. Because you have a couple of core components that really depend on the others being available before you can instantiate them and use them. So what we're going to see in this talk and also in some of the examples that I'm going to go through is places where I've used Ansible built-in functionality to ensure that other parts of the stack are available before I try to instantiate it. And from a troubleshooting perspective, we're going to go over some debug steps and some other things that you can use, hopefully, if you're in the same position. So we're going to use Sinoa 7. And if you want to look at the code yourself, it's right here on GitHub under Ansible Elk. And I'm going to pop it up here. And there we go. All right, so we are here. So the first thing that I want to look at is my inventory file. So it's going to be real simple. I'm going to deploy a full Elk stack, Elastic Search, LogStash, Kibana, Nginx, and actually do some tuning as well in the memory. On this VM on my laptop called host01, I'm going to have another VM on my laptop called host02. And it's going to ship all of its system logs via file beat over TLS to the first VM. So the first thing you do is check this one out. I'm going to take a quick look at my vars files. And we'll dive a little bit into the importance of this and what it is. But I want to go through some of the options. Out of the box, you don't need to change anything. But what I want to illustrate here is there's quite a few options I've added over time to this playbook that I maintain that either people have asked for or we've thought would have been useful. Things like swapping out Fluent D for LogStash. For example, using Apache instead of Nginx for your reverse proxy, setting listening ports if you want them different. I bring this up briefly because this is going to reinforce the idea of modularity within your playbook. So let's go ahead and run this thing. So this is going to install a full elk stack or it may explode. Who knows? If you want to start taking bets, what's going to happen? Now is the time. OK? So this is going to go ahead and run. And while this is running, I'm going to continue with the rest of the information here. So this is basically exactly what I just said. We have host 0.1, which is going to be our full elk stack. We have host 0.2, which is going to be the client. Now you've had your snack. You've played with the button. Now it's time to put your jammies on. That's the Mandalorian from, I feel like Ricky Gervais in the office where he says a joke and no one gets it and it's real awkward. So maybe I didn't do it right. Cool. This is an obligatory slide when you're running a demo and you're just kind of waiting for stuff to happen, but it's modified for Ansible. So we're actually doing work right now. We're doing work. We're just Ansibleing. We're waiting for the Ansibleing to continue. Let's see where we are now. OK, stuff's still running. That's good. We're down to setting up Kibana. This is interesting. So when I talked about waiting for services to come up and be available, I've used a few of the wait for and a couple of the other Ansible built-ins to ensure that my dependent services are actually up and running before I try to install a whole bunch of other stuff. Other stuff that we'll do here is we'll actually check the services endpoint and the API for the actual content we want to see. So we'll look for either a 200 response time. If it's a web application, we might look for, this is my ugly firewall code, by the way, we might look for certain web content to be displayed. All right, that was quick. OK, cool. So this went faster than we thought, which is good. So my last role in Ansible is going to be an instructions role that basically just prints for the user what they need to do to use the software. So we should have a full Elasticsearch stack right here. Look at that. Who's going to lose money and who's going to win money? OK, so we got a full 6.8 Elasticsearch stack. No, I don't want to share information with you. And we are going to dive in here. Now, we see here at the bottom that we've already got an index created. That's because we've sent our local system logs to the local listener for LogStash, and it started to populate an index. But I tend to pause here because some people like to change the naming convention of the index pattern. There's a couple of options with the index pattern that might be more of a design choice for your usage. So I don't want to make that decision for anybody. So let's go with LogStash star. OK, cool. And this is where you want to decide, do you want to go time stamp based, or do you want to go received at? Or do you not want to use a time filter at all? So that's going to be a conscious decision you, as a developer, or a consumer of Elasticsearch might want to make when you instantiate it for the first time. But we're going to go with timestamp and create our index pattern. And we should start seeing some logs already here. All right, look at that. Sweet. So notice that this is just host 01. So remember, that's just our first VM. But we already populated our Elasticsearch index. And this is already going to be useful in that you don't have to SSH to a machine in query journal CTL or look through log files. When you start sending all your logs here, you can only have one place you can go to search for all the stuff, which saves you time. The second step here is now that we have a full Elk stack up, we got Nginx on the reverse proxy end. We've got Elasticsearch. We've got Kibana for the dashboarding. We need to send some data to it. So we have one more line to run here. And what this is going to do is install the FileBeat client on the second VM. So it's going to install FileBeat, which is kind of a low level transport. It's mostly written in Golang. And it's going to ship whatever we tell it to to the remote server that we've already set up. So if you had, say, a fleet of systems, you could just run this once, have this in your inventory, and then they would all start shipping all of your logs to Elasticsearch. So while that's running, let's get back to our talk here. Want to mention Ansible Facts? Who here is familiar with Ansible Facts and Discovery? OK, awesome. Quite a few people. So you can correct me when I make a mistake. I really, really like System Facts. You can do quite a bit with System Facts that you normally wouldn't be able to do if you try to write something yourself. So one of the first things that I'll do if I'm trying to automate something is I see all the facts that are available to me that tell me all about the system. Anything from the amount of memory installed to the operating system distribution, minor version to the version of Python, or anything that you could really want. Discs, the size of the disks, what order they are. Facts are massive. So this is going to probably seem elementary to a lot of folks that are already doing this type of work. But the example that I want to bring up here with Facts is that in this playbook, we are actually going to tune the memory that we give Elasticsearch when it installs based on the installed system memory. So we never want to go over 31 gigs, because it's just not really useful. But we want to use probably about half the memory for Lucene indexes. Typically, that's kind of the recommended setup. And so however many gigs of memory you have in this VM, it was eight gigs. So four gigs would actually be given to Elastic, and we'd let the other four be available for Lucene. So this is just a good example of using Facts to kind of adjust the configuration. We mentioned a little bit about service dependencies. So this is the example of using the wait for here, the URI module, to actually look for the status code of 200. So we don't want to bring up the rest of our Elastic sac if the main Elasticsearch daemon is not running and it's not actually serving a 200. It does us no good to install all that other stuff if the main thing it depends on isn't available. So there's all these really useful things that are danceable for doing this. The URI module in particular is really, really powerful. The wait for is also useful and the until. So just some more options here. So this is for the listening port for LogStash. So it's kind of a chicken and egg. You can't send data to the LogStash port if the port isn't up and available for connections. But you have to make sure that the port is up and available for connections if you want to continue to install your stack. So all of these little Ansible built-ins are there for you to make sure that you have a very nicely orchestrated stack deployment. This is just an example of what you might see in your playbook run, doing these sort of checks to make sure that your hierarchy of dependencies are met. One more example, we do quite a bit of automation with bare metal hardware. And so we have the unfortunate ability to use some of the vendor tools, which each vendor does their out-of-band tooling differently. You have ILO for HP, you have RackADM for Dell. Super Micro just has a bare bones IPMI implementation. We've actually since written a tool called Badfish, which is sort of a vendor agnostic wrapper around the Redfish API to automate all of this. But in some cases, especially older hardware that don't support the Redfish API, we'll have to use the old vendor tools. So this example here is using the RackADM to compare what the boot string is of the device and then to tell us whether we need to do further automation on that bare metal host or whether it's in the state we want it to be in. OK, variables and conditional logic. So when we first started running this playbook, you saw that very long YAML file that had all those values and things in it that help us customize our Ansible playbook. So this is referred to as kind of your vars file. So this could be one file. It could be many files. I tend to have always go with group vars and all.yaml and pile everything in there. But it just depends on your usage and how big your playbook is. But the sort of what I want to glean out of this is when you compartmentalize your roles, let's say you have five roles you need to deploy a full stack. Within those five roles, the needs of your playbook are going to grow. There might be demands to substitute one part of the stack for another one. So in my case, someone wanted Apache instead of InginX because they were using a lot of the mod-ldap functionality in Apache, which InginX doesn't have, at least, natively. So when you break your roles into compartmentalized small chunks, you can then use conditional logic to have more flexibility when you're doing deployments. All right, so on to playbook and roll hierarchy. There's no right or wrong way to do this. What I've learned kind of the hard way and learned from other people and examples on the internet and the documentation, this is kind of the hierarchy that I use, that I prefer. At the very top level, I have an install directory, and that's usually where my playbooks go. There's usually a solitary YAML file for every playbook. So in the one that we just ran, there's elk.yaml, which is deploys the stack on one server, and then there's elkclient.yaml, which deploys filebeat and maybe metricbeat or packetbeat or any of the other stuff on all the multitude of clients that you have. So that I usually put that in the install directory. Groupvars is where I put my variable files, and then I have a roles directory directly under install, and then underneath my role, I tend to just have files, tasks, and templates. Again, there's no right or wrong way to do it. If there's a simple way, you don't even need any of the structure if you just have a simple playbook. But you should count on the fact that your playbook's gonna grow over time. People are gonna start using it. They're gonna have more needs. They're gonna request, it'd be really cool if you could do this thing, and this thing turns into 20 other things. And so I try to spend some time when I first start designing how my automation orchestration will work to think of all the possible cases that could be used, and I use sort of a hierarchy like this that no matter how big it gets, it's still wieldy enough. It's still manageable enough for me to quickly go in and drill into a specific area that I might need to develop against or build. So again, this is an example. This is the stack that we just ran, and we have quite a few roles in here. And we've also used the conditional logic to sort of either or things. So for example, here, if we're using IngenX, we would make sure that, oh, I can't highlight that, that's an image. We would make sure that Apache Reverse Proxy is none. So that's an easy way to swap between two different roles that do the same thing. Again, there's no right or wrong way to do it, but compartmentalizing your roles really helps you with the scale and growth of what could become a complex application stack. Some obvious things. If you're not using an SCM nowadays, then you probably got unfrozen out of an iceberg, and you've never seen a computer either. But I really like Git. I think everyone here loves Git. And I tend to use the Git branches to sort of freeze versions of Elastic Search. So for the example playbook that we just ran and the one that I maintain, I keep the last three versions, the last three major versions. So a 2.4 version, a 5.x version, a 6.x, and then soon I'll start testing seven, and then that'll be a branch. Either way, it lets you sort of, that's why Git is so beautiful. So again, this is a no-brainer. You wanna try to automate the client operations as much as possible. You put all that work into deploying your stack, but if it's hard for the clients to consume the stack or install SSL certificates or pull in config files, then you've done half the job. So you wanna try to automate as much effort you put into the main stack, all of the clients that consume it. You wanna make that easy as well. All right, so if I didn't talk about CICD, then I think Jen would punish me here. So you absolutely want to have CICD for your playbooks that you run, and the easiest way to do it, albeit maybe even the laziest way, is just to use the Ansible Lint tool. That's at least gonna provide a step for you to make sure you don't make any syntax errors. It's gonna do style checking, and it's incredibly easy. It's in PIP. There's RPM packages for it in Fedora, in CentOS. At its very basic level, you just run Ansible Lint, you point it to where your playbooks are located, or really any Ammo file. I like to take that a step further, and because everything I do is on GitHub, I have access to Travis CI. I can make a simple Travis CI template, toss that up in there, and it really just does the same thing for me. So if I'm developing against a branch and I push a commit, it's gonna run all those tests with Ansible Lint against the playbook code. And what I tend to do is I'll have a development branch and a master branch. I'll do all my work in development, and then I'll just pull requests every time from development to master. But it literally takes like five seconds, 10 seconds to push a Travis.yaml, get it integrated into Travis CI. And if you're using GitHub, there's no reason why you shouldn't be using Travis. I don't have enough time to explain how to set it up, but it is really that simple. But I do have a full guide at hobo.house, which is my blog. And I think the first link there, the latest post is how to set this up for GitHub and specifically for Ansible playbooks. All right, so troubleshooting and debug tips. If you're like me, you make mistakes all the time and you very rarely get things right. So knowing how to debug things and to figure out what went wrong is super important. So there is luckily for us the debug functionality in Ansible. And I use this a lot if I want to extrapolate a certain fact or maybe the output of a certain impact, like if I want to see what the return code is or what the standard out content is, and then I want to do automation against it. So I will put all these little pointers into my playbook when I'm testing, before I try to do automation against a certain return variable to make sure that I have it right and it's drilled in correctly. Sometimes on hardware, you'll see what you think to be the primary disk and it turns out that the way that Ansible orders the index is that that disk might be two down on the list. So if your task was to format that disk and delete it, you certainly want to test to make sure you have the right one. All right, so here's another example here where I am checking some basic things out of FACTS. I'm basically checking that I have the right operating system version that I'm looking for and my distribution major version. And lastly, you can just run dash dash check and Ansible will tell you what it's going to do. It won't actually do it. It'll run through and tell you what changes it would make. All right, so the playbook that we just ran, there's two major things that I'm working on. I've kind of been doing this in my free time, but also when we need to deploy something within performance and scale, I'll spend some time on it. We use the ElkStack to collect logs and performance data and other things like that. The two major things are obviously to upgrade the ElkStack to 7.x and I want to get to the point where I could do a multi-node deployment of the full elastic search. So this right now doesn't all in one, which I think for shipping logs, for any sort of medium type purposes is okay, especially on bare metal it's good, but really the ideal situation for elastic searches that is clustered across multiple nodes. And then all of the sort of redundancy happens at the sharding level and you make sure you have more than one shard or several shards per index. You can pick which indexes are important and which ones aren't to give those extra shards. So there's extra copies of that data. But this is not a elastic search design talks, unfortunately. So let's check back in on our playbook and let's see what happened. You guys could still win some money. It could still fall over. And what we're looking for here is we're looking for logs from our second VM that should be shipped over TLS via FileBeat. All right, so, Hosto2. Oh, so if you bet against me you lost. If you bet for me then you maybe won some money. So we see Hosto2 now is sending its system logs directly over to the Elastic Search stack that's running on Hosto1. So excellent. So this is an example at the very bottom where we're calling the Elastic Search API to send some payload data over to Elastic Search. And we use inside of the performance and scale we develop a automation framework called Quads which automates all of the bare metal systems as well as the network switch and network infrastructure provisioning in the future. So we enter in how we want a large set of resources to look like and when the time comes our automation goes and chops up our different scalability labs into isolated work groups that are multi-tenant and then when the timeframe ends all those resources get reclaimed and go to someone else. So that's one of the places where we're using Elastic Search. If you're interested about Quads, you can go to quads.dev. It's also out of the Red Hat Performance GitHub. All right. I wanna see if there's any questions here. Thank you for your time. Yes. Not right now. So I didn't, I wanted, that was a purposeful thing so I wanted to give people a chance to name their own index but you can do it simply with the curl command as well against the Elastic API. Right. There's still that step to go in. You do it once. You know, it's already got data there from the local host itself but it was more of a purposeful design thing but it would be a very good feature request on the GitHub that I'd be happy to work on and to set that as an option. Thank you, I had a question. All right. Any other questions? Well, Elk? Ansible? Yes. We don't. Oh, sure. Yes. The question was, I mentioned Badfish which is a vendor agnostic Redfish API tool and if there's an Ansible role for that. There's no Ansible role for it. There's really no installation but it's available via Docker. You can just do a Docker run to run it or you can just clone the GitHub and it's just a Python three. You can run it there. So there is no actual installation. Yes. Right. The question is, I don't seem to be using defaults main.yaml. I just, I stick everything in var files. Just a personal choice. That, a lot of people do that as well. No. How are we doing on time? Okay, great. Let me show you all the Badfish tool. It's really cool. So, and not only do we have a really cool logo, you can read more about Badfish here at quads.dev. So you can go directly to the GitHub and we have the core maintainer of Badfish right here in the front row, Gonza Raffles so be sure to ask him about Badfish. Sweet. But yeah, if you wanted to use it all you could, you can just do a Docker run or you can just clone the repo and run the Python file yourself. And our quads automation uses Badfish as a library as well. So that's what does all the out of band management and stuff for us. All right. Well, I don't want to cut into anyone's beer time and so thank you all for the opportunity. I appreciate the attendance and if you have any other questions later, feel free to come up and talk to me after the talk or send me a message on Twitter or I'm always on free note IRC as well. So thank you all.