 Hi, everybody. Thank you for joining. I hope everyone ate. They're not hungry. They're ready to hear about some cool Awesome automating stuff. My name is Lauren Santiago. I'm an assistant administrator for the identity and access management team at Red Hat. I work in IT there Today, I'm going to talk to you guys about how I got started with Ansible Ansible tower infrastructure setup at Red Hat IT The on-call process automation that we set up with Ansible and Ansible tower We're gonna talk a little bit about Nagios event handlers and Ansible tower and at the end have some time for Q&A So how I got started with Ansible, I started as an intern at Red Hat and when they converted me to a full-time employee, I Didn't have very much Linux experience So I wanted to prove myself and I was volunteering to do everything that would come up And one of those opportunities was release engineer and they already were using Ansible for their releases So that's how I first started using it and then over time I started training as Release lead which allowed me to start editing playbooks, troubleshooting issues, helping people create their own. I worked with upgrading Ansible, upgrading the code So I just started getting more hands on with it and that's how I got really in Ansible I've been using it about three years now and then my favorite thing to do with Ansible is Help train somebody who's never used it and automate their first task or something that they've always done manually and show Them how to get it automated schedule it and Bring them into using Ansible as well So for the Ansible tower infrastructure setup at Red Hat We we have more than two data centers there, but With Ansible tower we have a staging environment, which is testing But we have our main one in Phoenix 2 and then we have our DR in RDU 2 Both of them are behind an F5 there There's three tower nodes and in our primary data center for Ansible tower I have a postgres database that's clustered So there's a active and a passive which we could Use a playbook to cut over and switch it anytime we want if we need to which also helps when The DBAs are upgrading and doing different things so we can prevent outage Because we use Ansible tower for something like self-hilling with the on-call process We created a DR environment because we didn't want to ever have any downtime So that one has a secondary Postgres node that is passive until we cut over to DR, which we use a playbook to make that one active the Cut over to DR right now with a lot of the applications there is manual not all of them But with Ansible tower it is I mean we use playbooks But if someone has to run each playbook to get it set up that way But it's just so we don't have any outages so it's always running and available for us So some of the steps that were automated with the on-call process Well, this is all of them. So a Service alerts on a host that's broken Nagio says our monitoring system. It's monitoring it So when a host is broken in or the service is alerting it will then call the Ansible tower API and Then so initially what that playbook will do well It'll call a job set run a playbook initially. It'll set five minutes of downtime. It does this to let Nagios Actually run the playbook or make the call the tower to run the playbook and give it time to fix the host Then once the playbook finishes an alert clears we have a blog that's creative. It's a production system in our Document documentation space in jive This is so the service owner as well as the on-call team knows that a Playbook was ran against a host and that it was broken It's also Nelson IRC to the on-call team and the service owner about what her hosts alerted and what job was ran and what fixed it and then an email is sent and IRC message is sent if it fails which then Nagios then pages the on-call person as well So we have it set up like that just where it used to be the on-call person would have to Acknowledge the alert themselves Then go open the documentation space find the playbook that the service owner Provides and then manually follow the steps to fix the host then they'd manually have to create the blog and So if this just makes it where none of them have to do it So I just want to include some pictures so you guys could see this was Some jobs where you could see Where I mean it's running at three or four in the morning where there is someone on call 24-7 but they didn't actually have to get paged or get online to fix it or anything like that it all was fixed so The on-call team and the service owners can look at the jobs that were ran They also have the mojo blogs and then they're also implementing code now where Attracts with the CMDB where if a host keeps alerting or it alerts multiple times within a certain time period It'll open a ticket and service now which then it'll go into service owners queue because They don't want hosts to keep You don't want a production host to keep restarting services every day Something obviously more is wrong and they want it to be looked into fixed all that so that's how everyone's more aware of what's going on Here's an example of the blog posts Right now This is just like one of the first basic ones where you can see it announces a playbook was ran by tower on the host the date the time It gives a link to the tower job status They're adding right now in the inventory files Variables to the service owners so they can also be tagged so it's not where they have to go look at it They'll get an email or notification that their host was broken So some configuration infrastructure management about red hat just the touch base of some context of what I'm talking about So puppet is our main Configuration tool that we use there Nagios is our monitoring tool. How many of you guys in here use Nagios as your monitoring tool? And so what engine they use for releases ad hoc repair commands They use it for building hosts on VMs Ansible towers used for the cell filling DB configuration And then they're also moving release process and then a lot of teams and IT have scheduled jobs in tower of daily tasks that they want completed Before this integration between Nagios and Ansible tower did not exist So we developed our own and open sourced it so I can I'll talk about that more later For those of you who don't use Nagios or don't know Nagios is an open source Monitoring tool for computer systems. It's built for Linux. It can be used on other operating systems those periodic checks of services applications networking It can be agentless and it uses the agent it uses in our PE agent It's just a little bit of background Just so you guys can kind of understand I'm sure everyone has at least a monitoring service it may be different, but they're all pretty similar just with different benefits This is also an example from Nagios. So so you can see what a host monitor by now This looks like the local host over there would be the host there's default checks that come that Nagios will monitor and then service owners can configure Custom checks for certain services either to make sure it's running if they want to have a script ran and make sure it returns A certain result they can do that And then they can determine what gives back an okay status and what gives the critical status Which would have determined what's paged or what caused the Ansible tower API to run the playbook Sanary monitoring workflow for Nagios so right now if a service is okay It will check every five minutes against that if it comes back critical it would Check again and but a minute later Then it checks again a minute later and it checks again once it comes back critical the fourth time It would alert the on-call person Then what they would have to do is what I've seen before They would have to acknowledge the alert to silence it and then that person would have to go look for documentation Which is in a separate place in the documentation space, and then they'd have to perform the necessary actions That's provided by the service owners And then once it fixes it it goes back to okay and Nagios continues to check five minutes So with the workflow with Ansible tower Nagios, it's a little different So we still does the five-minute check every time when it comes back critical it triggers the event handler But it doesn't run the script to call the Ansible tower API because we don't want to we were trying to prevent fossil alarms Which before it would check two or three times before it would page the person on call We don't want it to actually page the person on call So on the third check is when the API handler makes the call to Ansible tower that triggers the job Sets the downtime and again the downtime is set so Nagios doesn't keep triggering the event handler or page the person on call and it gives Ansible time to Run the playbook hopefully fix the host clear it and if it doesn't then that's when the on-call person still paged It'll take the note out of rotation take the corrective out actions It'll put the note back into the rotation But one of the things that they were forcing the service owners was to create a check in their playbook That has to return a certain value before Ansible even put that host back into the load balancing because we don't want to put the broken host back We'd rather get the on-call person to look or get the service owner on to look at it If it's normal hours and see if we can get it fixed Then it creates the blog post about the event sends a notification to IT on call person And then Nagios is back to green and happy and checking every five minutes Um, so right now the developers deploy services they define their Nagios checks using puppet It's put in puppet modules developed by IT. It's very little code like I was saying some sorry pre-defining came it comes with Nagios they can get as Detailed as they want the service owners know their systems and what their applications need to do the most so that's why It heavily depends on them at least working with us or them coding what they wanted to do Ansible tower generates a generic host inventory From the Alyssa host minor by Nagios. This is also like a dynamic inventory that's created from them This is helpful because what the on-call team ended up doing is things like SSH or NRP something that's monitored already by Nagios They can create a check to restart the service just a generic template that if that alert So this way every single host that's mar by Nagios if one of those alert Tower will know. Oh run this playbook to fix that service So and then IT built sort of standard system repair playbooks a lot of services just need a restart that Is a really simple playbook We also created one that had like extra vars where if it's not something you could put if they have us a Host that has multiple services that sometimes a needs to be restarted sometimes be needs restarted So they created a generic playbook where or a Generic job template where it could have extra vars provided by Nagios and by the inventory from the service owner Developers are welcome to build their own repair playbooks host inventories. However, they want to do it. They could depend on Nagios There's plenty of documentation that is one of the things I will say is key to is if this is something that you do implement document everything so Service owners can also jump in and do their own and don't have to depend heavily on you, but you know like Myself and others make themselves available to help with whatever they need So to break down the Nagios event handler definition So usually it's a this is kind of like in puppet and that's how they Code that in there. So it's a command name where for example We're using the tower handler one that gets called and then it's a command line Which is usually a script which is the code Where it will now get us will call the interval tower API and then there's the state of how many attempts because By default with answer will tower we have it set to the fourth to the third But service owners can change it like hey We don't want to any action taken unless our host alerts five times unless or some sometimes they want it Action taken immediately first time in alerts. We don't care. We want it fixed There's the downtime where they set the downtime on the service the host name, which is the host that it's alerting inventory They can hard-code it in puppet if it's for sure always going to be that they can use the dynamic inventory Which comes from Nagios and then there's a limit which limits it to that host because sometimes, you know, they have Ten hosts in an inventory. They don't want to restart take out all of them So they limit it to they can pick limits where You know if host a alerts, but they want to fix a c and e they can change it to where All of those come out at once one at a time. There's different limits. They can set This is an example with the handler so the event handler is at the end and that's just where it would Call the restart service there. They're telling it to use the generic inventory in the service name, which It's it since it's not included in the playbook. That's provided in the inventory It's just like an extra bar. So Nagios is will tell Ansible tower and ansible tower played run the playbook with Ansible and I'll know to restart a ct PD for the host that's alerting and It's based off the generic inventory. So it's going to be pulled automatically from what with alerting So The code that was created for the handler to call insible tower If this is just like a simplified version So you guys get to see because I wanted it to be more of like tech tech deep dive You guys can see the code the code is not that long. It's only probably like a hundred and eighty lines But I wanted to put a more simplified version for you guys just to see like what it's doing You can see the extra bars the jobs that call in the job and the if l statement Another way they do it so we have Splunk which tracks The insible tower jobs and what's ran on what hosts we also have the the jobs that are tracked in ansible tower and Then we have the logs of ansible tower being called in agios So that's another way for the on-call team to troubleshoot tracks see if anything's happening or not happening like it's supposed to and Then it's also it's helpful when We're using the cmdb and the logs for tracking how many times a host alerts if it's alerting every other hour every other day multiple times a week for then them to trigger the job to make the Service now ticket for the team to fix their host or to at least make them aware that it's alerting because That was the one thing when we first did this we didn't want teams to not think that there anything was happening with their hosts or Not reading emails not reading blog posts because it's easy. I am from believers not everybody reads email So like it's easy to look past it and this is just a way You know now you have a ticket in your service now Q You have to look at it fix it figure out the problem And then one other step that it was taking is if they don't fix it then they're gonna stop offering the The monitoring service and then they'll be responsible for themselves So the Nagios and tower handler script like I said before it didn't exist no integration and I was a little painful but Worked with other people in IT. We got some code and that's how we got this working Worked with Ansible. It's open source now. It's on GitHub Contributions are welcome. It's there for anyone who has Nagios and is interested in this It makes it way easier. I will make sure my slides are up to to help with anything and Always available for email too if people have questions Some of the success that we had with it was over a hundred pager playbooks were automated since February of last year I think that number is even more now 50 of them are converted into a single generic service handler template Which was like restart services a lot of them were more Customized it just depending on how detailed service owners would be and work with us to get the playbooks coded 15% of production alerts are automatically handled by tower and never paged on call person now They're moving more of the application release process of Ansible engine into tower they're planning to migrate Jenkins Ansible tower for base OS image build and They're using Ansible engine for VM deployment and they're supposed to move that to tower DBAs storage team networking They all have a lot of their configuration set up in tower where they just provide inventory Push that and then they have they just go kick off a job and press a button And then they have the stack of whatever they need built Multiple teams at it began implementing their own automation needs A lot of people once they saw demos of this and how it worked a lot of people wanted to jump on board to have their stuff Included with that, but then they also were like can we automate this schedule this so there's a lot of people who didn't know what all Ansible and Ansible tower could do and then working with them to get everything set up So it's been very useful and helpful, and then a lot of people are interested in it And so now I have some time for Q&A Does anybody have questions? No, so they they change shifts so the way it works is the on-call team I'm not on that team anymore, but it's a global team. So there's three shift changes and So it's usually The u.s. Time and then after that I'll show you takes over and then after that India and Burno Those guys do but they all have they all are supposed to follow up on the blog post to see what was alerting What wasn't they all also have a handoff where they talk to each other about what what had been going on And they update each other on that and then plus they have all the automation where if something keeps alerting They had the blog post to see their communications with the service owner and the service now ticket that gets created to them from their queue I don't think I understand what you're asking No, oh, they have so they have the way they have their their duty set for on-call It's they actually have an in puppet where they have the on-call rotation. So they're getting paged. They have Email aliases that page them and then they also have IRC They have an IRC bot that pages them to that they have to always have up and open and then I Wouldn't say that's the only two ways that they get paged and then Tower will announce announce an IRC emails and do the blog post for production systems Yeah, that's why they did this to prevent from so like of course if they still get paged they have to get on but They're not getting paged as often now because it's Self-healing it's automated and then everything's still being blogged or tickets are being opened and so they still have the Responsibility of looking over it all that but it's changed instead of them getting paged at 2 o'clock in the morning and having to hop on and fix things They were fixed and They don't have to do that You're welcome. Well, so with When Nagios is calling ansible tower, it's the same Nagios user every time and so that helps with the tracking and But like a service owner can hop on if they need to use tower to restart their service and it's fine the tracking only comes from the Nagios user and So that's what would get flagged if Nagios is kicking off tower jobs for you and restarting a service every hour or every day and I mean, that's usually not expected or wanted and especially in production or any environment, but Production is where they get really flagged Well, it'll recheck it, but it won't re kick off the job It'll page the on-call person and they can look in there and see that the job still running if it is So that was one of our issues before and so that's why we said the five minutes of downtime We didn't want to set too much in case Something major is going on and we want the person to be paged if it's not fixed within five minutes And so far with all the ones that we've automated They are fixed and validated within five minutes I mean there could be the use case where it doesn't but usually if tower doesn't fix it It's a bigger issue more than just the host. It's usually More like infrastructure size something bigger is going on and then more things start going down and then We also don't want them taking all the nodes out of rotation Like if you have four we don't want you to go and complete outage because of it And so there's preventions from that too. So like if you have one or two already out It's not going to take out all four. It'll start paging on-call I mean the service owners give us that so So some service owners have complete scripts that have to They provide it so it's whatever they need like I don't understand all the applications and what they do So it's whatever they want like if they need to go in it like I don't know say they're just as full and it causes Some other issue they can create the playbook to go RM-f Some files and then run some other command and then restart services It's whatever they want to do, but they provide that to us Yes Yes, I mean so they get they get access to everything but the inventory Because we don't want them going in and messing with not their stuff Yeah, but they get access to everything else if they want to edit a playbook The only ones that they're not supposed to touch are the generic ones like service restart if they want something special outside of what's provided They can request it or they can code it themselves and then just say they need it pointed to their inventory or they could use the generic one Well, that's why They do where they create the well they do the blog post, but then they also do the service now ticket where Let's the service owner know. Hey, this is happening You need to look into your host and figure it out and fix it and then that's where if they don't fix it Then the on-call team the operations team and say we're not going to monitor your hosts anymore Kind of a way of forcing them to fix it so that's where what's in puppet is where you configure the check for Nagios and You tell it what to check against like if the service is down if it's returning a 500 or 404 Like you configure what you want it to let you know so it's all it's working with the service owner You have that in puppet Nagios knows what to look for in that aspect and then if it alerts a certain way for Nagios it knows what to call in the tower API and then You have your playbooks written in Ansible as a job templates and then it calls it there based off the inventory Could you repeat that it was really painful for on call, but they were getting a lot of alerts and I mean I used to be on that team and on-call life wasn't for me, but It was a lot of pages in the middle of the night a lot of noise they still have they have on-call rotations They still have work They have to do and it was blocking them putting them behind them not being able to complete their work so now this lesson focused on other things for ITU and operations and Infrastructure and not so much having to stop every two minutes to go look at alerts. That was the biggest issue But it also I will say some hosts or some Service owners didn't realize that their hosts were so bad until It was being tracked by Nagios and then the tower API kept kicking off playbook runs Which then brought attention to them to look deeper in the things because we were like hey This is happening every day Look into that and then it helped them realize they were bigger issues that they had to address So that also helped in different ways too Not many but it is possible like a little network blip or something where Nagios doesn't have that connection at that second and Then it checks again five minutes later, and it's clear then that's usually what it what it's there for The Nagios checks set up like that were before I was even there But that's what when I was talking to the SME for Nagios He said it was set up that way just in case it was something minor like that I've never seen where it's been a false positive. It was usually always a legit thing, but that's why So they can easily open it when I was doing this project I was working with them hands-on all the way because we were trying as many people automated it in this and Out of where on call was having to manually do it Now that I'm not that person anymore they can still open a ticket for consultation for the SMEs and they can help them They can if they have someone who knows Ansible Yeah, they can do it themselves But if they don't they will work with them because we want it automated and we want it to be easier for them Yeah, it's kind of like but I mean it's worth helping them for an hour If then you don't have to do it anymore. Well, thank you