 Hi everyone, my name is Saurabh Hirani, I'm working as a DevOps engineer at BlueJeans and today I'm here to talk about transitioning your monitoring systems from manual to automated to a distributed one. Now a lot has been talked about benefits of automated monitoring, distributed monitoring but not much has been said about how do you go from taking your manual setup to an automated or a distributed one. We are going to talk about that and all of the learnings that I'm going to share are going to be backed up by demos which are present in the GitHub repo of this talk so that you can also try them out on your own. So let's begin. We'll start this talk by asking a very interesting question, which is your favorite server monitoring tool and which is your favorite server monitoring tool and why is it your users, right? So if your users are alerting you before your monitoring tool is, it means that you have a monitoring system which is not in pace with your infrastructure changes and in most cases it's a manual monitoring setup, right? So a manual monitoring setup is basically a demo setup that overstayed its welcome. You wanted to do some monitoring for your organization, you played with a few monitoring tools, you set up a demo system to show it off to your team and you never got around to operationalizing it and that is where it stayed. It is currently monitoring your production setup but every now and then you run into problems. Some major problems are your infrastructure team added 100 new hosts yesterday. How do you start monitoring them? You have to log into the server, open up a config file, add 100 entries and then you are monitoring. But what about the time gap in which you have to make this manual change? You lose monitoring in that time gap. Worse yet, what happens to hosts when they are removed from your infrastructure? This means that if your hosts are decommissioned and if they are present in your monitoring system you will still get alerts about them and they will be false alerts. So you decide to fix this, right? So you try to go on to the next step which is automated monitoring. You realize that you should never send a human to do a machine's job. So you are like, I am not going to make any manual changes, I will see to it that I auto discover hosts and when hosts get added to my system I start monitoring them, when they get removed from my system I stop monitoring them. How we achieve that? We are going to look at that in a demo but now you are set. Everything is automated. But still there is a major problem here. You still have one host monitoring your entire setup which means if that host goes down you are blind. You don't know what state your infrastructure is in. So you try to go on to the next step, distributed monitoring. In that what you try to do is that you have multiple checkers and distributed checks which means you don't rely on one host to do your monitoring. You have multiple hosts which are doing the monitoring and if one of them goes down the checks are redistributed across the rest of the host, right? So that works for you plus you also get the feature of horizontal scalability which means as your infrastructure grows you can also grow your monitoring system with it. Now this is all hand waving. I have just talked about manual to automated to distributed monitoring in less than a minute. But an important observation can be done here. As you saw when we are at manual and then we want to automated and then we want to distributed the feature set that we require from monitoring system starts becoming more and more apparent. In the first state all you needed was an ability to log into the server and make manual changes in manual one. And then in the automated setup you needed some way to discover hosts and generate configs out of them. And you probably need some rest APIs also for you to monitor the monitor and all of those things. And in the third state you need some level of inherent clustering abilities in your distributed setup, right? So as you keep on going from each state to the next state the requirements for your monitoring tool starts becoming more apparent. And at this point I will call out the distribution that I have used. For manual monitoring I have used icing a 1.x which is basically a Nagios compatible version. If you have ever used Nagios you will feel right at home with icing a 1.x. And for automated and distributed setup I have used icing a 2. Because icing a 2 provides me with the features of distributed checks and all those things. Now the tool set that you might have might be very different from these things but the learnings that we are going to have are going to be common across everything, right? So enough of talk let's start with the demos. How do you make the transition? You had a manual setup and you want to go to the next level which is you know automated monitoring, right? So the first step obviously is to audit the manual setup. Find out what you have, right? So before we start with the demo I'd like to point out an important thing. No matter how manually your setup is, see to it that you audit it in an automated way. And this might mean using tools, libraries, API, writing one of dirty scripts to do your job but don't log into the server to get that information. Because if you constrain yourself through an interface, if you constrain yourself through a specific way of getting information then the scope of human error is reduced, right? So let's look at a manual setup first. This is the GitHub repo of the stock. What I have done is that I have a simple docker container running icing on one dot x. All you need to do is do pseudo start container and you'll get the setup. This setup is populated with these config files. This one config file, right? One big config file which is present in your manual monitoring setup. What it is doing is that you know you have three app nodes. App one, app two, app three, which are a part of host group app. Obviously these are dummy nodes because they have local host IP, right? A host group basically is a group of hosts which share the similar attribute. In this case, app one, app two, app three are app nodes. So if you go down, you have Redis one, Redis two, Redis three, which are your infra nodes. And you have some checks running on your app nodes, some checks running on your infra nodes, and some checks running on both app and infra nodes. So you have this configuration. You put it in a machine, you reload Icinga, and you get something like this, right? So this is a, you know, don't go by the UI. This is a very simplistic representation, but it gets the idea across. You have app one, app two, app three running some checks. You have Redis one, Redis two, Redis three running some checks, right? So your manual setup is ready. Now you have to audit it. So if you want to audit the manual setup, I had written a bunch of scripts which, you know, which are using a tool called Nagira. The tool that you might use might differ. But we are doing something very simple. So if I do discover host list, right? I get three nodes. App one, app two, Redis one, Redis two, Redis three. And there is no magic happening here. If I just open this script, all I'm doing is I'm making a call to a specific API and getting some information out of it, right? Similarly, you can get information about your host groups, right? App host groups has app one, app two, app three, infra host group has Redis one, Redis two, Redis three, and so on, right? So this way you capture the information that is required for you to know. You know, what's the, you know, what's the spread I have? What is the chain that has been made over the years? So now that you have this information, you are ready to move on to the next state, which is building a parallel automated setup. Now remember, you cannot do anything manually here. I mean, you should not do anything manually here, right? So what you decide to do is get the host information and generate these configs that you have in an automated way. How do you do that? Every organization has, you know, host information spread across multiple databases. You will have some hosts present in your config management system, like Chef, Ansible, Puppet, et cetera. Some hosts which are not Chef, but which are in your cloud, you know, cloud providers like AWS, which means you have elastic load balancers, you have RDS instances and everything, and you will have some nodes which are neither in AWS nor in Chef, your network devices, switches, routers, and all of those things, right? So the information about these hosts is present somewhere, and if that, and that somewhere is called an inventory, a host inventory. And if you can tap into these host inventories in a programmatic way, then you can very well generate the configs. So for an automated setup, I have, I've written a simple Chef cookbook, and for those of you who are not worked with Chef, Chef has this fundamental concept of recipes. Recipes are basically Ruby scripts which run on machines and do something. In this case, we have three recipes, configure AWS nodes, configure Chef nodes, and configure XYZ nodes, right? Where AWS, Chef, and XYZ are your host inventories, which means these are the places where your host information resides. Now if you open one of these guys, let's open the configure Chef nodes, right? This is a very simple Ruby script. What it does is that it looks at all of the target environments. Chef has a concept of environments also which are very similar to your normal QA prod stage environments. All that I'm saying here is that go through all of the environments, find out all of the hosts in those environments, go through each of these hosts, and for each of the hosts, generate the monitoring config for it. This is a programmatic equivalent of doing this. So now if I have to add 100 new hosts, I don't have to log into a machine and change the config. The next time my Chef runs, or the next time my config management tool runs, it will detect that 100 new hosts have been added in my config management system, I have to add them here. If hosts are removed, they'll be removed from there also, right? So that way you achieve some sort of auto discovery of hosts and auto removal of hosts, and you can try out the Chef recipe by just doing a simple way grant up to get this information. So now you have host configs which are automatically generated. Then comes the next part of adding the checks. So out here we had added checks on host groups. We wanted some check to be run on all app nodes. So what we did was that we took all the app nodes, put them in a host group, took a check, applied it to the host group. This means that this checks get applied to the app nodes which are present in that host group, right? But when you're picking up hosts from inventories, you are no longer limited by this abstraction of putting hosts in a host group. Nodes in inventories themselves have very rich information. You have information about what OS they are running. If you have worked with AWS nodes, you have information about, you know, which region are they in and all those things, right? You should be able to tap that information to generate configs. So another example here, out here I'm generating the configs. Now don't worry about the syntax that you see, I'm just using a function which is getting some parameter here. So the check HTTP check is running on all the nodes which have the role app, but it is being ignored on one node which is app three. The check procs check is running on all infra nodes. The check users check is running on all nodes which have the OS Linux. And this last check is running on all hosts which don't have the OS Linux, right? So now you see we are not tied to host groups to generate checks. We are directly adding checks through host attributes, right? So through that you get the programmatic equivalent of this. So tomorrow if a new check gets added and you want to apply it on some parameters, you don't have to change the config, right? It gets applied to new host as long as a masses attributes. If you add 100 new Linux hosts, then the 100 new Linux hosts get the checks which are applied on the host OS. So through that you get a second set which is this. This is the automated setup. I'll just call this out. Just in a pointed 127.001.2 automated monitoring transition and manual monitoring transition. This is my Isingar host, this is my Isingar to host. Now if you see there's a difference here, this mute sign. What I've done is that if you have two monitoring setups, we audited the manual setup, we build a parallel automated setup. Now what we need to do is that we need to, we should see to it that we don't have alerts coming from both the systems because if things go wrong, you don't want your automated setup to alert you right now because you're just trying out stuff. So that's why I was just disabled notifications here. So audit of manual is done, parallel is built up. Now comes a part of progressive cutover. Now you have to start cutting over manual nodes one by one which means that you have to take a set of manual nodes, cut them over to the automated setup and start moving on. So for the cutover, and this is important, the progressive word is very important because you have to show some progress to your stakeholders. You cannot just shut down your role monitoring system and bring up a new one. Because this is your monitoring system we are talking about, it cannot have a downtime. So in the progressive cutover, what we need to do is a typical scenario, you want to disable one guy and you want to enable the other guy. So what you do is that you take a set of nodes, right? You enable notifications on the new guy, you disable the notification for the same set on the old guy and then you rinse repeat, right? So you take app nodes, you enable on the new guy, disable them on the old guy and then you take infra nodes and do that. So let's try doing that, which means I'll be enabling notifications for the app nodes here and disabling notifications for the app nodes here. So if we do that, again, these are just using simple API calls. So if I do manage iSinger2 notifications by host group, I have to enable app nodes on iSinger2 enabled and I have to disable iSinger on, right? So if you look at this UI now, let's refresh this page, right? So the notification go away for the app nodes, which means I have enabled notifications on this guy and we go to the manual setup and if I refresh this guy, yeah. So you have disabled notifications on the old guy, right? So this is how we do a progressive cutover from a manual setup to an automated setup, right? Now comes the next part of doing a distributed setup. If you have done your automated setup in the right way, then you get static distribution of nodes as a side effect. To understand that, let's look at the same recipe that we had looked at earlier about generating host configs, right? So you look at this parameter stage one, stage two. This is obviously a configurable parameter. You can say that monitoring host A will monitor only QA nodes. Monitoring host B will monitor only app, stage nodes or you can take it to the next level. Monitoring host A will only monitor QA nodes having the role app. Monitoring host B will monitor only the QA nodes having the role infra, right? So that way you can have static distribution of nodes. You can have two different set of host, monitoring two different set of, two different monitoring host, monitoring two different set of nodes, right? But if you want to have advanced feature sets like dynamic distribution or high availability or failover, you need to choose your tools wisely. Icinga 2 in that context is pretty cool. It has great integration with Chef. It has a very solid REST API. It has features like distributed monitoring and high availability just out of the box. You can enable them and work with it. And you should definitely try it out, which is pretty cool. I have a link to Icinga 2 cluster here, which is basically a set of two vagrant VMs which are working in a master slave mode. So that is how you do your distributed setup. Now, throughout all of these stages, you have to remember to monitor your monitoring system also. Which means you have to see to it that it consumes resources less than the capacity that you have provisioned for it, right? Because if your monitoring system is not being monitored and for monitoring it, you can do cross monitoring. The stage can monitor prod and all of those things, right? So that's about it. These were the things that I wanted to talk about. There are a few other learnings that I had to skip because of time constraint. Catch me and we can talk about it. And the resources are present here. Click on it and you can try out the demo setups. And thank you for your time. Questions? I just want to know, it's okay. I just want to know, this is only for node level monitoring or we can use application metrics monitoring or some other services, right? Yeah, you can do that also. So what I did here was, you know, if you look at it, I'm doing a check called check users or total processes. You would be better of capturing this information through Grafana or doing some metrics level monitoring, right? So in icing.