 This session is called the future of monitoring is now with SenSoup and I'm Howard, I'm Tizzo on Twitter, GitHub, Drupal.org, and anywhere I can get it. If you do check out my Git, actually it's on the ZivTech GitHub account. I will show you that in a second. So I have a vagrant example project that goes along with this talk. And I'll be showing you guys the kind of set up at the end in what I am not going to call a live demo. But it will be the thing running on my local machine. And you should be able to clone that project, run vagrant up, and actually have it to play with if you kind of want to tinker with the stuff that I'm going to show you. So I am the CTO of probo.ci, which is a continuous integration system. If you've ever used something like Travis, imagine it's sort of like that, but with live previews of those individual environments. So you can create a pull request. We'll spin it up in an isolated Docker environment. And we will keep that environment around so then you can click right from the pull request and go see what your site would look like if you merged that code. So as you can imagine, that involves running a whole ton of services and a whole ton of environments and a whole ton of machines and a whole ton of things. And we've got random people from all over the world running arbitrary code on our boxes, which is just as scary as it sounds. And it also means that our resource utilization in things can be really difficult to predict. So having something to keep an eye on what's going on is really critical. So I imagine it's kind of not totally unlike what the Pantheon guys go through. Only like 1 1,000th of it. And instead of a small team, it's just me. I have a small team too, but it's a much smaller team. I'm also the vice president of engineering at ZivTech, which is an open source consulting company specializing in Drupal, but also doing some DevOps consulting and that kind of stuff. So let me first give just a little bit of context. We've been doing this kind of DevOps-y stuff and consulting around that stuff for a while. We use Puppet. And I'm not going to go deep into Puppet, obviously, but I do have some code examples using Puppet throughout this presentation, because I think it's useful to kind of see how you use Puppet to configure Sensu. And then I'll also show you the Sensu configurations that Puppet generates. If you use Chef for Ansible or Salt Stack or whatever the newer, cooler one I haven't heard of yet is, I'm sure there's a Sensu plugin for it, probably maintained by the Sensu folks at github.com slash Sensu. It's probably really awesome and works really well. And if not, I'm sure you can write one. So I think this is really complimentary to a tool like that. We also run the Elk stack at both ZivTech and Probo, which, if you're not familiar, is definitely something you should check out. I gave a session with Nick from Pantheon on that at Drupalcon LA, the video's up. So Elastic Search, Logstash, and Kibana give you a way to aggregate logs across all those different servers and services so that you can then make sense of what's going on. There's some overlap between the two, and a lot of people wire them together, actually, so that they can do Sensu alerting on things that are happening in Logstash and they can, or sorry, things that are coming in through Logstash and they can publish the stuff that's coming out of Sensu back into Logstash so that the events that happen from the monitoring system get interspersed with the logs of the services that cause the issue that they're trying to track down. And so the idea is you can use this set of tools to build kind of a holistic picture. So I just kind of wanted to talk about that briefly as a little bit of context. Another thing that we use that has some overlap is Jenkins. And so for us, if it changes state, it runs through Jenkins. We use Jenkins not just as a build system to run a set of tests, we also use it to run Drupal Cron so that if a Cron job fails, we can get an email and that log gets captured somewhere. Sure, if you've got one server, maybe you can be a very disciplined person and SSH into that server every morning over your morning coffee and check the logs to see what happened to Cron. If you're not disciplined or maybe have 50 servers to deal with, like I do, that just becomes impractical. And having Jenkins keep an eye on stuff is really helpful. Jenkins can also graph how long Cron is taking over time so that you can start to see how things are evolving, captures all the output. So when something does go wrong, you can go back and read the stack trace. Really helpful. But what we're talking about today is Senseu, which is one of my just favorite toys. So I'm really excited to talk to you guys about it. Senseu is an open source monitoring, and sometimes people call it an open source event routing framework because what it allows you to do is sort of set up these checks that are going to run. And just like Jenkins where, who here has experience, I should take a little poll, who here has experience with Jenkins? With Senseu? With Nagyus? Yeah, me too. OK, cool. So folks that already have experience with Senseu, we're not going to go crazy deep. Hopefully you'll get something out of this. But if you meander out, I won't take it personally. All right, so let's talk a little bit more about Nagyus. The future is now because we have Senseu, but some of us also still have to deal with Nagyus once in a while. Nagyus is a thing that I, and I think a lot of the people have a love-hate relationship with. It can work really well. It's been around since, I think, 1999. The interface was last updated in, I think, 1999. No offense to anyone who's contributed to Nagyus' project. We ran a fork of it called Isenga. There's a whole bunch of forks. There's a whole bunch of bolt-on projects. The thing is that I think anyone that has worked with Nagyus just kind of has this complicated relationship with it. And that's partly because Nagyus is a complicated thing. I think we're now at Nagyus version four, but I found this great document of, or this great outline of the Nagyus version two object relationships. And this, it totally explains what my issue with Nagyus has always been. Every time I need to configure the thing, I have to admit I have to pull up the docs and check a bunch of examples and try to figure this thing out. Because it's like, okay, wait a minute. I've got a, I just want to keep an eye on this service. So I guess I need to add a service, but do I need a service group or is that optional? And then I define the service and then I say that it runs on a host. Do I need to track the host group? Okay, when something happens on that service, who's it going to contact? Okay, I guess I need to deal with this thing over here. And so you're like building these interconnected referential configuration files where it feels like you have to specify four things before you can get it to do anything. And I admit, once you start to build those up, you don't have to keep defining new groups, but it's all very explicit. And you have to lay out every server and every service that's on it in a configuration file. And sure, there is some automation for like making that happen like magic. But you tend to have to bounce the server for that to happen. It all runs on one central host, which means that as you add more checks and more clients, it tends to slow down. Unless you go even deeper down the rabbit hole of bolting more things on the Nagios, like the Gearman plugin and all kinds of nonsense. Narayan knows the pain. So this is what led a few years ago to the hashtag and movement monitoring sucks. And so a lot of people just sort of started to get on this bandwagon and say, you know what? Monitoring sucks right now. Like we're all, it's possible, you can do it with these open source tools, but it feels like it shouldn't be this painful. It feels like this is this one thing that we have in our stack that still feels like 1999, when all the rest of our stuff is 2000s, well, I guess when that movement started, 2013 maybe. Luckily, some things like Senseu have come out which have brought around the new hashtag monitoring love because now we're all feeling better and I'm gonna show you why. Right? Here again, scary slide, Nagios object relationships. Ah, the Senseu object relationships. I can just explain that to you right now and you can follow it. You have a thing called a client. That is the thing that is going to be on your server checking on things. It has subscriptions. That just says these are the things you should be checking on. We'll get into more detail. Then you have checks, which are how you check on those things. So you might say I'm going to subscribe you to all of the MySQL checks and then you might have a MySQL check for database replication or maybe you keep an eye on memory limits. Maybe you keep an eye on the number of open file descriptors that MySQL has. That one's off the top of my head because that just bit me. Then you can pass those through mutators if you want to have some reusable code that changes the output that's coming out of the checks and then that gets sent to handlers, which just say send an email, post to hipchat, send something to Slack. That's it. The idea is you have this really small, really simple, really composable pieces. And if you want it to do more than what you can do with those composable pieces, which is a lot, especially if you write just a little bit of code here and there, if you're willing to write just a little bit of code, you can avoid a whole ton of configuration. Because rather than having this tool know how to do a bajillion things, it's really easy to write just a little bit of code and route to your custom code. Not that you can't do something with Nagios as well. Or, Sensu plays really well with others. So you can wire it up to other systems, which I'll be talking about in a bit. One of the really nice things, I was talking about how all that automatic, or all that hard-coded stuff in Nagios and how you need some system to configure that. Sensu is all about automatic service discovery. A server comes up, it connects back to the cluster, now the cluster knows about that server and will warn you if it ever goes away. So you don't actually need to put anything special. You don't need to touch your hosting server, you're sorry, your monitoring servers, when you bring up new hosts. They'll just join the thing, even if it's an auto-scaling group in Amazon that brought up a new AWS instance behind an ELB or something, as long as that has the same configuration as all the other nodes, it'll just connect and all of a sudden you've got a new node in your Sensu setup. And it is designed to play really nicely with others. Ansible, Puppet, Chef, Saltstack. We'll talk more about some of the details here, but it just does some dumb nice things like having confd directory so that you can have lots of little configuration files and you don't need Puppet, Chef, Ansible, whatever to have some crazy massive template. You just have these little pieces and Sensu when it starts up merges them all together. And so you can manage things at the level of drop this file here and if I don't want it anymore just remove that file and restart it. You're not merging things into one big file and worrying about all the pain that comes with that. So it's way easier and way more fun to just start monitoring all the things. More warnings about monitoring all the things at the end. So let's get into some of the details of Sensu and how it works. So it's a microservice architecture. It's worth noting that Sensu does have a Sensu enterprise, which I'm not gonna be talking about and have not used. I know they have a whole bunch of really nice plugins that come bundled with Sensu Enterprise and there are open source alternatives out there that are a little less well supported. If you're using Sensu in a big way it's probably not a bad idea to support them financially by buying Sensu Enterprise, but I can't speak to it. But these are the individual services. So I'm just gonna let that GIF loop and I'm gonna walk you through kind of what the pieces are. So again, the client is a little Ruby process. I should mention the whole thing's written in Ruby. The client is a little Ruby process that connects to RabbitMQ and the server also connects to RabbitMQ and the server sends data down to the client saying, hey client, I've got a check I want you to run, say all the web servers. And one web server can come back and say okay and the other one can say warn and that'll go through RabbitMQ back to the server. The server will store that information in Redis saying, you know, oops, web one has a problem, web two looks okay. And then the API exposes that to you so that you can go and ask the API what's the history of this check? Has it been failing? What's the status of this host? How's it looking? So Redis, if you're not familiar is another one of my favorite toys. It can be described as a network connected data structure. So it's sort of a whole bunch of really nice in memory data structures with a really simple wire protocol. It's really nice for doing really lightweight, really fast operations on keys. It can be tuned. A lot of people think it's in memory only. That's not true. It will store the data down to disk and you can configure copy of the data down to disk every five minutes or every 5,000 writes or whichever one comes first. Or you can say, you know what Redis, I don't actually care about this stuff. Run in scary mode. Don't write it to disk at all. So it leaves it to you to tune that stuff. For our purposes, the defaults run great. Don't even worry about it. RabbitMQ is sort of what allows Senseu, I think Senseu does a really good job of being a UNIXE tool of saying, I am going to do one thing and do it well and the rest of the stuff is not my problem. Nagio's made a lot of things that's problem and then Nagio's made itself our problem. Senseu says if you want to be able to connect to a whole bunch of hosts and you want to be able to cluster that, you want to make it highly available, RabbitMQ has solved all those problems, just connect to RabbitMQ. Worth noting Senseu now also supports just running Redis. You can actually use Redis as the transport layer but I wouldn't recommend it. Honestly, RabbitMQ has worked really well for me. So the Senseu client is a Ruby process that starts up on each one of your servers and it connects back to RabbitMQ. This is different from Nagio's and friends which punched a hole in your firewall and listened for Nerpy and then ran checks through that or over HTTP. Usually there are a bunch of other ways you can wire it out. But the point is you had this attack surface still. You still were punching a hole in your firewall and trying to restrict who can connect to it, from where, for what reasons, and making sure that stuff get filtered. There have been security updates that came out. I like to punch as few holes in my firewall as possible so the idea that each client connects back to just one host that's all secured using TLS certificates is beautiful. The Senseu client runs checks. We'll be talking a whole bunch more about checks but basically it's just a command to run and if the command's successful, everything's okay. If it's not successful, it's not. And the other thing that Senseu client does is listen on a port. We'll see an example of this so that if you want to, you can push stuff right into the client and it'll go back over RabbitMQ to the server and you can have active things that live outside of Senseu that get pushed into it. We'll talk about that more with you. Senseu server also connects to RabbitMQ, pokes clients by publishing checks. I should mention there are two different ways to run checks. One is standalone, which means the clients keep track of when they're supposed to kick off. The other one is non-standalone, standalone equals false, which means that the server says, hey, anybody who's a web server, run this check now. And it is possible to hit the API and say, I'm poking you, Mr. Server, go poke all of the web servers and have them run their web server check. And then the Senseu server processes events as the responses come back and routes them. So we'll be talking a lot more about that process but that's the Senseu server is the thing that ends up firing off emails, sending Slack messages, hipchats, whatever. It also, like I said, publishes that check data back to Redis where it can be stored and read by the Senseu API. Reads data from Redis, provides access to the client registry, the check results, the event data. There's this API called the Stash API, which basically is just like a metadata storage API where you can just sort of store arbitrary key value pairs that your checks can then read again. So essentially, if you've got checks or handlers that need to keep track of some piece of information, that's a place just for you to dump it and be able to get it back out and it's sort of up to you to know what you're doing with that. You can also use the API to silence alert, resolve events, forget clients, right? That's the one thing. If your auto scaling group brings up a server, it has now joined. You probably want Senseu to warn you if your servers just start disappearing so you do need to wire something up in an auto scaling group where if you then take that server away, you tell Senseu, hey, forget that server, I just deliberately deleted it, that wasn't an accident. Yeah, so that was pretty much walking through that. Hopefully this diagram now makes more sense. So the clients get told run check, they report back to Rabbit, that goes to the server, that goes to Redis and then you can query it from the API and find out what's happening. So to jump a little bit deeper into these concepts, again, we've got client subscriptions, checks and handlers. Clients are pretty simple, like I said, Ruby process started by system D or whatever and they connect to Rabbit and do their thing. Subscriptions are how you identify which checks should run on which servers. You can kind of think of this just as a tagging system. It's just like an array of arbitrary strings, you get to make up what they are and if a check tag matches the server tag, that check should run on that server. That's it. So, let's say you have some puppet code and you just put censu subscription, censu test, I could have picked a better name. B minus for creativity. Well, in your puppet code, and again the Ansible, Saltstack, et cetera, Chef code looks really similar to this. But the idea is you should have some kind of automated configuration management system and so you just drop this in wherever you want the thing. So, if we had a service called censu test, in our configuration management, any server that has the censu test package installed on it should have this little line which will end up creating this piece of JSON. So, the data format for censu that gets sent in between the different processes, what gets stored in Redis and the configuration files, it's all just JSON. It's really nice because every language reads it, every language writes it. So, that's it. Now that's, now that server, if you have that piece of configuration on a censu client, that client now runs any check that's tagged censu test. So, checks. One thing Nagio's got absolutely right is its check API. Super simple, super unixy. So, you just use simple scripts that have exit codes, right, every process that runs on a POSIX system can have an, has an exit code. And you can exit with an integer, I think what, zero through 254. In Nagio's land, zero means everything's normal or okay. One means you've got a warning, something's a little wrong. Two means you've got a critical, something's really wrong. And three means unknown, you should probably be worried about it because we couldn't tell. So, one of the interesting things about being Nagio's compatible is Nagio's has been around since 99. There are thousands of checks written for it. There are packages for every platform that pulls down a bunch of Nagio's plugins. You can just app get install those. Sorry, it's 2017. You can just apt install those. And you've got those checks and you can start wiring them up and running them on your boxes and reporting back on MySQL, TCP connections, memory limits, whatever. If you can think of it, there's probably a Nagio's check for it. And what's cool about that is that means you can also just run those scripts from the command line and then run echo dollar sign question mark. If you're not familiar with that one, worth coming to the training just to learn that, it will tell you the exit code of the last command that ran. So, you can run that. You can, you know, while you're working on your test, you just keep running your little script and running echo dollar sign question mark. And if it's zero, everything's a okay. One's warning. Two's a critical. Three's unknown. So, puppet code to generate a check. Here's an example. I'm calling my check success. Note here that my command is basically just some bash to run. This totally works. It echoes something went right, which gets captured by SENSU and stored back in Redis so that when you go to the SENSU dashboard, which I'll show you guys at the end, playing with fire, showing you something live on my machine during a presentation. Here we're saying the handlers are defaults and the subscribers are SENSU test. We'll be looking at handlers in just a second. So what that means is echo something went right. Been true, if you're not familiar, is a thing that just exits zero. And so this will always exit zero, totally successful. We're all cool. Very helpful, very helpful. Replace all your tests with that, they'll always be green. JSON configuration, the results from that is, oops, I grabbed the wrong screenshot. I will show you the JSON for that in a minute. Sorry about that. Handlers, handlers again, I guess I should have mentioned the API for this, the checks, like I said, are just processes that get run. And again, those don't need to be Ruby, those are just shell commands that get run. So again, like here, we were just executing a bash right in line. So you can write your checks in any language. Same deal with handlers. So here I just have some bash that prints standard in and pipes it into a file. Sensusdefaulthandler.log. And that's just going to end up being a line break delimited log file of JSON events of all the things that happen. Probably not the most useful handler. Maybe you wanna tweak that. This would be the JSON configuration that results from that. So type pipe means, so there are different types of pipes. The type, sorry, different types of handlers. The type pipe means fork, create a new process and essentially like tell bash, fire up a new process with bash and tell it to run whatever you handed into the command. Sending the message to standard in. So your handler is just responsible for reading the JSON report of what happened in standard in and doing something with it. So that's all. Oh, and then obviously here, there are some other settings that got automatic defaults from our public configuration. So filters would allow you to define reusable pieces of code that would decide whether or not to send this on. So maybe you wanna filter out, don't send messages during non-working hours. If your SLA just says we respond between nine and six PM like ours does. Good work if you can get it. And then the severities. So for your handler, you can say okay, warning, critical, unknown, which of these statuses should we be sending on? So if you wanna only send criticals to this particular handler, that's not a bad idea if it's something like Slack or HipChat maybe and you're trying to keep the chatter down. We'll come back to that idea. Here is an implementation of a custom PHP handler. So if you're not familiar, that shebang at the beginning, that hash exclamation point, basically tells bash or whatever engine is running it that I want you to invoke this and run it with the executable that I now specify. And here we're saying resolve to the PHP based on the path variable. Don't worry about that too much. Throw that at the top of all your PHP scripts. It'll be great. Helpful tip, you can actually put user bin and drush and if you invoke it from inside of a Drupal root, it will automatically bootstrap Drupal and you have your whole Drupal API available to you. So if you check some of your, we'll talk more about strategies for running Drupal checks but that's a quick and dirty way to throw Drupal checks in and get them running. So here we've got a PHP script. I should mention I totally stole this example from Nick at Pantheon who used something very similar in another one of his talks. Here we are reading the JSON blob out of the input and pulling out the check name and then pretty printing the raw data and sending it out just using PHP's built-in mail. And so here we're not gonna get a super pretty human readable thing but it would give you that first step and show you kind of what's available so that you could then build like a little bit of a nicer template and figure out what you want that to look like. So again here, standard in is where you pull the JSON object from and then we're just JSON decoding it, extracting just the check name which we know will always be there and then sending it along. Mutators, I'll just say, allow you to convert the incoming structure into the outgoing structure for the handler. This is most useful if you're turning something into, is there a name for graphite format, the line break delimited key value structure? So essentially some of your checks will generate a JSON blob and you wanna be able to convert that into the format that goes to graphite. Handlers are really useful for doing things like that so that you can take a piece of JSON, convert it into the text that's expected by your handler. Again, your handler would look like what I just showed. Without a mutator it would be the raw JSON but you could prep it into something that could be piped directly into graphite. So with that you could use netcat as your actual handler because you pass it into a mutator which turns it into a text string and then you could just use netcat to send that straight into graphite. Op stuff. Okay, so again just to kind of review that you can have a check script. Each of these scripts can basically be written in any language which is what I'm trying to illustrate here. The agent itself is written in Ruby and I should have drawn some squares around this I guess. The agent is written in Ruby. The agent is what's going to run your check script. So this kind of all happens down in your client in a, oh you can't see my, so the bottom part there, the agent and check script happen down in your Ruby process where the check script would be like a child process of the sense of client that's gonna be running that check which can be implemented in PHP, bash, whatever, right? So that PHP script bootstraps Drupal checks to make sure that the admin user one isn't logged in because we block that normally and if he is logged in then we probably just got hacked and so you should throw an error, right? So our PHP script can actually use Drupal's APIs to try to check that. I don't know if that's actually a good idea but that would be something you could do. The agent then can throw an alert saying something went wrong. The PHP script exited non-zero so something is bad, send that back to RabbitMQ. That goes to RabbitMQ which again also decouples things. So if you get a big flood of alerts they can queue up in the MQ, the message queue in Rabbit helping to keep this kind of scalable and decoupled and allowing you to scale your message queue independently of your other pieces. I should also mention Sensus server is also sort of horizontally scalable. It does its own leader election and you can run a cluster and do that in AHA as well. And that is where sort of all the stuff at the top right part of this happens, right? The filter, mutate and handle are all sub-processes again sort of forked in sub-processes of that Sensus server which then will also write it back to Redis like we saw. So the filter can say should I send this on or not? The mutate can say let me reformat this and then PHP can say let me take the text I just got from the mutator and do something with it. A lot of the time people just build that mutate right into the handle. The mutate, each of these steps, if you add filters or mutates those are additional forks, right? So each one of those slows down your processing a little bit, right? There's a little bit of overhead. So sometimes people just sort of for expediency or a lot of the time they'll just build that right into the handler rather than having these chains unless they're just trying to keep their code dry so that they can have reusable pieces and not have the same little snippet in 10 places. And so the handler is where you might again be able to write a little bit of code to say let me look up in some database or from the stash API or from wherever which users I should be sending this email to rather than hard coding it like you saw you could add your own little layer to do something smart there with a little bit of code to sort of replicate some of the stuff that Nagios could have modeled for you. Again, if that gets too complicated there are other ways to do it which we'll look at in a second. So monitoring Drupal, right? This is DrupalCon, we gotta talk about that. So there are great SenSu community plugins for lots of things and all the ancillary services that you need to be running Drupal. There are great ones for Solr and Elasticsearch and MySQL and Apache and Nginx. I don't actually run any on Varnish but there probably are some good ones for Varnish. A lot of those can also collect metrics which I've alluded to before, we'll talk again a little bit about that at the end. But you probably also wanna keep an eye on Drupal itself, is it bootstrapping? Are the pages loading, et cetera? Well again, SenSu's Nagios compatible so you could use the Nagios module to do your monitoring. The thing that I don't love about that is that the Nagios module needs to be installed on every single one of the Drupal sites you wanna keep an eye on and then it can work either over NRPE or you can open up an HTTP page and then you have to manage like a client secret and the decision on whether to alert is managed on the individual site level. So I manage hosting that hosts a whole ton of sites. If I decide you know what, my cron interval that I was alerting on was 60 minutes but I actually wanna make it 65 because I run cron every 15 minutes and the Drupal semaphore means it's going to keep failing until after that point. So if it crashes once, maybe I'm not gonna worry about it unless it continues crashing after the expiry of the semaphore, right? This is a real example that I was dealing with. So how do I reconfigure that with the Nagios module? I would need to touch every single one of those sites and update its configuration. So I wrote this module called the check module. I couldn't believe that namespace was available in 2015. The Drush check module or just the check module is actually a Drush extension. So the idea is rather than installing it on an individual site, you can install it on the server and then you just need to run Drush check on each of the sites that are in there. So now all of a sudden you could turn that into a super simple bash script, LS your var www folder and run Drush check across each one of those things. And you can pass in in the command line parameter what the threshold should be. So if you wanna change that, you change your one sense you client config and puppet in the one place you define that for PHP heads and now that replicates out to all of your PHP heads and every single one of them is running the Drush check module. Drush check module ships with everything that Nagios has because I ported it over because that's what I was replacing. So here's an example of how you run it. You can see this is me running it on Zivtech.com. I just ran user local bin Drush specified the path to my Drupal root. Just said check all and then you can see I'm doing the trick I was talking about before, dollar sign question mark to see that the exit code was zero, everything's a okay. So I just throw that into one of those check definitions and away we go. Now we've got monitoring for that site which checks the Drupal overall status check like the dependency report making sure that like, you know, that's the thing that'll warn you if you don't have GD installed, et cetera. Security updates that are available, specifically security. A patch that would be awesome if anybody wanted to write it would be some way to specify a black list because one thing that we've run into is security updates where we know that we're not using the sub-module that's actually affected. So sometimes you're like, you know what? I actually know that I don't ever care about 8.1.3 of this module. And last successful cron run. So that you can know if cron has been failing. Again, we have two layers of that because we've got Jenkins but then we also have a sense you check so that we can get alerted rather than waiting for the client to go do a solar search for their content, not find it and realize that cron's been failing for three weeks and nobody noticed until they tried to do a search and couldn't find it, right? I can't tell you how many clients I've heard with that sub-story. Throwing that stuff into your automatic monitoring you don't have to worry about it anymore. You'll get told if your cron starts failing. And then there's a hook. So you can register your own plugins. So we'll write our own modules that keep an eye on their own things. So again, you could do that on the core requirements check or you can just write a plugin, a hook for our thing specifically so that it doesn't show up on that core requirements check but does check whatever you want on your server, on your site. So another thing that's really cool and a little bit different from Nagios is Senseu has first class support for metrics collection. So your checks can also collect individual metrics and then pipe out to a handler that can do graphing. So getting into the details of that is a little bit of a bigger topic. This is one of the differences with Senseu Enterprise. It ships with first class handlers for influx DB and graphite, I believe. The community ones out there work, that's what I use but they're not as actively maintained. It seems like the enterprise ones are super cool. Influx DB is a new time series database and Grafana is a new tool for building graphs that look like this against pluggable back ends. You can point it at I think elastic search, influx DB, graphite, et cetera. If you're already using graphite and have it all set up, that's awesome. Last time I tried to install it, it was still awful and there were not packages available and I needed a whole bunch of different tooling and it just was incredibly frustrating. I have gotten graphite up and running once and it was a painful experience. Maybe they fixed it, I gave up and switched to influx DB and Grafana which I believe are both written in Go and there are packages for apt and yum. You just add the package, you can just apt get install it, sorry apt install it, whatever and configure it and get Senseu collecting metrics and piping them in there. What's really nice about that is again, you already have that RabbitMQ TLS secured infrastructure where RabbitMQ is using certs and RabbitMQ server isn't allowing connections from any client that doesn't have the corresponding cert so you've got really nice public key crypto establishing this connection and now you have this sort of bus that you can use to send all kinds of data so you can even run something like MCollective, Marionette Collective also has a plugin for running over RabbitMQ so you can use that as sort of your operations bus, all your data flows through there. Senseu collects your metrics, Senseu does your process monitoring and then MCollective could do your remote execution. Again, another thing that's a little bit outside the scope here but like a really nice value add with Senseu that you kind of have this one tool that can do a whole bunch of stuff and this RabbitMQ dependency is reusable for other really cool things too. Now, you probably don't wanna maintain and write your own database like I suggested before of who should be contacted for what's checks and at what times. That sounds simple but can get outgrown quickly so if you do have a team and you wanna share the responsibilities, on-call scheduling is really nice, we've used pager duty. At this point we kind of have one main team that's responsible for the same stuff kind of all the time and so there's just a whole bunch of people kind of carrying the pager, we're not doing a bunch of sophisticated routing but we have with clients before where they were in a different time zone and like if it happened at 9 a.m. Eastern, we got the call because that would be 6 a.m. in San Francisco and then like starting at 6 p.m. they started getting the calls because that was gonna be less disruptive until, right, until they headed home. And then it was their stuff so they got to have it the rest of the night. And then, oh, pager duty is really cool, it's got really awesome apps, you can do escalation and it will keep track of whether it's heard back from the initial person so it'll send it to one person and then they can say like, oh I can't fix this in bulk, sorry, I can't fix this, it'll automatically go to another person or you can have it send you a push notification first and then a text message and if that fails auto robo call you and like read you the alert, it does all kinds of cool stuff. There's also a really nice open source project called Flapjack that allows you to do some of this similar stuff. It also has a really big focus on rollup so that you can do, you can do some of that with Sensor too to aggregate alerts to basically say things like, in Sensor it was aggregation, you can say things like of my elastic search cluster of 10 nodes, I need to know that like six of them have good health before it's actually a problem and so it can actually like check all of those 10 and check to see whether 60% of them are okay and if 60% of them aren't then call that a critical, maybe 80% call it a warning, above 80% everything's fine. So we're just about wrapped, I wanna make sure that I've got time for questions and I wanna show you guys the real dashboard. I'll also show you that check definition that I had the wrong screenshot for. But a few tips, the biggest thing and this is the catch with monitoring all the things, monitoring fatigue is really bad. If you start having alerts that go off all the time, we had this problem for a while, the issue that you get right is it's the boy that called Cried Wolf and everybody starts just ignoring the alerts. Oh my, the problem is you pull your phone out, you look at it and you see, oh I just got a Slack message from Sensu, I'm sure that's nothing. That goes off all the time. Now you're in a much worse situation than if you had no monitoring because you have this false confidence that you have monitoring but no one's looking at it so you might as well not have it at all, right? You've created false confidence and no actual value add. So what the result of that or the way to fight that is to consider every single alert that goes off an actionable thing. Even if the action is deleting that alert because you realize it's not actually helpful and you don't really care. Because if you go down this route, I promise you you'll get super excited about monitoring, you'll add some checks for things that seem like a good idea, they'll keep going off, you'll realize they're like kind of false positives and like really that doesn't matter all that much. Load can kind of be one of those things where you start saying to yourself like load's not a great example. But a lot of the time people set a threshold for load that's way too low to begin with. And it just is super noisy and that starts to make them ignore things. You do want to be watching your load but there's kind of like predisposition to be a little overly sensitive to it, I think. And some of the defaults tend to be a little low in my experience. If you're managing your capacity without a huge, huge amount of overhead or a headroom. Another thing, Drupal especially, watch out for caching. I've seen a lot of people create HTTP checks that don't have a cache buster attached to them and now you're just making sure that varnish is working. You're not actually loading Drupal. So watch out for that. Also something I should note about the Drush Check module. Again, it's Drush, so it's not actually making an HTTP request, it's loading PHP up through the CLI. So if your FPM cluster is down, Drush Check won't tell you that. However, if you combine Drush Check's core requirements checks with a simple HTTP check, that what we do is we just append the timestamp as a query parameter. So it requests ziptech.com, question mark time equals timestamp. Every single time that's gonna bust the varnish cache and actually make a request back to Drupal. So if you're getting a 200 on the homepage and your Drush Check checks are running, you can be pretty confident that a lot of stuff is up and working the way it should be. Another one of the DevOps things, just DevOps philosophy. Everybody should do a round on rotation with the pager. If you're writing code, you should be one of the people that's answering for it. I'm throwing this in as a tip and a warning because I don't know what your team's like when I ask people to answer calls in the middle of the night, they don't get quite as excited as the people in this, but it's aspirational. I hope your folks will be excited like this too. Here's one other catch that is a real concern and problem that we had to solve. What if SENSU goes down? This is not, it wasn't as bad for us as the poor folks at GitLab. If you haven't read that story, please do. Like six layers of redundant backups all failed and they lost like hours of production data and had a horrible recovery time because someone RMRF'd on production by accident. Computers are heavy equipment, folks. Don't operate, I'm tired. So the question though, similar thing that happened to us, we suddenly got an alert from our SENSU, or sorry, not from SENSU, we got an alert from the client that their site was down. I jumped into their SENSU cluster, they had their own. The SENSU cluster was down. How long has SENSU been down? Okay, okay, it's fine. My SQL crashed because we corrupted a table. That's why we have a hot slave jumped over. The hot slaves replication had been failing. And we didn't know, cause SENSU watches that. Luckily, the replication started failing when the corruption happened, not before. So we lost basically no data. It was like two hours from three to five in the morning, but like SENSU had been down for I don't know how long, cause nothing was watching SENSU, right? Who watches the watchman? You have to have an answer for this. I'm sure there are more clever ones than mine. There's a service that I've used for this and a few other thing called Deadman Snitch. Deadman Snitch is supposed to be sort of like a poor man's monitoring thing. The idea is they give you a URL, you can just send requests to, and you tell it a threshold for how often a request should go to it. So basically you can just throw it at the end of your cron job, send a curl request to this place, and then if Deadman Snitch sees an hour go by without getting a curl request, it alerts you and sends an email and I think it has a couple of other options. So, super simple. Let's see an actual practical example of doing something with SENSU. Here's the puppet code. Subscription, SENSU server. So I've created a subscription on the server itself. Check, Deadman Snitch. I'm just running curl and I redacted the thing there, but nosnitch.in would normally have a little pseudo random bit of gibber-jabber there. The handler is default, so if it does see an error because Deadman Snitch is down, it would just send the alert through our normal one that would go to Slack and email. Having different handlers can also be a nice way to cut down alert fatigue. You can have a dedicated email address that everyone's supposed to create an email rule so that you have messages that go somewhere that people can check periodically, but it's not constantly pounding you. There are other ways to solve that too, but that's a quick and dirty one. Subscriber, SENSU server, and the interval every 300, I believe that's seconds, it will run this specific command. That means that we should never go too, too long without it alerting. Here is an example that makes up for the other one. Here's the definition of the check. That Deadman Snitch thing is basically the name. Again, the subscribers are any server tagged SENSU server. We saw us tag it before. Standalone true, although maybe this should be false. Maybe we only want the SENSU server to kick it off. Standalone, again, means the client will run it even if the server is down, so maybe we should flip that. Handler defaults, interval 300, and there's the command. So standalone true that there's a catch with using this configuration management stuff. It sets defaults, and the one it set here might not be the one that we want. So yeah, again, that's gonna run that curl command, and that's gonna let you know if something's gone down. So let me take one second here because I do wanna show you the dashboard. Right on time, that was my alert to say make sure that you're shining in the dashboard. So here is the Uchiwa dashboard. This used to be called the SENSU dashboard, but it got turned into its own project. So I can zoom in. How's that? So here's the top level. You can see that I've got a major failing check, a minor warning, and a minor failing check. You can see the minor ones are yellow. The major one is red. If I click on the individual node, you can see that I went right to this node. It's using the, I think the reverse DNS lookup, so I'm ending up with whatever this is from my ISP here. You can see this is the command that we're running. You can see it's just returning true, which is why this is a major failing thing. And you can see the output that was getting echoed there, right? And then here, too, is the history. So if this alert was flapping, that would say like zero, two, zero, two, zero, two, zero, two. And so you can use that in filters, et cetera. Think there's a default filter, right? Or there's one that ships with it to keep you from sending an alert every time the check runs, so that by default, I think the SENSU handler plugin does that. So there's a Ruby project that gives you some helper stuff to handle making writing extensions lighter. So you just extend this class and you can, and it will automatically do some things like filtering on the fact that you'd already sent a message for this. So that basically like it will only, it will watch, it will look at the history and see whether there's a change, right? So like if the last one was also two, we probably already alerted don't worry about it. But if this one now becomes zero, send a message saying resolved. This tab is where you can go to see all your hosts, whether they're alerting or not. Here you can see a list of checks. Here you can see any silenced checks. So that's one of the nice things. Let's say this keeps going off. What I can do is, what I can do is, oh no actually I could do it from the dashboard. There's a couple places to do it. I can click right here to silence the check and I can click here to silence the host. And this stuff's also available on the individual check and host pages. So if I click this, this just pops up and says silence for 15 minutes by default. One hour, 24 hours, custom, no expiration. Expire this silence on resolve. Otherwise the fact that it's been resolved won't activate this check again. It'll still be down for two hours. The check will still be not sending alerts for two hours. And then you can also enter a reason that your other team members can see like, yeah, cool. So, and there is, sometimes there's a little lag before stuff that you've written to Redis shows up. The UI doesn't always update right away, which is kind of annoying, I will admit. But there you can see, here's the comment. So if somebody else on your team hops in here, they can be like, oh, I see there's a silenced alert. Let me click on that, let me see what happened. The stashes, the silencing used to go into stashes, the new versions of, since it doesn't. Again, that's for like arbitrary storage of data that your checks need to find. Aggregates, oh, and SENSU, the Uchua dashboard can connect to multiple data centers. These are multiple SENSU clusters. And so you can see here I only have one. Now one really simple check that I wrote that I can just show you kind of how this works real fast is check some service. I just wanted to do another quick example that shows you that this Unixie kind of baseline, what are the return codes? Rep, right, will search for the word broken. Q says don't print any output from that. Just return zero if we find it, one if we don't. V means flip it. So if we find the word broken, it's broken alert. If we don't find the word broken, everything's fine. So, and then it's just checking var log some service.log. So here you probably don't have your thing print broken when it fails, but you might be able to set up a grep for looking for a stack trace, for example. If I find a stack trace in the PHP error log, or if I find that the PHP error log is not empty, throw an alert, right? And then I can basically silence the alert just by truncating that file. Super simple way to just throw up a thing so that you get told anytime something lands in your PHP error log. So now on my VM, if I wanted to echo broken into that path, it'll take a second because this runs every so often, but in a couple of minutes, that should show up here. Now if I don't want to wait for a couple of minutes, I could also use the thing that I told you we'd come back to, see, I made a promise, I'm keeping it. Here's a bash script that echoes a JSON file that describes a check. So this is the Alpo check, and this says you are totally out of dog food. Doggie fire, this is fine. Status two, so that's the same as what the alert code would be. And then just pipe that right into netcat. So if I just run sensu error.sh, you can see I got a result of okay. And whoa, there we go, the Alpo check. You are totally out of Alpo, right there. Emojis and everything. So let's say that we got some Alpo, we borrowed some from our neighbor, right? We could send, wow, look how, now you are running low on dog food, cry face, right? And so just to look at that, you can see here we just switched status to one, updated the message, otherwise the name is still the same. And if we wanted to say we made it over to PetSmart, we have now resolved the issue, dog food looks great, plenty of dog food, right? So you can run a little script like that from anywhere, you can do that from PHP code, you can do that from Node.js code, you can do that right from bash, anywhere you can write a string to TCP, you can push stuff into sensu, anywhere there's a check. So again, that's listening on port 3030 by default on localhost, on every agent. And you can see check some service now as a warning, and you can see there in the history it was fine, but the check has run twice since I pushed that update by echoing broken into the lock. Yeah, so that was the critical parts that I wanted to show. I also promised to show you guys where this was, it is github.com slash ZivTech slash DevOps tools example. I think I can update the session description, so I'll update that and provide a link to this. You should be able to follow the instructions and get this exact example with all those checks and everything set up, and the bin directory that I ran these scripts from is in there too, so you can push errors in, push alerts in, et cetera. If I do want to delete a check, I can go to my checks and grab the Alpo check, and I can say, forget this one, it works just like forgetting servers, and now that's no longer in the list. So all right, I'll open it up for questions, we're just about out of time, but I just wanted to thank you guys again for coming and hearing all about Sensu, and don't forget to evaluate this session, I'll hang out for any questions, thanks. Does that have the check module already enabled? It does not, because it also doesn't have a lamp stack. Is it tightly coupled with the RabbitMQ? No, I think Sensu Enterprise ships with some other transports, the open source one ships with Redis and RabbitMQ, and that's it. So there is a transport abstraction layer, and you could write your own transport, and it ships with the ones for Redis and RabbitMQ. Between the two, I'd recommend RabbitMQ, but Redis certainly simplifies your setup. What's the overhead of the monitoring? Very low, there's a little Ruby process that sleeps in between things that it needs to do, and it forks to run your script. So the overall overhead is really low, I wouldn't try to do it once, I wouldn't try to do it 60 times, run check 60 times a second, but sure, running a check every minute, every five minutes shouldn't be a big deal. Yeah, I mentioned pager duty. So what those two services do, right, so yeah, so you could write whatever, I mean, you could write your own. What those two services do, and there are plugins for them already available, is routing so that you can have an on call schedule, who's supposed to write, I'm on call this week, Lawrence is on call next week when I'm on vacation, and that means that I need to be, this week I need to be no more than 10 feet from my laptop, and next week he does. Right, and pager duty allows you to set those things up, and then even say like, if I can't solve it even if he's not on duty, escalate to him anyway. Damn it, it's the second time, sorry guys. Just trying to wake you up at the end of the day. But yeah, Flapjack is open source, and does a bunch of that stuff. Pager duty is kind of expensive, but really cool what it does, and the app is really nice. But you do pay per team member per month, and it adds up for sure. I think last time I checked their pricing model. I've never had to actually be the one paying the bill, so I, any other questions? Yeah, right, so how do you make sure that the checks are actionable for team members other than the person who wrote the check? I mean a lot of ours are really self descriptive, like number of Docker containers running, or percent disk space, or load percent, right? And so for those, they say what they are. For any other one, I know what Pantheon does, is every one of their checks, I didn't describe this, but you can add metadata to clients, and you can add metadata to checks, and they get passed through the whole pipeline. So their handlers look for a special key that they've defined as metadata on the check called playbook, which is a URL to a wiki. So if you see the load alert go off, there will be a link right in Slack from where you saw that alert that says click here to go to the playbook, and that's a wiki page that says if you see high load, if it's on this kind of server, you should check this. If it's on that kind of server, you should check that. If it's Valhalla, just call David Strauss. I'm just guessing that's what that one says. Yeah, so I think that's a really good strategy. Again, the commodity, or the pre-built handlers in the sense of community modules and stuff wouldn't know to look for that field. You'd need to take responsibility for your own handlers to handle it, but I think that's a really good solution. A lot of those will just dump whatever output came from the check came back, and for us, and some of those handlers will also drop all the line breaks, which if your checks output a lot of data can be really hard to read, and the only way to solve that, I think, is to write your own handlers and then hopefully contribute them back so that we can have a handler that does a really nicely formatted HTML or something, pass through the ANSI escape sequences into HTML email colors or something. But yeah, I don't have a super great strategy for that. Honestly, the text that comes back from our handlers tends to be a little mangled, and you sort of just learn how to read it, and we don't have anything as sophisticated as the playbooks, but again, most of our checks are self descriptive, so there's really not a lot of playbook-y stuff that would be helpful. Anybody else? Yeah, right, right, yeah, right, right. What's the...