 So is everyone here for SENSU and not prototyping, which was moved. Everyone is here for SENSU. Well, it is 2 a.m. in San Francisco right now where I'm from. So I usually would be sleeping now. But I guess we'll go ahead and get started here. So glad to see everybody. Thanks all for coming. And, yeah, it's great to be at DrupalCon. Really excited to feel all the Drupal energy and to the great DevOps track and kind of excited that I get to share a little bit about SENSU, which is a tool that I work with and work on. So I'm Nick Stilao. I'm the director of operations at Pantheon. And Stilao on Twitter and d.o. So, yeah, let's get into it. So SENSU is a monitoring framework. It's similar to maybe people have been around operations for a while. It's similar to Nagios, but it kind of is reinvented to take advantage of the cloud and configuration management and some of these other tools that are out there. So SENSU is an open source framework. It's all hosted on GitHub. It's kind of based in Ruby. And the reason that I wanted to present is that I love, you know, the Drupal community has so much energy and so many smart people doing some, you know, like various and different things. And everyone in this room is doing really cool and smart and interesting things. And there's some a lot of cool people in the SENSU community who, in kind of the DevOps communities, really want to kind of get those communities together and see what we can do when we're all together. Because that's kind of where the magic happens. So hopefully sharing a bit with SENSU and get to hear more about what you guys are working on and what your problems with monitoring are and that kind of stuff, and we can make more awesome. So to give a rundown, we're going to go through, we're going to kind of give a little context about monitoring. Kind of about the monitoring journey that I've been through in the past couple years at Pantheon and how we've been able to kind of create our platform. Going to go through some of the basics of SENSU, kind of how to get installed, what the architecture kind of looks like. And then we're going to kind of dig into some use cases. I've sprinkled some David Hasselhoff photos through my slides. So if you're paying attention, there may be some, some thing in it for you at the end. So kind of keep counting the David Hasselhoffs. Also, I want you kind of to the back of the mind while you're kind of listening to kind of think of your favorite dance move. And that may come out later. And also kind of think about some of your use cases that are, you know, that are stuff you're thinking about. You know, what's hard to monitor? What's hard about building a team that kind of runs Drupal sites or, you know, runs stuff that people come to rely on? What are some of the kind of technical and human problems that you want to help with? And, you know, we can kind of brainstorm together and maybe run through how SENSU might be able to help with some of those problems. So when I'm thinking about, when I'm thinking about SENSU, kind of think about in different ways. So I think one thing I think about is all of the companies that are getting value from running SENSU, right? We don't kind of do this just for fun. We do this at the end of the day because we're creating value for end users. We're creating, you know, systems and websites and platforms that people come to rely on and build their businesses on. And so some of the companies kind of that I work with in San Francisco are the kind of guys I hang out with. So we use it at Pantheon, Paperless Post, O-Power, HipChat, DNS Simple, Pinterest, kind of a ton of places that kind of, I know, in the Bay Area and kind of from online, also all over the world. There's the local Prague DevOps meetup. You know, there are people running SENSU on hundreds of nodes. So it's pretty cool to see that companies all over the world in Silicon Valley and in Prague and kind of all over are actually using SENSU to create kind of platforms and systems that people rely on. Another way I think about SENSU as the community, right? It's kind of the people that make this happen, right? Whether it's the Drupal community or SENSU or kind of any other thing, it's not really about the technology. Technology only goes so far. It's really about the people. These are some of the SENSU contributors. So there's currently 92 contributors and I would love to get 100. I just like the nice round number. So there's a couple slots down here on the bottom and if anybody wants to kind of get their name up there, their photo, I'd love to make that happen. I help maintain the community plugins which is kind of, you know, contribute space where people can put the different things they're working on. And this is a kind of GitHub site that's pulled directly out of the Git log for SENSU and kind of using their gravitas and stuff like that. And then another way I think about or that I think of when I think about SENSU is kind of the dashboard, right? So with a tool that you use every day, you know, and the dashboard is kind of the visual part of the tool that you're looking at. There's a lot going on in other scenes. You know, there's Ruby processes. There's the RabbitM AMQP bus. But, you know, at the end of the day, what it means to kind of me in operations and my team is kind of, you know, loading this page a lot. And there's kind of a couple different UIs that we'll get into, but this is kind of a nice lightweight one that we like to use because it's so simple. And I think it's, I think there's something about kind of the visual interface to a system kind of mirrors the complexity or some of the ideals of the system itself. And so I think if you look at the dashboard, right? It's pretty easy to kind of tell what's going on here. There's kind of, you know, you can click on it, get information. It's also kind of nice and dynamic so you don't need to reload the page and new alerts will pop up here. And I think that kind of simplicity and kind of elegance and kind of clean architecture that's mirrored visually is also mirrored in the code itself and kind of the architecture of the system. And right, so, you know, kind of starting pantheon and, you know, kind of being an off-sky, it's pretty clear that, you know, you want to monitor all the things. You want to monitor, you want to just go crazy and monitor everything, you know. You want to know stuff is breaking way before your users do, you know, the kind of slightest sign of anything going wrong. And so, you know, and that's kind of, to me, right, it's not in production unless it's monitored. And the only code that makes any money that adds any value is code that's running in production and then code that's monitored running in production. So to me, it's kind of, you know, without this final step, it's not really real. So the sense of something that really kind of lets me turn, you know, really kind of cap off, like going from idea through implementation coding and stuff like that into production. How many people are familiar with Hubot, the kind of like chat room bot? So we kind of set one up and it has this little meme gen plug-in, so I made this slide by just doing meme gen me monitor all the things. And I was thinking about doing all my slides like that, but I thought it would probably go downhill, so I just left it for this one. And so I think one thing monitoring gives us is it helps us build conceptual models. So, you know, whether you're running kind of one Drupal site or many or you have one server or many or kind of no matter what you're running, you know, humans don't really understand code, right? Like computers understand code. And so it's important as a human to kind of actively develop your conceptual models of what your code is doing, kind of, you know, what your database is doing and what your application layer is doing and kind of all these different things. And so this is kind of an idea I stole from Code of Hail and this awesome CodeConf talk a while ago that the map is not equal to the territory. And so the kind of gondolkan experiment is which one of these is San Francisco? Any guesses? And yeah, none of them or all of them, right? So the idea is that these are different representations of San Francisco. One, you know, one has terrain on it. One has more of the kind of streets. One is more of, you know, a photograph, a visual representation. But they all kind of have different layers of data associated with them. But none of these are actually San Francisco. You know, if you are kind of a hipster snob, you might want a neighborhood map. And if you're on this side of the road, you're in Petrero Hill. And if you're on this side, you're in the Mission. And that might matter a lot more to you than, you know, kind of what the terrain is or what it looks like from space. And so, since it was one tool and the more complex the system, the harder you have to work to kind of build these conceptual models. And then, you know, a couple of years ago when kind of cloud systems start being popularized and now kind of with containerization and all of the platforms, the territory got a lot harder to map. You know, I think it's a pretty easy, you know, relatively easy conceptual model thinking of a data center. You can think of blinking lights and humming fans, and it's kind of like a nice ops dungeon in there, right? Like, you know, those are servers. You can kind of visualize the network cables going to them. But then in the cloud, you know, it's kind of harder to grasp your head around this. Then, you know, the nodes might be coming and going and maybe autoscaling or kind of different teams managing different sets of nodes for different sites. And it's kind of a lot harder to kind of think of what that actually means. And so the importance of monitoring increased recently with kind of the kind of cloud VMs and containerization stuff because it got that much harder as kind of operation staff and people, you know, running stuff in production to actually kind of wrap your head around what that means. And so SENSU was kind of a tool born out of this need for better tools to create those conceptual models of what we're actually doing here. So kind of more kind of background. How many people are familiar with the monitoring sucks kind of meme? A couple. So this kind of came out a couple years ago and this guy, Lucis, who's been in the ops community for kind of vocal ops community for a while, kind of, you know, kind of just had it and he wrote this blog post that's like, these tool monitoring sucks, these tools do not kind of meet my needs. And I think it caught the kind of operations community and infrastructure tech ops kind of community right at the right point where a lot of people were encountering these same problems with kind of the cloud and the lack of tools to kind of support that stuff. And so here are some other kind of tweets and stuff. Sean Porter kind of was like, oh, Nagia sucks. You know, it's just like we can make it bearable but it's not, you know, it's not something that I love. It's not a tool that really kind of, you know, gives me the warm fuzzies and really goes above and beyond. Jason Dixon now works at GitHub, kind of echoing some of those same thoughts and Jason Turnbull from Puppet, kind of again echoing those same thoughts. Like, you know, now's the time, kind of do something about it. We need to, you know, create these tools and make it better. So there was a GitHub repo kind of born out of this that kind of collated a lot of the ideas and that kind of stuff, links to some of the blogs here. And then the top tweet here by Sean Porter or Porter Tech. So he's the guy that actually created Sensu. And I didn't, I never seen this tweet until I was kind of researching for this talk but I thought it was kind of cool that, you know, kind of memorialized in Twitter kind of some of those emotions that actually led to this great tool. So Sean lives in Vancouver and kind of was working in a company called Sony in which after this time realizing the need, realizing that the community kind of needed a solution was able to kind of start investing in kind of this idea that became what Sensu is today. So around this time all these different kind of a Cambrian explosion of a lot of different kind of specialization. There was an obvious market opportunity. So here are some of the, some of the kind of projects that came out of that. Some of these are open source, gray log, log stash, that kind of stuff. Some of them are SaaS, right, pay-to-duty, New Relic. And so I think Sensu definitely kind of is, you know, open source tool. Pantheon is kind of SaaS pass kind of model. A lot of people here are kind of all standing on the shoulders of giants but also making money. And so at Pantheon I think we take a very pragmatic approach to some of this stuff. So we invest a lot in Sensu and our kind of infrastructure monitoring like that. We also have over a thousand Pingdom checks that get programmatically created every time you create a paying site. So I think there's great stuff about open source and a lot of these solutions. We use New Relic on almost every site on Pantheon as well and a lot of really great services here. So I think the main point isn't necessarily open source versus SaaS. It's just that there was a market opportunity that their community really kind of came together around and there's a lot of great tools that came out of this space. And so I think kind of in my particular context at Pantheon, Sensu makes a lot of sense but for other people it might not make sense for everyone, right? Or it might just be kind of part of a hybrid solution that involves both open source and kind of SaaS monitoring. A lot of good things about all of those. And I don't think with Sensu I don't think monitoring sucks anymore but monitoring is still really hard. I think it's especially kind of at scale. When you kind of create a system that's large enough you start to kind of see complex behaviors where you couldn't even really imagine that this part of the system would influence that but sure enough somehow that's able to happen. Monitoring is not about the tool. It's really about creating a team that can deal with the escalations. If you could monitor stuff great and send the alerts like into space that's not really monitoring, right? You need to send those alerts to a human who can deal with them, fix the problem, get the site back online, get the service back online and get the end users receiving that value again. Getting more people on call. I think every organization needs more people on call. This is a very difficult ask and it's something that I think, yeah, is a hard problem to solve. Asking people to be on call to kind of wake up in the night to kind of really own the service that you're delivering is a tough ask and it's really important to kind of create this system around that that makes this possible. There's a lot of great things about getting more of the company invested in really the day-to-day stuff. I think whether you're a dribble shop or a platform or a host or anything like that the world we live in is, you know, as Scott said, colleague Scott Massey said in his talk, you know, this isn't like a fling with a site, right? You're kind of marrying these sites, right? You want them to be online, not just for a day or the day you deliver them, right? You want them to be online, responding, delivering value, being performant, like from now until forever until nobody cares about the site and it goes to where old sites go. Making alerts actionable. So the other thing is, Sensu makes it really easy to create alerts and that doesn't mean they're necessarily actionable, right? So kind of the, you know, the sky is falling, the sky is falling, what am I going to do? It's not very helpful, right? It just kind of gets your heart rate going and you are no better off than you were before. So instead of just the sky is falling, I would kind of like someone to come to me and say, you know, there's a meteor, it's about 500 kilograms, it's about, you know, 36 inches in diameter, it's hurtling towards Earth at about, you know, 1,000 miles an hour and if you can just step right to the left, yeah, that's good. So it's important to make sure, if stuff's breaking, your monitoring alerts help you fix that. Chris mentioned the other day, kind of about, so Sensu's probably a great way to create a monitoring fatigue and I'm like, it is an excellent way. So again, the technology does what it does but you as a human and as kind of part of a team really need to figure out how to make alerts actionable and make sure that you really don't get alerts when you don't need to. And so I spend a large amount of my time just figuring out how to, you know, get one alert instead of two or prevent that one kind of little bit, that alert you got in the night that you really didn't need to wake up for so we'll go into later with Sensu, that kind of help you fight monitoring fatigue which is a constant problem when you want to be very finely tuned into what your system is doing. And then lastly, you know, like, right, the ops team, maybe the people on the call, whatever, just kind of a small part of an organization and so you do have to kind of create that organizational change where we kind of call it like the pull the red cord metaphor which comes out of some of the Japanese production, Toyota production system where, you know, as someone who's on call, who's doing the monitoring, you are hooked into the system the way no one else is. And if you're able to kind of say, you know what, guys, like, we need to all stop what we're doing and think about this kind of halt the production line. And that's really important because, you know, you're hooked into the system, you might see this database query that's, you know, not doing well. And, you know, maybe you can fix that query, maybe you can restart my SQL, whatever it is, right? But more likely you're going to need to kind of work with a client. You can't just stop that query, right? You need to kind of figure out why that query, you know, what data are you trying to access with that query? Maybe, you know, you need to buy new hardware, right? Maybe you need to switch hosts and then that gets kind of business involved and the kind of financial guys involved and stuff. So creating an organization where ops team monitoring is really just kind of the leading edge, kind of the, you know, they have their finger on the pulse and they're the ones that are able to, like, actually go through the organization and figure out how to make those changes is super important. Some things I do and don't do with Sensu, a little disclaimer. So operate a bunch of Drupal sites. And I think there's kind of a saving grace in that it pantheon. It's, you know, I think it's impressive member. Technically we have a two-person ops team, although all the other technical team is really part of the ops team as well. So in some ways it's actually easier to operate a lot of sites because you really need to automate and you really need to have tight tolerances and kind of be strict about this stuff. So in addition to kind of managing these servers and boxes and containers and Drupal sites and stuff, you know, the human elements, kind of manage the operations team, the on-call team and some of those processes. I was looking at our pager-duty account, which is hooked into Sensu, and we've handled over 15,000 incidents in, like, two years, which if you do the math is, like, constantly, basically. So, yeah, so that, you know, basically, Sensu is something that kind of, you know, connects me to this service, you know, to kind of these servers, this value, this platform. And so Sensu is kind of really my kind of, you know, conduit to kind of what the platform's doing at any given time. And I help maintain the Sensu community plug-ins, some of the contributions. And so I think a good kind of a note here is that I don't write much Sensu core code. It's actually really not that much code, a couple hundred lines, but, you know, community is a lot more, and contributing to a project is a lot more than writing code, right? So, you know, I'm doing stuff like this. Like, I think one of the most valuable things I can do and the most valuable things anyone can do in open source is just be excited and get other people excited and kind of create some, you know, some workflows and point people in the right direction and stuff. And so I've kind of really liked kind of, you know, it's also fun and open source, right? Nobody makes these roles, right? They don't kind of assign you as this and that. You kind of just step up and, you know, bite something off. And so it's kind of cool to see how those roles evolve. And I think, I forget the name of the law, that the products created by an organization kind of reflect the organizational structure. So I think that's very applicable in open source and it's really cool that I think SENSU kind of reflects that open source and distributed and kind of volunteer mentality. And this has kind of helped, you know, so this is kind of shaping my journey with SENSU, my journey with monitoring DevOps over the past couple of years. So I don't manage thousands of servers. I work with people that do use SENSU on that kind of scale at Pinterest and Cloud and places like that. And I don't operate life or death mission critical services and I'm really glad about that because I think there's a lot of parallels between kind of, you know, operations and other on-call kind of emergency services people. But I think at the end of the day, as much as high availability is critical to our end users, it's not a life or death situation. And that's important, a little bit of context when you're doing this stuff because it may seem like the end of the world sometimes. So that's kind of the context. So everyone can kind of take a nice razor hands up, a little stretch in the back. Someone get them up. Okay. Let's see, so we're going to run through some of the architecture. So SENSU server, so SENSU is Ruby-based and kind of the key of it is using RabbitMQ to talk to all the clients. So up here kind of you have a SENSU server, you have some SENSU clients and RabbitMQ is the thing that's connecting them. So RabbitMQ is a message of us and it's cool because one way you can look at SENSU is kind of really just an operations router. So kind of in this case, the queue model, there might be three queues, might have an all queue, a web servers queue and a databases queue, and the clients can connect to different queues. So the client one would be connected to the web servers and the all queue. Client two would be connected to the databases servers and the all queue. And with SENSU, you're able to ask, kind of publish a request for maybe all servers to check on their disk versus just database servers to check on the database or web servers to check on Nginx and PHP. So that's kind of some of the architecture that SENSU is based on. It's pretty easy to get up and running. It's got what's called an omnibus install. And so this creates an entire separate directory. So it's kind of opt SENSU and there's good packages for Debian and Fedora and Ubuntu and stuff. And this is nice because the Ruby that you're using is entirely separate than your system Ruby. It's entirely self-contained, so it's easy to get started. This was kind of designed by a colleague at Pantheon, Joe Miller. I think one kind of strong suggestion I'd have if anybody's getting into this is use config management for SENSU and for monitoring. The whole reason you set up monitoring is to increase the reliability of your production systems and if you can't rely on your monitoring, it kind of defeats the purpose. Configuration management is something Chef, Puppet, Ansible. I don't really care what it is, but as long as you're not doing this by hand, that's what will give you the confidence so that you can rely on your monitoring, on your systems and on the value that you're giving your end customers. There's great Chef and Puppet support, so that can kind of get you going. And those are also good ways to kind of contribute as you're getting up and running. So every snowflake is unique, whereas when you're running servers, you don't want unique snowflakes or Drupal sites for that matter. If you're going to run a lot of anything, you want them to look very similar, if not identical. So a good way to get started with SENSU, this is kind of to get up and running. So you can play around with Vagrant, EC2, kind of see how it feels to you and that kind of stuff. Probably you may, if you already have monitoring, such as Nagios or something, a great way is to run SENSU just alongside Nagios and maybe don't actually send those alerts out, but just start using a dashboard and get a feel for it. And Chris's talk the other day, you want to make the other people on your team jealous that they're kind of not using the cool dashboard and really easy to write those checks and stuff like that. So once you have it running in parallel, you've got some buy-in from your team, some interest, then you can kind of swap out your old setup and be going with SENSU. I have a question. Are you basically using only SENSU at this point? Yeah, I've never used Nagios, so... Pantheon is only using SENSU. Pantheon has never used Nagios. We do use Pingdom and other kind of third-party SaaS stuff for monitoring, so it kind of came well in the Pantheon timeline that we knew we kind of had this problem, didn't really want to go down the Nagios road and had the ability to jump in with SENSU. So to give you a bit better feel of what's going on kind of under the hood, we can write it check. So a check is what runs the checks on the health of a particular system. So in this case, we're going to check that the directory at C exists, which of course it always should, but we'll just go ahead and write in check. So this one's in PHP. You can write them in really any language, which is pretty cool. A lot of the community-contributed ones are in Ruby, but you can write them in Shell or PHP or Python or really whatever you're familiar with. Below is the configuration, and you can see kind of some stuff that it's going to be handled by the email. I've put the command that we want the executable to run. We want it to run every 60 seconds. And then in here, you'll also see some kind of funny things. Expect dancing, true. If you have an Etsy directory, it's pretty much a dance-worthy occasion, and then yeti, true. So the kind of point here is that the configuration is done with JSON, and you can really stick in anything you want into the JSON. For example, this playbook attribute is something that's not really at all part of SENSU, but something that we put in every check. And that playbook is a URL of a wiki page that describes how to resolve or look into or debug that particular check. And so that's kind of a step we've made to make the alert actionable. So I think of the checks as kind of commodities of detection. So it's totally compatible with all existing Nagios plugins that are totally battle-tested and have been in use for years. There's lots of SENSU community plugins, and you can even use kind of any command line tool that this is, you know, in bash that'll return the exit code. So it's really just going off of like classic Unix principles, standard out, standard error, and the exit code. Checking on time. So, you know, this is kind of a box of smoke detectors, right? So one of these smoke detectors isn't really better than the other. It certainly took some engineering work to figure out how to, you know, trap the kind of molecules and detect that they're smoke and sound and alert. But, you know, and SENSU, I think that checks are kind of commodities. You know, there's a great kind of check. There's checks both in kind of sent you contrib and Nagios for checking on an HTTP response, right? And nobody's going to be like, hey, bro, my, you know, check for HTTP 200 is way better than yours, right? Those are all kind of all the same despite that there is some engineering going into it. And so I think in this, you know, in this image with this bucket of smoke detectors, right, it's less about kind of what one individual smoke detector does or doesn't, but if you're managing, you know, a conference center like this or a university or something like that, it's really about how do you know which smoke detectors are going off and how do you make an actionable to kind of get someone there to deal with it if there is an issue. So then kind of the other side of it, there's checks and handlers. And so handlers are kind of the things that are on the escalation path. So if the check marks something as unhealthy, a handler is how that gets to a human to deal with it, a human or other system. And so again, we'll kind of write one in PHP. This one's just going to read from standard in again, kind of using some of the kind of common sense, principles gets JSON blob over standard in, we'll decode that, pull out the name of the script and just use the mail function to kind of mail it off to us. So this is kind of a simple example, but it gives you an idea how easy it is to write a handler. And the magic of SenSu is really in the handlers. Again, the checks are kind of just checks and they're making sure websites are responding with the right code and the right amount of time and making sure that the disk is okay and stuff like that. But with the handlers, that's where you're really able to implement something I coined as ops logic. So this is business logic for your ops team, right? So this is for your business, what escalation paths make sense. So this might be you start at the CEO and go down. This might be you send to everyone at once. This might be you kind of pick a random person. This might be you send to multiple services. This might be that during the day it goes to the engineering team at night to the ops team. And so SenSu is kind of the magic of SenSu for me is that given the kind of human, the difficulty of kind of the human sides of monitoring that are building the team, building the escalation policies, making them actionable, the flexibility of SenSu handlers makes it really easy to kind of develop interesting kind of business logic and ops logic workflows there. So in this kind of example, the check is run on the SenSu client. It sends a result that's just JSON back through RabbitMQ. Again, just JSON and there's a kind of opportunity to filter out ones we don't want. So maybe that is, maybe those are ones that we've explicitly said, hey, I'm working on the server. I know it's bad. You don't need to alert me about it. Maybe it's one that are subdued for kind of off hours, that kind of stuff. An opportunity to mutate the JSON if it needs to be tweaked for the handler at all, which we'll talk about. And then sent to the handlers, which are the scripts like we just wrote to actually kind of trigger the escalation paths. So before we wrote a quick handler, now we're going to write an awesome handler. And this is kind of where the ops logic comes in, right? So the top of this is exactly the same as it was before. But in the bottom, we're going to use some of those custom attributes that made sense for our business, for our use cases, to kind of show off some of the flexibility of SenSu. So here we pull out of the JSON, you know, if we're expecting Yetis and if we're expecting dancing. And if there might be Yetis, we trigger an automatic API driven thing to go buy some Yetis, some party hats from Amazon, because Yetis love party hats. And if, you know, we should expect dancing, you know, we'll trigger the function to get on our dancing shoes. And so, you know, I don't know what the check was here, but, you know, if your Etsy directory exists, you know, already we have the shipment coming from Amazon and, you know, our servers are wearing their dancing shoes. And so this is just kind of some of the flexibility of how handlers in SenSu can work with your business's needs to escalate properly. Do you guys know what GIMP is, the open source image editing thing? I spent about two hours trying to make an animated GIF infrastructure diagram, and then it crashed. But I did manage to get this and it still looks a little wonky, but I'm still pretty proud of myself. Thank you. I was going to even put them on GitHub, but now I'm not quite sure where this is going to go. I might need to take a break from GIMP. So this is kind of illustrating the kind of fan out publishing model. So again, leveraging RabbitMQ, one of the things it can do is called fan out. And so for the all servers case, the check comes from SenSu server, gets to RabbitMQ, and then RabbitMQ publishes that to everyone listening on that queue. So in the all case, it goes to both servers. If the server were to publish a blue dot, but instead GIMP crashed, that would only go on the web server queue, and only the web servers would run that check to check that particular part of the system. Kind of in some of these diagrams, the SenSu API is kind of like floating out in the middle of nowhere. And at first I'm like, well, that's weird. Why is that just like floating out there? And kind of the Redis. So that's kind of a little Redis data store. But then I'm like, actually, I think that kind of speaks well to SenSu. One of the great things about SenSu architecture is that it's very decoupled. And so I mentioned the user interface before, and there's really nothing special about that user interface. It just happens to use the API, which is a REST API to generate the dashboard visual representation of what RabbitMQ is doing and what data has been stored in SenSu or Redis. So you'll kind of see them floating out there, but they're doing their thing. Oh, another kind of tidbit, and hopefully this will be an aha moment, is that SenSu is a Japanese word for fan. And so the whole kind of model of SenSu from conception was using this kind of fan out model with exchanges and queues to publish checks to multiple different clients. So we're going to kind of get into some use cases for SenSu here. So these are kind of specific things you might encounter or kind of little kind of cool things of why SenSu works well. So auto registration, so if you're growing your fleet, maybe you're auto-scaling out, maybe you're just adding servers regularly, that kind of stuff. You want the server to be monitored as soon as it can be. As soon as it's online, you want it to start being monitored. And with some traditional tools, that can be a little hard because you have to kind of maybe send back your IP address or kind of puppet or chef has to run and see what all your nodes are and then kind of write out configuration files to start Nagios or something like that. So because SenSu is using the RabbitMQ, as soon as a client connects to the queue, it is being monitored. So all it uses SSL certificates to connect to the queue and as soon as it connects, it's being monitored. I was thinking like, what's hard about registration and then I was thinking about Drupalcon and this is the photo from Munich. And I'm like, yeah, registration is hard. Anybody with the association are running this stuff. Who are these people? What do they want? Are they authenticated? Are they supposed to be here? Why are they here? What specifically are they trying to get out of this? Do they have a role, that kind of stuff. And so all of that is baked into just kind of the very initial connections of the client on SenSu. And then the other kind of like, hello, my name is I-284-3A. We're not dealing with even people with names here. I think increasingly kind of servers are being VMs, containers are kind of nameless resources, entities, and that makes that problem even a little more difficult. Keep alive checks. So this is a default check. So again, as soon as you connect to RabbitMQ, the SenSu server will recognize that and periodically will send a check that will check to see if the node is alive or dead. And so this is configurable, kind of how often it happens, when it sends a warning, when it sends a critical. And so again, as soon as it connects to RabbitMQ, its kind of role is defined, it's authenticated, it's made sure that it's kind of up and online. And there's like a lot of different checks you can do on a server, I-08, whatever, I-nodes, CBU usage, kind of on and on, right? But like the most fundamental one is like, is this client, you know, am I able to communicate with this client, right? That's kind of the most basic, you know, basic part of any system is kind of communication. And so the keep alive kind of keeps an eye on the clients to make sure they're alive. And keep alive will fire if like the load average spikes so high that the client, like, isn't able to actually handle the keep alive and respond back, okay, it'll happen, you know, if the server goes down, of course, it'll happen if there's a net split or kind of other, you know, networking issue that's breaking communication. So that's kind of auto registration and keep lives are great when you're kind of adding servers, but a thing that quickly becomes pretty noisy is deleting servers. And again, you want that server to be monitored right up until the point where you don't want it to be monitored. And so, you know, typically what happens is when you take a cloud node down or something, it can fire off a bunch of alerts and be like, hey, this node isn't, like, I can't ping it anymore. And getting back to like alert fatigue and that kind of stuff, if you just deleted that server or it auto-scaled down, you know, you do not want to be receiving, you know, kind of panic freak out the server is down messages, right? So one kind of pattern in SENSU is to use the idea of like a gold record, right? A canonical source of data. So this might be Chef, this might be API, this might be a different source of record or other kind of API. And what you can do is when the keep alive check fires in the handlers, you can kind of, you know, check that gold, you know, that source of record and say, hey, should this node be up or should this node be down? And if the node should be up and it's alerting, you can escalate that. If the node was purposely taken down and it's alerting, you can kind of just remove it from SENSU and not worry about that. Again, kind of building the conceptual models with monitoring. Metrics are a huge part of that, right? Visual representations of the performance and kind of variability of the different systems. And so, you know, another thing that's becoming more apparent is kind of whether they're metrics, data points, they're log messages, they're check failures, that kind of stuff. They're all pretty similar, right? It's kind of an event that's happening, kind of data about your system. And there's lots of things you want to reuse, right? So, you want to reuse kind of for metrics collection and checking failures and logging. You might want to reuse kind of the transport mechanism. So, in this case, that's RabbitMQ, and it works really well. You might want to reuse the authentication. Again, that's kind of built in with SSL and to RabbitMQ. You might want to use some of the handlers in the installation, that kind of flexible logic where maybe these metrics go to Graphite and these metrics go to Librado or Datadog or something like that. So, we use SENSU extensively for this. So, if you look on the Pantheon's public status page, there's some kind of public metrics which are kind of ushered out of this SENSU workflow. So, this is a graph of our database, or sorry, Valhalla file system server-side successful requests. Another cool thing you can do once you're kind of using SENSU in that fashion, you have some data in Graphite or something like that. So, Graphite has these awesome kind of post-processing tools, right? So, the solid green and the dotted red lines are load average, ones from today and ones from yesterday. And then the blue area graph is on the other axis and that's showing, representing how much higher the load average is today than it was yesterday. And then it's really easy with SENSU to alert off that difference. And so, when you're monitoring kind of its scale or kind of all this stuff, it's hard to kind of look at maybe every one server. You want, and that kind of creates alert fatigue, tons of stuff to monitor. So, what you can do is kind of use your, use Graphite as a way to collect metrics, kind of do some nice little post-processing, and then you can get kind of cool high level checks about like, hey, is this thing worse than it was yesterday? And actually kind of alert that through the escalation path of something weird is going on. And then kind of another kind of pillar of DevOps or something is you want your monitoring to be hooked in with your metrics. And so this is about making them actionable. And if you just get an alert, it's not nearly as rich as if you can get an alert, have it linked to this page, load this up, look at what the load average is now, look at what it was before, and visually kind of see that very clearly. Also there's kind of, I put in a red line and that's a very clear, understandable metaphor of what's going on here. And if you got a page and opened this kind of groggy late at night or something, you'd be like, oh, well, I did cross the threshold, but not by that much, right? And it's going down. So that would be a very different, this graph is able to inform you of that, whereas if you just get the alert, maybe you don't know and it's harder to get this data and harder to actually take action. So kind of getting a little bit more into this processing pipeline, the kind of flexibility, the power of SENSU. So the kind of mutator function. So when the client send back the result of the check, it's kind of a JSON representation. So that'll have information about the client, what the IP is, what the name, stuff like the timestamp, it'll have information about the check, about kind of which check it was, the check history, if it's been passing and failing, that kind of stuff. And when you're using metrics, you kind of want to, you want to get the metrics into graphite or wherever you want to get them. In this case, graphite accepts kind of, it doesn't accept JSON in the SENSU format. It accepts kind of a much cleaner kind of text format. And so what you can use is a mutator, which will just take the JSON in, mutate it and pass out something else. So in this case, it's taking the whole JSON check representation. It's just taking the output that the script we ran on each client spit out to standard out, and it's passing that. And what this does is it allows the handler to just be a direct TCP connection to graphite. So it doesn't need to kind of do anything fancy. It can just kind of mutate it slightly and then shovel it right into graphite, which is, makes things scale and perform a little bit better. So, yeah, some other kind of use cases, right? I want this check to, and kind of what the filter does. I want this check, you know, I want to know about it during the day, but this like really should not wake me up. You know, this should not annoy me and my wife when I'm at dinner. They should not, you know, go off while I'm jogging, you know, before work. So there's kind of built in kind of subdued functionality, which is really just a kind of filter in this workflow that if the current time is less than, or, you know, outside of these bounds, just drop it on the floor. Check dependencies. This is another kind of important one, right? You don't want it so clippy over here saying one alert is better than two. So maybe you have a kind of a private, you know, maybe you have multiple interfaces like a private network and a public network and you're monitoring both with pings. If that server goes down, you don't need, you don't need to, you know, you don't need to keep alive and the private and the public ping or something, right? So you can kind of use a very simple dependency model to say like, hey, if there's already, you know, a failure for ping or keep alive or something like that, don't worry about escalating this. I've got it. And that kind of, yeah, more of like the sky is falling when your phone is just kind of like hopping off the table, vibrating, and it's not really helpful. So this kind of creates, you know, saying you can see what the problem is, get actionable, good metadata to get more debugging information and go on from there. Another cool part of SENSU use case that was designed around is this idea of a push workflow. So each client listens on local host 3030. It's all configurable. You can send in with TCP or UDP. And so what we're going to do here is just, so we're going to use Netcat. We're going to use a kind of little, little spit in a little JSON to Netcat in there. So this is going to say, hey, just going to push this event in there and status two means critical. So this would push through the SENSU client, through RabbitMQ back up to the server and get handled appropriately. So this is cool stuff. And this is really what I want some of everyone's expertise here with is kind of integrating with this with Drupal. So this could be failed login, like kind of any event, any action that happens. So it could be maybe failed logins or kind of anything that's going into Watchdog could maybe also go here, go through escalation path. Love to get some help kind of thinking through some of that stuff. You know, maybe someone, every time an administrator logs in, I don't know, you know, kind of whatever that workflow is that kind of detect those events and trigger that escalation workflow. And it's just really easy to do. So like my colleague, Joe Miller, was at his bachelor party in Las Vegas and I made one, I did this and set him as the kind of escalation target and I said, critical, you are not having enough drinks and kind of sent it through the system and I'm not sure if you appreciated that but I thought it was pretty funny. And so, yeah, SENSU going above and beyond. Like a loosely homogenous fleet. So again, right, Drupal sites, servers, kind of anything. It's better, you know, instead of snowflakes if they're, you know, more homogenous. But I'm a practical man and sometimes not everything looks the same, right? Maybe you have a couple servers and one client is paying for a much bigger server than the other ones. Maybe, you know, maybe you just kind of, the servers you have are kind of, you've got them every year so they have slightly different performance characteristics, right? They're all, you don't want snowflakes but you probably want every server to have a different, you know, MySQL password or something like that. And so SENSU kind of has some nice ideas of client attribute substitution. So a good example here, we could just put in that Yeti and it would, when it ran the check, pass that as the command line arg. So this kind of, if you can't have everything identical, like all owls, you know, sometimes you have a cat in there SENSU can, SENSU can kind of handle that and you really want more homogeneity for your sanity but in the real world, right, things are all exactly the same. Yeah, I mentioned about the API, so this is a little Adam Jacob from Opscodehead saying you need an API and really an API is a critical part of any infrastructure, especially open source infrastructure and what an API does is it creates, you know, right, an interface that people can build off and this is some of the, so SENSU has this nice little REST API, very concise, very easy to use, so you can publish check requests on specific servers, you can kind of create, add, delete, update clients, you can look at the histories, you can kind of add little data snippets that the clients in servers used to talk to each other and this is what also enables like kind of multiple UIs, so there's a Hubot plugin that hits the API, there's a, you know, we have our dashboard, you know, kind of wallboards that just read from the API, we have the kind of web interface that we load and that and so the API really kind of decouples like innovation and that's another kind of, of course, tenant of SENSU that the code base is very kind of extensible and small and it allows you to configure it and you to add the flexibility without kind of growing the core very much and yeah, if any system where the UI is very tightly coupled to the kind of the core purpose makes it really hard to kind of innovate on, I think, in particular because visual representations are so different than, you know, kind of what's under the hood. Some more little kind of SENSU snippets so you can aggregate across clients so you can kind of mark a check as something you want to aggregate and this would be like 125 clients are okay 10 are 10 are warning and one is unknown and kind of and so one thing you can do if you have events that happen kind of all at once or you really want kind of a higher level view instead of having each client escalate through pager duty or whatever you can kind of use this facility to kind of aggregate the results of all those checks across your entire fleet and then you can have a check that says, hey, let me know if more than two of my web servers are not functioning or something like that. If your load balancer is able to take out crippled web servers it's easy to kind of write a check that says, hey, as long as I have more than a couple good web servers back there, let me sleep through the night I'll deal with it in the morning. Yeah, so like I think that's pretty common theme here, right? So another one is kind of because there's some good examples of this in the community using Sensu to actually trigger remediation programmatically. So this is a screenshot from our Kibana kind of logging data and so some of this is done through Sensu and some of it is not but if there's if Sensu detects a failure in certain cases it will just itself trigger a remediation step for that and then if that doesn't work it will escalate it up to a human but if this is something that happens occasionally so in these cases are kind of restarting PHP or restarting a file mount kind of happens occasionally this is over several days you can see there's kind of clusters of it and so that's kind of another use case that the flexibility of Sensu handlers can kind of help you kind of work that workflow in as well. Some good resources online there's a users group all the codes on GitHub IRC is a friendly place would love to see you guys there there's some good presentations if you kind of google around they're on the sensuapp.org yeah so contributing if this is something you guys are interested in it's definitely kind of easy to contribute you know don't worry about the Ruby thing like we'll get some PHP checks in there we'll get anything you want just really excited to get more people using this more kind of use cases flushed out kind of business use cases or technical use cases or anything like that and yeah I really want kind of like some more kind of Drupal specific stuff into Sensu because I think it could be a good match so excited to see what you guys have to say about that and thank you guys so much I'm happy to kind of answer I don't think we have too much time but a couple questions grab me after love to talk about this stuff so thank you guys for listening so much appreciate it yeah I'll just yeah so the question is kind of in addition to using metrics and graphite to look at data kind of over time and kind of more statistical processing do we use any other tools so currently we don't I think the metrics are very yeah the graphs and the metrics are very important data to have for a lot of reasons and so that was kind of just one example of leveraging the processing power of graphite which we already have in place to kind of do some nice alerting I think there are great solutions kind of reamon or more kind of event processing stuff that could handle pretty complex use cases and then another aspect is because of the push model you can really have arbitrarily complex processing and then if it detects anything it sends an event through the push yeah or external tools yeah and I think it is a hard balance between having features and having an easy out of the gate but also having very complete so I think sense who is very flexible and offers a good solution out of the gate but by the time you're using this at scale and have those kind of interesting use cases you're going to probably want something that's a little more specific built and sense who can probably help you with that with some aspects of that do you have a suggestion for kind of that learning the behavior of machine no I don't and I think that's very interesting and I think especially for bigger systems I think that's kind of a very interesting kind of kind of emergent behavior and kind of stuff like that it's really beyond the power of kind of simple graphing or a human to really kind of get their head around that so you're going to need to leverage kind of some external tools we can chat about it but yeah anyone else any I just wanted to know that you mentioned about Revit MQ and the Redis do they need to run do they need to run locally with the sensor so that's why that's why this is an advanced talk it's like it's kind of there's great chef and puppet for it and so if you're not using chef and puppet it is kind of it's kind of intimidating to set up because there are those moving parts and it does because since it was so conceptually based on that fan out model that you really do need to run the Revit MQ even for a proof of concept but there is and I have a link kind of on the this chef monitor one there should be some pretty vagrant compatible stuff where you can pretty much just do kind of vagrant up and get a running environment you know VM on your local box with the dashboard and Revit and Redis and a server and a client and you're able to test it so that's kind of the route that I would go to get started can you load balance the sensor service? yeah so you can it's a little bit different than web load balancing but you can run multiple and again because of that so you could set rabbit up in HA fashion and then you can just run as many sensor servers as you want and because that results Q that the client send back their data on is a one sensor server one sensor server will get one result and escalate it as it will and the other will get the other result so is it going to work as a master slave or is it just it's active active they're both popping up we run more active passive but we'll just turn on the other one looking at the rabbit the reason I mentioned Redis because we basically have an application where Redis is getting used and we always struggle to load balance Redis because there is no as such load balancing in Redis stabilized version yeah so the way you would set up HA would be questions about load balancing Redis right so the way you set up HA was since it was just redundancy on each level so you'd set up HA on each level and you'd set up HA for the sensor server with multiple which is easy and set up yeah replication with with Redis and I think you know there is an aspect of this stuff that you certainly need to rely on it and it needs to be very you know consistent and reliable but if the if sensor goes down and I'm alerted to it you know it doesn't it's not customer facing so although you want to be catching alerts at that point you're already you know fixing you know like logging on whatever right so at that point if sensor goes down for you know like and I'm alerted to it you know I'll fix it it'll get online it'll be running checks right so that's kind of my approach to it and you can use Pingdom to check SENSU which is just kind of a nice sanity check of it there is running that there are producers connected to it that there are consumers connected to it their messages flowing through and that kind of stuff thank you anyone else very quick one you talked about scheduling and also automatically fixing problems so could you set it up so that it automatically say restarted a patch at night but not during the day yes I don't know how to do that like I don't have a pattern like you know code table I can show you but that's kind of the idea and another thing about this and kind of chatting and these are great kind of use cases I think you know all kind of nerds when someone comes with a cool use case you're like ooh how would you do that right so kind of people love on IRC and stuff kind of like digging mailing lists digging into this stuff I think the one feedback I've gotten is you know you want to separate monitoring from supervision in a lot you know as much as you can and so it's kind of tricky and I think you know there's a spectrum of theories about that but for me sleeping through the night is so important that I would err on the side of like that's a safe operation or you know once you we've fixed something like 20 times and you're like all I do is log in and like do this one thing right that's kind of when you're like okay I can probably have this happen automatically yeah but you do want to log it and that's kind of what I showed so it's not happening like you know a page isn't being restarted I mean it you know so but yeah that's a cool use case and definitely kind of something that the flexibility of SENSU could work with restarting Apache in the middle of the night but not during the day cool guys awesome and I appreciate any feedback you guys have and on the forum and grab me later excited to see what everyone else is up to did anyone count the David Hasselhoffs alright he said 3