 So I'm the last guy standing between you and beers Some of you might expect them yanturno here talking about distributed systems, but he pulled this talk two weeks ago and During one of the track chair calls I got Pulled out as being his backup. So that's why today I'll be talking about Monitoring in an infrastructure monitoring Drupal in an infrastructure as code age if you feel like you're in the wrong talk And you don't want to see this feel free to actually still leave Because I kind of understand if you feel that way in 10 minutes you can also do but please bring me a beard then So, yeah, let me first introduce myself. I'm just about that There's a fact out there I'm not related to Dries But we do know each other since before Drupal existed Ages ago. I used to be a software developer even wrote PHP and Then I became an operations person because I needed to deploy stuff and I was the guy constantly racking machines and putting stuff in production Today my role is I'm the CTO of one of the larger open-source consultancy firms in Europe We are inuits. We have offices in Belgium the Netherlands in Kiev We do a lot of Drupal work, but we're not a Drupal shop. So If I look at what I do today Mostly building large-scale infrastructures doing it the DevOps style So doing continuous delivery doing automated infrastructure Who can tell me what DevOps is This is DevOps track guys who can tell me what DevOps is Are you all afraid to reply to questions or does nobody know? Who was in my talk last year? Okay So nobody knows what DevOps is Two people okay Are you sure? collaboration of operations and developers Nah, I give that a 50% score So this is what DevOps is to me it's About culture it's about how developers and operations people work together it's about Automation and when we talk about automation we mean automate all the thing we want to test automation We want to automate the builds. We want to automate the deployments. We want to automate the monitoring Basically when we talk about automation, we think about infrastructure as code And when we think about DevOps, we also think about monitoring metrics and all that stuff And we also talk about sharing and that's the cams the clams keyboard Which was coined by John Edwards Damon Edwards and John Willis who have a podcast called DevOps Cafe and The L the influence of the lean movement has been introduced by Gene Kim because they figured out well There's a lot of lean stuff in this and there's a lot of stuff Which is based and it's also cool because if you had just the cams without the L in there Lots of people figured out well if you make an anagram out of it. It's just a scam which it's absolutely not so How many of you are Testing your Drupal code, okay? some of you You have continuous integration you have stuff where you actually put in place an environment where Software is being pushed to production With a set of rules with a set of checks which you want to succeed and that's how big organizations work That's how organizations work. They're afraid of pushing software into production And they want to have tests in place that make sure that stuff goes the right way but for some reason on the infrastructure level on the actual Operating system level. We've been tolerating that people logged on to machines Manually and started deploying stuff in there. We didn't allow the software developers to do that But the infrastructure people well, that was fine so With the advent of clouds and large-scale deployments there's a Lot of new tooling that came on the market which allowed us to do this different and That's basically what infrastructure as code is about Infrastructure as codes makes us think again about how we build our infrastructures today How we model it how we can fastly reproduce the platform and how we can basically have Disaster recovery for free the same way we build software and if I look at a platform these days there's a Large part of that which I want to completely reproducible think about it like Drupal core nobody's gonna touch this part or Three skills a kitty and This part is completely in source codes Everything there if I'm running a platform in production acceptance or testing. It's identical Then on top of that I have a part where I give it an identity Like the sites directory where you basically give this is the V host. It's this site You give it an identity with some specific configuration enable these modules and for the operating system for the whole stack It's pretty similar You add some business rules and the whole block down here That's something which is in version control which you can manage and which you can redeploy over and over and over again You need some scale out. Maybe you have some custom parts, but it's all automated And that's something you can rebuild the thing you need to add and the thing you need to backup is The actual user generated content because that's something which you cannot version. It's gonna be volatile It's gonna change frequently But a whole lower stack you have to think you have to start thinking about that as code and make sure it can be reproduced Infrastructure as code also means that as infrastructure people are doing this. We need to start thinking as developers I mean I come from a development background a lot of other people come from a development background But we really think about our infrastructure as code. We have quality checks in there We also use version control. We also use testing. We also use continuous integration and continuous delivery and That means we build up our core infrastructure that way We add the middleware deployment in this way Apache solar engine eggs all the components that are used to build a website and we do this in an automated fashion We do continuous delivery of the full stack We enforce security rules in there So we actually tune the parameters for this firewalls and all stuff automatically. So when we deploy stuff we deploy a host a service and The application with the monitoring and that's the link to the monitoring part configured automatically So Monitoring who likes monitoring? I mean About two years ago John Vincent loses on Twitter was really fed up with all the monitoring things We were having and he basically tweeted with the hashtag monitoring sucks and and He put up a git repository Put in all the different tools he know and we pretty much started looking at what he's out there What tools are out there? What do we need to change? What's? new and then how can we improve this thing which is monitoring To me basically the monitoring sucks movement is a sub movement of the DevOps movement It's people who care about open source will care about monitoring and improving the monitoring space in the open source world and One of the reasons why monitoring sucks is well a lot of those tools are not built for scaling They have a GUI which is well one page and if I five thousand notes, I'm gonna go page page page Monitoring usually Before the infrastructure of code age is something that was an afterthought. It was not in sync with reality You had added four new services, but you never modified the actual monitoring config because it was all done manually It was mostly targeted at monitoring a host So you're monitoring your web server It pings a patch is running on it and genix is running in it But you're not actually monitoring all the different services on it because you don't have time to map that there's maybe one default v-host which you are monitoring but all the other v-host might have broken database connections and you're not monitoring that because It's manual. So the services sometimes are monitored up actually the actual application never and that basically ends up in chaos and monitoring not being done correctly so fast forward a couple of months and DevOps days Rome Elf Manson from recorded future He basically gave an ignite talk about his new found love for monitoring He had found a way where he had Integrated a bunch of new tools in his case ensue and next he'll I will be talking about sense who tomorrow And he started to find a new found love for monitoring So That ended up in we hosting monitoring hack sessions in our office People started organizing a dedicated conference about monitoring called monitorama, which had its first event in Boston earlier this year and European event was in Berlin last week were Yeah, Steneck was there Kardo was there, but he's not here So there were people a couple of people were there already from the Drupal community figuring out how to improve monitoring so Quickly, what's wrong with the current tools we were using? I mean a lot of the tools were just not built to integrate with conflict management They either had no API to talk to at all or the API was completely broken or we changed every time yet absolutely no relation to scaling environments and They were focusing on stuff. We don't care about like auto-detecting new services We don't care about auto-detecting new service because we are defining them in code tools like Xenos Sabix, you know, does anybody use those? Yesterday John Topper was talking about how he loves Sabix and I love Sabix too if I have a 20 node environment because once I reach 100 nodes I need a full-time DBA to actually manage database it's using The same with cacti the same with other tools Then there were tools that are using round-robin databases like RRD tool, which is an awesome tool if you live in the 80s But time series databases have evolved and now it's much easier to actually create metrics and monitoring and I'll come back to that later so There's a lot of tools which have been trying to do stuff good But there's a new generation of tools out there which is going to allow you to do stuff much flexible much better So we've defined infrastructure as code we've defined Why monitoring sucks and why we really want to build some things out now the next question is Where do you start monitoring? So where do you start monitoring? What systems are you monitoring? I heard somebody say something You monitor all the things Indeed you monitor from in developments You monitor your acceptance platform your production platform and a lot of organizations I work with they only monitor production because that's all they care about They have no idea about the load their application is generating in the test environment because they're not monitoring it and Also, they only have to have one monitoring platform They don't have a test platform for their monitoring tools So everything they try to monitor they're actually breaking production life We're doing it and they're missing out on metrics because they're playing with their monitoring platform So we really want to have a feedback loop between Developers and operations people as fast as possible in the development cycle if a developer adds a new feature if he creates a new view I want to see the metrics of the database and I want to see that he created a slow query And I want to know that the moment he actually does the commit and the moment he creates that view If I only see that when he pushes it to production I'm going to be about a week too late to tell him that well We need to add some indexes. Well, you know the way you create that view. Maybe we should do it different And that's part what from the lean movement That's what DevOps is about by creating a fast feedback loop between developers and operations So you get to know how the system works before it reaches production before You need to scale it to a thousand notes If you look at it from an architectural point of view Lots of the tools we don't like are big and bloated tools where they have everything in one and they think they can do it all think HP OpenView, think Tivoli, the typical enterprise tools and a lot of the Promising open source projects like hyperacons enough they try to emulate that but it doesn't work We need to take a step back and think about the good old Unix philosophy where you have a bunch of small components That are all really good at doing their job and In that philosophy we need to have tools that are capable of collecting metrics Transporting metrics to somewhere else typically that's a queuing system Maybe Transform and change Metrics like drop data. We really don't care about And we needed a third part where tools are actually doing the analytics Analyzing what's really going on and act on that Of course, we also need to satisfy management. So we need to be able to build nice looking graphics So they can see hey, we've got more users and hey, we've got more revenue. So How do we start doing that First thing we need to do when we build infrastructure as code is to start monitoring a baseline Each time when we deploy a new host, which is going to be part of our infrastructure We're gonna automatically add the monitoring We're gonna automatically add all the tools to do the collection. So it starts collecting the baseline and We're also gonna add check definitions and update the monitoring tool When we deploy a new node the monitoring tool needs to know that a new node exists So that kind of brings you to an architecture which looks like this Who knows or who uses components in this architecture? I don't know if you can read it from the back, but there's basically well, there's a bunch. This is a general Overview. There's a lot of Java stuff in there, which you might interest if you're using solar Or elastic search in the back But if you look at basically the Apache stuff, then it's log shipping It's monitoring from for example, not just as mentioned in one of the previous talks and it's Shipping the light blue part is basically data collection the darker blue part is data shipping and then the green part is The actual tools that are storing the metrics and maybe doing stuff with it and then the red part is actually reaction What is happening on the platform? But this is a whole set of tools which you need to integrate a bunch of them are still old-school like not just like your Kinga But a lot of them are from the new tools So if you think about that in the infrastructure's code age That means that the contrast is not good enough. This is an example how you could configure that in public Can we turn down the lights maybe completely because okay much better so basically when you do that in I Have examples in public, but you could do the same using chef Or see a vengeance or any other Configuration management tool you want to use like John Topper said yesterday The question is not about whether to use chef or puppet the question is about use a tool and start automating stuff So basically when I define a Host which is going to run an Apache instance I usually have a meta class which is my way of installing Apache or our way of installing Apache We include Apache we include not very right to do stuff We install PHP PHP with a PC We basically configure the logging to go different and We have the PHP class installed before we configure a PC because otherwise we cannot install it And as I told you earlier, we have a baseline which already installs collectee to do monitoring We use collectee for monitoring for metrics collection But on this host since it's also running Apache we Configure the collectee plug-in to do Apache who knows collectee okay now. I can't see anymore. How many people are raising their hands So collectee is one of the tools which allows you to collect all kinds of metrics Store them centralize them or ship them to a monitoring tool ship them to a metrics tool It's really good and it's really scalable to build these things. So when we define our Apache we pretty much define How we configure Apache which which parameters that we want to get metrics from there And that we also want to define the logging because how many people do actually configure log rotation when they add new V-hosts when they had new files usually everybody forgets that so the disk flood We have that in there and we have the firewall in there Pretty much defining the full container of what the services are being built on in code Reproducible and if I need to deploy another service it's fine so That's the definition on one note Now I've defined one config and one note knows how this stuff works Does anybody have an idea how to make this distributed? You see there's distributed content in this talk anyhow The example is puppet. Yeah, but it works with pretty much any other Yeah, well, okay So the puppet architecture can be different ways you can usually have a puppet master where there's catalogs being built There's a client which connects to the puppet master says hey, I'm no dex. Give me the configuration for no dex And the puppet master will then ship that so Ricardo says the puppet master will know Well, yeah, the puppet master knows the definition of that note It does not know what the note next to it is using so we need to think about using a feature in Puppet which allows you to collect and store configurations for other notes and That's for example in puppet language. That's stored conflicts And ideas there that you have a note which is running which where young are including the Apache class Which then has a config which is going to store in a database and When there's for example the nodule server connecting It's gonna know from that database that there's five new notes who have the Apache class installed which it needs to start monitoring Basically, it means you export a note with the add add resource and you collect it with a spaceship and this piece of code actually shows that so We have we use a Kinga as a variant of nodules for all kinds of fun reasons Part of it is that the nodules fork is the original nodules built is pretty much going open core rather an open source and Developments really isn't moving on That's an example of how forks are really bad sometimes But forks can also be good because the fork a Kinga is European based those guys throw pretty good parties in Nuremberg And they're actually improving and heading in a direction that people like So on the server we basically collect all the resources we've exported and on the client We're exporting them and we do that for Check ping but also for the Apache stuff and for monitoring the host So in my Apache config there's also this check which we defined because this is in my Apache class We check each time When do you want to check what you want to check and that's code is being executed while we are collecting the data So when I define a new Apache server Automatically puppets will reconfigure nodules and it will relaunch it and the new note will immediately be monitored so Infrastructures code what does it help us there? That we will be capable of knowing when new notes appear and we don't need to do manual stuff So quickly if you look at the Kinga and at Nadia's I stole this afternoon There's a bunch of plugins for both Drupal and for Nadia's where you can check if there's updates missing when you can see the health of your Drupal instance and Basically those need to be there. You also want to check when the cron job has last been run If there's long running cron jobs, and if they're not being blocking the platform But that's a check we do by default when we export the Apache so The thing is we are now monitoring not only Apache But also every time we export a v-host every time we deploy a new instance We monitor the service. Yes, there was a question there. So Yes, you need to clean out nodes to disappear in Peppet that's basically puppet node clean and Nick Steele, I will be talking tomorrow about sensor Sensor is going to drastically take a different approach for that, but I'm not going to give us all talk There's to me two ways in using the monitoring tools You stick with not just if you have an environment which is pretty fixed, which is only going to grow Because it's easy to configure there's a lot of community around it And there's a lot of tools you can already reuse if you are indeed running into the environment where I've got 200 nodes today 30 tomorrow There's gonna be 16 new ones popping up the day after and three hours later. You're gonna have 5,000 Then we start thinking about new architectures and we start thinking about tools like sense who and if you're really interested in that kind of stuff a Lot of the approaches on how to configure that and the idea that you also need to do that as infrastructure as code That still applies, but the tool is just being replaced there. You actually choose another tool like sense who I think Nick Will explain how to configure it to a chef But the IDs are still the same So, yeah, it was a couple of slides further on Yeah, so Now we're both monitoring the actual application and Apache So for a lot of organizations that is a huge achievement Who views using scrum in here? Oh, not too many Weirds, I'm used to talking to developers where they're pretty much only using scrum of a development methodology So you guys don't know what the definition of done is right? Who knows what the definition of done is? Ricardo So for a lot of people would think that DevOps is about introducing scrum to systems people We were really happy that we could include in the definition of done of a development team that it was actually monitored and in production Because now now we had actual software deployed which was monitored. It's done, right? To me a software project is not done until your last end user is dead Until you don't run that application anymore So in which time zone do you need a wake-up call at 528? Or is that not just going off? So Software development has evolved the way we deploy software is not in two weeks prints anymore Where there's new features at the end of the sprint with monitoring include no software is constantly on and we need to be able To monitor and make sure that as long as there are users We know it's up and running So I'm not a big fan of the definition of done in scrum anymore actually want to get rid of it But that also means that when you say you want to stop monitoring and you stop maintaining the application when your last end user is dead You need to figure out if he's still alive So how do you do that? I've been in a couple of setups where I was pretty much We were practicing not even practicing we were actually doing an upgrade and So there's distributed system where there's stuff coming in on one side. There's API guys calls being made and The server is sitting there and we need to upgrade it. This was pre continuous delivery so You sit there next to the developers and I say yeah, well, we're gonna plan the upgrades. Okay, sure So when one when can we do it? Well when there's no users anymore? How do you know? Well, we don't know we check the Apache log Okay, so you keep watching the Apache log for four minutes and then there's one new hit So there's one user who is this can we kill him off? Can we shut him off? That's one approach. You wait till there's actually nobody there anymore The other approach is well, we put a firewall in place. It's gonna error stuff and it's gonna break stuff But the actual way to do this is measure all the things which is one of the DevOps mattress We really want to measure all the things and sometimes we even measure too much But the thing is you can throw metrics away afterwards, but you cannot recreate them. You cannot reinvent metrics So we start measuring and monitoring stuff from within the beginning we want to measure deployment statistics Oh, I was actually gonna replace this screenshot with the actual front end deploys of one of our Drupal sites We're doing but basically we use the tool called graphite graphite when we use this graphite Okay, four people For the others graphite is awesome. It's a next-generation time series database. It's scalable unlike stuff like cacti and munion and It's I mean every developer in this room Can send metrics to graphite you open a socket to port 2003 and basically send a time stamp a name and a value And that's it who cannot write that in PHP. You're not a php developer, okay Then write it in shell use netcat whatever it's so trivial and what we do when we deploy new software is we basically send a null metric and Graphite allows us to draw a line as infinite on every null metric What it also does is this gives you the point in time where new software has been deployed and When you map that to the actual behavior of the application you can go back and talk to the developer and say You know last Monday at 5 p.m. We deployed this new feature from your git commit and They'll say yeah, that's right and After that our database queries went up to 20,000 per second. What the fuck did you do? And they'll be like Well, I enable to debug parameter And they'll disable it and you'll exactly see when they introduce the fixed and they'll all be happy again So What do you want a metrics off? You want metrics of a lot of stuff you want To be able to know when you can pull the application down So you want metrics on the actual concurrent users and not the ones you see in the Drupal dashboard You want to see if you're running a service the number of sign-ups you're having the response time of your service You want to see if you have documents being generated or acute you want to see how fast they actually go through Your usage you want to see how many times people have restarted engine X and PHP FPM Because people do that behind your back Because it's unstable by nature And You want to figure out with your team with your management? What's actually specifically valuable for you specific application? And that brings you to the next point where you can actually build self-service metrics where if your developer It's interested in the number of times. He does a certain operation You just let him send metrics. You just let him build his own dashboard and He'll also will be wanting to look at all the Drupal error slogs over 25 sites Correlated so we need to give him tools which allow him to learn from his existing platform And there's another tool I like to use for that and that's called lock stash Who knows lock stash? Was using lock stash Okay, so lock stash is a tool written by Jordan C cell and Jordan is a guy who does anger driven development If he doesn't like something if he's frustrated with something he writes a tool to fix that and Trust me. He has written a lot of tools. Maybe a bit too much and every month. I figure out that he written another tool even last week and Lock stash can collect software and metric can collect log files from everywhere and if you if you see in the top corner there It actually gets Metrics from Drupal logs. It actually knows how to understand and parse doors Default it can also get stuff from Lock for J or Apache or syslog whatever format it has it knows It can do some filtering on that so throw away stuff think back about the queuing about the small tools I mentioned that are capable of doing cool stuff and then it can store that into a lot of tools Now the default tool people want to store things in is elastic search Who knows elastic search? Who has learned the tool here up till this? so elastic search is Think about it as the next generation of solar Not really 100% correct, but It's in that direction it it allows you to really search fast in your data. It's scalable. It's clusterable and If you have elastic search with a kibana dashboard For your logs, you can actually have people search Distributed set of log files much faster and it will also give you the opportunity to build metrics out of those logs Because one of the things that's in here as an output is graphite Exact same tool I used before So if you want to know how many times there's Failed user logins on your site you can find that if you want to find how many times mollum actually did not scan correctly and didn't Catch your spam you can get graphs out of that better than the ones on the mollum site If you want to see how many times you've got a captcha because somebody tried to spam something You actually can build those metrics on your own dashboards And you can also integrate that with all the other error messages and all the messages in your system There's another way to do that. There's a Drupal module which allows you to ship stuff to log stash so Anyhow you want to use it you want to use the log files from Drupal sent them to syslog or Use this module or use the plug-in where it does it where it grabs the logs from the database There's no excuse not to ship those logs centrally and to correlate them People have been doing this on a lot of sites before So now you have metrics you have Numbers and figures and you want to get statistics out of this and then there's a tool called stats D which is Basically a tool written by the guys from Etsy. It has been written in all kinds of languages And it does the simple math for you. This is an example of an Apache log and The events correlated to that. So here you see the graphs of How many events were there? Stats D will build that for you It will ship it to graphites and your developers then can build new dashboards because graphite also has an API this thing in the top It's basically something you can build up from within PHP code And people then can start building their own dashboards and start sharing things they want There's surprise surprise also a Drupal module to directly send stuff to stats D This sends stuff to stats D stats D flushes its metrics to graphite So this is basically a metric from a Drupal site Generated right into my graphite system. So now I have a lot of metrics now. I have a lot of Things to play with What's the next step do something with it? What do you want to do with it? I Saw a slide This afternoon where there was a metric of the disk usage and a threshold About well if you reach this part, you're gonna be screwed Frankly, I don't care about that anymore What I want to know is the speed my disk user changes because I'll know upfront My disk users is gonna reach a certain limit and I know when my site is active that It's gonna be in four weeks, but when suddenly activity changes and There's a huge change in the normal growth or even a huge drop in the normal load That's when something's going wrong That's when there's a real anomaly in your platform. So it's time to go back to our math books and read all this stuff about acceleration and all that stuff and Even go a step further to actual statistics Graphite allows you to do a lot of these things. It allows you to map Things on what happened last week It allows you to do Forecasts it allows you to do all the stuff that's decisions are really fun with and Well If there's such a huge drop in your system You also want to get an alert on that so We have basically and this is an example on how to do that on on a Java GMX if I'm not mistaken Yeah, so this is basically an example, but it could be used for any other metric if you see a certain drop You can check that on the graphs and you can trigger and not just alert based on that drop that drop or if you have a system where basically if it's Well as long as over the past 50 minutes We stay over that load or the past half hour. We stay over those actions. It's fine But if it keeps longer and we learn that from the graphs Then there's an issue and they're the same. We can trigger Basically alerts on those graphs This code snippet isn't that interesting for you guys so we have Monitoring which is automated. We have metrics from the monitoring from the other usage from the platforms And we built dashboards with that Dashboards which the developers can self-service from where they can learn and add new stuff but also stuff which we care about like my sequel replication lag and One of the tools that allows you to build those dashboards. There's plenty of them out there But this is an example using G-dash It's just a simple templating engine which allows you to Write the metrics on one line which you want and this is something you can then Template and give to developers and they can start building their own dashboards and sharing metrics So what's the next thing people need to do? They want to start learning from that data They want to start actually do machine learning and big data analytics on those platforms And a lot of the large-scale sites are already doing this if they're doing an e-commerce shop They really want to see the forecast of what's going to happen and that's where machine learning is going to come in And then we're down to the last part in the camps. It's about sharing and part of the sharing is visualizing visualizing the revenue visualizing the sales turnover visualizing the signups and Part of the sharing is basically doing that on dashboards and this is then the screenshot of one of those Dashboards you can build with graphite and G-dash And on our story on that part is sharing these dashboards Like in the coffee room or lunch room where everybody sits together where everybody meets CEOs will walk by the dashboard and figure out what is that spike and they'll talk to the operations people and they talk to developers What happened here and you create discussion you share experiences and Sharing experiences. That's also what DevOps is about and what a lot of people are doing in this DevOps track The thing I want you guys to share now is your feedback That was basically my message to you today Any questions nobody wants to know where the beer is okay. Well, thank you