 Okay, please take your seats. Let's start Welcome to my presentation. Have you been stuck in your server? Two things I would like to say for the beginning This is a beginner level presentation. So If you know about monitoring you would probably get bored. So I won't take any offense if you decide to leave during the presentation because time is precious the second one my business partner and friend was presenting a session at GroupCon Sydney this February and He realized that there was one guy sleeping in front of him during the presentation And he got really nervous because he thought oh my god. I'm being boring, you know people are sleeping once again I know that check beer is beautiful. So if you fall asleep I will take it as you had a really good night last night and I will take no offense. You need your beauty sleep So just to introduce myself. I have been a sysadmin for 13 years I'm from Czech Republic lived in Prague many years moved to Sydney 10 years ago and I think when I started working for the biggest telco in Australia, which was about four years ago I became a DevOps engineer because we were in doing a lot of puppet and continuous integration and Two years ago. I set up a company called morphed together with my colleague and we do triple services Puppet deployment to the cloud beautiful stuff. So monitoring is not my passion number one But by talking to our clients, I realized that It's very needed. I don't like Going to Servers without having any data to investigate. So this is just the bare minimum. You need to know if you want to start with monitoring So this is my first big presentation like that. So when I was doing a little research how to start I realized that there is this one rule the rule of three things everybody recommends apparently The the three is perfect Omnitrium perfectum everything that comes in three is perfect So even if you are presenting, we are supposed to give you three things So two approaches I will tell you these three things I want you to remember when you are leaving this room or This is how it was done in the past. This is how it's done now and this is what the future so I would like you guys to Know what monitoring is and why you want to start monitoring if you haven't yet Then we will look at what tools are here ready to be used for you Even you haven't been exposed to that before and the third one I will show you how easy it is to start and I believe that you will be able to do that yourself this afternoon so part one What's monitoring and why you want to monitor? broken glass I've worked as I work at the brokerage company here here in Prague. I think it was in 2001 and they wanted to Get access to the New York Stock Exchange to be able to offer check customers to trade on New York Stock Exchange They went to the bosses went to the US. We're trying to get an API didn't get any API. So they got a Web interface which was meant to be used by a person brought it back to Prague and say guys you have to parse the out HTML and Make it basically hook our own website our clients trade through to the system To pass it to the New York Stock Exchange and then back. So we spent maybe two months developing this SSL sockets JavaScript parsing it was a lot of fun. We started trading It's like 2 p.m. Afternoon because the time difference New York Stock Exchange was opening 2 p.m. Prada time And five minutes later. I have these three sales guys behind my back saying we are not trading something's broken We we are losing money. So we looked at the website and it completely changed Apparently the provider changed their web interface without telling us why would they write because we were parsing their HTML So I realized that we completely underestimated the monitoring part what we could do is for example Having a little robot, you know, like trying to sell one little ticker every five minutes And then they just like do a little operations every five minutes to get known that fact that it's broken in the morning We would have much more time to fix that So I had really bad three hours of my life with these sales guys behind my back I managed to fix it, but I don't want to do that again. So I started monitoring I Don't want to scare you with this definition. This is the only definition I have but I really like it because It's actually not from the 80 area. It's from nature conservation area and it says exactly what monitoring is. It's a series of observations in time Carried out to show the extent of compliance or degree of deviation from an expected norm I really like that one So I'm going to offer you a few reasons you want to monitor The first one is you want to know the bad news before your customers to or at least your boss does as my story You want to scale up your servers in advance? if you know that you are going to run out of This space or if your CPU is getting hot you you swap a lot you want to know that You want to tune up your application? Maybe there are extra modules enabled recently Maybe you have more customers now and the application became slow and you don't know that or you want to monitor that response time You want to prove your uptime to your to your customers. This is let me Take just one slide detour This is called five nines. It's actually a Unusual unit you can find on Wikipedia together with Sydney Harbor as a volume measurement or a bus as how many people used in London apparently so five nines would be Five minutes per year downtown, which is like six seconds a week These numbers are very often in as a lay of this admins and management wants to see them And when we have our services in the cloud we see these numbers, but I think they very often guess They wish that we don't know what this really means. I read somewhere this this five nines Which is the six seconds a week downtown is actually if you have a power grid Supporting a city. This is considered. This is considered a uninterrupted service But I think we remembered Google being down two minutes three weeks ago or Amazon having some some troubles like 30 minute block as well recently So let's go back to the reasons I can see you want to minimize your downtown. It's expensive when you are not up and running Also, maybe you want to capture your customers Behavior, you know, like maybe they trade or use your application during lunch break or on Sunday afternoon You want to know that to be ready. Maybe you had a ad running and You want to analyze the success of the ad You want to have that up to diagnose When something happens you want to be able to go back and see what actually happened So having some data that I will show you a few examples. I find interesting so You can watch out for trends like here We have a tape usage and you can see it's growing slowly over the two or three years so I can kind of see the speed and I know when I will need another tape So that's a trend I can see From here Then I can watch out for spikes. This is a low average Monthly graph and I can see there is a spike every week It's probably Sunday night when there is the weekly cron job running. Maybe Backing up data dumping my SQL database G zipping lock files I want to know that that are these CPU spikes because maybe my application is is hurt at that time And if there were some clients they would have below average performance You can watch out for irregularities. If you this is a example of a memory is each graph during a day You can see that the server has one gigabyte of memory and suddenly You have like a spike when it got into swap. So the server started swapping. So since you have gig is in swap and The application actually claimed much more memory because Linux overcomments memory. It gives you more memory that it has Hoping that the application won't use it all but the application actually asked for three gigabytes of memory you want to know about this because if You start swapping everything slows down in the better case and The another thing to watch for are thresholds. This is like this use it this usage in person So, you know, you don't want to get your file system full So there is a threshold of maybe 92% and you look at this graph and see hey, I'm reaching the threshold This graph actually makes me nervous. This is not good. It's good to have it Okay, so it was just an example how you can use data you have collected Let's have a look at areas you Want to monitor or can monitor. So the first one Would be network, right? Like do you have any reds in your data center today? Like the cables there What's your network connectivity? Are there any drops? Is it the speed? What do you expect? Is the pink time? What do you expect you want to know that? Then you want to monitor your server Is it performing as expected like file system CPU is it swapping? What what are the iOS app system doing? You know, like there was one probably all of us know the smart system for self-monitoring of hard drives which was able to tell us hey, I'm about to die Like probably not being used anymore with SSD drives, but that was a good example of that Then you want to monitor your services just just because your services are running. It doesn't mean that they are Replying the proper way. Maybe you have 503 Getting back to the customer and you don't know about it Even though it happens once a day you might overlook it in the log file but definitely you don't want to have the customer to have this experience and You can monitor you should monitor the application just because your side is up and running once again and your Running a health check Gives you okay. It doesn't mean that the customer might see something like that. Sorry and I think you also want to monitor users. Maybe they are misbehaving. Maybe there are too many of them. Maybe You want to know about set up password aging? Okay, so that's these areas. I want to I see as a sysadmin. I would like to monitor and now I'm going to look at it from the Drupal point of view. So network. I think I will skip network because most of us Have services, maybe it's too Rich to say many of us have services in the cloud and the network monitoring is not really in our Hands, we basically decide to trust one of these providers and say, okay But we still want to monitor our server We still want to monitor our services this time We probably focus on the web server and the database these two most precious parts of a Drupal stack the application is our precious Drupal site and the users here we go We also there is the system users Actually If you have a small server a little lambstech somewhere People usually just log in once a month or once a fortnight I actually saw people setting up a little script which sends you a email every time somebody SSH into the server because you don't mind The mail when it's actually you but you definitely want to know when that happens and it is not you That means that your server is compromised. That's the one of the easiest cheapest way But you want to know that and users from Drupal point of view Maybe suddenly there are too many users, you know, like your account got compromised and you know, like There is this robot setting up user accounts just to use them for spamming later on okay, so that's the first part and Now I would like to show you the tools which are available. Let me check the time. I have to go a little bit faster so these are Nuggets and Munion are two Tools which have been around for years and I believe they are the easiest to start with and I also edit something which is not actually a real monitoring tools But I will explain why I put it there in a second. So Nuggets Nuggets is an application for system network and infrastructure monitoring It monitors and alerts when something goes wrong and then it alerts again when it goes back to the normal state It can provide monitoring of network services All the protocol you can see there many of them of course more host resources the disk usage loads logs Anything else you want to basically anything else which can be executed by a script so you can Use it for temperature Alerting there are many plugins. You can use just like anything which is executable from shell script can be a plug-in You use from Nagios and get alerts based on it Why did I put the pronunciation in the brackets there? I really like that Nagios used to be called net saint, but there was another project having the same or similar name Which got copyrighted so they had to change that so they decided to go for Nagios and gonna insist on same hood Agios is a saint in in Greek. So that's why it's Nagios It's really nice so Nagios So it alerts you when something goes wrong then it alerts you when it goes To the normal state again via email pager SMS You can connect anything once again when you whatever you can talk to via command line is executable. It can be your alert system You can have different contacts. So web server All the beyond server alerts go to this person all windows go to this person you have notification escalation So if the sys admin does not acknowledge a problem, maybe it goes to his or her manager an hour later He can set up these rules You can set up dependencies. So apparently if the server died completely You get an alert about the server being dead and there is no point in getting alerts about Apaches not available my skill is not available because the server must be up and running, right? It's a dependency can do the same with network with many or network is not Say you don't have connectivity You why would you why would you? monitor alert on HTTP not being available makes sense There is a concept of soft and hard state So if there is a trouble Nagios can scan us reach a service. It tries several times before it actually goes to the hard state Just to prevent living So this is when you install Nagios, this is like kind of like a control panel you get Out of the box You can see that you know like there is local host as the only host it monitors because it's running on the server It actually monitors by default when you install that and there are a few services which you get like straight away like current load of the server current users Total processes running Status like status like green is okay Yellow is warning and red is something that happened You can see where the service was checked the last time How long it has been in the state and how many attempts like so? You see that one out of four okay? and this is It's tried four times before it made it a warning There are Nagios add-ons the most important one at least what I believe is the Nagios remote plug-in executor which basically enables the Nagios server To connect to a server you want to monitor maybe your lamp stack somewhere and Then there is this plug-in which you can connect to any any plug-in you There's this demon you want to you can connect to any plug-in you can write them yourself like plugins are anything executable shell script parallel Python binary And then do your checks on the box But it's the way how Nagios can connect to the NRP and then execute all the checks locally on the server You want to execute but nothing is stopping you from having a plug-in which actually checks your website somewhere else from this server, so it's a very flexible and The other one important is this NSCA which enables you to do passive check. So if you have an asynchronous application something which Realizes oh in a bad state. It actually has a way how to tell Nagios Because Nagios usually asks actively but this time you this Service that something is getting wrong here. You have a way of Push the message to the Nagios server to realize that There is an integration with Drupal. You can see that this Nagios is a module from Drupal.org And there are few Components it checks, you know is your Drupal site up-to-date Sorry, is it the core up-to-date are the modules up-to-date you have unreadable files directory Are there any Bending updates you need to run updates. So this just behaves as another plug-in to Nagios Okay, so that was Nagios, which is to Monitor and alert and now let's go to Munion Munion is actually in north mythology is a raven one of the two ravens which brings information to the God Odin So it's an application which provides network and system monitoring this time its outputs are graphs which you Access via a web interface there are also many plugins available for me and it has master node architecture So we have one million server which connects to each node. You want to monitor each server you want to monitor on that each server you want to monitor is a Munion node running. It's like a collecting demon which basically collects all the data about that server and Does the master takes it usually every five minutes just and graphs it It uses the RRD tool which is a route drop-in database tool Designed to handle time series data I'll show you that one is very popular by other projects as well Do I have this graph? I'll tell you why because With the RRD tool you you always know what the disc footprint will be because it always has Daily No, actually, this is not the right slide by this daily weekly monthly and yearly you can see one year and nothing else So you will never be bigger that this amount of file system data Regardless how long Munion has been up and running. This is an example of an Apache Perhaps like when this is the million dashboard when you go to the front page of one server you can see that there is How many Apache? How much time in Apache responding? How many processes Apache workers are running how many are free? load average and Memory usage there are about 30 of these when you install union just by installing it you get 20 to 30 of these and You get historical data. So if I go to a detail of that to the memory usage So as I said, it's day month Sorry day week month and year So, okay, this is a day. It's not that interesting But here I can see little spikes who is my application leaking maybe or maybe There is a demon being restarted every day and it's just like normal that it keeps taking a little more memory as it's running but what's what's interesting here I can see from this yearly graph that this server had like a half a gigabyte of memory and You know, like the red one is like memory in swap and was overcommit it Until in June this year somebody early me realized that I'm running out of memory and doubled it and Then it became normal. But you know if I see what was happening. What has been happening with the server? once again, there is a Integration there is a union module on Drupal dot org Which you can install and It will talk to Munion will talk to it And you can get some How many users are locked in? There are many many You can write your own plugins as well I think this module lets you define your own plugins. This is my colleague playing with well like a plug-in extension in his sandbox and he created How many content like so this is how how many content types so he has this amount of block pages pages stories together How many users you can see our comments? It's like for example, you can see it's growing slowly probably Being spammed slowly So These two Nagios and Munion are not the same kind They complement each other like so Nagios alerts on thresholds and Munion Provides you the metrics you look at Munion and say what's different today to yesterday But if you put them together you get an alert from my Geos Then you can go and look at the Munion graph and see oh, I see so the memory was slowly growing growing growing and then it died So and the last from these three I wanted to introduce you to APC is not a monitoring tool every probably most of you know what APC is. It's just a it's like a opcode cache To speed up generation of the HTML code Instead of my PHP instead of compiling it every time it's being accessed instead of Compiling index PHP of your triple site every time you access it You just keep the opcode in the memory and then just you don't compile it. It's just executed But And it's not it's not it's inside your web server. It's not a web cache But so people do have this install I'm pretty sure it's like a normal thing, but what people usually don't know is that there is actually this one little script called APC PHP which comes with the APC PHP package and you can just copy it into your Root of your web server and you get straight away these graphs and you can see how much memory is my Drupal side using how many what's the hit-miss ratio? because when you install and I'm telling you these guys because I know that people use it but very often they They don't tune it up and when you install APC it it comes with 32 on the BN And you've been to it comes with 32 megabyte of memory and if you enable a few modules in your triple site You will go over that like 40 50 megabyte and then you end up This filling up and flushing, you know every few minutes Maybe and say how come my local machine is so slow? Well because your APC cache is actually slowing it down by filling up not having enough memory purging the pages and then you know like populating the cache again So you want to have like a look at this slide now and then to find out what's going there? Do I have any fragmentation? Has the cache overflowed you get more static statistics about fragmentation Because I realized that this APC what I call APC dashboard actually didn't find a name for that is not a monitoring tool Which would comply with my definition. I also provide a Provided a Munion plug-in which monitors APC so here you can see these these spikes In memory, it's probably when it over flown got flushed because you know you you reached the maximum and then got populated again Flashed to got populated again flushed. I don't like this graph means it's tuning the same Representation, but with number of files as opposed to memory in the previous one okay, so I wanted to introduce you these three as the The tools to start with but they are also other tools Which are very popular? Collegady and graphite. I believe people use together. Collegady kind of Collects the data on each server you want to monitor But it doesn't care about graphing it and then you use the graphite to to track the data shinkan is Nagios by an area replacement. I think Nagios has been around for 15 years and That interface looks like from 90s. I believe that shinkan is like a modernized replacement of Nagios sensor is a new monitoring tool Which is meant to replace Nagios as well new relic is a commercial Commercial monitoring tool which can monitor your server But also inside your application by having a special plug-in even for Drupal when you can see traps what was happening There are some freebies, but you don't have the historic data I believe and Pingdom is a example of a service which can Monitor your server remotely is your web page up and running it can send you an email when your Page does not respond It can go and register and get it for free. I believe so Two parts gone so the first one was What monitoring was and why we want to monitor the second one? I introduced you to a few tools and now I want to show you that it's easy to start Let me check the time So to install these it's easy like I'm making an Assumption I learned that most people running their dev and state servers use Debian based distributions very often Ubuntu So this assumes Ubuntu enterprise clients run redhead redhead based distributions But this is for Ubuntu I'm mentioning here another thing I would like to say when I Talk about you start monitoring. I mean that you can actually start monitoring Even your deaf environment, even if you are running your LAMP stack in a vagrant server It still makes sense to put the Munion there Maybe even Nagios there just because when it's when the file system is running out of space You want to know that you didn't want to like getting these awkward Responses from your website. You're testing no not knowing what's happening You might get an email before that happens. You want to see your APC cache is Healthy and you get maximum speed from your vagrant machine So I believe that and even Munion, you know, like of course that it's a little bit silly to have the Munion server on the same Machine you want to monitor on the LAMP stack because when it dies Of course, it's not being monitored But you can still see from the graphs from the metrics what was happening before it died So it still makes sense. I actually do it myself so You just install a few packages So Munion is the server the central node and node is the collecting one if Nagios You just install this one with APC dashboard. You just take this php APC script Which comes with the package and stick it in your Website But to get a little bit out of it you have to play with it There are many guides on the website you can find which which shows you how to do it But if you like me provision your LAMP stacks again again with a new project for every project You don't want to do it again again. So How can we automate it? Because I'm puppet is one of my passions. I Found a way how to put puppet into this presentation I'm pretty sure you guys know what puppet is just with one sentence is a system for automating system administration tasks That will be the all definition the new one Is open source configuration management tool I found a new one it has like In three steps, it's a declarative language for expressing the system configuration So you are not defining the steps how you install this package and then this package and but you install you describe With the declarative language. What's the state you want to have? So this is what I want to get I can define some dependencies For example that I have to install the Apache package before I start the Apache service but if you don't say If you don't specify these dependencies, it's just describes what you want to have I want to have these packages installs these services running and these files being deployed there So that's the first It's a client and server for distributing that system configuration These days There are more and more people even enterprises instead of using the client and server Model when client is each server you want to provision and server is the puppet master You actually just get the Puppet code and apply it locally on the machine. It's a perfectly valid case, especially in your vagrant machines That's how you would do it And there is a library which realizes the the configuration Basically the executors it may be one for Windows one for Linux It's basically say I want this file to be in this directory There is a library which takes care of that An example of a puppet manifest so Here I'm saying that I want to install union node package Ensure that it's installed and then I want to service union node Enabled so if the server reboots It will get automatically started and ensure that it's running at the moment. I'm executing this this manifest But require the package union node to be installed first before you Handle this service So I just wanted A side of tell you what monitoring is I also wanted you to get your hands dirty slightly with playing with puppet a little bit If you haven't So I created like a little repo which you can clone to your deaf machine even if it's running a Your limestack already it's designed not to Clash with it Then you run puppet apply on that code and then you should get your monitoring tools So the first step you clone This repo you can go have a look You can download it of course as well, but uh, this is the good way Then you run puppet apply on that code. You just go down. So you say puppet apply So it's applying the language which which is Which is in this directory so you can go and browse it and When you're done, it will take a few minutes just to install these packages. You get union You get Nagios which will alert you to your email address I will show you how to change it and you get apc dashboard You can have a look at So in that repo you find this manifest pp which is the main file which describes what's happening there and just goes quickly with you Because I want to show you how to so here i'm provisioning union and as a parameter i'm telling i'm saying that i want the Basic authentication to use user union and password proc 2013 So you can change it and apply it again. So you get your own password protection the same with nagios Here you can put your contact email providing that your Machine is capable of delivering email. You can just put your gmail Whatever you use and you will start getting email alerts from nagios and the password you use when you connect to the Web interface nagios admin is the default. I didn't change it as a default which comes with the nagios package And here is how I deploy the apc Let's where I take it from and write it to the root of the web server But and I restart apache I don't do it if it's there already and I require the php apc and apache packages to be installed So this is A easy way for you to start monitoring if you want to you can just like run these three commands and Play with it. Let me check the time I think we have a few minutes left So some fun Can anybody tell me what's wrong with this? This is a memory usage By day What can you see? Shut down memory leak Yes, I would read it the same way. So Shut down. I would say it's more likely that might be memory leak and then the union Itself got killed the the the collector process might have been killed because of the memory Not being available. Maybe the server even like crashed before somebody rebooted it And then it came and It's probably going to happen again So you need more memory or you know, you're gonna tune your applications And the next one So this is swap in and out during a day Good not enough memory once again something is happening at 12 in the afternoon suddenly You can see pages going into swap in and out in and out in and out and then again in and out in and out You want to know that this is too many you want to fix it? Okay, so this is the repo I was addressing in the In the In the code puppet code. I will try to show you What do you get? I started a vagrant machine just when we were starting. Let me see whether I can handle it so here I just started a vagrant machine I destroyed the old one started a vagrant machine So I got basic ubuntu machine. I think nothing there Then I entered inside the machine. So I'm just like log it into my server. That's where you probably will start and I'm running install git to be able to Clone the repo I could have downloaded via wget of course, but I like having having the history and then As I said, I'm running puppet apply Telling when the manifest is and you can see that the puppet is running provisioning packages installing nagios deploying nagios configuration So you can play with this and see what's going on and then go to the manifest Which is pretty simple I tried to put comments in and keep it as simple as possible for for you to get your hands dirty and play with it So and here it finished So let's see I did it exactly at 12 So this is the repo on github with the instructions there so you guys can Just run them on your machine So I opened this page before It was ready. So now there should be one hour of statistics So you see it started just an hour ago So you will get more and more And then I have nagios which uh The local house is here And you can see you get some services current load current users disk space, it's read because the vagrant The nagios believes that there is not enough space the way the vagrant shares the file system But it's false positive. I believe HTTPC ssh So and it will alert you to your email address you can provide in the manifest And then you get your apc. There is nothing graphed yet because Because there Is no website. It's like a empty bundt box But you can see I think there is how much memory you have So this is what you get any question You're saying that the shinkan has a near ui than Than the morning, right? I was saying that shinkan is a uh Binary replacement for nagios. That's how one of the developers basically decides to Make bigger changes the other nagios developers didn't like that. So he forked that project and create shinkan But I haven't touched that one myself but uh it From what I saw from the screenshots, they look different more modern than nagios does nagios is kind of the question Was where what are the downsides of the shinkan with regards? I have no experience with shinkan. I just I just know that people use it And one more thing. Are there any third party services that can be used instead of maintaining local installation Yes, I think I showed you like the New relic for example, it gives you good start with Um You get like like five different server graphs like cpu loads memory A file system and you get inside into your php as well. You can like see how much time was spent in the database There is some I don't use it myself, but I have used it before uh In inside an enterprise organization, but I believe that even as a user you can just register and get some freebie You won't just have the historic data. You will maybe have only last 24 hours But you can still connect your site to it and play with but this presentation was designed for Your development servers not for production. So that's why I didn't bother about third parties I just wanted you to give a quick way of Play with that locally Anyone else? Okay, is there um free open source alternative to ping them Uh ping ping them has a great service But it costs a little bit of money if you have a lots of servers It costs a lot of money and actually everyone in this room if we just monitor each other servers It would be kind of free. So I'm looking for an open source solution to ping them So you can use nagios exactly for that nagios has pink integrated in it and nagios has hdb hdbs plugins So you can have one nagios server and just add another host to monitor You can you can put the nagios server on your lamp stack If even if you have one lamp stack somewhere in the cloud you can put nagios on it and monitor other services From there. So that's what I would do. I actually do that Thank you anybody else Is there any way to To monitor like the update status of drooper like Like having different versions of modules I need to update it Yes, the uh on that slide covering covering, uh, the me you nagios Nagios model module I'm nearly there. Sorry for flipping This is munion nearly there So I believe that here Yeah, pending version updates Pending triple module updates That's probably what you are after so you basically get an alert if there's any update pending. That's correct Yeah, uh, it behaves as a plugin for nagios So it becomes into warning or critical state and sends you an email. All right. Thank you anybody else Oh just a quick question. Um, I'm always paranoid about performance. Um, so I wanted to ask you if you have any experience with Um, I don't know the demon processes on the client service that somehow interfere with the performance on the client server Well, of course, it costs you something to say the munion monitoring demon keeps collecting data Nagios keeps connecting and executing scripts every five minutes, but my point of view is like Already when you are designing your application you want to Make this a part of the greediness of the application you need it. You don't want to save 5% how much it can be on Monitoring just to make it run faster get like faster cp or more memory, but don't be stingy on on these um, I think that The new relic which goes inside php there are It slows down by order of percent. I think it's Under 5% I'm not too sure here, but you can see that but it's it's worth it It's worth it. You get more data So you shouldn't be stingy on that you want to tune up something else, but have the monitoring Is that it? Okay guys, it was my pleasure. Thank you Please go and uh rate my presentation. I'm curious. Thank you