 All right, so I think it's 2.15, so we can get started. Hey everyone, thank you for coming out to this session, to make sure you're in the right session. This is open source logging and monitoring tools. It's part of the DevOps track. My name is Steven Merrill. And just to give you a quick introduction, I work at Phase 2, and I also am a volunteer on the Drupal.org logging infrastructure. And I'll be talking some about that in this talk. So just generally about this talk, what I'd like for you to all walk away from this talk with is the ability to learn about how you can visualize your data, which will probably be log data. We'll also talk some about actual application metrics and maybe how we can combine those two things and how you can actually do all this with purely open source tools. So I will talk about commercial alternatives, but there will also be a big focus on open source tools like the ones that we're using. I also want to give you some information about how you can actually customize the output from certain common daemons that often host Drupal to get a little better information out of them. There's going to be a very strong focus on log aggregation, parsing, and search, which does include information about Drupal.org's own logging setup. There will be some information about performance metrics tools and how to feed those into monitoring. And then I do have a sort of a two-machine demo of everything that I'm going to talk about in the talk that we'll be using throughout the course of it. So the demo setup that I have is I've actually spun up two instances on Google Compute Engine. They're both using CollectD, which we'll talk about to grab certain system metrics like what's the current load, what's the current CPU usage. There's a single Drupal instance that has a lamp stack with varnish in front of it, and that's using R-SysLog to ship logs to the utility server, which is the other machine. And this utility server is running R-SysLog as a log collection agent, and it has everything else that we're going to talk about today. It's got Jenkins, although we won't look at that. It's got Graphite and Grafana. It's got the Elk stack, Elastic Search, Log Stash, and Kibana for log aggregation and parsing, and also an implementation of the statsD protocol to get actual metrics out of our log information. So I wanted to jump in with a demo and just show some of the things that we're talking about. So I'll make this a quick demo of just a couple of minutes' worth of taking a look at what's going on in a particular site. So I'm going to first go over and actually just start some traffic against the sample Drupal site that we have. And then, yeah, I wanted to just give you a look at the tools that we're going to be using. So this is an instance of Kibana, which we'll talk about later. This is a search front-end for Elastic Search. It's generally used to show logs. It can also be actually used to show any data that you happen to get into Elastic Search itself. So if you are using Elastic Search as a search engine, you can do it. In this case, I've got a pre-configured dashboard that's taking a look at access logs. And so you can see here that I'm actually looking at all the logs from an hour ago to a few seconds ago. It's logging logs from three different daemons, which are Varnish in green, Apache in yellow, and Nginx in the light blue slash Drupalcon color. And the top view that we have is events over time. So we can see this is actually based on the header up here. We're taking a look at all the logs in 30-second increments for the past hour. So I can see that at 145.30, there were 688 Apache events, 705 Nginx events, and 368 Varnish events. The ElkStack can also give you statistics. So I can say for the particular timestamp that we've narrowed down to, what was the average duration in milliseconds of a request? And so in this case, you can see my entire window includes about 1,000 Varnish requests, about 600 Apache requests, and about 800 Nginx requests, because Nginx is in front of all the daemons that I'm using here. I could also go back up and refresh, and you'll see that we're actually getting a considerable increase in the number of Varnish logs because I have that seed running in the background, and that's causing us to do that. So these logs are streaming in live. They're going from the Drupal instance to the Utility instance. They're then getting parsed by LogStash, sent to Elasticsearch, and this is the visualization on the front of it. And we'll talk more through these later. And then, of course, I do also just have the entire raw access to the data. So you can see this is the first 100 items in my set that I have available, and I've got the verb, the actual request, what page is requested, the response, the duration, the number of bytes, and so you can see most of these are hovering around the .1 milliseconds to be served out of Varnish. I could also just drill down and get literally all the information about what's happening here. So for example, I'm tracking which things are cache hits and cache misses. So I could also filter down to see, show me only the Varnish cache hits. And so when I do that, you can see that our mean duration in milliseconds actually goes down to zero, basically, because of a rounding thing. But yeah, so by filtering down, I can see, okay, we have 1,300 Varnish hits, and that's actually come up in my filtering, and then I can actually remove that if I want to. And then we go back to just showing the overall view, and of course I could again refresh this. And the mean duration is going down because I'm lobbing Varnish softballs. We're getting Varnish cache hits. Another thing that you can do, and that we do do on Drupal.org, is that we parse Drupal messages. So there will actually be precious few here. So I'm again showing the same time picker on the top. This is in this case the past two hours, things that have come into Drupal and have been sent out as syslog messages. So as you know, out of the box, Drupal lets you use the database. You install the standard profile, you get the database logging module turned on. For high-performance, high-traffic sites, you generally like to keep load off the database, especially since if you start having a lot of errors, those will be logged, and that will increase database load when you may already be having some sort of problem. So for a lot of large sites, including Drupal.org, you can log to syslog. And then, of course, with our syslog, we can ship these over. And similarly, we can actually parse out the plain text that Drupal gives us, and we can get some more log information. So there are, again, more fields than I'm showing here. I'm choosing to show the timestamp, the type of the log, the Drupal user ID that hit it, D-U-I-D, the log source meaning which box did this originate from. And, of course, again, we could filter down, we could search, we could do any of these sorts of things, and I'll show how we use these on Drupal.org as well. But this is an application of the similar stack to get these results. So that's a quick demo. And then, in addition to that, I did want to show, you know, there are many other things that you can do with Kibana Elastic Search Log Stash, the Elk stack. This is another case. This is actually from one of our internal Phase 2 servers. So in this case, we actually have a seven-day window, again, with the time stamps, the time picker on the top showing all NGINX logs, and when they happened in relation, you can also, in addition to just showing all the logs, you can do specific term facets, for example. So we can say, okay, 88% of those requests that we received in the past seven days were 200s, or on the right here, you know, you can't really read the hosts, but this is broken down by host name, too. So you could also say, which are our most popular properties? Or, again, you could filter down to say, show me only things that happen to come to our Wiki. Another example, this has been anonymized, but you don't necessarily even have to use this do live log collection. We had a case a couple weeks ago with a client who unexpectedly ran into a large amount of traffic, and, you know, they were trying to figure out exactly what happened. They didn't have any sort of automated setup, but they did have Apache logs, and so we could actually just take the Apache logs, feed them in, use LogStash to analyze the GOIP information based on the source IP, and also just do the standard parsing, and so in this case we can see the traffic surge started up there, and then down over here, where we've got the bit where it levels off kind of towards the side, that's when we actually deployed a fix that increased, in this case, the caching for a certain sort of page, and so the number of requests coming back to Apache lessened. And then finally, this is an example that just happens to come off of the Kibana site itself. So Kibana, which is a front end to Elasticsearch, can also show GO information on a map, for example. If you want to have some view of your data and say, you know, show me all the Drupal page not found and what country they came from. Or you can do things like, in this case here, showing these are different file type extensions, what were the top countries that viewed things. So there's really a decent amount of stuff that you can do with this purely open source toolkit. So let's jump back to talking about logs in general. You know, many different people have different definitions of logs. Generally logs are just time and data, right? The purpose of log is to say at a particular time stamp this thing happened. And the formats can be pretty different. A lot of newer applications can actually log to formats like XML and JSON, where they can be just read without needing to have any sort of intermediary parser. Some logs happen to be one line. Apache access logs are just one line long, but many of them like MySQL slow query logs, Java stack traces could be multiple lines long. So of course, you know, Drupal itself makes logs, NGINX or Apache, Varnish, Jenkins, Elasticsearch, or Solar, Cron, Pseudo, MySQL, they all produce logs in whatever Drupal site you happen to be running. And there are different types of logs. You know, for things like Apache or for Drupal itself, we might care most about error logs, right? Just if there are any PHP errors happening, let's find them. Then you also have transaction logs, and that's what we might think of as Apache or NGINX access logs, right? Just logs that a particular transaction happened and finished. This is the same thing as a MySQL slow query log, or even just turning on the general MySQL log. This lets you go back and do statistical analysis on what has been happening in your logs. And sometimes you might also want to just ratchet up the logging level with trace or debug logs, like when you SSH a VVV to a server and get extra output. You can also use this in some cases. So you might also have certain legal or other issues with logs. For example, if your site will have HIPAA data in the United States, you have to keep something like five to seven years' worth of logs available to be audited in the future. And at least out of the box, if you don't have any sort of tooling, they require shell access to view, right? The most common use case for logs is to log them to disk, and then you can use something like grep or sed or awk to look through them. And as I mentioned, you know, they're not all necessarily human-parsible or designed to be used by humans out of the box. They're usually designed to be as concise as possible. So they do have sort of cyber-friendly tooling. And of course, if you have these logs and you're keeping track of them, you can often solve problems with them. So you could find slow pages or slow queries or, you know, sort through the actions of a particular user on your Drupal site. Get an average idea of traffic to one part of your site or one page on your site, like the anonymized instance where a client was having a huge traffic surge and they wanted to look through the referrers and figure out why and where is this coming from. And, you know, track new PHP errors as they come up. So the first thing I want to talk about is generally what log formats do we already have and how could we add some smarts to them to get more data. So, you know, the first most venerable one is you have syslog. This is two examples of different kinds of syslog messages. At the top, we've got just literally the DNS mask DHCP daemon just with a message saying that it's acting, that it has a particular IP address. So we've got a date. We have a host name. And then the rest of this will have the process name, optionally the process ID, the message from it, and the message could be pretty much anything you want. Now, as I mentioned before, having Drupal log syslog also looks like that where you can see that the program is Drupal and then it's actually outputting all the rest of its fields into a pipe-delimited format. So in this case, you know, someone installed PHP module. Now Apache logs, due to convention, will generally output in the combined NCSA format. It has been around for a number of decades. And you can see it's got a lot of information. Out of the box, the request is going to be just what verb was it and then what page is requested. So in this case, someone was requesting slash dashboard from this particular site. Nginx also chooses combined NCSA as the default. And so you can see similarly, you have a git request. It contains the exact path that was requested and the same information about number of bytes, response code, et cetera, the referrer. So varnish has a daemon called varnishNCSA and it can output logs in the NCSA format. But it actually does something that's a little nicer for you in that it will put in the actual host name and the HTTP so that without needing to necessarily correlate a specific vhost's logs to the log entry, you know what was requested. Now I think they're doing this because generally Apache or Nginx, you might have multiple vhosts and you could just infer from the logs location on disk which virtual host or which host name it was from, whereas you're probably just going to run one instance to varnish on a server. And so in this case, you've got all that information right there and you can easily tell from these varnish logs exactly what the request was for. So Apache and Nginx and varnish themselves all do support additional types of output if you want to be able to grab extra information and put them into those logs that are being sent to disk. So one thing that all three of them can keep track of is the amount of time that was taken to process the request. Varnish can also, if you're logging varnish logs to NCSA format, can track the hits or misses and all of these can actually also just put out any header that you want. Now that said, varnish can be very, very verbose, especially if you are actually getting 95, 98% hit rate. So you may not always be able to capture varnish logs on disk, but if you do, you can do this. And later on we'll look at how we can normalize all these things with log stash to provide the same thing I did in the demo where we can get a consistent amount of milliseconds for each request, for example, or a consistent request that always includes the host name like varnish does. So Apache has modlog config, which is included in the base, and a single log format directive in any Apache configuration file can be used to override log format. So the default NCSA format looks like this. So Apache can output extra information. You can get the time taken to serve the request in microseconds. You can get any header you want by using the lowercase i directive, and you can also capture the port on which the vhost was listening. So if we wanted to, for example, to get that same example from what I showed to be able to pull out duration and to be able to get a consistent request, we could actually have it do the combined NCSA format at the beginning and then also grab duration in microseconds, the host header, and the port. And again, with Apache's logs, you could just put this in any config file and it will override the default. So if you want to just start storing more information, you can do it that way. So an example of an old and a new log line before and after the change looks like this. Time in microseconds, host name, or host header, I should say, and then the port that the request came in on. So Nginx also has the HP log module, again included in its core. With Nginx though, you can't override the default format. So if you do want to have your own version, you'd have to define a separate one. The default one looks like this, which is actually a lot easier to understand than Apache's, which defines the NCSA format. And you can also get the same information, although of course Apache and Nginx see things differently. The request time is done in seconds with millisecond resolution, and you can still get the host header and the port. So similarly, you know, you could put in a log stash format if you wanted to track these extra things. And then before and after. And then similarly, varnish NCSA, as we said, it gives us nice logs that include the host header. You can pass a different log format to the daemon when it starts, and most of those things look the same as Apache. So varnish in the third option that we've seen will give you seconds with microsecond precision, and it can give you hits or misses. So again, if you have the disk capacity available to log varnish logs to disk, you could also have historical record of what were my hits, what were my misses. You really may not in production. And if you wanted to start up varnish in this method, this is what you'd pass to varnish NCSA. And again, before and after, we already have a nice request header that's got everything we want, so we can just get the duration information from it. So the next thing I want to talk about is how you can ship logs. Generally, the default mode for all these logs is going to be to be written to disk in order that they come from. And if we want to collate them for analysis, display, and search, we'll need to get them somewhere else. So a lot of different tools can be used to ship logs between different servers. Both of the common syslog daemons, syslog ng, and our syslog can be used to do this. And of course, our syslog d can not only ship its own syslog messages that are logged by daemons, but it can also be used to ship any other log file you have on disk. This may not always be optimal, but it does mean that, you know, if you've got either a Debian or a Red Hat-based system, you're probably running our syslog d, so you can ship log files if you want to. LogStash itself, we're going to talk about it mostly just in terms of its capabilities to filter and parse and send logs to Elasticsearch. It can also be used as a log forwarder if you want. So you could tell it to monitor some log files and ship them off to another instance or to somewhere else. LogStash Forwarder is a standalone go daemon that can be used to forward logs. An interesting note about LogStash Forwarder is that it supports TLS encryption and compression, so it will automatically, when you connect and have a LogStash Forwarder input on your LogStash box, it will connect over TLS and it will compress logs as they are sent in transit. And there's another Python project called Beaver that has similar goals to LogStash Forwarder. So you also always have some concerns when you're shipping logs. One of them is what to do about queuing. So in other words, if your centralized log server goes down, what's going to happen? If you have a box that is collating logs and you have 10 clients that are sending to them and either its main network interface is overloaded or it needs to go down for some maintenance, what's going to happen? So a lot of these log shipping daemons do have configurable abilities to say, for example, it's okay to use up to a gigabyte of disk to spool up or you can signify how many you'd like to do. You can also configure most of these things to say how many retries. In other words, if the server is not there, do I retry again one time, 10 times, 20 times? And of course you do have to think about security too, especially depending on what you have in these sorts of things. You know, hopefully your Drupal logs and your Apache logs and your Nginx logs will be free from, say, personally identifiable information that's a cardholder information if you're an e-commerce site, but you have to be careful. So again, log stash forwarder, our syslog can both be configured to ship over a secure channel using TLS encryption. And yeah, you might have to consider do they need to be encrypted in transit? Do they need to be encrypted at rest? And who has access to them? So I'm going to go through this very quickly. We'll put the slides up on the DrupalCon site. But as I mentioned, you can ship logs with our syslogd if this is our, you're probably running a distro that runs our syslogd currently. So if you just decided I want all my syslog messages to go to another server, you can put this in any of your our syslog config files. You just say all log levels go to utility, the double at means over TCP, and then the 514 is the standard port for syslog. Yeah, and this will default to shipping anything that would normally hit either varlog syslog or varlog messages depending on if you're on Debian-based or Red Hat-based distributions. And then you can also on your log host that's receiving all of these, you can say, well, if the source is not utility, I'm assuming a host name utility here, then drop them. So in other words, this will help you not flood your varlog syslog or varlog messages with messages from every other host, because you'll probably want to do something more like this, where you could actually say each of the logs that have come from the directory varlog syslog, the host name that came from year month day, and then you can put in a format that default would just be daily per host logs with the default our syslog file format on disk. You can also, as I mentioned, use our syslog to ship a file, so in this case if you wanted to ship the default Apache access log to another machine, you can use the imfile extension for our syslog, which again probably comes out of the box in your particular distro, and you can tell it what input file name, what tag to give it, and this will end up as the program that it shows up as on the remote side, because as you recalled earlier when you looked at that, syslog logs have a program field, and this is what will show up. Then you can control the severity of the facility, how often it should be pulled, and that it should run that file monitor to watch things. And then similarly we mentioned queuing, our syslog has an ability to figure all these behaviors, so you can tell it where to place the spool files, what prefix it should give them, and most importantly, how much disk space may it use if the central log host is down, should it save them on shutdown, what type of a queue to use, and how many retries to have if it cannot reach the main server that it's shipping to. So syslog ship log files look like this. The top entry in this slide is a normal local log file, so again this is that example of this would be what was in your VAR log messages or syslog if Drupal logged to your server. The thing at the bottom is an example of, in this case, shipping varnish access logs from one server to another. So you can see that the shipped syslog looks pretty much the same, it's got the date, it has the host name, so this came from host Drupal, and then as I mentioned before, that's what I showed when you were shipping these. So this is coming in with a tag of varnish access, and then it's got the full content log, so you could parse it out from there. Alright, so now we'll talk about automated log aggregation tools. How can you actually take these things and make sense of them? So I do have to mention, obviously we're going to focus on the open source stack, but there are a lot of competent third party SaaS offerings that will do this. So you've got things like Splunk, SumoLogic, Logly, Log Entries. So now we can start talking about the Elk stack itself. Log Stash is a great tool to work with all sorts of logs. As I mentioned, it can be used as a shipper. You can run instances on your boxes to ship to a central one. The main way that Log Stash works is it has three different pipelines. It has inputs, filters, and outputs. Excuse me. The inputs can also be specified to be in different codecs, so again, a lot of the time you'll probably have to do some sort of parsing to make sense of the logs, like parsing out from that combined Apache format. But if you do happen to have logs that are already in a JSON format, you can just say that, and then they'll be parsed directly with their keys and values intact. The current version is 1.4.2 and the main page describes all the options for the input, filter, and output pipelines. So in the context of this talk, we're mainly going to talk about then shipping those to Elastic Search. So how many folks here know of Elastic Search? We're already using Elastic Search to do log aggregation. Anyone using it with Drupal as a search engine? All right, good. Cool to see. Yeah, so Elastic Search is similar to Solr. It is a web service that uses Lucene under the covers to do indexing. I personally think it has a much nicer REST API than Solr. For example, to create a new core or to create a new index in Elastic Search, you can just post a JSON template to say what your field definitions are instead of having to specify a schema on disk. And for example, to delete index, you do curl-xdelete index, and it just works. So it's actually, I think, a nicer tool from the REST side of things. So Kaban is what we looked at earlier. It's the third part of this. It's a viewer for logs-logs. It does need direct HTTP access to Elastic Search. So in other words, all that stuff that I was looking for was because the browser I was in did have the ability to make queries directly to Elastic Search's REST API. So you might protect this with NGINX, Apache, something that ties in with your own internal security measures. It uses Elastic Search features to show statistical information, and it's worth noting it's an Angular app. It's purely client-based, so everything just runs in the browser. As long as you have access to do the queries of the indices that contain the logs, it can show any Elastic Search data, too. It has certain optimizations to look through logs that have been sent by LogStash, but you could certainly, again, if you're already using Elastic Search for Search, you could make a couple of quick things in Kaban to show that data. LogStash jars and packages do also come with Kaban, so if you've already got LogStash in the box, you can just run in LogStash and give it an address and port if you want to. We'll be posting on 9292 on all interfaces. Alright, so the next part is a feature of LogStash called GROC. So GROC is more or less a sort of standard library of regular expression patterns. It's a way to pull semantic data from these standard log formats that we have. And it's implemented as a LogStash filter pipeline. So an example of this is the NCSA format. It's got a standard format, so you can literally just say, this stream of logs is coming in that format, parse out all these fields. And there's two different things online you can use to take a look at the patterns. You've got grocdebug.herokuapp.com and grocconstructor.appspot.com. So we can take a look here real quick at, again, GROCing, this is an Nginx formatted log in standard NCSA format. So I'll just copy this and maybe in the interest of time I'll just use GROCdebug. So GROCdebug has three modes. You can go in and say patterns and we could actually just go into the built-in GROC patterns has a large amount of these. So as you can see let's just take something like username. So username is a named pattern in GROC and you can see that here's the regex that defines it as a standard Unix acceptable username. There's also things like the Linux syslog GROC pattern which actually combines several of these syslog base 2 which is defined up here and greedydata which is .star which will match anything. And so then we can also actually if you have a logline and you want to know what it matches against just the standard library of GROC patterns you can put it in this and say oh okay this is the combined Apache log pattern it matches. So then I can go to the debugger and say okay here's my input and I want to define combined Apache log and then when I do this you can see that here it's pulling out all the fields. So it's grabbing client IP, hostname, IP username, timestamp etc. And so all of these are what will be sent into Elasticsearch for processing and so that's how we got that ability to segment down by request or by duration. We'll look at how we do that other bit. I'm sure I'll say just very quickly GROC constructor has a couple different modes so you can actually put a logline in here and say go and it is going to go through maybe taking a little while. So there's the matcher here or you can also say incremental construction there we go and this should come back and yeah so this could literally actually let you if you have a longer logline this will literally basically give you every different pattern so I could go back and just say okay if I only took up to this part of the line what patterns would match. So here it's showing you all the different patterns that it's managed to find in this. Now of course the whole thing matches combined Apache log or greedy data because greedy data is .star which means everything. But yeah it'll tell you it's found common Apache log host, host name IP etc. So those are two things if you have existing logs and you want to know is there already a GROC pattern. GROC constructor will let you kind of look through the whole thing GROC debugger is sort of a higher level tool that will help you to discover patterns in your logs. So configuring log stash as I said log stash just has three pipelines it's got the input, the output, and the filters and so by default log stash has its configuration at C log stash and you can put any number of snippets there so it's also friendly to config management if you know that on one server you're going to be parsing Apache logs and another you'll have Nginx logs. You can just have snippets and put those in place with Chef for Ansible. You can use multiple files. Every file must have at least one of those input, filter, or output stanzas. And again I'll be putting this up. So an example log stash configuration to parse Drupal and Nginx logs. So when we looked previously we said send all these logs to var log or syslog and they'll be named with the host name and a date. So we can just say okay for the input log stash will read from a file. It's going to look at all the files on disk that match this pattern. You can also use an exclude. So for example you might have log rotate or some other process B zipping files after 10 days it cannot bother to look through those. Type is a purely arbitrary tag you give it. So if you had multiple different inputs like if you had multiple file inputs or maybe you had a couple log stash order inputs where you would be shipping logs you could send them there. You give it a sincedb path and this sincedb is just basically where log stash is going to keep account of what line it's at in each file. So you'll see later on on Drupal.org we actually rsync and update files in place from the 10 or so servers that we've got. And so with this it can still actually just keep polling when it gets new lines in a file send them through for parsing. The outputs again a lot of what we're talking about today is going to Elasticsearch but later on we'll actually look at how we can output to say graphite or statsd to get some metrics out of our log information and there are a great number of output plugins. You can output to s3 you can output to another file. You can use conditionals in the language too which we'll see in a bit. So you could say you know you could have some tags and say this is something that we should index into Elasticsearch and this is something that we should just output to a file file or send to Redis. Again, the great extensibility of LogStash is one of its great assets. So now the filter pipeline is where all the interesting stuff comes in. So because these log files are being shipped by our syslog they're all going to have a type of syslog. And so what I can actually do in here is I'm going through and saying okay let's take the previous message field and call it syslog message and the reason for that is that actually a lot of the groc filters will also have a message field so you don't want that to be overwritten. Then we can say okay let's first do a groc matching the syslog line pattern. In other words again because in the example we're using here everything is being shipped over syslog so we can pull the program the time, the log source, the date and grab that out. Now the date filter is something that lets you say don't use the timestamp that this log was received as the timestamp. Use the time that's actually in the log. In other words again if your log server was down or if you suddenly have a large influx of activity you might not index everything the same second, millisecond, etc. that it comes in. With this you can use any field that's already been matched to make that the real timestamp. So then you know here's an example pattern you don't have to use named groc captures you can also just use any regex pattern you want so in this case we can say that piped delimited drupal format okay in the message field match this, grab hostname dtimestamp, dtype, dip, drequest.io or idrefer in other words pulling all the fields out of that arbitrary piped delimited format that drupal syslog module uses by default. So then similarly we can go through to the enginexlogs we can remove the syslog bits of it because we really only care about the accesslog bits and similarly we can use groc and these lines are going to be in that same format I showed you where we take the combined apache log and we add the duration the hostname and the port to it. And then to emulate varnish we can add hdps in the hostname to the request and again here's where you're seeing those conditionals being used so if the port field equals 443 then update the request to be hdps if not make it hdp the other thing is logstash is just jruby so we can also just put in an arbitrary ruby field here because apache logs in microseconds and enginex and varnish log in seconds with microsecond or millisecond duration so we can actually normalize these all to milliseconds. And then again we'll grab the date field and then we can do some extra parsing some of the filters that are also built into logstash include goip useragent so you can do a very quick analysis of what useragents people use or what countries these requests come from. Okay so you have to hurry because we're going quickly talking about monitoring and performance metrics next so that previous spot was just all about how can I shift my log entries in. So logs serve one purpose and then you can also get performance counters out of them so logs again are just date and time. But you might also want to track more specific metrics about how your servers are performing how your application is doing so there's a separate set of tools that are usually used to help monitor and manage systems performance. This data as opposed to just sort of being shown over time can then be trended to chart resource utilization or capacity. So there are third party tools for doing this as well and so generally yeah if you're sampling some kind of a performance counter at a regular interval this is known as time series data. So lots of open source tools exist also to keep time series data. RRD tool is probably the classic one. There's also whisper and influxdb newer on the scene. So RRD files are a way that you can keep time series data at a particular resolution with a guaranteed file size. In other words you can say I'm going to keep 10 days worth of this metric at a 5 second interval and then I'm going to keep 5 months of this data at a 10 minute interval. One of the big advantages when this came out was you always knew how much disk space that was going to take because you could figure out okay 50 days at this amount and so tools like MUNIN, Cacti, Ganglia, and CollectD can all actually take some sort of metric like what's the current load average what's the current user use of CPU and put it into an RRD tool and if you've ever seen a graph like this this is probably MUNIN or Cacti just taking a performance counter or a metric from one of these RRD tools and graphing it in this format. So Graphite is a newer tool that's really just designed to graph and store metrics. So Graphite it has an advanced storage engine called whisper. One of the big things about RRD tool is that it expects that you are going to write something all the time. Whisper, in addition to being able to buffer and cache data under heavy write load you don't have to insert to it all the time. So you can actually just, if you have a particular metric that you only need to track at an irregular interval you can do that and it's basically designed to just take this time series data, store it and then graph it or send your API information about it. So then one last sort of performance counter's tool is called statsD. It was originally created by Etsy, written in Node.js. It accepts data over TCP or UDP and the main usage is to increment counters. So there's a canonical implementation in Node if you don't want to deploy in Node. There's Bucky which is in Python and StatSite which is in C. So for each of the counters that you keep and you can also keep gauges, but each of the counters you can get a raw count of stats in between whatever interval you're collecting on and you also get event counts per second. So it can tell you either like there were 200 hits to varnish or given that these flush every 20 seconds that's about 10 requests per second and by default it'll give you both of those metrics. And of course one way that you can get information into something like statsD is to use LogStash. If you're already parsing the access logs that matter to your application you can do things like this. So this is an example again in the output of a logStash config. You can say if it's a varnish access program use a statsD output and then set the sender to the log source which is another field and you can increment for example a response counter to say I got another 200 and then that'll keep going for 10, 20, 30 seconds whatever the flush interval is and then it'll write that out to graphite and so you can actually use your log aggregation data to get some of these metrics. You can also increment a counter for example how many bytes have come in during this particular window of time that you've been tracking. So then similarly to Kibana which is the fully angular front end for LogStash there's Grafana. So if you have a graphite or influxDB setup you can use this and again as long as you have direct HTTP access to the graphite API you can get that information. So I mentioned collectD too. CollectD is a great tool for keeping track of all sorts of system statistics. It's written in C. It's quite efficient. It has plugins that are written in a variety of languages and it can remain resident in memory and track things like system load, network usage. So a very quick demo of this stack as well. So I'll do two things. I'm going to actually I'll just go to my PHP app here and I will generate a thousand nodes with the bell generate. So again this Drupal machine is varnish fronting the lamp stack so we've got Apache with mod PHP and MySQL on this machine. And at this moment I'm starting to load it up with a good amount of disk writes by MySQL since it's doing some deletes and then inserts of a thousand different node objects. So I'm going to go and let me close some of our other demo examples. And so then I can go into Grafana. And as I mentioned Grafana is the sort of Kibana interface to Graphite. So Graphite in this case is tracking both system metrics from Collecti, which means for example I'm going to go to our other dashboard instead. I'll go to this. In this case I'm actually using its ability to look at the memory usage, CPU usage and disk reads and writes of particular processes. So here I can see okay Apache at this moment on the Drupal Box is using 319 megs of memory and varnish is using 92 and MySQL is using 49. Or then I can also take a look at the CPU usage. So Apache is actually using 50% of the CPU in user space as it's going through and writing these nodes. And then MySQL is using 28% of the CPU to do user space work to write all these nodes. And this is very similar to the time we were working in Kibana. So I'm looking at the stats from the last 30 minutes refreshed every 30 seconds. I can go a little bit further in. And again we've got our graph. So this is just an example of how you could track these sorts of metrics over time. And of course we have MySQL using a great amount of the disk writes at the moment. It's also doing a good amount of MySQL reads. And it does look like about 30 seconds ago we may have finished with our Devel Generate of that content. This is another thing that you can keep over time. It's not necessarily log data, but it does give you a good insight into how your application or your servers may be performing. And then quick look at also what stats D can do. So this sample uses that same stats D output. I can actually go into my stats D stats. And you can see here that we've got Apache, NGINX and varnish and we're tracking. Let me actually go into the edit interface to show how this one works. So in this case we're tracking stats and not the stats count. In other words this is the raw number of, I'm sorry, this is the normalized number of requests per second. So for this time period which is 15 minutes we can see that varnish is somewhere between two and 13 200 responses per second. And then if I were to go into Grafana and change this to stats count of stats, this would just display the raw counter values not divided by the 10 second flush interval that I have. So I can do this and see that instead this will not be very exciting because it's just multiplying the previous values by 10. But in other words in this particular time frame Apache did 210, 200 responses. That was Drupal's batch process going back and forth and coming in. And again this is all being shipped over coming into log stash and going out to both elastic search and going to stats D and then to graphite. And similarly I can take a look at varnish hits and misses. Again you may not always be able to to be able to log varnish NCSA logs in production, but I could see okay I'm still running that Siege command so we have a fairly decent amount of hits per second. And then again when I went through the Drupal interface when I was getting varnish misses because I'm an authenticated user then my number of misses shot up and we're back down to 0 or 0.1. And this is still showing the normalized to per second. So there were an average of 0.1 misses per second. And I could similarly go into this and let me go to edit. Yeah, so this is actually just showing the stats and I could change this to stats count to get the raw numbers which again we'll just up this by 10. And then lest I totally neglect it, this is an example of using liberado metrics which is a third party SAS service to take these same metrics and graph them over time. And this could be used if you don't want to run this all in your own infrastructure. Right? So monitoring in metrics, I think this is the interesting one. Like I said, I'm not going to talk a lot on monitoring necessarily. There are a variety of monitoring tools. The classic one is Nagios. You've got Ishingo, which is a newer version of Nagios that supports a few more templating features. One of the big complaints with Nagios is that it was not super friendly to config management out of the box since you have to put every check on the Nagios master unless you go into very advanced like gearman configurations. Ishingo tries to solve some of these. Senzu is a very new one written in Ruby. It requires a lot more infrastructure though. You have to run both Rabbit and Redis to make it work and that will be the backbone of how your checks get monitored. And then there are also monitoring check tools. So the monitoring plugins or Nagios plugins, depending on who you ask, there's been a fork between those have a variety of different checks that you can do. And of course, a lot of people think monitoring sucks. Some people love their monitoring. You may see these on Twitter. It's a lot of the tools actually are old and don't have interfaces that are as nice as Grafana or Kibana. So monitoring checks themselves and the reason I'm not really going to go into them a lot are they're generally just used to get server state. Like, can I ping the web servers? Is the disk more than 80% full on my servers? Is Redis responding? Is the load average over 4? Number of CPUs. Have any Drush process been running for more than 10 minutes? These are all things with their corresponding monitoring checks that you can define from a particular server. Metrics examples though, like the very simple ones that we looked at here, give you far deeper insight into the actual state of your application. So in other words, if you have a setup like that STATS-D or even just sending straight metrics from LogStash, you could say how many failed login attempts have happened in the last 10 minutes? How many users have logged into the system this week overall? What's the average varnish cash hit rate? How many 404s did we get yesterday? What's the average back-end response time for the last hour? And so one really neat thing that I think, and again there's an open source tool that lets you do this, is to bridge metrics and monitoring. So one thing that you might find would be that a certain a certain trend will tend to show you that you're going to have some sort of problem in the future. There's a really great project written in Ruby by Heroku called umpire. And what umpire lets you do is it watches graphite or libretto metrics. In other words, whatever your metric system is and it'll respond with either a 200 or a 500, depending on whether the metric is in or out of the bounds that you specified. So you kind of have to be careful with this one. Especially in saying, oh, all the Apache processes are in use, that may not necessarily correlate to a failure. You don't want the alarm to go off. Has anyone been reading about the breach of the White House about a week ago? And they said that part of the reason for that is that there was a noisy alert that they had muted. So you have to make sure that the alerts that you're going to put into your monitoring system actually have value, because if you just mute them all the time, then you may miss something very important. Because you could say something like, this is, again, graphite configurations and tasks, but you could say, okay, check the statsd.varnish hit rate. And you could say either a minimum or a maximum and a range in seconds. So you could say, did the average varnish cash hit rate fall under 60% over the last five minutes? Or did UID-1 have more than five failed login attempts in the past hour? And so I can do a quick demo again. We've already got the umpire statsd set up here. So this is my copy of umpire actually running locally. There's really not a lot to it. So in this case, I'm taking a look at the stats counts, log stash, drupal varnish request 200. In other words, this is saying look at the metric for the number of 200 responses from varnish. Alert if that goes over 20 in the past 300 seconds or five minutes. So I can make this request and it's returning me JSON, but if I take a look in the network inspector and then I request it again you can see it's sending me a 500. So in other words it's saying this is out of bounds. And so I could also then similarly say and it's telling me actually that over the last five minutes the average value that's in graphite is 66. So I could say okay maybe we'll check an only trip that goes over 100. And I can do that and now I get a 200. I still get this extra data that I can do. So if I wanted to get a little more information out of it, I could do that. But yeah, this will read from either librado or graphite and get us that information. And so then with that you could feed that into something like a check HTTP check for a single or Nagios or whatever. You could even use it with something like Pingdom. This can run over SSL. It can you can give it a basic username and password so only your monitoring system can access it. So let's talk a little bit about the Drupal.org logging setup. It's currently running CentOS 5 which is fun. And it's got dual quad core Golf Town Xeon. So 8 cores, 16 threads. It has 16 gigs of RAM. And currently it's got a large RAID 1.0 array with 600 gigs of dedicated HDD storage for Log Stash. There's a lot more storage actually just for historical logs as well. This is the final resting place for currently all Apache and varnish logs for the DDoD setup. So we're currently running a fairly old version of Elasticsearch, Log Stash 1, 2, Kibana 3, Milestone 3, and Curator which I didn't actually mention too much in the talk. Curator just lets you say delete all Elasticsearch indices older than 4 weeks. That said I did some sprinting on Sunday and I'm going to keep on doing it. We're aiming to upgrade this to CentOS 6 which will be a vast improvement and we're going to get it up to the latest versions of all the software. Kibana 3.1, Curator 2, Log Stash 1.4. So the setup that Drupal.org is using is consolidating logs from about 10 web servers. It does get all the syslog logs from Drupal as I showed in the presentation. It's currently grabbing Apache and varnish logs. Interestingly, it wasn't varnish that broke the camels back on DDoD, but we do not currently have the capacity to do the Nginx logs because Nginx is sitting at the front of everything on DDoD and doing SSL termination, so that was a bit too much. We're looking to perhaps see if we could add edgecast logs in here, but again I don't think we have the IOPS currently on the box to be able to do that parsing. So the way that we're doing things actually when I got in and started volunteering on the DDoD log host, the log shipping was already in place and they're actually just straight up using arcing. So on each box there is a defined location where logs go and then there's an hourly cron job on each of the machines that does an rsync using the update in place functionality to write the latest versions of all the files to var log hosts on the log host. And again, because log stash's file input allows you to have that sense to be and it will keep track and say I was on line 58 million of this log file whenever a new log comes in and it next poll it will say, okay there's new content and it will start indexing from there. Currently we're processing a little over two billion logs a month with that amount of hardware and again we're going to take a look and see what we can get to after the upgrade Elasticsearch13 and of course getting on to, well, a slightly newer kernel CentOS 6 is still fairly old in terms of that, but we're going to see what sort of capacity we can get out of the box when we upgrade that. So simple example, I won't go into it here, but finding a problem using the log host. So we deployed the log host about two months before the D7 upgrade went live for Drupal.org and this node has a problem where XJM was having a problem updating an issue on d.do. She'd hit save and it would say the website encountered an error and of course you can't see the real error logs. So we're using the same schema I showed and so I could just simply say, okay well let's go into a query and say program Drupal, look for XJM's UID and logs of type PHP from Drupal which means PHP for a couple of years. It found a couple of PDO exceptions and we found out that one of the fields was too small in the D7 upgrade and so we scheduled a down time and did that. And of course the nice thing is drupal.org is a very highly trafficked site and by using log host for this we didn't have to use any of the production resources to do this. We didn't have to look through watchdog and put strain although mostly read strain on the MySQL database as we could just use a log host and we've got several other things. We've also gotten some questions of like how many requests were there for a particular module which is not true anymore. We've now got the CDN in place so we might not be able to divine that information from it but that's an example. So I apologize we only have about seven or eight minutes for questions yet but questions. Alright so the question was when using our syslog or logstash to forward logs can they find newly created files or do you have to restart them? So the answer to that is with our syslog I've actually only ever used it in conjunction with log rotate meaning that and it will follow those so in other words if you're logging to myvhost.access.log and then at midnight the cron drop runs and it you know it takes that and sets it aside that will work with our syslog. Logstash absolutely will you can just you give it the wildcard pattern that you want a new file appears and when it next pulls it will pick up new files so you don't have to do anything special to get it to find new things. The only thing that you might run into is that we actually ran into this a long time ago on d.do is you might need to make sure you have enough file descriptors for the logstash process to be able to pull over all those and see if there are any new content. Questions? Yes. So the question is suppose you have a php memory error and you can isolate it to one page what would you suggest? I think the main benefit of having a setup like this is specifically that like if you just got a report from a client saying I hit a white screen every once in a while you might not know whereas again by default Drupal in its logging format is going to include the referer the current URL the current UID things like that so I think it is actually super valuable for doing exactly that sort of thing and then from that you would probably go into I mean if you see it's a memory error you can probably just update your configuration and restart your daemons or that's where you might take a look with XHPROF or something like New Relic actually if it's like a performance issue but yeah from there you could actually you could use this as a triage to say where are our errors so another thing that we did after the D7 upgrade was actually we just had a view in Kibana of what are the top php errors and we just knocked those out so we could basically do a recurring query that said for the past two days show me based on a term count we did all these common threads of oh this error happens a lot let's fix it you could correlate that to a deploy and then see in a day or so we've killed this one that's where it's gone get a question over here oh absolutely so I think we may have some news about ELCS from Leslie from Elasticsearch oh I think yes hi everybody I'm Leslie Hawthorne and I'm the director of developer relations at Elasticsearch and I've been hanging out with the Drupal community now for like eight years some of you folks may remember me from Google summer of code if that's part of your role and we have a really cool cute little stuffed toy ELCS if you folks are fans of the ELCS stack I wasn't able to bring them with me because I'm just hanging out at the conference we don't have a booth but if folks want Elasticsearch swag just come and find me and I'm happy to bring it for you tomorrow alright so maybe time for one more question if anyone's got one yes that's a very good question and this is actually sorry the question was how do you detect and prevent tampering with logs because right if someone you know well there are many different vectors for this right if someone if a lot of your developers have pseudo on machine for example they could just do what they want to and of course you could use the auditing subsystem potentially to look at that and the reason I mentioned this is very topical is actually how many people here have an opinion on system D oh it must be more than that okay but so one of the big arguments around system D is that it by default comes with journal D which is a binary format and one of the stated goals behind making it a binary format was to make it tamper proof in other words to provide a built-in ability to actually show that this log has not been modified so I actually don't know that I have a great answer for your question like in the case of D.0 specifically we're not actually using these for any sort of internal audit function you know we're using these to say that that's there but yeah I think yeah especially in the case of something like rsync write the rsync privilege that's coming or the process coming in is privilege and is updating those in place so it's actually a good question I don't have a great answer for you offhand the system D developers would say use journal D and then they can verify that the log has not been tampered with. There is some support I think that's going into rsyslog the in current versions to also provide separate checksum to be able to verify log entries but again you'd have to stay within the rsyslog ecosystem you couldn't necessarily rsync because once it leaves the purview of the logging daemon then it's just another file on disk alright so we have one minute left I did want to mention I have a lot of resources here again I'll put this up on the Drupal consite links to all the different daemons that I mentioned that sample config that I showed for Drupal and Nginx will be there how to ship MySQL slow query logs into log stash 2 the slow query logs are a little tough because they're a multi-line format and that shows how to use the multi-line codec in log stash to handle that. Also very important thing so Drupal 8.0.0 beta 1 is out which is excellent news so we do have sprints going on you know this mentions them specifically on Friday there will be extended sprints through at least Sunday and you know if you are info type folks there is also a public meeting of the infrastructure team that's going on tomorrow at noon I believe it is on the site and yeah if you know puppet if you would just like to document or help with info issues we're always looking for volunteers on the infrastructure team as well so if you would like to especially on Friday help out on that you could come try and find me or look for probably Rudy Grigar or Gerhard Kineswaita so there's lots of folks yeah so you know we definitely need a lot of developer help to get Drupal 8.0 out the door but you know having good volunteers to help with Drupal.org's info is also appreciated so if you look at the sprints we will also have a just Drupal.org infrastructure sprint on Friday as well and so I think that's time please after the fact go back and evaluate the session Amsterdam2014.Drupal.org slash schedule and thank you very much