 Hey, go ahead. What a stunning intro. Welcome. So I'm Brian Davis. I'm a software engineer for the Wikimedia Foundation. And I'm going to give a little presentation on what is the marketing term for it is the elk stack, which is elastic search log stash and cabana and how we use it at Wikimedia. So first, it's not this kind of elk. It is instead an acronym, kind of like the lamp stack. The E stands for elastic search. Elastic search is a document-oriented full-text search engine built on top of Apache Lucene by Elastic Search BV, all released under an open source Apache 2 license. This is the same technology that we use behind the scenes to power Cirrus Search, which runs search on English Wikipedia. The L is for log stash. Log stash is a pipeline processing system that takes data from inputs, passes them through filters, and then hands them off to outputs. You can think of this, basically, a lot like the Unix shell pipeline, where you tail a file and run it through, grab an awk and said, and then dump out the outputs to another file. Log stash is also a product of Elastic Search BV that's released under an Apache 2 license. And then cabana is a front end. It's a browser-based tool, mostly JavaScript, that lets you make dashboards and interact with a backing Elastic Search Instex. And it was really originally invented pretty specifically to output the log stash created data that was stored in Elastic Search. And yet again, this is a Apache 2 open source project by Elastic Search BV. So how does Wikimedia use the ElkStack? We gather log events into log stash from various applications, and we'll get to a slide a little later that talks about some of those ways that we bring things in. Once we have the events in log stash, we process them to clean them up and normalize things, a lot like a grep-ox-ed sort of setup to make logs from various applications look uniform where they can look uniform to make searching and desktop building a little easier. Then we tell log stash to go ahead and store those events in an Elastic Search Index. We have a dedicated cluster specifically for serving log stash that's separate from the Elastic Search cluster that we use to power Cirrus Search. We also have log stash, and this is relatively new like within the last week or so. We have log stash send some metric data over to a stats D instance, which then passes that on to our graphite storage system, which lets us track some trending metrics. We're trying to figure out how to use this for some alerting technologies now to notice when particular log channels get noisier than they normally are and point humans to go look at the data. We use cabana to search the log events from Elastic Search. We have an instance for production and an instance for the beta cluster. And there's some talk of working towards getting an instance up for our lab's infrastructure in general, but we have some technical hurdles we have to get over for that. And then we have some Isinga alerts based on those metrics that we push into graphite to watch for bad trends. So log stash inputs. I don't know if this is a completely exhaustive list, but we take in Apache error logs and HHVM error logs using our syslog forwarding. So syslog, Apache and HHVM send to syslog on the local host. That syslog forwards a copy of the messages off to the log stash cluster and then the log stash cluster processes them from there. On the MediaWiki side, we use monologue, which is an open source PHP library for log processing, log generation, and log processing. So we plug monologue into MediaWiki and attach a handler to monologue that sends syslog formatted UDP datagrams off towards the log stash cluster where they can be collected. Our venerable SCAP deployment process uses UDP, log the UDP forwarding. So SCAP uses sends UDP data packets towards Florene, which is our log aggregation server. And when they hit Florene, the log UDP software copies those messages back out and relays them to the log stash cluster. We've got a growing number of Node.js services, Python or Parsoid, REST Base, and others that are all using a logging library called Bunyan and a plug-in for it called GELF Stream that sends UDP formatted UDP records, UDP datagrams using the GELF protocol, which is a particular structured logging protocol over towards log stash. Our Cassandra cluster that's behind REST Base is now using a log back plug-in. Log back is a Java logging library, similar to log4j, but different. And we're now using a configured plug-in for the log back stack there to send messages directly to log stash. In the beta cluster only, we're not doing this in production because we're not quite sure how to secure the data yet. But in the beta cluster, every puppet run sends via a custom Ruby report plug-in the data from the puppet run into the log stash cluster for the beta. And we also forward all syslog messages that happen on the beta cluster host into the log stash there. And then I have one interesting new project that we'll see a couple slides on towards the end, where I have a labs project that is listening with an IRC bot in several channels. I think I've got five or six channels that it's attached to right now. And it collects all the IRC messages that happen and stores them in Elasticsearch and lets us do some interesting things. And log stash is pretty flexible. There are many, many other input plug-ins possible as we find new pieces of data, new applications that we want to collect information from and organize it. Yes? Yeah, so there are two things I don't see on there. One is event logging. The sort of thing we have that sort of calls, it calls log event. So that doesn't go into log stash. Is that correct? Yeah, that's correct. Event logging is a separate system that stores into MySQL database. And then the other thing, and we're also not using this, we're just straight up varnish Apache requests. It's a huge volume thing that is currently very, very difficult to manage. Yeah, we'd need to throw a whole lot more hardware at this pipeline in order to keep up with our front-end varnish request traffic. That's processed via Hadoop right now. So we take that data and route it through Kafka and use Kafka to dump it into Hadoop, which can actually keep up with the log volume. So unfortunately, we don't have one magic tool to rule them all yet. So once we get the data into log stash, then we have various filters set up, which is, again, like a shell pipeline, an event comes in, it gets turned into a piece of structured data inside log stash. Basically, it turns into a dictionary inside log stash. And then we can run various pre-built transformations on that. So we have one generic one that strips anti-color escape sequences from the message portion of each event. They might look nice on the console, but when they end up getting re-rendered as escape sequences in HTML, they're kind of ugly. We have filters for a couple of different types that join multi-line events, so events that come into log stash. One logical event that happens on the server but comes into log stash as separate lines because we're using some line-oriented protocol. HHVM stack traces fall into this HHVM crash traces. And so we use a filter that tries to put those back together instead of having a bunch of random lines in log stash output that could be intermixed with other things to try to collect all the pieces of the stack trace back together into one large message. We've got some normalization filters that try to populate common attributes that we can use across all types. We tag everything with a type, which we use as the origin of where the data comes from. MediaWiki, Apache 2, HHVM, SCAP, SysLog, Parsoid, Cassandra, et cetera. Brenda, I really quickly, speaking of Kafka, can log stash ingest from Kafka easily? In theory, yes. We haven't ever tried it out, but there is at least a third-party input plug-in for log stash that knows how to read from Kafka topics. I think there's also a matching output. In general, almost anything that you can do with Ruby, we can figure out how to make into a log stash plug-in if somebody in the world hasn't done it before. Log stash is interestingly enough a JRuby stack. It's Ruby running inside a JVM, and the plug-in specification is pretty simple, pretty straightforward to deal with. Yeah, so besides the normalizations, to set a type, a channel, which is kind of an event type, channels, the easiest analogy for anybody familiar with MediaWiki is that these are log groups. So a channel in our log stash cabana output is the equivalent of a log group on the MediaWiki side. So if you put something in, say the memcached log group by saying WFDbugLogMemcached, blah, blah, blah, that'll show up as type MediaWiki and channel memcached inside the cabana interface. And we try to stick a normalized severity on a level, meaning the severity of the event onto each thing that we store in log stash, and we normalize those names to match the PSR3 standard. PSR3 is a PHP logging standard, and it turns out that the level naming used in PSR3 is basically taken directly from the RFCs for SysLog. So it makes kind of a nice hierarchy of highest to lowest with consistent naming that we can apply across most input sources. We've got a few other filters in there that discard junk messages, things that are known to be incredibly spammy and we're never going to do anything about them or messages that come in that are empty or malformed somehow. And a few other things that we may touch on a little bit later. And again, there are many other possible filters that could be applied to transform content depending on what it looks like when it starts to come into the system and what you want it to look like when it goes out. I think I was reading on Elastic's website earlier this week that there's something in the neighborhood of 150 filters that are currently available in Log Stash and writing new ones is pretty easy. So after you've taken the inputs and you've run them through the filters, now you're ready to go to the outputs. The outputs that we use are to send documents to a local Elastic search cluster. Which is used to power the Kibana interface then later. We also in one instance, send documents to a remote Elastic search cluster. We have a feature that watches a particular log stream that API, no, now I'm not gonna remember exactly what it's named, API feature usage stream. Sanitizes that data to remove anything that might be considered identifying or private, takes the IP addresses out and et cetera. And then forwards off those sanitized records to the same Elastic search cluster that's used to power service search in production, which allows us to have a special page on our Wikis that you can go to look at this API usage information that's actually powered by Elastic search, searching on the backend. We also have, as I mentioned before, an output that's pointed at stats D that generates metrics for storage and graphite. Right now what we're doing with that one is taking the count of media wiki log events by channel and level and sending that into graphite so that we can then say, watch trends on the media wiki memcache channel error level messages and set iSingle alerts against them. So that pretty much ends the log stash magic part of it. Then Kibana is the front end that we use to look at this data after it's all been piled up in Elastic search. We have a production instance at logstash.wikimedia.org, access to which is controlled by username and password and requires membership in an LDAP group that indicates you've signed an NDA because there's potentially sensitive information about activities that have happened on the Wikis in the log events so we can't just share them wholesalely with the entire universe. IP addresses, pages edited, user names that username IP address correlations, those kinds of things. But we have another instance for the WMF beta cluster at logstash-beta.wmflaps.org and this one as of last week is actually wide open to the world. Woohoo! So anybody who's following along out there who wants to start jumping in and looking at Kibana as we start going through these next parts of the slides, please don't crash my service, but go ahead and give it a shot. And thanks to Greg there who's in the room in San Francisco for helping us figure out that it was okay now to take the password authentication off of that beta version. So Kibana looks something like this. I'm gonna go through some slides here at the beginning rather than trying a live demo because, well, live demos. So we'll run through some slides and then if we have time at the end, we can hit the live demo. Although looking at my clock, it looks like we probably have some time. All right. So this is a basic Kibana screen. Gonna just go through a few of the UI elements here. So kind of in the upper middle there's a gadget that lets you select the time range that the search is going to cover. Right next to it, there's a little button that refreshes that the current dashboard or the current search. This can be useful if your time range is like the last five minutes or the last hour and you haven't turned on auto refreshing, that data will go stale as the time moves on and then you can hit the refresh button when you're ready to see new results. There's a cute little home icon that loads you back to the default dashboard. This screenshot that I'm showing here is basically the default dashboard. Actually, I think this is the default dashboard out of the beta cluster. So it gives you a nice anchor to go back to. So a little folder icon to load to save dashboard. When you click on that, it'll pop down a list of dashboards that have been saved. The list that it shows is not exhaustive. I think it just shows 15 dashboards. So there's a little box that you can type in there if you know that there's one that you're not seeing in the list that you want to get to, you can start typing and there's kind of type ahead auto completion that'll show you the things that match your search. Next icon next to that is the three and a half inch floppy disk that probably means nothing to everybody who is 20 years old or younger, but has become the universal save this icon. Allows you to save the current dashboard. If you're trying to create a new dashboard derived from an old one, please, please be sure to change the name before you hit the save button inside there. So when you hit this little icon, it'll pop up a save dialog and one of the things you can do is rename the dashboard. And the reason to make sure you change the name before you save over one that everybody loves and uses is that dashboards are not versioned on the back end. So as soon as you hit that save, it overwrites the stored dashboard and you might make somebody sad if you overwrite their nice dashboard with something horrible. Right next to that is an easy to miss but very useful icon that's to share the current dashboard. So this one can be especially useful if you're, you're using Elasticsearch and Kibana to triage some kind of outage or problem and you've invented a big query and zoomed in on the timeline and you have just the things that are interesting up. Then you wanna show it to somebody else but you don't think it's worthy of making a whole permanent dashboard for. If you hit this little save icon, it'll give you a shareable link that's good for 30 days that will point to the state of the dashboard at the time that you hit the link comes in handy, especially in the ops channel on IRC when you're finding something and trying to show it to somebody. And the little magic gear icon up in the upper right hand corner lets you configure global dashboard settings for the dashboard you're looking at. The most useful thing in here most of the time when you're messing with dashboards is to add a new row. So let's move down a little bit. So the next thing we see here is the query bar. So this lets you enter any Elasticsearch string queries, query string query style search into Kibana to go against Elasticsearch and filter your results. The example I'm giving here type media wiki and channel fatal would basically show you the same things as fatal.log on Florian shows you. I'm sneaking things in here for S. There'll be a little slide later about some more search tips too. Below the query, there's a filtering tab and in the screenshot I took, it was collapsed because it's collapsed by default. Click here and it'll open up. The filter section will open up and show you permanent filters or cross-culting filters that have been set up that are ended with the query string. We'll see a little later some ways that it's easy to add new filters as you drill down through a report. In the events over time section when you see a little histogram of how many events are showing up per unit time and the unit time scale varies depending on how big of a time window you're looking at. But one of the neat things that you can do here that has maybe not the greatest discoverability in the world is to zoom in and focus on a particular time range you can click and drag on the histogram and it'll give you kind of like a shaded box showing you start time, end time that you're selecting and when you let off it'll refresh the report focused in on that particular time window. There we go. Down below that in the default dashboard we have some widgets set up to show you graphs of events by type. So these are those types that we apply to things in the log stash side that roughly correspond to the applications that generated the log events. And one of the things that you can do here is click on the bar for a particular type say media wiki or SCAP or syslog or whatever. And that will add a filter up in that collapsed filter section to restrict the search to only that particular type. And there's a very similar thing for the level gadget that's set up so that you could then pretty quickly go from seeing everything to seeing only errors, error level messages from media wiki say. Down below we have the events table which shows you surprisingly enough the events that have been recorded by log stash that match your searches and filters. And the default dashboards that we have set up the home dashboard shows you when the event occurred what type it's tagged at, what level it's tagged at, which wiki it was associated with. If we have that information and these are the media wiki, wiki db database names. So like enwiki or zhwiki or enwictionary or whatever. The hosts that generated that log message, if we have it, which we usually do, and then the message itself. And when you can click on these rows and they'll expand and show you more detail about that one particular log event that you've clicked on. There's a whole list of the fields that are available on the list of the fields that are available in that log event and then some more gadgets for applying filters here. With each field there's a magnifying glass that you can click on that will add a only show this, only show things that match this value in this field. There's a little band symbol that you can do to do the opposite to basically say I don't wanna see any more log messages that are of type media wiki or whatever you're clicking on. And there's a little grid icon that you can use to add a field as a table column in the table view that you're looking at, which can be nice after you've drilled down into something particular if there's some data item, like say the process ID down there that you wanna see how those are correlating across certain messages. All right, that was my quick and dirty, yeah, yes. Can you go back to the main screen? I was wondering what the time frame of the types and level is. In other words, it just shows its massive count, but is that over the last day or? Sure, so it's over whatever search time, whatever time range that your report's showing. And it'll follow along as you zoom in on the histogram to a smaller time window, or you do something else to expand out to a larger time window. Then the types and the level gadgets on the dashboard are driven off of those same search results. So basically everything you see on the screen correlates together, the histogram of events over time, the types, the levels, and then the events down in the table below all stay in sync. Yeah, so some search tips. When I was preparing for this, I got some lists of things that people wanted to know how to do and hadn't necessarily figured out themselves yet in Kibana. So if you wanna find the equivalent of the fatal log on Florine, that's type media wiki and channel fatal up in the query bar, and you should have it. So that gives you basically the equivalent of like tail fatal log on Florine if you keep hitting refresh on that. And in general, type media wiki and channel whatever your log group is here. So memcash, whatever resource loader, et cetera, et cetera, lots and lots of different log channels. If you do a search and you get no results, you might wanna try a larger time range. You might be searching for something that doesn't happen often enough. Our default dashboard only looks at the last 15 minutes. And so if you haven't expanded that out, you might be looking for an event that only happens a couple of times an hour and hasn't happened to show up in the last 15 minutes or not. And the other thing to do is expand that filtering section by clicking on the filtering word and seeing it zoom out and see if there's some filters that ended up getting stuck in there that are keeping you from seeing what you wanna see. And in general, shorter time ranges lead to faster searches. Elastic searches is pretty good at this. And especially if you stay within the last day searching, things are faster than if you start to expand out to say like, show me everything that's happened over the last seven days or the last 30 days, which requires loading up a whole lot more indexes into memory to look through. Log retention, actually I didn't talk about anywhere in here, but we only keep 30 days plus today's worth of index logs. So today and 30 days in the past. And that's then at like, I think like 4 AM UTC or something, we basically dropped the oldest day so that we only ever have 31 indexes at most around. So when you start playing with this and looking at the things and especially expanding out and looking at the actual details of a log event, one of the things you might end up desiring is better log output for MediaWiki. Things that are logged via WF Debug Log all come in as type info and really don't have, they have some additional data, but not a lot of data beyond the message that was sent, what Wiki it was on and what host it was on. But we now have the ability in MediaWiki to use a fancier logging layer, yes, yes. Right, I'm sorry Nintra, but you said they come in as type info, but I thought you said the type of them all was MediaWiki. Ah, I probably meant to say level info, type MediaWiki channel, whatever, and then level info. Sorry about that. So when you use LoggerFactory directly, you get a PSR3 Logger interface, which allows you to specify the log level. Debug, info, warning, and error are kind of the basic ones. There's a few others, but these four should cover you most of the time. Debug's good for messages that are too spammy for production logging, but you'd want to see if you were doing some local development or a detailed debugging session on some host. Info is basically your basic valuable state change information. Hey, this thing happened, should be in the logs because I might want to find it later. Warning for soft error conditions, meaning this thing probably shouldn't happen, but it's not the end of the world that it did. And then error for harder error conditions, like I caught this exception and that's bad and we should try to figure out how to make this never happen again. And in addition to then being able to set the level via that, you can also add additional context to a message. So by passing another array dictionary, really, of data along with the message, you can add new named things. Some examples here on the slide, being able to tag the method in the line and then these would actually end up being surfaced in the Kibana interface so you could search for method colon my method name here and find all the logs that were recorded with that method name. And another fun one there at the bottom, exceptions. If you want to attach an exception to a log message, it should, according to the PSR3 standard, always be sent as a piece of data named exception and then there's kind of special handling in both the MediaWiki default logger and in the monologue logger that will take those, notice that there's an exception in the context and attach stack traces and other information to the message that finally ends up going into log stash and elastic search. All right, so that's basically the end of what I had prepared about the elk stack in general and the use of Kibana. I've got a couple other things here at the end that are kind of more fun toys that I've been working on. So I mentioned that there was a labs project that was listening, that is using IRC inputs into elastic search. One of the things this powers is a tool called SAAL at tools WMF labs SAAL that shows are the bang log messages that the system administration log messages that are sent out in the various IRC channels in a way where you can search through them, page through them, filter them by day, kind of a fun little thing that I made. Some day maybe this will be the replacement for WMBOT writing things on the wiki, maybe not, I don't know, it's in trial right now. And another one that I built is the tool named bash, tools WMF labs dot org bash, and this one is those great collections of quips that we all knew and love from the top of our bugzilla searches imported into elastic search and also an IRC bot here that allows you to add new quips by being in a channel with the bot and typing bang bash and the quip that you want to have stored more information about that on the info page for the tool if you're interested in it. I've got the obligatory learn more links at the end of the slide deck. The slides are actually up on commons already if you search for elk tech talk, they should show up. So you can get these links out of there. Elastic.co is elastic search ABV, excuse me, elastic search BV, the company that makes all of these tools possible and releases them under the Apache license. They've got a webinar up about an introduction to the elk stack that covers a lot of the ground that I covered but some additional things. The manual for Kibana is out there. We use Kibana 3.0, 4.1 or 4.2, I think is the current version but we haven't switched to that yet. There are a few changes in the 4.0 series that I'm not in love with. So I'm waiting for those to get switched, fixed upstream before we cut over to a new user interface. It's a page on wiki tech about our log stash setup in general, a fabricator project for log stash related things or elk stack related things. Things in the operations puppet repository that show how our log stash is configured specifically if you want to see the exact filters, inputs, outputs that I use and then some things on media wiki and fabricator about structured logging use in media wiki, how to use it, different levels, things like that. That's all I had prepared folks. We have questions, comments or concerns. You make me feel stupid by asking them. I'm not going to embarrass myself because I'll just stand all the stuff back and forth. I learned something today. I like the drag and zoom in feature. I didn't know that existed. I was always flexing with trying to type in the right date range. That was annoying as heck. So now I will save like hours out of my life. So, well, I'll start. So the cell thing you built is cool but as far as I can understand you can get the same thing by going to the Kibana on beta labs and just choosing cell, right? I mean, it's just one of the most very log messages. Yeah, the cell link on beta labs is actually a link to my tool. Right, it's a link to the elastic search that's behind my tool, or a link to the Kibana that's hooked the elastic search behind my tool, yes. It was at one point, I was storing all that stuff inside the beta cluster elastic search, but once I got the separate tool up and running, the separate system up and running, I detached the IRC listeners from the beta cluster one. Brian, what is the 4.0 things that we don't like? Awesome, because I said that out loud and somebody might notice. The biggest one that's annoying to me right now is that in 4.0, all dates are always shown in the browser's local time. There's no way to pin the timestamps to show in UTC time, which is actually what's recorded in the elastic search data. So it's basically just a user interface thing in what they've built. In the existing Kibana 3.0 stuff that we use, we can pin times into UTC, which makes things a lot easier for, say, talking to the operations team and correlating data among different data sources. There's an open ticket for that. In Kibana, it just hasn't been fixed yet. I think it got bumped up to their next version, Milestone as a maybe. Anything else? Okay, thanks so much, Brian. Thank you.