 Okay. So welcome to this talk about Elasticsearch, Logstash, and Kiba9 and the rest, of course. We have today Honza Kral. It's a Python developer and Django Core Dev. Okay. So let's go. So good morning, everyone. So I'm here to talk about logging all the things. And while that seems somewhat obvious, I want to take it a little bit further and to explore what logging is, what it can be, and what are the important aspects to keep in mind when you're doing logging, especially when you're doing centralized logging, and also the motivation for it. So also, every good talk begins with the definition. So what is a log? What do we mean by logs when we say it during this talk? Well, essentially, when you come to think about it, log is any sort of message, any sort of document, any sort of piece of data that has a time stamp and doesn't change after it's been created. So it can be applied to many other things than just logs as you think of them as lines in the file. It can be a Twitter feed. It's essentially also logs, any sort of stream of events as they happen, but also something that happens in your organization on the business side. All the invoices that you send out, all the money that you get back, all these transactions can also be considered logs and can be treated in very much the same way. You can actually use the same system to treat them and I'm trying to convince you here that it would be beneficial. And also metrics can be viewed as logs. So your CPU usage, your free memory, and all of this information that's traditional stored next to the logs in some other separate systems, but in reality it's kind of the same data. But except for some textual representation of what's going on, we have a numerical one. But it's really the same thing and you want to treat it the same way and you want to work with it exactly the same way as you would with logs. So I will probably keep saying logs throughout this talk. Just keep in mind that it applies to anything that has a timestamp and doesn't really change much once created. So that's what logs are. That's what logging means for us today. So why should we care? Why do we talk about this logs and metrics so much? Well, currently any company out there generates huge amounts of data. All the information that's going on, all the different events that's happening, an incoming request from a user on the load balancer, then on the web server, then on the server serving the static files, then on the database. And this is just when we're talking about a simple website. Imagine that you have anything more complicated. Imagine God forbid that you have something like microservices and you need to track all of the different services and all the different requests that are going around. That's a lot of data. And those are just the technical data. You also have a lot of business data. How is your business doing? How is your traffic? What sort of thing is happening on the business side of things? So those are all the data. And a lot of the times they're just going to waste if they're recorded at all. Some of them you always have to track because your business depends on it, but a lot of it is going to waste and can be used. But let's start from the simple questions. What happened last Tuesday? Preferably at three in the morning. If a customer comes to you and says, hey, I use your service, I really like it. But last Tuesday at three in the morning, I had this annoying thing happen to me. And it really bugs me. What do you tell them? How do you find out what's actually happening? Well, the first approach is you typically try to grab it. You have some log files somewhere. You try to grab it and that breaks down pretty quickly. That's fine on your local machines if you're looking for something that just happened. But if you want to look on a production system, that's typically not that nice. Because you would have to go to multiple machines. Sure, we can all do SSH scripts and stuff like that. That's fine. We can go to multiple log files. Even that we can do with Grap. It's getting a little hairy, but you can still do that. What's harder to do is any sort of analysis or discovery. Because Grap, you already have to know what you're looking for. If you don't know what you're looking for, there is no way that Grap will help you. And lastly, the crucial part of the question is Tuesday at 3 a.m. Who can grab all the different log files for what happened at Tuesday at 3 a.m.? I see no hands raised up and that's probably fairly accurate because, you know, time is fun, especially when you're dealing with computers. And the nice part, the nice thing about time and time format is everybody likes their own. And people don't really like to share. So we have just a sample here of some of the formats that you might see in the very common log files. And some of them are quite interesting. For example, PostFix just assumes that it will never run for more than a year. That's not really the confidence I'm looking for in my systems, but okay. And you can see that they're all very different. Some of them are not even sortable. So how does this work if I want to grab for something that happened Tuesday at 3 in the morning? Well, the obvious answer is you cannot. So we need something else instead of Grap. So what's it going to be? So let's see what we need to do. But also, we want to be able to do more things than just look at the individual log files or individual events from individual sources. We want to be able to correlate different events. And that's why I gave you such a huge feel about what log actually is and what it can be. Because only once you get logs or data from multiple sources do you really get to see some interesting stuff. For example, if you compare your logs from your load balancer and from your web server, and just look at the raw numbers, you can immediately deduct certain things in certain behaviors. For example, if your traffic on the load balancer is going way up and the web server traffic is still going steady, that's probably a good thing. That probably means that you have some sort of caching on the load balancer and it just works. But if you see them rising together, that means that the caching that you have in place doesn't work. And that's something that's nearly impossible to discover without having both of these systems together in one system that you can compare these numbers. The same difference is web server versus database. You also want to have some sort of caching. You definitely don't want to scale linearly that the more requests that you do, the more database requests that you do. That doesn't scale really well. So you need to be on the lookout for these kind of patterns. And this is, again, something very difficult to discover. Also, what happens when you see a rise in errors on your web server, does it maybe correlate with a new deploy or with a new employee getting on board? Or a new client? Or something like that. But you can also not stop there and you can go into the more businessy kind of sort of thing. So we bought these ads, we bought this traffic from someone, do we really have something to show for it? Sure, for a lot of these things, you can go to external services. But external services are external, they don't know your system. So they might be difficult to tie in to the rest of your infrastructure. So this is sort of everything that we want. We want to be able to look at what happened at Tuesday 3 a.m. and we want to be able to answer all of these questions to do the correlations. So how will it look? What's the ideal state of this proposed system? So we need a central storage. We need something that can handle the different data that's coming from different sources that can handle the amount of data. We also need the data to be enriched. We don't just want the raw data, the raw text file from the log. That's not interesting. We want to parse it and also we want to do some enriching. So for example, if we stick with example of running a web server, we have a URL in there. We want to map the URL to the article and the author who wrote it or the product in our eShop and the category of that product. Because once we get that, we can immediately see much more information in our data. The same as when we have the client IP or the user agent. We might want to see which country did they come from. And also other stuff like we see some cookie in there. That's cool, but was that user logged in or not? What was the username? All of this information. And once we have this information, of course we want to be able to search on it, to filter on it, to get the results back. So if you know you have an annoying user called Honza and he bugs you, hey, I cannot find anything on your website. You can just easily do a search, say, hey, from this user, did I see any 404s? Maybe there is something wrong with this browser or something. So this is what we want. We also want to be able to analyze all of it. So not just look at individual records, but see it, look at patterns, visualize the data, and be able to discover some interesting stuff. So essentially what we've designed here or what our wish list equals to is centralized logging. That's a technical term for the system. And it consists of several steps, and those steps will not be surprising at all to you right now. So we need to collect the data. We need to parse the data in case that they're in a textual format. So we need to extract the different fields that are otherwise hidden in the text. We need to create a structure from the text. Then we need to do the enriching step. So do the GeoIP look up on the IP address and other sort of other stuff. We obviously need to store the data somewhere that's capable of doing the search and aggregations. And finally, and most importantly, we need to visualize the data. Because we as humans, we are pattern recognition machines. It's very easy for us to spot an anomaly in a pattern. It's very hard for a computer to do so. You would have to instruct the computer specifically what to look for, or you would have to have a very, very smart computer. And smart computers are expensive, especially in time. So how can we accomplish that using the Elastic Stack? So Elastic is the company I work for. We produce all of these things to do all of that. And don't worry, there's not a sales pitch. Everything is open source. So this is how it maps. So in the center of everything, to store and doing the search and analysis, we have Elastic Search, which is the data store that can handle this amount of traffic. For visualizations, we have Kibana. We'll see Pretty Pretty Screenshots later. And for the collection parsing, we have two products. We have Beats and we have LogStash. And they are a little bit different. Whereas Beats is more like a lightweight agent that will sit on your machines, collect the data and send them somewhere. Either for further processing into LogStash or directly into Elastic Search. LogStash is more heavyweight. It has much more options, but it's also much heavier to use. Just to demonstrate what I mean by that, Beats is a small agent written in Go. It's a single statically compiled binary that you can just upload somewhere and it will work. Whereas LogStash runs in JRuby. So written Ruby runs under JVM. I'm pretty sure that's fairly popular in this crowd. For more sophisticated or if you really need more from your system, this is typically how it would look that you would use Beats to actually collect the data and then LogStash for doing the parsing and enriching because that's what this is all about. So this is the overview, so now let's get into it. So the first step in the process is Beats. And Beats is sort of just a family of products. There are several different Beats. And most importantly, you can create your own Beats. Beats is written in Go. We even have a Beat Generator, so you can just run a command that will create all the scaffolding, all the boilerplate code for you and you just have to essentially write the one function that actually collects the data. And we have several different types of Beats out of the box. So let's see some examples. The first one that we have here is Metric Beat. Metric Beat is something that regularly does something and collects some data. It has different modules. This is an example of working configuration where we want to configure, when we want to monitor Redis. Every one second, we want to essentially capture the info from host one. And we also have one for Apache where every 30 seconds we want to do the same thing. And then Beats have these modules for Redis and Apache. It knows how to go in and fetch the information. We also have a file beat, which is essentially just, here is a log file, just keep tailing it and optionally say if you see a line that doesn't begin actually with a hash, just merge it with the line before. So that probably means that there is a stack trace or something that spans multiple lines. So we can group them together already on the Beats level when we are first collecting the data. Because doing it later is a hard problem when you have data coming from multiple sources. How do you identify which actually belong together in a simple message? And then my favorite beat is packet beat. All you have to do with packet beat is you say, I have this protocol running on this port. And then packet beat will just keep monitoring the network and logging what's going on in there. And because it understands the protocol, it can give you more information. For example, it understands the Postgres protocol, so it can tell you, yeah, oh, yeah, this is a select. This is a transaction. This is a select going to this table and log all of that information in a structured manner. And finally, these are all the inputs that you can have, and then finally you have an output. Output is either elastic search or a file or standard out. Or in this example, it's log stack. So we'll just take it and send it to log stack for further processing. So it's a custom TCP protocol to get it into log stack so we can do some more stuff. So let's follow our data and go to log stack ourselves. So what log stack is, it's a data ingestion pipeline. There are inputs, then there's a bunch of filters, and then there are some outputs. It's really just that. It's really a pipeline. So what are the different options? There are many, many different inputs. The most interesting ones, at least for me, is all the different queues that we have, Redis, Kafka, RebMq, ZeroMq, all of the different ways how you can get data from a queue. Also, how you can get data just from the network. You can just open a TCP socket and listen to whatever comes in. Or there is specialized one like the Beats input that's pretty obvious. Or even syslog or log4j, you can even just go to S3 or SQS or some other systems. So many, many different types of inputs, how you can get the data and ingest them into the pipeline. Then sort of the meat of it is all the different filters that you can apply to your data. This is just a small sampling of the filters that there are and highlighted are the ones that, again, I personally consider more interesting. For example, anonymize, if you consume some data that can potentially contain some sensitive information like email addresses and stuff that you don't necessarily want to expose to everyone in your company, but you want them to have the ability to inspect the logs, you can just anonymize everything, which will go through a one-way hash. So all of the same emails will correlate to the same hash, but nobody will be the wiser which actual user this is. I've talked about the GUIP filter, that it will just take an IP address and give you back the country, the city, where the user came from. So you can visualize very nicely on a map where your traffic is coming from. And you can even know, for example, which users around the world have the best experience. The best latency or the worst. GROC is if you want to parse text. JSON is kind of obvious if you have data in JSON, so just parse it as JSON. User agent is if you've ever seen a user agent string in your log files, it's a nightmare to make sense of even for people. So user agent actually will parse that into structured information. This is Chrome version 7.72, and it's running on Windows. And then look such as, again, number of different outputs. The crucial one is probably Elasticsearch, but there are many others. You can just write it to a different queue to be processed by another system. You can write it to a different, completely different storage. If you're so inclined, you can even write it to MySQL or something like that. I don't know why you would do that, but you can. And it actually might make sense for some of the data because what you can do with LogSesh quite easily is you can say, put all the data in Elasticsearch, and if you see some critical error, send that to me over email. And if it's really, really critical, just pink page your duty and have my pager go off so I can jump on it right away. So we can have multiple different outputs with filters. You can be alerted immediately what's going on in real time. So that's LogSesh. It's really not that hard in concept. You have inputs, you have a number of filters, and you have some outputs. The only interesting part is that you can have multiple outputs and obviously multiple filters and multiple inputs. And then the data gets into Elasticsearch. So what is Elasticsearch? Again, just to super high-level overview, it's a distributed search and analytics engine. It's open source. It's document-based by document. What we mean is everything that you can express as JSON, we can index and we can search on and analyze. It is based on Apache Lucene, the library that does all of the heavy lifting. And it is very friendly. It speaks JSON over HTTP. Then there are obviously clients in any of your favorite languages. My guess is that your favorite language would be Python, so we do have a Python client for Elasticsearch that you can use. And the nice part about Elasticsearch is that it is distributed and it has some qualities that make it very well suited for the logging use case. So how does it look inside of Elasticsearch? Again, the most highest of levels of overview. So Elasticsearch is a cluster solution, so you have a number of nodes that work together. From the outside, it's completely transparent. You don't really care what's happening inside in what node. As a client, you can always talk to any of the nodes in your cluster and they will all answer the same questions in the same way. So you don't have to worry about any of this, but it's nice to know how it works so that you can reason about what your expectations can be. In the cluster, the data is stored in indices. And each index is essentially a connection of shards. So what we do is we say we have this index, which is just a logical grouping of documents. And we'll split it five ways, so we'll split it into five shards. And each of these little shards, we'll store twice, you know, in case we lose one node so we can still keep going on. And these shards are actually the unit of scale of Elasticsearch. When we have the cluster, so in this case we have two indices, one with four shards and one replica each, so we have two copies of each shard, and one with only two shards and no replicas. We don't care really about that index that much. And those shards are what lives actually on the nodes where the cluster keeps rearranging. So if I were to add one more node, the cluster will say, oh, I have a free node and it will move some of the shards over to that new node. You will have a primary and a replica, which is just a logical difference. It really doesn't matter that shards look exactly the same. They do exactly the same amount of work. So again, something that typically you don't have to worry about. But what this means is a very important thing. When you search across through the orders index in this case, you will have to go out to four shards. And that's okay. We can absolutely do it. We can even stand. And that means that it is the exact same operation if I want to search four shards no matter where they come from. So they can be inside one index or inside four indices. And the only thing that really matters is the number of shards. And this allows for some interesting, interesting things where we can create a new index every day with any number of shards. Typically, you would start with one shard when you're starting the system and you would grow the number of shards as the one shard becomes not enough. And then when you search, you just search over as many indices as you need data for. So if you want data for the last seven weeks, you just search over the last seven indices. And this also means that you can treat the indices differently. So for the current index, for the index for today, you will have more replicas and you will have it on those nodes that live on stronger boxes, the boxes with SSD drives and everything because those indices are doing the most work. They're actually actively indexing new data. And as the data gets older, so a week old index, so you will just back it up. You will do a snapshot with Elasticsearch. You will store it on S3 or something like that. And you will remove the replicas, which means that at this point if you lose some node, you will lose some data. But that's okay. You have a backup. And also, this data is not that important. It's week old. It's okay to save a little money sometimes. Then a month old data, you might want to move to weaker boxes. So boxes with just huge spinning disks that everything will live on. Then you can even close the indices. So they will still live on the disk, but they will not be in memory. They will not be available for search. But you can make them available for search very easily just by opening them. And finally, after some amount of time, you can still eat the data. So you have a very clear sort of plan how to degrade your data and make them use less resources even while keeping them. Sure, it will mean that if you search for older data, it will be slower. But that's okay. 90% of your users will probably just want to search today or yesterday, or typically actually just last one hour. They just want to see the dashboard for the last one hour that they can actually just put on the wall and have always their auto-refreshing every minute. So that's one nice feature of Elasticsearch that's very relevant to the login use case, how you can make use of it. And speaking of dashboards, so the last sort of part in the Elasticstack is Kibana. Kibana is a small JavaScript application that provides visualizations for your data in Elasticsearch. It doesn't have to be log data, but with log data, that's how Kibana started, and that's really where it shines. And you can see immediately here what I talked about. You can immediately see a gap here in the data. And you can see it, because, again, you're human, I assume. So that's why visualizations are important. And this slower one, you have... So this is split by country. For each country, we split again the users of our website, whether they're authenticated or not. And for each of these two groups, we ask what browser they're using. And immediately, you can see very different things for different countries. So we have China here, where we have mostly authenticated users and some not authenticated users. And in the end, we have, I don't know what country, but nobody there is logged in. And you can see that immediately, because it just pops out. Because, you know, again, the human thing and all that. You can... If you use the GUIP filter in LogStash, you can see where your users are coming from, just by clicking on a map. And you're not limited to just pretty pictures. You can actually drill down to the individual records and you can do search. So in this case, I'm looking for responses that went to IE6 and are 400 to 600 kilobytes in size. And I can see the individual records. I can see the individual URLs. You can see that we are using the data from U.S. government. They actually publish this data set publicly. So you can drill down, you can click into it, and you can see all the different values. So putting these things all together, this is how it looks logically. You collect the data with Beats, you send them to LogStash to enrich them, you store them in Elastic Search, and you visualize them using Kibana. This is sort of the ultimate thing. Well, the ultimate architecture would be that instead of just the arrows, you would probably have like a Q in each... Instead of each arrow, you would have like a Kafka between Beats and LogStash and between LogStash and another LogStash that would then put it into Elastic Search. But that's only once we're talking hundreds and thousands of requests per second. Like 100,000 requests per second or millions per second. If you only care about thousands per second, you can just do it directly like that and you will be perfectly fine. If you need more capacity, you just add more machines, more nodes at each level. You can have more than one LogStash, obviously, and more than one Elastic Search. You should have more than one Elastic Search to get any sort of high availability and resiliency. So this is how it works. It's really not that difficult to set up. You can just start with everything on one machine. When you want to start, I recommend you just use Beats and Elastic Search alone. No LogStash, it will just work. And only when you discover more things than you need, like doing the enriching, etc., you can introduce LogStash in the middle. It will be minimal change in your configuration and you can sort of grow from there. First thing that you will do is probably separate Elastic Search and LogStash on separate machines and then sort of keep growing from there. So how does Python come into it? What are the concerns when you're logging from Python? So first important one is to enhance your logs. LogSlog, well, this happened. But also tell, for example, how long did it take? How many queries did the database did it take? Or how long did they take? Also include some sort of metadata. So who is the user who requested this? What is the page that we are currently on? Again, speaking about the web example. And ideally, log it just as JSON. Because if you log it as text, then you will have to parse it later. So you're both serializing it into text and then parsing it out from the test. Both of these things are pretty error-prone and they take a lot of CPU. And no human is going to look at the individual message. We will be looking at it through Kibana. We care about the individual fields, not about the one textual presentation, including all of that. So log as JSON. And the way how to do it is there is a Python package called struct log that's actually created by Hineck. He's somewhere around. I believe he's now giving a talk at one of the other tracks. And what struct log enables you is to do exactly that. So add structured info to your logging. Add qualified fields with names and values. With that, you can track the info through the services. So if you, for example, have your load balancer, you can attach a session ID as an HTTP header. And then track it, even if it has to go to two different web servers, you can in the end put them back together and track the one request through your different systems. You can add a little comment to each one of your SQL queries. Again, to match it back to the request that started it. So you can sort of track the one user action on your front end to everything that happens on your back end. Then ideally, you want to log that into a file. You can send it directly to a queue or to beats or to lock sesh or something like that. But at that point, what happens if your logging infrastructure goes down or if you want to upgrade it? The worst-case scenario here is that it will actually impact your production, your application. That's not really very acceptable. So what you want to do is sort of use some sort of buffer. And the easiest buffer that you can find that's most universal supported, well, that's a file. So just log it into a file and then you can have a file beat sitting there listening to it and sending it to either directly to Elasticsearch or to log sesh for further processing. And you can be perfectly fine. If your logging system goes down because you're just playing with it and you're still not committed, that's fine. Your application will still run. You will not lose any data. You can backfill them later. And it will give you a lot more flexibility. So I think that this is OK for our overview of what's possible, why you should do it and what are the key concepts that you should keep in mind when designing a system like this. And now we have some time for questions. Can we get the mic? Yeah, it should work. OK, so any questions? Question for open source solution for user authentication in Elasticsearch. Currently, no. But Elasticsearch speaks HTTP. So what you can do is stick an NGINX in front of it and do HTTP off and SSL on it. It's very difficult to do different levels of access. It's possible, but there might always be the weird corner cases. But that will get you 80% there. And it's very easy to set up. If you need more than that, unfortunately, currently you have to pay us money. We do offer commercial plugins for Elasticsearch and the one doing security is one of them. Hi, great talk. What's your suggestion for an environment where we are installing full-hour stack of products at the client's site and we don't have an access so everything is there. Sometimes we just can pull the log files if we ask client. So what's your suggestion for this scenario to effectively get the logs stored in Elasticsearch? So there are two different ways how to do this. One is that you install Elasticsearch in Kibana at the client with every installation, but that's probably only worth it if the client will get some value out of it as well. If that's not the case, just create a pack of the logs, ship them over, and then have your own stack that will be configured to actually get that pack, read through all the logs, run it through the entire pipeline, get it into Elasticsearch, and then visualize it. And at that point it's only up to you whether you will just create a temporary one like on AWS just for each one of these packs that you received, or you will have one big one and you will get the data from all the clients. Make sense? Cool. Okay, another question? One here. So you mentioned that Beats understands different protocols and we can configure it to listen to the TCP port and send logs down the pipe. Let's say I have my services running in Docker. How do Beats play with them? So with Docker there are several ways how to do it. So the Docker listens on the network interface. So the easiest way for this particular beat, which is packet beat, you can just run it inside the Docker container with the application that you're trying to monitor, or you can run it in a separate container that you configure networking that will be able to listen to that traffic. Alternatively, you can not use the packet beat and just lock things directly using the metric beat which can live in its own container and just keep pinging the other services. Alternatively, Docker has its own logging functionality that you can then feed into LogStash. So you can use Docker to collect all the logs from all your containers, aggregate them together, and send them to LogStash for processing and for loading into Elastic Search. So there are many, many different approaches. It depends exactly on what you're trying to do. Okay, thank you. Another question? Yeah, two. Okay, we have two minutes. Sure. It's not really a question, but a pet peeve of mine is that people log what they're doing, and I can see from other things that what you're doing, please log why you're doing things. Something is secret sauce, I don't care, but the things that you do want to log, are able to log why you're doing stuff, please log the why. You heard the man. About this flow of log messages through time, are there some hooks in Elastic Search for that? You know, like after one week do this with data, or do you have to just write scripts? Thank you for that question. That's what I forgot. Yes, there is a tool. It's a little curator. It's written in Python, actually, and it allows you to do just this. And also a new version of Elastic Search, Elastic Search 5, which will come hopefully later this year. It's already built into Elastic Search, so it's an API inside Elastic Search. So Elastic Search Curator, if you just search for it, or if you install Curator, it does a command-line interface, so you just stick into your cron, and periodically you will run Curator, everything older than five days, remove it, or any other actions that you might have. Okay, no other questions. Thank you so much for the speak.