 Rwy'n cael ei ddweud o'r cyfnod Gwffana Loki o'r Llochi, yn Llochi yw Llywodraeth Prometheus butffalogs. Rwy'n cael ei ddweud oherwydd mae'n ddim yn iawn i ddim yn gweithio'r drwng. Rydyn ni'n gweithio'n gweithio'n gweithio'r ddweud o'r ddweud o'r ffordd o'r ddweud. Rydyn ni'n gweithio'n Gwffaeth Prometheus? Felly, rydyn ni'n gweithio'n gweithio'n gweithio. Rydyn ni'n gweithio'n gweithio. Llywodraeth Rydyn Ni, dydgyddwn ni'n cyfnod Feith iechyd Pometh felly – Fyddwn ni'n lefach Fredrick o' eu drwsiaeth I like to start almost every project with a design doc and so there's like an eight-page design doc that really goes into why we've done it and how it's different from existing log aggregation systems. So go and read the design doc, send me comments, it's all open source, it's all Apache 2 licensed. And today I'm just going to pick on three topics, three things that I think makes life so I promise. But seriously, keep me honest, I do not mean to trash talk elastic, we've tried to take a different trade off to elastic and I want to emphasize the differences by being respectful though. But if I can't do that then pull me up on it. So first we're going to talk about how locally it's simple to scale. I'm not an expert on elastic by the way and Splunk and other log aggregation systems but my understanding of how they work is they take your log event and they split it up into a bunch of entries into their index, this is called an inverted index. And so what you can see in this very crude diagram is they've gone into the log line and they've broken this out into this set of tuples, these four of these tuples at the bottom and they will point back to the log entry. What happens when you do this is you end up like your index becomes your data store, so your index is the majority of your nodes and you very quickly get to this kind of it's hard to scale. Your index becomes the size of your logs and these inverted indexes are pretty tricky to scale like all of the, in my opinion at least of course, all of the operational trouble people have with elastic is because it's trying to fundamentally do a very difficult thing. And so I wanted to not do this, I wanted to avoid a lot of these problems. I want one of the things you'll also notice is the second part of the tuple, the key in the inverted index tends to be very high cardinality and it's free text, it can be anything. And high cardinality indexes are difficult to implement, like they tend to get very big when you query them, you have to use special techniques. If you come from the Prometheus world, a lot of you seem to know about Prometheus so you'll know about, it's always banging on about don't use high cardinality label values. So they're kind of some of the problems. What we've done is try and build Prometheus for logs. So instead of indexing the log entries, we index a little bit of metadata about each log stream and you can think about a log stream as the same as a time series. So we take time indexed, well just binary data, we build these into these chunks and then we have a small amount of metadata pointing to these chunks. And that small amount of metadata is in fact exactly the same as your Prometheus labels. So the nice thing about this is that metadata is small, it's low cardinality, generally is many orders of magnitude smaller than your raw log data. And so scaling that's easy, you can fit it all on one machine, even at a very large scale you can fit it on a single machine. And then your log data is huge, but we're not indexing the internals of it. So the idea here is that this makes it super easy to scale, the streams themselves are just sharded out across the nodes in the cluster. This is done, we've got an internal DHT and dynamo style replication logic and all the usual stuff, but I'm not really going to focus on that too much. So that's me trying to give you a bit of an idea of how we've tried to make low key easy to scale. And come on, do you guys want to come in? This is a good point to come in. There's space up here so we'll cross the front. No worries. Okay so the second point I want to focus on is about integrating with existing tools. I am on call, I'm actually on call right now, I just got paged. But luckily Gotham's here so he's dealing with the page. But my usual on call routine is I get a page, I use Slack unfortunately. I get a page in Slack, I go to a dashboard, the dashboards are in Grafana, I work for Grafana Labs. I go and fiddle with those dashboards and normally I find I end up copying the query out of the dashboard and putting it into the Prometheus UI and fiddling with it because I like to fiddle. And then once I've kind of found the query that really highlights the problem, maybe it's high latency, maybe it's a higher error rate, maybe I've narrowed it down to the particular service or the host or whatever, I then want to go and look at the logs for that service, that host, that time range. And normally I'm using a separate log aggregation system. This one, I can't remember for the life of me, I can't remember which one that was. But you had to translate the query into that log aggregations query language. And so I've now had to copy the query once, translate it again and maybe I'm going to do distributed tracing and I'm going to translate the query and translate the selectors again. Finally I'm going to do a fix. It's all of this context switching that's expensive. Or in my opinion, it's what slows me down when I'm responding to alerts. So I wanted to fix that. So a quick overview, all very familiar with Prometheus, a quick overview. Prometheus has identifiers and has time indexed values. The time stamps are milliseconds in 64s, values are float 64s. Prometheus emphasises simplicity, like a really easy to understand data model. The identifier is this kind of bag of label value pairs. You can think of it as a map or think of it as a bag of label value pairs. Then when you want to come and do a query, you get to use selectors on those labels to pick the time series you want to use in your query. So you might say, oh, I want all of the time series for this job, for this metric coming from this host. Or all of the time series for this job, for this metric that have status code, you know, five dot dot. So all of the 500s. So in the way Prometheus gathers these labels is it, this is a Kubernetes example. If you're not using Kubernetes, this works similarly with console, DNS, you name it. There's loads of service discovery integrations. So Prometheus will talk to Kubernetes. It will gather a big list of all of the pods or services, or actually not deployments, but pods or services or nodes, or you name it like most of the objects in the Kubernetes object model. It will gather that big list. You then get to do some munging. I've been told I use too many colloquial English terms. So munging is just like fiddling, but in code. So in Prometheus, this is called relabeling. There's a series of rules you execute against the data you get back from service discovery, where you can mung it into the thing you like. And so a good example in all of the Prometheus installs that I've used or that I've worked on or installed myself, I tend to include the namespace of the job in the job name. And this is because when I first started using Prometheus and Kubernetes, I would quite often want to see the error rate of a particular service, so I do job equals service, blah, blah, blah, blah, blah. And I'd accidentally aggregate together the error rate from my dev namespace and my prod namespace into a single thing. I didn't want to do that. So I started to include the namespace name dev or pod in the job name, and it stopped me making these mistakes. Now, I'm not advising everyone to do that. This is what I call an opinion. And what I really like about relabeling is it allows me to express my opinions and consistently apply them within my organisation. So then you end up with a set of targets and a set of labels after the relabeling. Metrics get pulled from those targets, and those labels get added to the labels for the metrics that you've pulled from that target. So this is sometimes you'll hear us say target labels and metric labels, and that's kind of what we mean. So let's have a look at Loki. Well, this looks familiar. I kind of just copied the previous slide. Loki is time-stamped log streams. The difference is the time stamps are a bit more precise. Actually, I have to check that. I can't remember what we actually did. And the values are byte arrays. But the identifiers are the same. They're bags of label value pairs. And how do we build these labels? We have this job called Promtail, and this is a really descriptive name because it's the Prometheus log tailer. It uses the Prometheus service discovery code to talk to Kubernetes. It uses the same relabeling code base which uses Prometheus as a library to build a set of labels for your logs. And then the quick little hack we do is inside that label set you put a file glob that specifies the files in your local file system that you want Promtail to tail. Promtail will then go and tail those files. You know, apps should read files there. Tail those files, add those labels to those streams and send them to Loki. Now the nice thing about this is you end up with exactly the same label set for metrics coming from the target and logs coming from that target. This is not arranged by... I can see someone's already really happy about that. This is not arranged in some fluently config to look similar or look the same. These are identical. They're using the same service discovery code. They're using the same relabeling code. And this enables really cool workflows. Oh, that's a terrible resolution, isn't it? I'll show it in a minute. But basically, on the left, you've got some Prometheus data. And the labels are job equals blah blah blah and name equals that. And on the right, we've got the same label, job equals blah blah blah, Tempodev, Loki was called Tempo for a while. Tempodev, Promtail. And you can see the labels the same. And now you can see side by side, logs and metrics from the same target for the same time range. And so coming back to the slide I started with, you can now see how kind of Grafana is encroaching into the rest of this workflow. We can handle our alerts in Slack, go to a dashboard. Within the dashboard now in Grafana, we're going to launch in Grafana 6. I think the beta came out last week. Full launch would be in a month or so. You can now click on a panel and go to what we call the explore mode, which basically extracts out the query into a more Prometheus-like UI with tab completion and all those fancy things. For you to fiddle with your queries. And then once you've fiddled with that query, a click of a button, because we've got the consistent labels, a click of a button, we can see the logs for those jobs. And this also kind of gives you an idea of what I want to do next as well. So, we've covered how I've tried to make Loki easy to scale. We've covered how I've tried to integrate it with existing tools, or Prometheus, and Grafana. And now I've kind of got, you know, a grab bag of other stuff, which I've called Cloud Native. Lots of logos. It's shipped as a container, although actually I do most of my development on my Mac. Focus with Loki on making it easy to run locally. So making it not depend on Cloud Storage. Even though it says here it depends on Cloud Storage, you also have a local mode using BoltDB and just storing the logs in files on the file system. So you can run it on your laptop. You can run it on Linux on your Mac. There's even a PR to make it compile on Windows system for Linux. Kubernetes Native, we ship. You know, our target is Kubernetes. So a lot of this has been focused on gathering logs from pods. We've done all the integrations with, what we've used, all of Prometheus' integrations with Service Discovery. We've built a helm chart. We use something called Case On It, which is a really cool config management system for Kubernetes. And we've basically made it really easy to get started with Kubernetes. One of the things I want to talk about in a bit more depth is this optionality around microservices. So I'll come back to that. And then finally, Cloud Storage. We've built Loki to use things like S3, GCS. So you run it and it will store the chunks, the chunked up, compressed time indexed logs in those storage. Right now, we need an index, a place to put the index as well. So we use Bigtable or DynamoDB. If you run it locally, we can use BoltDB. And my plan in the next few weeks is to have a version which writes the data to BoltDB and then flushes that BoltDB file effectively to S3 to completely cut out the dependency on DynamoDB and Bigtable. So let's come back to this optionality around microservices. This is a big slide. Loki under the covers is actually based on a previous project that I've been working on for about three years called Cortex. Cortex, who's familiar with Cortex? I'd hope I am. But not too many people. Okay, well Cortex is a horizontally scalable microservices-oriented Prometheus implementation, I guess. So we took Prometheus, we broke it up into a bunch of microservices and we used different techniques to make each of those services horizontally scalable and highly available and so on. And that's the Cortex architecture on the right. And then when we came to build Loki, we took exactly the same code base, we just vended Cortex in, did a few refactorings and effectively just like said and replaced every mention of time series with Logstream. And it worked. And so you still, because Loki is based on Cortex, you still have this underlying microservices architecture that you can deploy. And this is mean at Grafana Labs we run right now I think two big Loki clusters. One that's free for you to sign up for and give it a go and one that we use internally for Dev and Test. And that's running in this format. That's running lots of different services. The nice thing about running in a microservices way, and I'm sure you've all heard many talks about microservices, is we can independently scale up like the read path or the right path. And we can ship new exciting features to the read path because it's all stateless and if they panic, which they do quite often, then we just revert them. But on the right path we would never dream of doing that unless that could lose data. So I quite like the microservices way of working, but one of the things we learn in Cortex is this makes it really hard to get started with. You know, if you've got to launch seven or eight services and wire them all up and tell them where they all are and to make them all talk together, you're never going to get it running on your laptop. So one of the things we put a lot of effort in when we built Loki was making a version which basically takes all these services, puts them in a single binary, runs them in a single process, does all the internal service discovery, and makes it so you can run a single binary but still horizontally scalable and highly available version of Loki. So it's still a bit of a work in progress. I know the single node, the single node version works very well because that's how I run it on my laptop and that's how a lot of the community run Loki. And I really want to work on making the single binary, single process version more easy to horizontally scale. Currently you kind of need a console service and I want to replace that with gossip and things like that. That's Loki. Simple, cost effective to operate integrating with existing observability tools and cloud native through and through. I thought, what better than to give you a bit of a demo? Who wants to see a demo? Good. Live demos always go well. I'm just going to mirror my screen if I can figure out how. Do, do, do, do. Dispose, have a window arrangement. Mirror. There you go. Right then. Can everyone see... I don't know what's going on up there. Right then. I'm just going to cheat and use that internal dev cluster that I talked about at the final labs. So this is our graphiner now. What was the page we had earlier? I'm not going to use that one because I have no idea what caused that. I'm going to use that. This is a page we got a few days ago. Plontail is the agent that runs, that collects all your logs, that does the service discovery and so on. This is the dashboard. This is actually completely unrelated. This is a mix-in, a Prometheus monitoring mix-in, which is this format for reusable groups of dashboards and alerts written in this language called JSONet, which is really cool. A project that Mathias and Frederick and myself work on. There's also like Kubernetes mix-ins and console mix-ins and XCD mix-ins. We're trying to make a bit of community there. So this is showing... Let's pick the right cluster. So this is showing the Prometheus running in our default namespace and look, there's some errors conveniently. So I don't really know what's going on here. I'm going to explore this panel and now I've got the query, I can fiddle with it and I can kind of pick it apart a bit. So I don't want all these label replacers. That seems tedious. And then... Yeah, and then that's wrong now. Yeah. So this looks a bit more interesting. We can see status, status, status. I think that's a bug. That worked this morning. We continuously deploy Grafana master to this Grafana. So we find these things. And are there any Grafana developers in the audience? Yes. Carl, I think that's a bug. So we're going to... I want to know what this spike of errors was here. I'm particularly interested in the 5.2. So we can see there was a brief blip. And if we go and look at... Oh, actually. The screen's not really big enough. I'll show you the split, but the screen's a bit small. So we go there and now we've selected the same selector we had before, the same time range we had before and we can see that the logs are... Yeah, well, something's going on here, isn't it? And if I click this, this is some cool magic that they've added to the Grafana side. We can now see that of the 19 errors we've got in this time window, 79% were bad gateway and 21% were internal server error. And that's kind of it. That's Loki. Super simple. We're trying to make it really easy. It's obviously not a solution for analytics on log streams. This is more of a DevOps. Let's gather your logs. Let's give you a basic index over them. And hopefully make it easy for you to operate and run. I think that's all I've got. Thank you very much. Now, no leaving. We're going to all sit through the questions because it really annoys me when everyone was leaving. So who's got a question? Hands up. We'll start with the one in front and then we'll come to you. Go on. It's on. How do you want to deploy the prom, the queue on the cluster as a demon set? How do I deploy the prom tail as a demon set in Kubernetes? The question was how do I deploy prom tail? It's a demon set in Kubernetes. That's what we've focused on, but because Prometheus has static targets, has DNS, you could actually deploy it using Chef or using Puppet and configure it with static targets if you wanted to. It's already extensible enough to do that. There is a PR to do JournalD, JournalCuttle. And there's a FluentD integration as well. A lot of people already use FluentD, so we've already got a FluentD integration if you want to send logs to Loki. With FluentD, you have to manage the labels yourself because FluentD doesn't have the Prometheus service discovery. Cool, there was a... Oh, go on then. Can you tell the live logs coming in just like literally see them scroll past? Good question. Can I tell the live logs? Two answers. One is yes and one is no. There is a CLI that ships that has a fake tail that will basically ask for logs and then stream them and then ask for the next set of logs and simulate it like a Comet style. The challenge... Internally, Loki is actually all built with GRPC streaming, so it would actually be relatively straightforward to do that. The problem is it's really hard to get the logs delivered in the right order, and so I've actually much preferred the just like tail and oh, by the way, I'm only giving you like a minute out of date logs and allowing that minute window to kind of reorder and do stuff. I don't... I mean, I want to do this. It's actually like something everyone asks for. We want to definitely put it in Grafana as well, but finding the right solutions has been a bit tricky, but if you've got any ideas, I'd love that. Cool. So, two questions. First, is it possible to search for content in the logs? I completely admitted the fact that we have distributed grep as well. So yes, you can push down a regular expression and it will get pushed out to all of the replicas and executed locally as close to the data as possible, and then the results will be streamed back. So you can do it with grep, you can do it with regular expressions, RE2s. We've got a design dock at the moment, which I think... I don't think it's public yet, but we'll be public as soon as I get around to clicking that button, about like a more interesting query language against these things, but at the end of the day, it's always going to be brute force. We're not going to index the content of the logs, and so any querying you can do is going to be limited by distributing it over gigabytes or petabytes of data and really pummeling through it. Second question is related to bits you mentioned now. So how to distribute it, especially if you have a set up consisting of multiple data centers? I don't have a great answer to that. I don't believe many great answers exist. I'm a big fan of centralising everything. So have one of the data centers and push all your logs to that. That's not a great answer, I understand. This is one of the things that Thanos, the Prometheus query system, does really nicely. We could probably do something similar with Loki. It's so early days right now that that's not something I've really given a huge amount of thought to. We just push all of our logs from all of our, I think we've got about 15 clusters. We just push them all to one cluster. Cool. Any more questions? There was one at the front. We'll come back to you. Is there any kind of access control integrated already? No. Is there any kind of access control? No. So one of the things I didn't really touch on is it's multi-tenanted, like internally within all the services a user idea is propagated with all of the requests. It isolates different clients for you but you're supposed to do the authentication in some kind of reverse proxy in front of it. At the moment, we just ship it in a way that injects a fake user ID. Obviously, when we run the hosted version, we've already got a multi-tenant authentication layer that will authenticate you against the right instance and ship an instance ID. There's 10 of any of these kind of reverse proxies that will do authentication. If you want to run it in a multi-tenant way and have access controls over who can access what log data, we kind of leave that to you. But I think we've given you the basic tools to do that. The problem is trying to find a standard for doing more than that kind of thing. It's just not something I want to get involved in. Just for the Grafana side, it would be interesting to give developers the logs of their applications and not all logs. Actually, Frederick opened a PR about that recently, figuring out a way to effectively promote the namespace label to this tenant ID. So there's an issue going like put your comments on that and we'll probably do something about making that happen. Let's keep going back and we'll come back to you. Thanks, Tom. Have you run any benchmark trying to compare the performances of log here compared to other solutions like Elastic, for example? Too early to run benchmarks at the moment. We've got some compression benchmarks when we were designing the chunk format. So we know we get about 10x compression, 8x compression. That rounds up to 10. So we know we get what kind of compression we've got. The thought behind the TCO argument is pretty thorough. We've modelled how much it should cost to operate in terms of GCS operations, S3 operations and things like that. So we're pretty sure we're well below the dollar or gigabyte in TCO costs. But we've not done any kind of benchmarking in terms of read-write speeds yet. It's the project's too early, really. I think you mentioned in your talk that how do you deal with high cardinality from a log line, for example a URL path or something? Yeah, so how do we do with high cardinality? We don't. The whole point of Loki is that we don't have that problem. We've built it so your labels have to be the same as your Prometheus labels in a world we say don't do high cardinality labels. It's like you wouldn't put a URL or a user ID or an IP address into a label because that would just explode the index. And in fact, in Loki, because it inherits from Cortex, which is this big multi-tenant horizontally scalable Prometheus thing, we've got limits. You're only allowed a certain number of values for a given label. I think it's like 100,000 or something. If you try and write more than that many values in a certain time window, we'll just reject the rights. So the high cardinality data should go into the log stream itself. And then you can kind of pick at it with regexes and things like that. Right, shall we, Carl, shall we come back? Oh, after this one we've got to come back to the front. Is there a way to mask data like passwords or customer names or that kind of? Not with Prometheus and Loki right now, but it's a commonly requested feature. I know there's ability to do it with FluentD. So you can put filters in FluentD to do that kind of thing. I think we'll probably end up having to do something like that in Prometheus because it's so commonly requested. But not yet. No, but a good question. I think we've answered this question already so should we do this chat? Having metrics and logs in the same place sounds like a great idea. It's clearly saved you a lot of time and effort and energy. Are there any drawbacks to having these two things in the same place? Or is that where this piece of technology is heading? So let's define in the same place. These are not stored in the same processes. They're not even necessarily stored in the same machines. The unification is more at the index level. When I've talked about we use the same code as Prometheus, we use the same index as Prometheus, it's not literally the same index, it's another instance of the Prometheus index. By doing that you can still have them completely isolated from each other. You can just switch between them really easily because they've got the same metadata. Obviously it's a company, we offer hosted versions of all of this and if you centralise your data and a meteorite hits that data centre of course you're going to lose all your data. So there are some arguments against centralising it all in one place for sure. Any more questions? No? Right at the very back, last one. Go on, run, Carl. Thank you for your talk. What about alerting? What about alerting? Alerting. One of the whole reasons, we're quite excited about doing alerting on log data to be honest. Are you familiar with a project called M-Tail? It's by Google, it effectively allows you to write regular expressions that then match on log lines and then get exported as metrics. So a lot of our ideas around alerting in Loki is basically to do that. Like allowing you to give patterns that will match log lines that will turn into counters and then you just use Prometheus alerts. The other nice thing about all of this is you can do this in Prometheus in the same query language and you can combine it with existing time series metrics. So for instance you can do things like alert if the number of error log lines per request, the request being a metric the error log line coming from the logs. Alert if that goes over a certain number or a certain ratio or something. So I think that's something I'm really excited about like I hope to come back next year and talk about how we're combining them in single queries maybe. Thank you very much.