 We're going to go ahead and get started. So I'm Sarah Moore. And I'm Derek Kavanaugh. This is our first talk at KUKON, so we're super excited to be here. And today we are going to take you with us on a journey about default configuration in your logging stack and how shipping the defaults could cause unforeseen challenges. So let's start with a little story. The date was November 9th. We received a message from a fellow engineer saying, I can't seem to find my logs in Grafana. Huh, that's not good. So we start poking around a little bit, and we confirmed the problem. The logs have gone missing. Today we're going to walk you through the case of the missing logs and how we solved it. Before we get too far into the details of solving this problem, let's go through an overview of the PLG stack. So the PLG stack consists of three main components. Promtel for shipping logs, Loki for aggregating and storing logs, and Grafana for interfacing with logs. Promtel runs as a daemon set in any cluster you pull logs from. By default, Promtel will pull pod logs that are logging to standard out or standard error and ship them to Loki. Loki is the log aggregation component. Loki exposes an API for Promtel to push logs to. It will ship both the raw logs and an index to a storage backend. And that's typically a storage bucket. We'll talk more about indexes in a little bit. Grafana is the UI interfacing with Loki. So we can create a Loki data source in Grafana in order to run queries against the logs. This gives us the ability to interact with the logs in a number of ways, from simple auto queries to complex dashboards with integrated alerting. All right, so back to the case of the missing logs. So at this point, what we've done is confirmed the problem. The logs are indeed missing. So what's next? Well, Derek, you're a good SRE. What would you do in this situation? Well, Sarah, I would probably query Loki to look at the logs for Loki. Did we mention we work at a company called Recursion? So looking at the pod logs for Loki, we see a few errors that look like this. Maximum active stream limit exceeded, reduce the number of active streams, or contact your Loki administrator to see if the limit can be increased. Hey, Derek, do you know who the Loki administrators are? I thought that was you. Wait, I thought that was you. Shit, I guess it's us. So as talented Loki administrators, we decide to start by rolling the Loki pods. I mean, let's just see if these errors resolve themself. You can guess how that went. When we continue to see these errors, we do a bit of searching and decide, hey, if we've hit the maximum active streams, let's just increase that limit. If you don't know what streams are, we'll talk about that in a minute. So we take the action to increase the maximum active streams, and everything starts working again. Logs are showing up in Grafana. Developers are joyous. We're not seeing this error from Promtel anymore. Okay, sweet, we're gonna wash our hands of that and mark this as another case closed. Yeah, just kidding. It only took a few days for the problem to start happening again. For us, increasing the limit of active streams masked the real issue, and only worked for a short amount of time. At this point, we knew we needed to dig in and find the true root cause. So before we do that, we need to go over some fundamentals about how Promtel and Loki work. We're going to cover indexes, cardinality, active streams, and chunks as they relate to the PLG stack. So let's start with indexes. So this is an example of what a log line might look like. On the left, we have some labels, and on the right, we have the content of the log line. Unlike other logging aggregators, such as Elasticsearch, Loki does not index the entire text of the log. Instead, Loki only indexes the labels associated with the log. This means smaller indexes, which in turn make Loki consume less memory and run more efficiently. So to repeat, Loki only indexes the labels of the log line, creating smaller memory-efficient indexes that increase scalability and performance. So now let's define cardinality. By definition, cardinality is measured by the number of unique elements in a group. Here we are looking at some logs. Each log line is unique because the pod label is different in each line. This means we have high cardinality. If we were only to look at the app label, the cardinality would be low due to the label being less dynamic. So cardinality can be measured by how static or dynamic your label sets are. Next, let's take a look at ActiveStreams. ActiveStreams are log lines being received by Loki with the same index. So here we have a pod whose logs are being sent to Loki. We can see the index for each log line is the same, so these will create an ActiveStream. If we add a few more deployments, we get a few more ActiveStreams. So here in this picture, we have three. ActiveStreams are logs with the same index being received by Loki. So finally, that brings us to chunks. So as streams of logs are being sent to Loki, it will chunk up groups of logs and push them to long-term storage. This is often just a bucket. So in this picture, this group of logs goes into chunk one for this index. As more logs come in, more chunks can be stored. When the ActiveStream is closed, no more chunks for this index are added to the bucket. Additional or new ActiveStreams result in new chunks being written. So chunks are blocks of log aggregated, log data aggregated for a subset of logs with matching indexes. So let's get back to our case of the missing logs. So where were we? We were seeing an error in Loki that looked like this. We tried rolling the pods. We tried increasing the number of ActiveStreams. Now what? So previously, we were looking at the first part of the log line. The part that says, hey, you've reached your maximum stream limit. Now that we understand some of the basics, we can better understand the second part here that says, reduce the ActiveStreams, aka labels. Great. So what labels do we have that are causing this problem? Well, at recursion, we do tech-enabled drug discovery, which means we have a machine learning models looking at images of cells that come out of our lab at random times of the day. So this means we have very bursty patterns of usage. To handle this, we've built very dynamic clusters. For example, our machine learning clusters can scale up to thousands of nodes and tens of thousands of pods. So for us, what this means is that if Loki tries to index our pod labels, which it does by default, and we have tens of thousands of pods running at any given time, that's going to lead to a lot of indexes and a lot of ActiveStreams. This is problematic, but hey, we're engineers, right? Derek, how do we fix this? So let's just take a look at how we can fix a simple high cardinality issue. The following log line is indexing off of four labels, app, cluster, pod, and filename. For the most part, we would consider app and cluster labels fairly static. On the other hand, depending on how often the pods of this workload are created and or destroyed, the pod and filename labels can be fairly dynamic. In order to decrease cardinality, it may be a good idea to drop the pod and filename labels. So how would we go about doing that? This is where Promtel pipelines come into play. As mentioned earlier, Promtel's responsibility is to pull logs from a cluster and ship them to Loki. Prior to Promtel shipping logs to Loki, it has the ability to transform logs via pipelines. So there are four stage types that a pipeline can leverage. The parsing stage type is used to parse and extract data from the current log line. Extracted data is available for use by other stages. The transformed stage type is used to transform extracted data from the parsing stage. The action stage type is used to take extracted data from the previous stages and perform one or more modifications to it. The filtering stage type is used to run custom stages or drop log entries based on a specified filter. So thinking about our log line, we want to drop the pod and filename labels. So based on what we just discussed, there should be a stage type that we can leverage to drop labels. Luckily for us, there is a built-in action stage called Label Drop. Hey, Sarah, what do you think this stage does? I don't know, Derek. Maybe it drops labels. How did you know? So this stage allows us to pass a list of labels we would like to drop. Let's take a look at some Promtel config and make this happen. So here we are looking at a scrape configs block for a Promtel configuration file. If you've worked with Prometheus scrape configs, this should look pretty familiar. This is where we can define pipeline stages by leveraging the pipeline stages block. You can see that along with pipeline stages, there is a job name and Kubernetes SD configs block. So there can be multiple jobs in a scrape config performing different tasks. So the job name block uniquely defines each job. Based on the name of this job, Kubernetes pods, we can assume it's used to scrape pod logs. And then the Kubernetes SD config block is the service discovery block. It's just informing Promtel to scrape logs from pod objects. So let's dive into the pipeline stages block. The first time in this list is a parsing type stage. We don't need to worry too much about this, but the gist is if you're using the container D runtime, use CRI to properly parse the log line. If you're still using Docker runtime, then this would be Docker instead. Finally, we can see the label drop stage. This accepts a list of label keys that we would like to drop from the label set before shipping them off to low key. So if either label file name and or pod exist in the label set, drop them. With our label drop configuration in place, our logs now have the pod and file name labels dropped. Our label set now has much lower cardinality and should provide us with better performance and resilience. OK, so now we've gotten rid of the problematic labels for low key, but what if we still want to have those values in our logs? So instead of dropping the labels, we can move the labels to the log content where it's not indexed. Again, we can leverage the Promtel pipeline to accomplish this. So we use the built-in parser stage called replace, which is used to manipulate the content of the log line, which is the part that's not indexed. Let's build on the previous configuration to make this happen. So here's our previous configuration. We're going to inject the replace action type between the CRI and label drop blocks. So the replace block accepts two arguments, expression and replace. The expression argument allows us to use regex to pull specific parts of the log entry we would like to manipulate. So here we pass a dot star expression wrapped in parentheses. That means we're capturing the entire log line. The log entry is then saved as a capture group, which we can reference in the replace argument with the dot value variable. In the replace argument, we're prefixing the prod and file name labels into the log entry. So let's go back to the log entry and see what effect this would have. So you can see that the pod and file name labels are now part of the log line, and they're dropped with the label drop action type. So with this newfound knowledge, we decided to run an experiment and gather some metrics on just how impactful these configuration changes can be. So we deployed two instances of Promtel as well as two instances of Loki. On one Promtel instance, Promtel A in the diagram above, we used the default scrape configs. While the other, Promtel B, was configured with the drop pod and file name labels. We then created a deployment with 5,000 replicas. This was a simple workload that just spit out some random logs. So that's 5,000 different pods with different pod and file name labels. Now that we have 5,000 pods running, how can we analyze that data? So Grafana offers a CLI tool called LogCLI, which allows us to interact with the Loki API in a number of ways. With this tool, we can leverage the analyze labels command to audit labels and stream count. Here's the full commands. What this translates to is pull all the labels from the past hour from the Loki AM point and run an analysis. If we query the Loki cluster with no configuration changes, you can see that the total number of streams is over 5,000. This is due to the pod and file name labels being highly dynamic. So let's now take a look at the other Loki cluster where Promtel is configured to drop the pod and file name labels. Keep in mind that both clusters are aggregating the exact same logs. As you can see, by just dropping those two labels, the number of active streams dropped from over 5,000 to just 54. That is an improvement of about 100x. This illustrates how quickly cardinality can get out of hand. Now let's run some log queries and take a look at performance there. So again, we can leverage the LogCLI query for Loki. So for both Loki clusters, we are running the same query looking at the same set of logs. So first we can see the number of chunks downloaded from our data bucket. The default Loki had to download a lot more chunks than the optimized Loki at 730 chunks versus only four. We can also see how much faster the optimized Loki is at close to two seconds versus just microseconds. So big improvements here with just this one little config change dropping the pod and the file name labels. It's like 40 times faster and 97% less data being fetched. So to close out the case of the missing logs, we reduced the cardinality of our log labels and in turn Loki was able to ingest logs again. This not only closed the case of the missing logs, but as you could see by the previous numbers, it increased our performance and efficiency as well. So here are some of our key takeaways. First off, every cluster is unique. What worked for us might not necessarily work for you. We recommend running the log CLI cardinality test against your clusters before you try to make any changes. Upping your max active stream count isn't always the right solution and you might try looking at cardinality first. Finally, observability is the lifeblood of your systems. It's your ability to diagnose and fix problems so it's worth spending the time to fine tune your configuration for your use case. Though before we go, we wanna say we don't believe the defaults for the system are bad and in many ways they make a lot of sense and honestly we wanna give a shout out to the awesome developers who are building and working on and supporting the tooling and the PLG stack. Genuinely, thank you. And also, we just say how amazing this error log is. It's pretty much the best error log ever. It's not very often that we have error logs that exactly describe what the problem is and how to fix it, so mad props there. We also have an awesome team here who came all the way from Toronto and Salt Lake City to be here for us and then stayed up really late with us last night as we were rehearsing and practicing and they're awesome so thanks to our team as well. And thank you all for attending our talk and we hope you learned something along the way and we hope you enjoy the rest of the conference. If there's any questions, feel free to ask us on the mics or if you're more comfortable coming up after the talk, you can do that as well. Yeah, so there are mics on the side of the room. We have, this is being recorded so if you wanna come step up to the mic so that we can hear you on the recording, we would love to answer questions. Can I have a question? How the reducing number of active streams help you reduce the number of queried logs because number of lines, I would expect is the same even if there is another difference. So it's about the chunk size then or if you can explain a little bit. Yeah, so the question was if you decrease the active stream count, how does that proportionally affect the chunk size and the number of chunks? So when PromTail goes to compute the index and the stream, it's creating a hash of the labels. So every different label set will create a different stream which means it's sending to a different chunk. So the actual size of the logs may be the same but the number of chunks will drastically increase and will actually be more inefficient because ideally you want chunk size to be full of logs before you ship them off to your bucket. It's a lot more efficient for Loki to query. So when you have a lot of active streams, you have a lot of chunks with very little logs in them and it creates an additional overhead for the queries. Yeah, thank you. So basically if we have most of the time full chunks, it will not help us to reduce the query speed. It will only help us reduce the cardinality if we had full chunks of data. Yeah, yeah, for the most part. Yes, thank you very much. Yes, thank you. Hi, if I had to write some queries that were based on the labels that were previously in the index that we took out, how would that affect my query speed now that they're no longer indexed? So the question is we move a part that's previously been indexed into the content of the log line. How does that impact performance if we actually want to query on this piece that's been indexed? Yeah, so of course if your log is fully, is a full text index, it's going to be fast, right? But you have to think about with a platform like Elasticsearch, your index sizes are going to be massive, sometimes even larger than the data sets. And a lot of the times that's stored in memory, so that's going to cost a lot of money. So when you take the label sets out of the log and say you want to query those, what Grafana has come up with is query parallelization. So when you query, you can add to your query string maybe the label you're looking for in your text field. And in the backend, Loki is parallelizing those queries across multiple backend pods to make that performance a lot faster. And you can customize that query yourself or that parallelization, you can split it up by time. So you could say for each query you're splitting up, do it by every hour that you're querying. And the nice thing is that you can just horizontally scale to your heart's content to make your query performance very fast and then scale that back down. So instead of having all of this data in memory as an index and wasting money 24-7, you can now say, for example, you could use maybe Keda Autoscaler and scaling up and down starting in the morning, scale up a bunch, scale down at night, or you could use some like an HPA if you wanted to. So it just makes a lot more sense from a cost perspective to scale in and out and just only have the resources available on demand. Thanks so much. Thank you. Thanks for the presentation. May I ask when you start seeing this kind of feature, how many unique value account? We've picked them for the past seven days or for the past one day. Is that also I would like to know is like how many resources do you use to run the Loki? Is that possible to share? Are you, it's a little hard to hear you. Are you, you're asking how many resources you need to run Loki? Yes, so how many resources you use to run Loki? What is like the, how many logs you ingest per day? And when you start facing this kind of issue, as you mentioned, you have like the how many value, different unique value count you have? Yeah, so I think, you know, the way, obviously the first thing we deploy this Loki through their Helm chart. And if you're using it in production, you definitely wanted to use the distributed model, which basically each component in Loki is split out into microservices. I would, for me, I would start with the defaults, kind of like what we talked about, and see how things perform, right? Are you having like, umkills with your pods, any sort of crash loop back-offs? Maybe you need to start by just adjusting and tweaking some limits or the replica counts on the cluster, and then maybe go from there. And like we said, using the analyze argument with Log CLI is really helpful to see if you are running into these cardinality-type issues. And it's OK if you have a lot of active streams, as long as it's not affecting cardinality in too much of a negative way. So like I said, every use case is different. And we're more than happy to talk more about this. And your specific use case, if you want to chat with us after this. So when you say defaults, where are the defaults coming from? We're just talking about if you deploy the Helm chart, just do like a Helm install of Loki without any overriding values. So like the open source Helm chart? Yeah. Do you attempt to monitor, for example, from Prometheus, if you're approaching some limits? Or do you use some kind of proactive not getting into the scenario where you have to come back and work really fast to fix the issue and maybe not losing logs? What kind of monitoring do you or parameters do monitor? Yeah. So the question was, what kind of monitoring do we have in place to make sure that Loki is healthy? They actually provide a really nice dashboard from Loki that gives you a lot of different metrics. So you should be able to find it in their repo somewhere. So we just have that. And it has a lot of great information. And you can definitely use Grafana alerting to alert on any metrics that you're concerned about. It also does a really good job of showing you kind of how your chunk data allocation is looking. Make sure you're optimized there. Shows query latency, all of the goodies that you would expect to be able to proactively respond to these types of issues. Also, we have engineers who use the system. So who needs monitoring when they can just tell you when your stuff's broken, right? Like, don't let us know eventually. I'm kidding, by the way. I was amazed by this presentation, in particular because I attended also rejects this weekend. And there was a similar talk about, basically, similar issues which we observe with Prometheus and metrics in general. So the high churn drain, then the amount of labels generated by Kubernetes in general because of this layering model is huge. And it's amazing to see that, basically, on the same field, well, in the similar field of observability, similar tools face the same issues. And I guess my question is, because for this KubeCon, for example, observability and open telemetry are the hot topics. And we see that tools like metrics, logs, and tracing, trying to get as close to each other. So are there any future developments which allow you, well, at least for Prometheus and Loki to address the same problem maybe in the similar ways or maybe working together to? Yeah. So are you asking how to combine those two types of tools? Or are you saying? Well, combining those tools and basically both tools, although working in slightly different areas with facing the same issue of having too many labels sometimes, and basically, we just get flooded. Yeah. Yeah, you can most definitely have very similar issues with Prometheus in regards. And solutions are also kind of the same. It's quite funny to see that you're also dropping. Yeah, for sure. So yeah, same thing. You should definitely audit your cardinality with Prometheus as well to ensure you don't run into those issues. I'd also say take a look at Tempo, which is another Grafana tool that integrates distributed tracing, logs, and metrics all together and allows you to quickly jump between all of your metrics and logs and traces. So I think that's a pretty cool tool that we're currently investigating. And actually, that was one of the questions to the current presentation because with metrics, among other things, I see the pod. Because I see that this particular pod with this particular pod ID misbehaving. And actually, from Loki, I usually expect to be able to find that particular pod with that ID in the logs and actually trace what's happening. So the fact that you kind of put it aside was like, why? Why would you do it? I mean, file name is OK, but pod ID is quite essential. I think for us, it ends up being that we don't really care that much about the pod label because the more interesting thing in our queries is the app label. I'll go back until we find something. So for us, getting rid of this pod ID, it doesn't really harm us much. We don't find that our users have dashboards built off of this. I would say rather uninteresting in our use case. And so I think when we made this change, we wanted to make sure we were doing it with as low friction as possible for our engineers. So if there were one or two engineers that actually cared about that pod label or the pod ID, then it was there for them if they needed it. I think we're going to wrap up, but thank you so much again.