 Good morning, everyone. So today, I'll talk about Loki, which is Prometheus inspired logging. My name is Suresh. I work as a senior software maintenance engineer at Red Hat. And mostly, I work on the OpenShift product. So you can visit this GitHub URL to see what all projects I normally work on. And even after the session, if you have any questions regarding Loki's logging stack, you can drop me an email on this email address. This is the quick agenda what I'll cover today. So first, I'll give you a quick introduction about Prometheus and Grafana. What are the current challenges which we have in terms of the logging stack? Then I'll talk about how Loki is overcoming all these solutions, the features of Loki. I'll briefly cover the architecture of the Loki, the components which Loki includes, the query language which we are using to query the Loki and Grafana's UI, the features Grafana is providing for the Loki UI and some kind of how we can install and troubleshoot common issues. So to understand Loki better, it's important to know what is Prometheus. So Prometheus basically retrieves the metrics from collection points using exporters and stores them in a time series database. While collecting these metrics, it attaches the metadata in the form of labels. So basically it talks to your Kubernetes service discovery, gets the information about what all ports I'm running in Kubernetes and just pitches the metrics out of those. It's simple to install and operate. Nowadays, because of Prometheus, it's really easy to monitor your application, what's going on with your applications. Now Grafana is just a dashboarding solution wherein you can see how much resources your application is consuming like CPU, memory, or storage, et cetera. With Grafana, you can integrate almost 30 plus data sources. You can add the data sources with Grafana. As we have Prometheus for monitoring, as of now, what all solutions we have for the logging stack? It's either we are using a centralized log service in the cloud like AWS or Azure, or we are setting up some centralization service like Elasticsearch, which captures all the logs from your cluster and it creates the index out of it. So basically the problem is let's consider I have a three tier application which I am running in my Kubernetes cluster. It has a load balancer which is load balancing the traffic to my application server. And my application server is writing some data into the backend database. So once it is deployed, how it is performing in the Kubernetes environment, or if you get an issue with your application, so in the current scenario, how we are tracking those issues. We get some alert in the slide for the particular issue. We open the slide to read about that issue. Then we open the dashboards to see how it is doing, how the resource consumption looks like. Then we, to get better understanding of it, we open Prometheus UI, we enter the prompt query to get more understanding about the consumption. To correlate it with the logs, we open some log aggregation solution like Kibana. We just take the logs for that particular timestamp and we verify what is happening during that time with the application. If it is some kind of, let's say, high latency issue, we may use a distributed tracing tool like Juggler to get, if a particular trade is causing the issue, or it's the entire application which is causing the issue. And once we get a better understanding, let's say, if my service is overloaded, we just scale it up and we fix it. So we have to do a lot of context switching to understand the issue itself before fixing it, which wastes a lot of time. So how Loki is overcoming it is, so Loki is basically horizontally scalable, highly available, multi-tenant log aggregation system, which is inspired by Prometheus. Unlike other log aggregation system, it does not create an inverted index like Elasticsearch. So what Elasticsearch does is, whenever it receives a log, it uses the entire log line to create index out of it. It puts all the metadata like labels, everything into the index. So your index becomes of the size as equal to your log size. If your log size is around 100 GB, your index may be close near to 100 GB. So Loki does not do invert indexing. Instead, it uses the labels and it creates index based on that metadata. So this project started around March 18 and it was launched in Qcon in December 2018. As of now, we have nine releases for Loki. So on the right side, you can see, this is the configuration file for one of the component, which is Promtel. So it looks like your Prometheus configuration itself. You have script configs, you have reliable configs over there. So the features which Loki is providing, so wherever, with other log aggregation solutions, mostly people tend to log less because it consumes your lot of storage to put each and everything into the logs. But ideally, log should be very cheap. We should not log less just because I want to save my storage. So since Loki is not doing the inverted indexing, it is just using the labels and it is appending your log lines into the index on those labels. It is very cheap, it does not consume. So if I have a log file of 100 GB, your Loki index is around one or two GB only. It is pretty much easy to deploy. With every release of Loki, we get a Docker image, we can just pull that image and we can run a Loki stack or we can use the Helm charts as well to install Loki. A single command can deploy the entire Loki stack. Its architecture is simple. I'll just go through the architecture in detail in the coming slides. It's easy to scale. Since it is not doing the inverted indexing, it is very easy to scale. Out of three, two components in Loki are stateless. They don't have any states. So it is very easy to just scale out your Loki instance. And we can easily find out the information and correlate it with the data which we are getting in Prometheus. We have a feature in Grafana where we can split the window, we can do a PromQL query and we can have your logs in one window and your Prometheus query in another window and we can correlate the logs for that particular timestamp in a single window itself. This is how the normal logging architecture looks like. So you have a node which is Kubernetes node which is running some parts. We are running a logging agent which is capturing those logs. It is forwarding those logs to a logging service and you can just visualize those logs using some logging data source. In case of Loki, we have three components, basically. Promtel, Loki and Grafana as a data source. So Promtel captures the logs from your nodes. It forwards those logs to the Loki instance. Loki does some kind of processing on those logs and you can visualize those logs from the Grafana. To capture the logs instead of Promtel, you can also use Fluende or Docker driver, Docker plugin. Let's talk about Promtel. So Promtel runs as a daemon set. It runs on each and every node of your Kubernetes cluster and captures all the logs from all the parts which is running in your Kubernetes cluster. The primary responsibility of Promtel is to discover the target, attach the labels to the log streams which is receiving from the parts and push those logs to the Loki instance. In order to avoid deduplicating the logs, it creates a position file wherein it put the last line which it has read from the log file. And it also adopts the retry then discards strategy. So for some reason, if your Loki instance is not available or Promtel is not able to push the logs to Loki, so we have a configuration in Promtel wherein we can specify the maximum retry. So it will try till those retries to forward those logs to Loki before discarding those logs. So for Promtel, we have the following ways to discover the logs, scrape the logs. One is file target discovery wherein we can specify the log file names from which we can scrape the logs. If we can use Kubernetes service discovery to get the information about all the parts running in on that node to fetch the logs from those parts. If you are using a general daemon on the node, you can use general scraping as well to fetch all the logs which are coming to the general daemon service. If you have centralized this log server wherein you are forwarding all the logs of your environment, you can configure Loki to actually pull the logs from that syslog server and relabeling. So every scrape config can have one relabel config wherein we can transform the logs. For example, if I don't want specific logs from a specific pod, I can configure it in the Promtel file to discard or to drop those logs. So pipeline is actually used to transform a single log line, its labels and its timestamp. It has several stages, a passing stage where it extracts the incoming logs. Then it forwards those to a transform stage where it transforms the logs. And in the action stage, you can perform some kind of actions like attaching some labels or attaching the timestamp to the log lines. And we have a filtering stage where we can filter the logs. I want to accept our water logs. I want to discard. So in a nutshell, this is how the Loki architecture looks like. So in Loki, we have three components, distributor, ingester and barrier. So the red lines here shows the right path of Loki and the blue line shows the red path for Loki. So whenever you are writing the data inside Loki, your distributor is your first component. Distributor is your first component to get all the logs. So once distributor receives the logs in the streams, so it will hash that particular stream in a hash using some hash ring. So for example, here if I'm getting a log with the label component equals to printer, label equals to error, and the log line is printer is not supported, printer is not supported. So what it will do is it will create a hash based on these labels and it will create a stream ID and it will forward all the incoming logs into that stream ID. So whenever next time, if you get a log to Loki, instead of creating a new stream ID or instead of again creating a new index with all the labels, it will just use the same stream ID to forward only the log message instead of forwarding everything. So once it gets the logs, it compress the logs and it creates the chunks. So the distributor sends each stream to the appropriate ingester. So ingester, there will be a hash ring, there will be a hash ring wherein you will have your ingesters. So each ingester will have a stream ID. So whenever distributor is receiving a log, it will check the stream ID, it will forward those logs to a specific stream ID which matches the incoming stream ID using the labels. So it uses the consistent hashing to assign a log stream to an ingester. So ingester is a component which is responsible for writing your logs to the back end storage. It has several states pending, which means you have a new ingester which is ready to join the ring and it is just waiting for handoff from the chunk which is actually leaving the hash ring. So this is what it looks like. So these are the chunks. So Loki will create a chunk based on the labels. It will keep on appending the logs to the specific chunks. When the chunk is full, it will flush it out to the persistent storage. The second state is joining where the stream is actually sending the stream ID and the label information into the hash ring. The third is active. So active is a chunk which is actually writing the current log data into the chunks. So joining and active are the states which are responsible to write the incoming logs into the chunks. And the fourth is leaving. So whenever a chunk is full and it is ready to flush out, it will go into the leaving state and it will hand off all the in-memory data to the chunk which is ready to join the hash ring. So your active and leaving hash rings are responsible to respond to the queries which are coming to query. So as I mentioned, chunks are compressed and mark has read only when the current chunk has reached its capacity. So if it has reached its capacity, it will be flushed out or if it has not received any logs for a particular amount of time, it will get flushed out to the persistent storage. For persistent storage, we can actually use DynamoDB or Cassandra to store the index. So the last component is QueryR which is again a stateless component. So QueryR receives an HTTP request for data. So QueryR forwards that request to the ingester to get the recent data which is in-memory which is not yet flushed out to the persistent storage and also to get the historic data, it will query the index to get the required logs which matches that particular query. And once it receives, so since low-key actually shards your data and replicates your data to avoid a data loss, we may receive the duplicate data in that query. So once QueryR receives the data from ingester, it deduplicates it and then it returns back to the data source. To query the logs, we have logql. You can use the CLI or you can use logquery language to query the low-key. So one way is using a logstream selector wherein you can specify the labels to fetch the logs which matches those particular labels. For example, if you want to fetch the logs for a key job which has a value mysql, so it will fetch all the logs which matches this particular stream. You can specify multiple labels as well in the query. Now if you want to, you can also filter the logs using this logql. If you want to get all the logs which has a label equals to error, so you can specify that using some negations here and you can specify the keyword with your query. And also if you want to filter out all the logs which do not have, let's say, timeout keyword in it, you can use the same query, just add negation before it. You can also use aggregation operators with logql. So if you want to get some count of the logs for last five minutes, you can do that. It's the same as your promptql. So Grafana's UI for Loki, there are a few features which Grafana is providing here. First one is no pagination. So by default, it shows you thousand log lines in a single window. You don't have to switch to multiple windows to see all the historic logs. It can show as long as thousand logs by default and it is configurable. You can modify it as well. You can also filter the logs since you can query the Loki in the Grafana UI. You can filter the logs using filter expressions as well. Grafana also provides a nice feature of ad hoc statistics wherein, let's say I am fetching the logs for my kubesystem namespace and I have three parts running in my kubesystem namespace. Somehow it is logging a lot. I want to find out which is the particular part which is logging a lot. So using the ad hoc statistics in Grafana, you can see which particular part is logging a lot. It can show you that as well. Then performance, since it shows thousand log lines by default, sometimes your browser may attain to get frozen for some time. So since it requires at least 500 milliseconds to fetch the thousand log lines. So what it does is to avoid the performance impact. It first shows you the graph till it render the logs. In the graph, it can show the trend for the time, what are logs or how much logs it has received and then it will render first 100 lines and after that it will render the next lines. We have a next feature of explore and split view wherein you can split the window in Grafana. It will show you in the single window you can see the logs from Loki as well as you can query the Prometheus to get the metrics out of it. And you can do a live, you can see the live logs as well just like tail iPhone F to the log file. So to install, either you can use Helm charts. It's very easy. You can use the Helm command to install the entire logging stack. If you have a Jagger instance, which is running in your environment, you can actually integrate that Jagger instance with Loki. Then if you can just pull the Docker image and you can use the Docker run command to run the Loki instance. Or you can install it locally as well using, you can just clone the source code and build it locally. So common issues which you can face with Loki is Loki bad gateway 502 error, that which means Grafana is not able to talk to your Loki instance due to maybe a network issue or Loki instance itself is not running. The second common error is data source connected but no labels received, which means Loki, Grafana is able to connect to the Loki, however Loki has not received any logs. It may be a Promptel configuration issue wherein it is just filtering out the required logs. Are the missing logs, again it's the same, you need to check the Promptel to see if we are filtering out the logs or if it's a configuration issue in the script configs. These are some references which I have referred to for Loki. There is a nice documentation on the Loki GitHub page. You can just go there and you'll get the entire information about Loki. Thanks a lot. Any questions? Performance wise, Promptel is better. So Promptel can ingest marks to 17 MBPS logs at a time. Sorry, I missed your question. It actually depends on how much logs the Loki is getting. So if you have a cluster of let's say 100 nodes, a single Loki instance may not be sufficient for you since we are, and it's also about how you are replicating your data, how much you are replicating. So if you are, let's say creating five or 10 replicas for your logs, you may need more than one instance. So we don't have that kind of access control as of now with Loki. The tendency is we can just isolate the logs from the namespaces, but from the user's perspective, we don't have any access control as of now. Yes, we can. Yeah, so that is configurable. So you can configure for how much time you want that data or that index to be available. So by default, it's around seven days. You can tweak it to more than seven or less than seven days. It's strictly on the size. So if Loki instance is not able to process the logs in time, it will stop receiving, because unless PromTel gets a success response from Loki, it will not send the new log lines to Loki. So if your Loki instance is not able to reply back to your PromTel with a status code, it will not, PromTel will not send any new logs. So it will retry till the time you have configured the max retries. If Loki is not able to respond back till that time, your logs will be lost. As of now, no, PromTel into, as far as I know, as of now, we don't have any plans for downstream. Can we take it offline? We are already running out of time. Thank you. Thank you.