 Hey everyone. So we'll just start the session. So first, let me introduce myself. I'm Rohit. I'm co-founder and VP engineering at Fessage Cloud. Fessage Cloud is a platform engineering product. Basically, what we offer is a product where you can, you get a canvas, you can drag drop components and build a blueprint for your software architecture. And using that architecture, you can create and manage environments. So that's a short introduction to our product. You can check out our website and everything. But today, why we are here is to discuss specifically about Loki. So to introduce the other talkers here, so we have Sreedjit. So he is a DevOps lead at Capillary Technologies. He has been at Capillary Technologies for like nearly a decade and he has seen through all sort of modernization efforts. And you know, Loki adoption is just one amongst them. And fortunately, like we could work with Capillary Technologies as one of the early adopters of the facets implementation of Loki and Pramod, who is a lead at facets. So he was responsible for developing the Loki offering in facets. And we co-built it with the Capillary folks and Sreedjit was very heavily involved in that. And a lot of his feedback and learnings of taking it to production at a large scale helped us mature our offering like greatly. Let me just share my screen. Yeah. So as I said, so sorry. Yeah, I already introduced the speakers here. And so I'll also briefly introduce Capillary Technologies. So Capillary Technologies is a customer engagement leader. So they are into the CRM business and they serve a massive user base. So if you have ever been to a retail store and you have been issued a coupon or you have gotten some points from a transaction, it's a fairly good, there is a fairly good chance that you have been served by Capillary Technologies. And they empower a lot of major global brands. And of course, with the GDPR restrictions, they have deployments across the globe, like Europe, US, Southeast Asia, India, specifically. And they have, moreover, they have a large engineering operation of around 200 developers. And which is why I think today's talk would be more interesting in the sense that it's not just a technical problem, it's also a cultural switch when you are switching to a new logging solution and all together. So yeah, so as I said, we were co-building Loki with Sreedjit from Capillary Technologies. And what we'll do today in today's session is to share some of our learnings. Basically, we'll just reiterate what we went through. We will not go into the technical depths right now because we might have people with varied experiences and with Loki or very exposure to Loki in the audience. But we are happy to take any questions. So do put your questions into the chat. At the end of the session, we'll have a Q&A session where we'll try to answer as many questions as possible. Yeah, so that said, I'll hand it over to Sreedjit. So yeah, Sreedjit, you can take over. Thank you, Rohit. Thank you for having me. Let me share the screen here. I hope it's visible. Yeah. Hello, everyone. Thanks for joining. My name is Sreedjit. I'm working as a DevOps engineer in Capillary Technologies. I have been with the company for like around nine years now. And all these years, I have been working with the infrastructures and cloud resources across technologies. And recently, in upgrading our observative platform, we decided to move our logging strategy to Loki. And in this webinar, I will be explaining our journey experience that we had with Loki. So when it comes to the logging in Capillary, we have customers across the globe. We have infrastructure across the globe. And we have around 1.5 TB of logs that is being, in average, we are churning out where which is required to be retained for logs for a year or so due to compliance requirements. Logs often refer for backtracking, real-time investigating, real-time issues, and going back to all the dates when a customer reports a problem. So to a certain extent, these logs are used for investigation purposes. Now, the legacy mechanism that we hold here to handle the logs and store it where we use a fluently collector which sends the data to all our log data to EFS where we store it as a temporary storage. And from there, we attach a VETI SSS terminal which is a web terminal, VETTY. It's actually an open source tool with which developers or anybody can run Linux commands on those log files and grab the strings that they want during the investigating time. And later on, after three days, we will be pushing those logs to S3 bucket as an archival strategy and we'll be keeping it there. Now, as time flies by, number of applications increase, number of logs that are written by the applications increase due to due feature addition, EFS started getting a bottleneck in the IOPS part. And we had to increase the IOPS in time to time. And later on, we had to create a replica of EFS to have the reads and the writes separated out in order to handle the slowness of the IO. But to some extent, we were not able to scale up more than this. And as for the log volume increase, the whole system was in scaling up. We wanted, we had to add additional manners in managing different, different moving components here, creating alerts, monitoring them. That was a tedious job at some part, at some point. Now, what is the ROA on this? A fraction of these logs are used for troubleshooting. There are billions of log lengths being written and terabytes of data being stored. It's a large chunk of data where we can have, we can identify an application performance, how the API calls are happening. A lot and lot more data will be present in those log files, which we can leverage. This is an architecture diagram that we use to follow in our legacy mechanism, where we have a fluently pod, fetching the log files from different applications and sending that to EFS on which every application pod is running, which on, through which the developers can run their Linux commands to grab the data. Now, considering all these issues at hand, we were looking for a better solution that can solve us, that can improve our dev efficiency as well as solve our problems and get more insights from the log data that we have. So these were the four candidates that we picked during our evaluation, where ELK, Passible, Lobe, Key, New Relic were the major ones. Now, considering each one of them, ELK is scalable. We used to run that in our smaller cluster, subsidiary application clusters, but to a certain extent, what we understand is that it's a bit expensive to operate. You need more number of nodes as the log length, log volume goes high, availability and space issues, and archived log retrieval was quite a problem for us. Then comes Passible, which was the simplest application among them, which didn't have an HA, but was very, very non-expensive, and it was relatively in the nascent state where some of the use cases that we required was not quite an application. Then, New Relic is something that is common across the community, and it's popular as well, but it is expensive, and we have a huge amount of data that's going in there, which we need to pay for, and it is a completely managed solution. But again, the cost perspective is something which we need to consider whether we should move on or not. Now, then comes Loki, which is again scalable, open source, widely accepted in the community, and has an HA, and native integration of Grafana apps, whatever application that Grafana labs have developed is easily can be integrated with Loki, and not much expensive compared to ELK and New Relic. Considering all these factors, we decided we'll go ahead with Loki. Since it is scalable, we can have alerts and dashboarding for our insights cost-effective, completely managed in our infrastructure itself, and popular among the whole community, and we were used to Prometheus, and Prometheus, Grafana, and we have metrics there. Then, why can't we have the logs also in same Grafana, which will make it easy for us to handle both together. So, these points helped us to finalize that Loki is something that we need to go for. At this point, I will take a pause, and I would like to invite Pramod Iapn from Facets Cloud to explain about Loki architecture, and welcome Pramod. Hey, thanks, Rajan. Yeah, I'm sharing my screen. Yeah, so hope you can see my screen. So, hi everyone. This is Pramod. So, I'm part of Facets for more than a year now as a technical lead, and today I'll be going over the overview of Grafana Loki. So, what is Loki? Loki is horizontally scalable, highly available, and then multi-tenant log aggregation system that is inspired by Prometheus. So, all these buzzwords here you are seeing. This is what makes Loki stand out from other logging positions, and it is developed by Grafana Labs. So, as you can see, it is having really, really a lot of popularity and widely accepted by the community. And like Prometheus, it is for logs. In Prometheus, we have metrics, but this is entirely for logs, and it is very cost-effective. It is open source, and it is still actively maintained. Next is minimal index logging. So, unlike other logging solutions, Loki actually does not index the entire log content. Instead, what it does is only the entries are grouped into streams, and it will index only the labels, and that to the labels are Prometheus-style labels. I will talk about this Prometheus-style labels in the next slide. And here there is a small diagrammatic representation where you can see log data is about 10 TB, and only 200 MB of index data is there. So, this will improve the performance as well as querying time will be really faster, because only 200 MB of data we need to query. And this is not the exact representation. This will change environment to environment as well as configuration to configuration. It is all based on the logs that you are pushing into the system. So, that you need to consider. And then next is indexing logs. So, here you can see how it is indexing and then what is not indexed here. So, only the timestamp and then the Prometheus-style labels are getting indexed. And here, why Prometheus-style labels is because we are here in Loki, it is exactly using same as Prometheus. That is why it is called Prometheus-style labels, and only the contents is not indexed here. And what noting here is, when you use fewer labels, you will get a better performance out of Loki. And then I will be talking about log stream, what is log stream as well as iconalty problem. So, what is log stream? So, log stream is a stream of entries with same exact labels here. So, here we have three different lines, but if you look at the labels here, they are all same labels. So, what happens is that this is considered as a single stream. And similarly, in the other two lines, you will have a same set of labels. So, this is also considered as single stream. So, totally we have two streams here. And now, if we talk about carnality problem, let me introduce one more label called node, right? For let's take the first stream. So, we have three different lines, and in that we are going to add a new label called node, and each will have node 1, node 2, and node 3. Now, what happens is that this will become, so first line will become stream, and then the next line will become its own stream, and then third will become a different stream. So, if this is happening in the log system, let's say if we introduce a label called IP, and whenever there is a user, there are multiple users, who is this URL, right? And those will also get processed, right? So, that's what I carnality problem will that will lead to I carnality problem. When there are multiple unique streams, then that will be the problem for I carnality. There is an example here, as you can see from a log line, we have levels, log levels, and then statuses, and then we also have parts, right? For each log level, we will have each different statuses from that same log line, and then we will have, for each status, we will have three different parts. So, ultimately, if you do a quick match here, like four into three into three, that will be leading to potentially 36 streams here, right? So, that is what causing the I carnality problem. So, we have actually handled this in our product, like removing I carnality labels, and then we are packing it into that I will talk in the different slide, but yeah. And then next is deployment modes. So, we have three deployment modes here, monolithic mode, simple scalable, and then microservices mode. So, what monolithic mode is basically, as you can see, all the low key components are packed into a single process as a single binary, right? And this is only useful when for getting started and to do some experimentation on the low key side. And this monolithic monolithic mode only supports average of 20 GB per day of logs, and which is not really adequate for production grade systems, right? And in simple scalable, all, as you can see, each execution path is like split it into three target groups, basically write read and then back in. And each have its own use cases, like you can scale each targets individually, and you can increase the performance. And here it will support few under a few TBs of logs. And however, if you go beyond that few TBs, right, then only approaches microservices mode. Here, as you can see, all the components in the low key are run separately, and you can do customization and you have the flexibility to configure each component separately. And based on the load, you can increase one of the component or the other. So, that are the advantages of microservices mode. Now, how it works, right? So, unlike Prometheus, low key doesn't do a pull. Instead, it only pushes, right? So, for that, we need to use a agent that should collect logs from node, and then which will be sent to the low key. So, that is what, so that's, that's how low key works. Here, in this case, we are using PromTail. And PromTail uses same library as Prometheus for service discovery, as well as Prometheus styling style labels. So, those are all created by PromTail itself, and it is a product from Grafana Labs again. So, this will push the logs to low key. And to visualize the logs, you can use Grafana, where you can visualize through log QL queries, or also you can use a command line tool called log CLI. Again, there also you can use a log QL query. And then with low with low key, you can also set up alerts using alert manager. And also you can do recording holes. So, it's all possible from low key set. And here's a sample query, log QL query, as you can see, here, I have filtered with query fronted container, and then namespace locative. And I'm gripping for error log, right? So, what it does is it will return with all the error logs from this container. So, next is the architecture. So, here we have right path and then the read path, right? In the right path, we have distributor, ingester. And in the read path, we have query frontend, the courier, and then the ruler, right? So, whenever log enters into low key, first distributor will be the one that will receive the log. And it will do some validation checks, as well as it will, this is a service that is responsible to determine which ingester that log needs to be pushed to. And once it reaches ingester, ingester will be sending the logs to long term storage, like object store, like S3, GCP blob, or whatever the cloud provided storage, right? And then once ingester comes here, courier will be responsible for handling all the queries from the API or from the Grafana, right? And first courier will hit the ingester and it will look for in memory data first. If the timeline doesn't suffice, then what it will do is it will fall back to the object store and it will get the data. And here query frontend is just an optimization of courier where what it will do is it will split large number, large very big query into smaller chunks and then it will concurrently exude them and then stitches back the result and give it to the customer user, whoever using the Grafana or the API calls. And then here on the left hand side, you can see we have a ruler. So ruler is the component that is responsible for creating rule alerts, as well as recording rules. So whenever we create a rule, there we can specify which target, like in alert monitor, if we want to get, we can target that. So that all are taken care by ruler. So this is about the architecture. Now we will talk about the hashing. So basically, a distributor is the primary service that is responsible for determining the ingester, right? So how it determines is that it uses consistent hashing plus configurable replication factor, right? And the stream is hashed into, hashed using both tenant ID as well as labels it. And also low key maintenance ring for each service. And here in this case, like ingester register themselves into hashing with a set of tokens. Let me quickly show an example of this. So here in this diagram, you can see a circle here, right? That is called a ring. And each ingester on the right hand side will register with set of tokens that is 0, 2, 1, 6, 3, 4, 8, 1, 6, 3, 8, 4, right? So that range will be registered by the ingester and that goes somewhere till ingester 4. So whenever a logs comes in from promptail to distributor, what happens is that distributor does some hashing. So at the bottom, you can see, right? First log comes in and it will hash the label and it will get the hash number. And based on that hash, it will select the ingester within the range. So it will go clockwise. So if you see, it will, the number corresponds to the range that is in the ingester 2. And similarly, the other, other, all others as you can see from the image with colors, it has like placed into two, three different ingesters. And let's say if we have a replication factor larger than 1. Currently, this is for replication 1, right? What if it is 2, replication 2, right? In that case, what happens is that first it will place the first line into ingester 2. And then it will, in the clockwise, it will select the next subsequent ingester. In this case, ingester 3. So two copies of first stream will be stored in ingester 2 and ingester 3. So this is how the hashing works. Yeah. And then next one to integrate, yeah, integration of Loki into phases platform. So with these learnings, as well as from Caprari's insights, we are able to integrate Loki into phases with the right production grid configuration. And what we did actually is we have done load testing using Loki Canary. And also we have tested with actual data from various deployed applications. And with, from within phases, we are able to, we support storage backend, including credential handling. Loki needs to have a credential, right? So for like storage backend that is in the cloud. So that is automatically handled. And also we provide caves, native, Minayo storage solution as well. And then along with this, we also automatically create a Loki data source and integrate into Grafana. And also we deploy quick log search for easily accessing logs from filters. And then we have also dropped high cardinality level. So to make Loki very efficient and performant. And then what we did is we actually like packed all these drop label labels into the log line itself so that at the query time, what we do is after the cell comes in, we perform the transformation and get the filters within the Grafana itself. So that's how we are doing internally. And then there is one other option called auto forget unhealthy instances that we have enabled because whenever there is a investor that gets down abruptly without any doing any cleanup activity, right? At the time what happens is that it will still be in the ring. So it is not, it will be with the state analogy. So because of that, Loki won't accept any more logs. That needs to be manually removed from the ring. So we have to access API and we have to remove it from the ring. So that's why we gone with this approach and we have added auto forget analogy and it will automatically remove all the unhealthy instances. And then there are few more optimizations like two and four from capillaries inside and we have made all those optimizations and now we have very good Loki integration in facets platform. And here you can see on the right hand side, it is an example from facets resource. So we have deployed on application and as you can see from that application, we are able to get the logs. And from this step, we will be able to list all the logs for that particular. Yeah, so that's it from me and I'll be giving it to Srijid. Srijid, you can take over. Thank you. Thank you for a detailed insights on the overview of Loki. Yeah. So the audience, I have an update here, you can post your questions in the chat box provided. We'll be taking those questions at the end of the session and we will provide an answer. Thank you. And yeah, there we have selected Loki to be implemented in our capillary clusters. So we had to decide how to implement this, how the deployment strategy needs to be defined to take it forward. So we came up with this plan where first we will have it deployed in non-production clusters. Then once it is stabilized and signed off, we will move to the production clusters and then see the stability and other things and then roll out to the rest of the environment. Now slowly after that, we will face out the legacy setup. Now this is the event line that we had while deploying the Loki in the non-production clusters. So as facets to cloud has integrated Loki in its module for us, it was just a matter of changing the configuration in the blueprint and deploying Loki in our clusters. So the release or the installation was seamless. Later, we conducted low test for Loki in a non-production infrastructure with whatever limit it had and to understand the behavior and scaling patterns of Loki. Now along with the low test, we wanted to identify the behavior of Loki. We need to see how the other matrices behave along with the variation in load. So we defined some matrices which Loki already provides into ProVetius like log drop, resource usage, scaling patterns, chunk size, push2s3 bucket, etc were created and alerts were added on top of it to see how the behavior or how the log pattern variation happens and how Loki behaves. Now when it comes to the developer perspective, we were completely planning to remove the old legacy setup and bring it a completely new setup of a Loki logging strategy. Now this transition, we wanted to have it seamless for the developers to handle and towards the newer technology adoption. So on that basis, what we tried is to create documents for references, create KT and workshops for how Loki works and all. Then asked for the feedback of the developers, we asked them to create Gira tickets to let us know how Loki is behaving, how what sort of issues they are facing, which gave us the idea of how effective Loki is, how it is improving the efficiency of developers day to day work. Slowly once we decided the timeline of removing the legacy setup in non-prod cluster after getting assigned off, we slowly remove the old, uninstall the old setup and completely we were running on Loki in non-production clusters. Then later on we decided to go for the production release, we wanted to do a capacity planning and this capacity planning was something which we decided depending on the log volume, the log, the load test that we conducted and later on that linear projection that we created, we came up with a configuration which we decided we'll go ahead into the production cluster. Now, as I said, we have clusters across the globe, some are generating heavy amount of logs, some are not generating that much but there are clusters sitting in the middle having a medium amount of log. So considering that scenario, we thought of going ahead with the medium amount of log which would give us a balance between the two and see how effective is Loki in those kind of clusters. Thus deciding the cluster once we started a plan for release and we went ahead with the release of the Loki in that cluster and apparently we wanted the legacy setup also running in the same infrastructure considering we may need some time to stabilize Loki in our production cluster. Now, this was the time for monitoring. Yes, as soon as we deployed Loki into the production cluster, we came across different roadblocks on a categorizing scale, we categorized those issues into four points and they are like rate limits where the ingestion rate that we calculated and in real time was very different. Ingester issues where the ingestor pods went into and restart issues where S3 rate limit where from ingestor to S3 push off chunks were having throttling issues and at the end query problem where we started getting timeouts whenever we run a query and that was another kind of a problem in the rate path. Now, let's go into each and every, I'll serve through each and every issues and just to understand how each and each problems were handled differently. Now, rate limiting, as I said, the ingestion rate from promptail to distributor to ingestor, that is something which was completely identified in the non-production clusters. Now, when it comes to the production cluster, what we could see is that the configured value was never enough and distributor started throwing error that there are too many log volume happening, we need to control it. Now, sooner we identified this error and we started increasing the size. Now, having a closer look at the ingestor rate, Loki is designed for a multi-tenant environment where you can send logs from different clusters to a single Loki infrastructure. In such a scenario, the ingestion rate of one tenant shouldn't affect the log collection of other tenants and bring down the Loki environment for other tenants also because there is one Loki environment to which all the tenants are sending. In that scenario, the ingestion rate is something that is needed to keep the volume in a constant or optimal rate. But what about in capillary, we have one Loki cluster per production cluster in production infra and in our case, we don't have a multi-tenant environment. In this case, we have only a single tenant whose rate can be adjusted. But it's not like you can give the range more than beyond a limit so that the ingestor will get loaded and have to manage a lot of logs and you keep on adding more resources to it. So it is always good to have a sweet spot to keep the logs in limit but making the other components not overloaded. So we gave a safe upper bound and in combination with the ingestor rate limit, as well as the ingestor burst limit, we have around 100 Mbps in majority of the clusters. Now, in large scale, most of the clusters have similar limit that we have configured, but actual rate is much lesser than that, keeping the ingestion rate in control so that ingestors won't get loaded and they can easily handle, which in effect will help us to send the chunks properly to S3. Yes, we received a lot of all the memory issues after increasing the ingestion rate. Now, a lot of streams were getting created in each and every ingestor on investigating more what we could see is the node number, number of nodes in production and number of nodes in non-production clusters were different. We have more numbers in production cluster obviously. Now, the node level is one of the levels that we are sending to low key and that is being indexed. Automatically stream numbers also increased because of the same. So, we decided to bring down node label after considering that in our usual or day-to-day troubleshooting process here process, we don't see that node level is much of an importance. So, in that scenario, what we decided we will drop the node label directly from promptail itself, which in effect helped us to reduce the number of streams and reduce the memory usage of each and every ingestor. But again, out of a surprise, we got out of memory again for ingestor and in this scenario, we didn't have all the ingestors going into HomeKit. We had one or two among them going into HomeKit and this specifically is because the streams are allocated per ingestor. In that scenario, since the streams are going to a single ingestor, the application which is generating that stream, if it is having higher volume and at some point if it is generating a sudden burst of log volumes, then automatically the memory usage of that ingestor will go high and it goes to HomeKit. In such a scenario, we wanted something to handle or control the size of streams and that's where Loki is providing an automatic sharding option which we enabled and we gave a limit of around 3 MB per stream so that anything above 3 MB will get sharded and a sharding key will be added to that specific stream which in effect distributed these streams among different ingestors rather than one ingestor and after that what we took see that all of these ingestors were starting to use similar amount of memory consistent. Now, out of disk issues, this is something that we noticed in different instances where the ingestors get into a restart and the value play kicks in and once value play starts, the disk ingestor disk starts piling up with val file and I reported this issue in the Loki community but I haven't got an update on the same till now so it doesn't have a pattern, it will get cleared off after some time, it doesn't have a fixed size or it doesn't have fixed number or a time at which it will get cleared so that was a bit causing a problem. So what as a work around what I had to do was to increase the disk size a little bit over portioned disk and keep an alerting on those disks so once such an issue happens we'll get an alert and we will be in a position to decide whether we should expand the disk online so that is the only worker that I could implement. Now after completely releasing everything in our production cluster still now I haven't faced this issue and this was something that we noticed during our testing and stabilizing phase. Unhealthy ingestors, yes as Promote said autofog it is something that was not by default enabled which in effect when an ingestor restarts or it was not completely, it was an incomplete restart or a node goes into Hong state and ingestor has got a heartbeat request which if it doesn't reply to that it is considered unhealthy automatically the ring will get considered unhealthy and distributor will not be able to commit the logs in it so to remove these unhealthy ingestors and autofog it flag was enabled so as to handle that scenario. Now S3 rate limits so ingestors push the chance log chance into S3 bucket now the since the number of api calls started increasing when the more log volume increases S3 started throttling all our requests and it started throwing error as well in this scenario we had to limit the number of api calls that is going to S3 bucket so where we had to adjust the chunk size max chunk age etc to minimize the api calls and then comes the querying issue so all these issues that I had discussed now till now is regarding the right path of Loki now when it comes to the read path where the query comes in we have a lot of components that are there in front of courier we have courier friending we have a Loki gateway we have an grafana we have ingedex ingress etc which through which the query is passed by the time courier respond back with the query result we used to get a get a timeout and some of these gateways had very like 30 seconds or 60 seconds gateway timeout set which in effect to solve we had to go for go into each and every component and then we reconfigured the gateway and proxy timeouts thus after as I said we were we had launched this Loki in the medium cluster and we came across a bunch of issues and learning uh learnings where they're from these issues so collating all of them we reconfigured our blueprint of Loki according to the issues and how it was solved then and making the same into the facets cloud blueprint we were confident enough that this after releasing this configuration in another cluster we won't be facing the same issues with that confidence we decided to go ahead with other clusters as well and to say maximum we had to reconfigure some of the CPU and memory configuration other than that these same issues we couldn't find in other clusters and they are running fine now once it was stabilized across clusters and our devs started using Loki across the production clusters we slowly started thinking of removing that legacy setup now immediately what we did is we made the logs unavailable in vending and but still Cluendi was pushing the logs to EFS as well as S3 bucket and slowly once we were completely confirmed and confident and it was in a green zone we decided to face out the whole legacy system and thus Loki was successfully rolled out in our production clusters now what is the business impact we could leverage from the logs now logs are available in grafana the querying the promql query sorry the logql query my pattern can be used or the method can be used to query all the logs from Loki now the business metrics that we can generate from the bunch of logs like how many requests are coming in what are the API calls what are the latencies that we are having etc we started recording using the Loki recording rules and they were pre-computed and sent to Prometheus and that's where we look into the metrics that are generated from the log lines now alerts also could be configured for Loki as well as other applications and even we were monitoring Loki performance using Loki alerting rules and different applications which are throwing exceptions errors etc were alerted using the alerting rules of Loki now if you look back and understand how the whole journey was Loki is an excellent product and it does its job very well but not like with the whole default setup you won't be able to implement it you have to make your adjustments as per the line as per the volume or as per the cluster that you are running in yes there are tons and tons of configurations in Loki that needs to be handled and understood which requires an engineering investment and extensive domain knowledge is something which required because there are a lot of components that are running across the board which needs to be handled and understood with that we'll conclude the session here and thanks thank you again for everyone to join and I will be happy to take the questions yeah yeah so is it so I'll probably quickly send all the questions to you so maybe first we'll address the questions that that you have to answer Srijith so Ashutosh your question is more directed promote we'll take that at the end uh Vishnu so Vishnu Raj has asked what are the lead lead latencies that you are seeing uh Srijith lead latencies so I think p95 I just looked up I think in in your largest cluster I do see a few hundred seconds there sounds about right okay I don't remember the exact numbers Vishnu but could be yeah so Vishnu there I do see like p95 to be like a few hundred seconds so did you do anything to address the latency itself did you try any optimization to reduce the querying latency Srijith or are you living with it no no no we had changed the compression ratio compression where it was nappy we changed it to G-SIP and we had even changed the sound size and query timeouts as well as the career size and scaling etc to handle the querying performance okay but uh in in general yeah so from my point of view uh Vishnu so uh for a for a specific time window when when you are within a localized time window the query performance is not an issue but when you are looking for like events over a day or things like that that's when Loki starts acting up and you you probably need to live with some amount of log latency because of course there is this uh turn down time with s3 and things like that um yeah but I don't think so so and also as Srijith mentioned initially like there are only a few hundred queries happening so more important was to collect all the data and be able to create recording rules and you know create dashboards so that people do not need to do uh you know large analytics types queries because that are getting converted into metrics in Prometheus and address from there um yeah so I think Vishnu had a couple of more questions what is the caching system used in the system if possible to share the specifications of of the cache size uh Srijith um I think caching system I think memcache is there uh sizing I don't have it with me but it's there okay what is the quick log search in grafana yeah Vishnu so that is a custom dashboard that we ended up creating so that people need not write the common log goals at least searching by an application searching by a time range and searching searching for certain patterns within the selected applications so for those we created a custom dashboard in grafana so so it's it's just the name of the dashboard uh it's not something that is packaged with grafana and coming to yeah what is a daily compressed log volume in capillary that is running with I think over a tb right that is a compressed in s3 per day the day-to-day increment in s3 uh is is around the tb that's that's in on an average right um yeah and Vishnu yeah so s3 throttling uh was removed by increasing the chunk size yeah so that I think is right so basically idea is to reduce the number of API calls um recording rules can create metrics from logs and send to Prometheus yeah so this is something yeah so can you explain this so how do you configure a recording rule is what Vishnu is asking so basically by specific patterns you can extract out metrics and and send it to Prometheus right so maybe you can elaborate yeah so basically uh Vishnu it's uh you are you're just configuring the same logql query in the recording rules so the ruler is something that this is used for in the rulers config map you can mention the your rules rules is basically the logql query that you want to generate metrics on which you will be running your aggregation like some count etc which will generate the metrics uh metrics with labels and then in the ruler config you can mention which Prometheus you need to send this data to and the rest ruler will take care of yeah so maybe uh if you can find out the documentation and put in the link maybe promote if you can do that that will be helpful um yeah so I and Vishnu is asked what are the concurrent users that are actively using uh loki and also about what is the chunk size that you are uh you have configured uh so that you can work around that a bit limit do you know the exact numbers is it numbers I haven't uh I don't have those numbers with me okay uh maybe yeah we can get back so Vishnu so I think we'll at the end plug in our slack channel so maybe we can uh you know connect there and discuss a bit those questions I can maybe answer in the slack channel where I can get the correct numbers and share yeah so now let me go back uh so Suresh Jain has asked uh yeah I was going through the loki and its analysis done by different people one of the issues with loki is that since it does not index all text but rather performs a distributed grep so the searching time is a bit long so do you have information on the best ways to deal with it and how well you are able to improve on it so basically the I think the courier site is something that we need to tune for and the number of so as per the time window um career will start downloading the logs downloading the chunks from s3 bucket and then load that into the memory and then that's where this grep and the searching mechanism happens so most probably what you need to try to look at is how effective the chunk size is how how large is your courier size memory size as well and how the scaling of courier you can make it fast those things I would I think that should help more in the making query faster I think there is auto scaling set there right Suresh it is having HPA yes we have uh career HPA yeah and yeah so uh Jojan Arthagiri has asked regarding the auto forget healthy member uh when running ingestors as stateful says this flag helps in removing bad ingestors from the ring but the pvc okay he's continued it in another yeah uh attach seems to have corrupted val when the node terminates abruptly and the new ingestor attach it adds itself to the pvc could not read the corrupted val files as a result of a failure is there some is this something that you have experienced is corrupted val is not something that you have experienced right corrupted val is not something I have experience in my in my testing until until now the val exporting is something which I have faced let me see if something comes around yeah so uh what we experience is that the val files which should be cleaned up periodically uh does not happen in in case of such unclean term unclean terminations and when it comes back up we have seen this pattern that it tends to uh you know explode all of a sudden and fill up the at this and which is what srejit mentioned that he had to work portion that is for the time being and he's raised an issue with uh with low key community for that um yeah so yeah how did you achieve no log drops uh did you use any queuing service I guess it there's no queuing service assets it's mostly by tuning uh the distributor and investor making sure that they are always available to accept the request from prom tail and also uh and also you have right at log so that nothing gets I mean so investor does not you you're not relying on investor availability uh throughout uh yeah so and anything else to add this is it um you can tune the prom tail retrial mechanism which you can reconfigure how much time it should keep trying and those things also can be tuned so that um and along with that there's a max chunk age that max age i think uh suppose industry is down for last 15 minutes and any log beyond 15 minutes it might not accept if that configuration is there so those things you can increase considering uh there are no log tops or else uh once investor comes up and the log uh log the timestamp of the log is very old automatically uh just prom tail have to drop those logs uh because uh distributor will not accept the log uh due to the setting yeah and and uh I think uh also to monitor this was an important so an important aspect so we did have uh so prom tail of course gives a metric on the drop uh metrics yes and also there is this uh low key canary test testing that you can do where uh there's a small application that will generate a log and then make sure that the log is searchable through low key after certain period right uh so this is something that we actively made sure is there in all systems that is one of the big I mean red flags if that is not working and then of course you have component level alerts which will help you isolate where the issue lies so over a period of a few weeks I think we could uh you know optimize on uh all the scaling parameters and uh let's say chance sizes queue sizes all of those parameters right yeah uh can low key handle monolith application that generate a few tbs of unstructured log unstructured log not not just logs uh where we do not have an intelligent way of creating labels uh yeah okay so this is technically possible so you can do it two ways one is uh at a prom tail level you could go ahead and uh configure some sort of a I mean you can configure configure a log pattern and extract labels from it this will be a little bit more CPU intensive at the collection end or like one of the ways that I would suggest us like do not index anything in the labels make sure that you are when when you're searching you are searching a reasonable uh time window and what you can do is uh low key log ql itself has a support where you can specify the log pattern and it can extract out fields so you don't really need index labels for this uh you can you can extract out the fields and you can even filter by that so what happens is that the essentially the log lines that are being queried from the index and filtered as based on the time time time window and then what you do is you you use this uh log ql's pattern option to uh go and extract out the fields and then filter by it so that would happen at the quarry in memory in the quarry and uh not it will not have to hit s3 so that's a more reasonable balancing act that uh you can probably take up unless you of course need to dashboard based on a label or something like that uh yeah I think I have gone through most quiz okay does it support gcs uh also apart from s3 for storage yeah yes and so it supports uh gcs as well um and umayesh just one more question how many nodes have you deployed to handle 1.5 tp of logs every day yeah so that's an interest so how much like can you give a ballpark strategic of how much for you're spending for in just and over the whole setup maybe I think I am running around seven to eight nodes average what about configuration like um uh eight cpu 32 gb run eight core 32 and around seven to eight uh no and uh do you retain the log somewhere in the local storage before you flush it out yeah so how much do you retain in the ingester how many days 24 hours is it no no no I think like maybe two hours or less 30 minutes or less 30 sorry because max max managers it's keep on pushing to s3 yeah so the 30 minutes stays in uh in ingester and I'm looking at either 30 minutes or the size that I have given uh each one is first it will push to s3 uh if we are not storing the logs locally then how does a 24 hour query behave since it has to fetch everything from the object store uh yeah so yeah so this is something that you have to I mean so it you it does to search actually you don't need to fetch all the data you only need to fetch the indexes back from the uh from s3 so that is usually a smaller file to fetch and then you get the exact log lengths to fetch but yeah for a query range of 24 hours definitely you are looking at like probably a few seconds of wait time to get the results back it's not going to be instant umesh with loki at least do we need compute instance intensive instance for deploying our loki instead of memory optimized no we don't we are not running on a compute intensive instance we are using m type instances everywhere general purpose and I don't see the cpu usage of loki is not that intensive I haven't seen that going uh more than two cpu's but uh memory is something that there's more of uh demand here yeah and uh he's also asked what components in which ingester are acquired which one needs to be more memory intensive uh and which instances need to be complete then ingesters and queries are both memory itself considering the number of queries and the number of volume that they are yeah cool so I think uh we have covered most questions now I think there is one question from Ashutosh he wanted a little bit more clarity on the uh consistent hashing ring so promote if you can share the ring ring uh slide maybe I can take a dig at it meanwhile keep the questions coming guys uh you can see much fun right yeah yeah so uh so Ashutosh so maybe I'll try to explain this so this is a general principle this is not just a loki thing uh it's something that you would see across distributed data stores where uh if you have multiple instances that uh where the data is spread across uh the strategy of how where which data goes to which instance is determined by this uh method called consistent hashing right so what you do is you you take a range an integer range let's say and put it on the circle like this so for in this particular case so it is zero to whatever is that uh the max of uh unsigned int I think and uh whatever and you place the ingesters equally distanced amongst them so you say uh zero to x would be handled by ingester one x to y by the other one and the y to z by the other one that sort of a thing and this data is being stored in console in the case of loki I think and what happens is this distributor who gets the log line uh determines the stream and based on the stream and the tenant it would uh like uh it would uh compute a hash so hash function spits out a number in this particular range and then it'll figure out which range the number belongs to and uh selects the ingester that is adjacent to it so that's how you assign the uh ingester to streams and if you have a replication factor of more than one uh you assign it to one more neighbor uh of that ingester so the idea is that even if one ingester goes goes away uh you have a simple strategy on who needs to take care of what data and you have people with redundancy so if a data needs to be copied over from one instance to the other uh the data is still available uh in the uh neighboring instance so uh that that's about the hashing ring so which is where the interesting thing happened for uh Sreejit right so he had the app label low so app app label I think in touch api started generating a lot of logs and everything was going to a particular ingester and that caused it to go out of memory but once he enabled sharding automated sharding so the number of streams got split out and so now it is more evenly distributed across all the uh all the uh ingester instances uh so yeah so I think this hashing ring concept is something that that everyone needs to be aware of uh while using Loki because uh you can narrow down many issues uh by understanding the architecture of this one I suppose if that is clear um if you have any specific questions on that let me know okay also Adi has linked our uh Slack community uh link in the chat uh so if if you want like I think uh people who needed exact numbers and sizing and and things like that we can uh connect towards Slack and discuss uh and of course there will be more issues that will be coming up in Loki and and whatnot uh so it will be a good place to share the knowledge and our experiences as well there's also an AI bot that answers some questions about facets so so those who use facets do try it out uh if it is able to give you uh good responses yeah so I think we can wind up the session uh thanks everyone for uh joining uh and also see around in the community and we'll have more of these sessions thanks everyone