 Once a famous Formula One driver said, if everything is under control, you are not moving fast enough. Good morning everyone. I am going to talk about logging at scale using Greylog. 100,000 messages per second, billion plus messages per day. So what is Greylog? Greylog is a log management open source software that actually works. It has search, analysis, alerting and a lot more. But before getting into Greylog, how many of you over here are Ola users? Anybody? Okay. So Ola is an app based taxi booking platform. So you can just install your app and anywhere in India, in 100 plus cities, you can book caps using your mobile phone. It has millions of users. It is a majority market leader in India and millions, millions, million plus bookings per day. So all these good things eventually translate to technical scalability challenges on the engineering side, including our centralized logging platform, which has the characteristic of all the three V's in a big data problem, that is volume, velocity, as well as variety because Ola is internally powered by a lot of microservices generating their own data, generating their own logs. But why Greylog? Let's look at some of its features. It has a great UI, best for wearing logs. It is designed for logs. So as you can see it over here, like in the left hand side, you have a different keys present in the log. If the log is structured, then you have in the search bar and the search results on the right hand side, of course. We can build beautiful dashboards. It has support for a lot of different widgets, like the count widget, the quick values widget and so on. But the best thing, which is like the USB for me, I like about Greylog is that this is just one single place where we can do everything. If I'm a user, I can log in and do my RCS based on the logs generated by an application for an issue. If I'm admin, I can do a lot of my management stuff on Greylog as well. So Greylog is internally powered by Elasticsearch, where all the data is stored and it does query Elasticsearch for the results. And managing Elasticsearch at scale itself is a huge pain if you have handled Elasticsearch. So it gives me some manageability of Elasticsearch. Like in this example, I have a configuration for storing 100 billion messages in Elasticsearch at any given point of time. When it crosses, the indexes get rolled over and the old ones are deleted accordingly. It can take inputs from a variety of sources and send output to a variety of destinations. So I can probably take input from a TCP port, an HTTP listener, Kafka, Syslog. There are various input sources and various output sources as well. It can do real-time log analysis and alerting with a feature called Streams. So you can think of Streams as like a separate pipe added to your input where you add some filter and you get this filtered stream to do your log analysis, probably dashboards, alerting, whatever you decide to do. So with all these features, we were very happy and we decided to go ahead and give Greylog a try. But of course, we had a learning curve like scaling from almost zero to what it is today. It's a journey and I want to share the challenges and other things we faced while doing it. So what was our initial design? So we were just getting started with microservices. Docker is something which powers the containers and it is managed using a MesoSmarathon cluster. So our applications run on Docker which generates logs and the final destination of our logs are the Greylog cluster which is like the square rectangle you can see over here. It is tightly coupled with Elasticsearch. It has an API and a UI. But this is a new design and we wanted more reliability. We have not yet tested Greylog. So for adding reliability, we added one more layer, something called Kafka. So these are the three main components. My log source is Chalker. Kafka gives me a reliability and the final destination is Greylog. Now Docker applications are generating logs in standard out and we collect the logs using a Docker log driver. We chose Fluendi log driver so that we can directly send the logs to the next component called Fluendi producer. Fluendi is a log aggregation platform. So we can collect messages from a variety of sources and send it to a variety of destinations. It is collecting the log message from Docker in a TCP port and sending it as it is to Kafka. Kafka is a pub-subscribed system which can consume messages from a variety of producers and send it to a variety of consumers. So all the messages are received at Kafka and we have a retention on that. Since it is a pub-subscribed system, somebody needs to consume it. So we have a Fluendi consumer which consumes the messages from Kafka and sends it to Greylog. A bit more about Kafka, the reason for using Kafka, we chose Kafka because of multiple reasons. One, it is a tested reliable source. Second, if I want to do a maintenance on the rest of my cluster, I can just pause consuming messages from there, do a maintenance and start using it again. If I lose messages by any chance, I can replay the messages. If I want to have a long-term archival in some other storage, I can take another pipe from Kafka and probably store it in a cheaper storage system like, say, S3. And if I, in future, if I want to move on from Greylog to something else, I can again take another pipe from Kafka and design a completely new logging search platform keeping the existing system running. So Kafka gives us a lot of flexibility in that sense. So finally, the message sent from Kafka, over here, Fluendi consumer also does another job. That is, it formats the messages in Greylog extended log format before sending it to Greylog so that it can index it properly. So this is the final pipe of initial design. We'll simplify this a lot. So this was done. We tested it on staging in a weeks time and it was also moved into production. All good. But we ran into a first problem. That is, there was a huge lag in the Greylog cluster. Why was that? We found that the Fluendi consumer had huge log buffers. Fluendi itself is a fast system, but we were having a huge buffer over there. But that didn't seem right because Fluendi is sending the messages to Greylog over a UDP protocol. We thought of networks to be reliable, so we used UDP protocol. So Fluendi should not be a problem because UDP doesn't even require acknowledgement and the message should be fast. So what we did, we initially upgraded all the surrounding, like we upgraded Kafka to the latest version at that time, which was 0.9. We upgraded Greylog to the latest 2.0 beta version at that time and elasticized to the latest 2.3.1, but it did not solve a problem. It turned out that the plug-in, Fluendi plug-in was a slow plug-in. So adding more CPU, RAM or changing the environment did not help. So what we did to solve this problem, we removed that component altogether. Instead of Fluendi consumer sending the messages to Greylog, at the source itself, we are formatting the messages in GELF format and sending it to Kafka. And Greylog is consuming directly from the GELF Kafka input. So case 1 solved. Now more apps got onboarded. We have all everything running and suddenly we get an alert. Docker service crashes. The process itself crashed and when Docker service crashed, all the apps which are running in that box crashes. And this is a huge problem. We are serving our live customers. So the reason behind the crash was due to buffer getting full. Docker was sending the log message to Fluendi and since more and more apps got onboarded, the log volume increased and the buffers were getting full at Fluendi. So the Docker was crashing, complaining buffer full, but the same dot because probably it should log there that the buffer is getting full but not crashed. Crashing is something very serious. So what we did, we did some research on over the online, we read a lot of articles and we decided to upgrade Docker to the latest version that is 1.11. And also ran, also started using the latest kernel 4.2. Did it help? Yes, it did help. Instead of our Docker service crashing every four hours. No, it was crashing every five hours. But the problem was not solved. So what we did? Let's remove all the fancy stuff. Instead of sending messages to a TCP port, we start logging natively using the default log driver JSON file and we tail the messages instead of receiving it as a TCP port. So Fluendi is now tailing the log messages generated by Docker and not over the TCP. So now Docker daemon is not directly related to the Fluendi or the rest of the pipeline. So I removed that and the problem was solved. Now, a few more ads got onboarded and we ran into a next problem that is huge lag in the UI in the gray log again due to 3MB log message. So a single log message line was of size 3MB. What was the problem due to this? So elastic search intern is powered by something called Lucene. Lucene has a value field limit of 32 KB. So when it received a message of this 3MB, it throws exception of exception saying max message size limit reached and this exception is cached by a gray log server. And it retries the message 5 times with a weight of 30 seconds in between before discarding the message and the rest of the pipeline is blocked. So until this message gets discarded, more messages cannot come in and our pipeline is blocked, leading to a huge lag. Of course we did send the log line message to the developer and they fixed it but we wanted to solve it at our end as well for being future safe. So what we did? We started truncating the message. The solution was simple. We started truncating the message at the source. So if the message field is greater than 8 KB, we truncate it to 8 KB. Now we have smaller message sizes. So able to index small messages, the problem was solved. What was the next problem? Now gray log is getting popular. There are multiple teams using it and they are seeing value in it. Some of the non-micro service based developers also wanted to use gray log. So we started, we deployed Flindy which tails the application logs directly into traditional boxes which are not powered by Docker and we started sending those messages to the pipeline. But we ran into an issue. Since we are not tailing application messages directly which are in JSON format, applications can generate message differently. So for example one application may say status as success, another may say true, some application may say it as integers like 1. And these JSON keys are mapped in Elasticsearch with a data type. So the first time Elasticsearch receives a key, it attaches a data type to it. And the next time if it gets another data type, it throws an exception. And gray log is again retrying the messages before sending more messages, before discarding the message and there's a choke. So what we did, we converted everything into strings. So right at the source at Flindy, it will send only string messages to the pipeline and there will be no different data types, no exceptions and the problem is solved. Now every time we had a problem, we had to dig into multiple places like Flindy had a buffered Kafka plugin. Kafka itself has a retention and gray log also has something called journal, again a form of buffer. So whenever there is a lag, we'll have to look at different places to figure out what is the area of contention. But is buffer required at every level? We asked ourselves this question and the answer was no. Flindy is scaling the log files and it stores the offset. So the file itself can act as a buffer. So we disabled buffering at Flindy. Kafka is a pub-subscribe system and we need it for a variety of different reasons as I told. And of course we need retention over there. We configured a two-day retention over there. Gray log has something called throttling option. So if elastic search is not able to cope up with the speed at which gray log is sending the message, it can slow down and consume messages from Kafka slowly to match up with elastic search. So again, we can disable journaling over there. So with this, it simplified our debugging process as well as reduced load on our servers. Now, more apps got onboarded, we are dealing with higher scale. We started facing something called missing or delete logs. What was the problem? The problem was a slow Kafka, slow Flindy plugin again. So a Flindy plugin for Kafka was designed to be simple but not fast. We adding more CPU or RAM resources did not help so we were looking for alternatives. Luckily, I found a project called HECA. HECA is a Mozilla-sponsored project and it is written in Go. So with this, it was 5x CPU friendly, 10x memory friendly and it was without losing messages. So the missing slash delete Docker logs problem was solved. Finally, we have a centralized logging platform. So every time any problem happens everybody is blocked. So centralized solution is also a centralized source of problems. We wanted to decentralize the late second part. So what we did, they can be contentioned at any level like at probably Kafka. They may be contentioned at Greylog or Elasticsearch anyway. So what we did to decentralize the problems? HECA is already distributed because it is running as one process per box. If we want more reliability on HECA, we can add more CPU and CPU powers. We designed a plugin to add log bucket support. So a group of applications will write it to one topic in Kafka. Another group of applications will write to another topic. So in that way, it is dividing the logs at the source and sending to various Kafka topics. Kafka is a horizontally scalable cluster, but we decided to do it topic-wise. In a big Kafka cluster, we chose, say, five servers for topic one, rebalance the partition, five servers for a topic to rebalance the partition and so on. So we have a topic-based sub-clusters. So if any contention happens on Kafka, it will be limited only to that particular topic. We dockerized Greylog. So since it is a stateless application, it is not storing any journal. We can do dockerized and we achieve elastic scalability. So we can just click scale it to 50. I want more resources, scale it back down to 10. Elasticsearch, again, we separated as per requirement. Like there can be multiple Greylog clusters and with Greylog configuration, we separated as required if required. So with this, we achieved horizontal scalability not just by the applications, but also by design of our pipeline. Who likes it? Developers, because they are able to do quick RCAs. Develops, we get more time to sleep and management loves it because of lovely dashboards. So that's it. Do you have any questions? It is slow. Maybe could you give some metrics of why is slow, what was the maximum limit it could handle, a second or something? So I would say Fluenty itself is not slow. It is super fast. It was the problem with the plugin. So the Fluenty Kafka plugin is written in Ruby, which translates to Kafka, which is more Java friendly. So there were two plugins, one for one in written in Java as well, which is designed to be fast, but it was not. It did not support sending messages. It supported only consuming messages. On the metric side, with HECA, I've done Mildren request per minute as well from a single box. It can do more if I increase the CPUs, if I allow it to use more CPU codes. Fluenty was probably getting limited at 5000 or 10,000, something like that. Per second or something similar. But I don't exactly remember that. Any other question? Yes. Telling about Kafka, do you scale it across data centers? No, actually we have not. We have kept it at one data center itself. But yeah, we can definitely do that. And if you want it, say, for DA requirements, if you want it across regions. So we can also configure it like the replicas in Kafka. The primary is in data center one. The replica is in data center two. But of course, this will also add cost within a data center network bandwidth, maybe free. And if you are doing it across zones or across these regions, due to this humongous scale of terabytes of data, you may incur a huge bandwidth cost. So we are not doing that as of now. Any other question? Of course, myself. So how hard have you pushed this production and what kind of logs are you passing it? How hard will the scale at which I am talking about right now is actually six months old? No, it has grown 10x. So like in volumes right now that you are ingesting through HECA and then ultimately storing elastic search. What sort of volumes are we talking about? On a daily basis? Okay, on a daily basis it is in the tens of terabytes which is coming to Kafka and being stored in elastic search for different clusters of different retention. So we configured retention in elastic search based on message count instead of time. Because message count is more absolute in terms of size. If I stored it in terms of time, it would be variable because new applications may get onboarded, my logging volume can get doubled. And the size in elastic search would be in hundreds of terabytes being stored at elastic search. In this scenario you just mentioned, how big is the Kafka cluster and how big is the elastic search? Okay, so Kafka cluster is a 15 node cluster we are using right now. Dividing into three sub topics like the first five is used for one topic, the second five to ten are used for two, three topics. The last 11 to 15 are used for another set of topics. So, servers, it's dockerized and the collective would be more than a terabyte of RAM. And elastic search again is in a lot of boxes, like you can think of anywhere between 50 to 100, something like that. And each of them, all of them are big boxes. In fact, it is a big cluster and we have tuned a lot of different settings at each level to do it at this scale. Can you share a little bit more about the use case for kind of logs that actually happen by tens of terabytes? What's generating tens of terabytes here? Okay, as a platform provider, I don't really worry about what kind of logs applications can bring. But applications are of different types. There are booking applications which take bookings, service discovery applications which discover what are the caps around this particular location at this point. There can be payment related applications as well. All of them are generating application logs which are useful for debugging, not really data related logs. It is more aligned towards application logs and there's a variety of that. One last question. Sorry, I have to ask so many questions because I'm actually also going to face a similar problem. So this last one is for the hacker side, producer side. Would it have the same effect if you use Node.js with clustering, clustering function, clustered Node.js? Because hacker is using code, uses multiple CPUs. So I think the Node.js clustering also uses multiple CPUs for material and achieved performance. Yes, actually the real pipeline starts from Kafka. So inputs can be in multiple forms. There are also more multiple inputs. Like I'm also, for example, sending inputs via TCP by another plugin. Some applications are also sending it directly to Kafka. So as long as you send the messages to Kafka in the get formatted message, all is good. You can send it from any, you can replace the first component with anything you like, anything you are comfortable with. You have mentioned multiple applications, they send this kind of log files. Did any of those applications need during this setup of the infrastructure? Did any of them need to change his log structure? No, because it was transparent. Yes, it is transparent. So in the microservices model, like all the applications, we just instruct them to log in standard out. So that is all they need to do. And the log driver will collect the standard out and write it to file in a structured format with some added metadata. We are handling those parts. So for application developers, all they need to do is log into standard out and rest everything is taken care of by this. So filtering is being done? Yes, like the structured message is sent to, finally sent to Greylog and Greylog, once it is stored in Greylog. Yes, we can do filtering on Greylog. Thank you so much for having us.