 Hello, yeah, thank you as geek. Thank you organizers and volunteers for setting for having me here Good evening everybody. Yeah, let's try once more. Good evening everybody Awesome, so I'm Rohit. I work at Ola and Today I'm going to share our experience of setting up a centralized logging platform We do something which we do it at Ola scale billion plus messages and 100k request per second So what is Greylog? Greylog is a log management software that actually works It has lots of features like search analysis alerting Pipelines and a lot more we will see later Now before we look into this how many of you are Ola users? Well, a good show of hands. So that basically explains we have volume So all our logging requirements is because all of you are using it We have a huge volume billion plus messages per day And since all of you are simultaneously booking caps, we have velocity and Internal Ola is powered by a lot of microservices So we have hundreds of applications Which are pumping pumping logs at the same time and each of their application all all the applications may not log Exactly the same way. They may have their own requirements. So we have a variety which categorized So this is basically a typical big data statement Okay How many of you heard about ELK? A lot of you. ELK is a very popular stack and very widely used for log management. Even at Ola, we used to use ELK a lot but We found that it is a bit maintenance intensive And we often end up building on top of it So we wanted something simpler something which already has logging capabilities something which Doesn't require day-to-day maintenance So we were looking for an alternative And we found grey log, but why grey log first of all if you look at the name the grey log itself has the word log it is Designed for logging it is designed for solving just one problem, which is catered towards logging All its features are towards logging and not not anything else while in case of ELK it may be used for the other other things as well. It is a generic platform Some of the features is a beautiful UI which can be used for log searching You can also see on the top right the throughput which is Which was at that moment We can search based in our case. We're searching based on Application. I don't think that is visible We can have beautiful dashboards in this case the dashboard is based dashboard is for an application called HECA and The message streamed for the last one day is being shown over here. Then there's a message count for one day The cardinality as the number of unique messages is It's a quick values is a pie chart with showing the Messages which occur most frequently like the top one which is having is occurred 3.28 percent. This is a small application So it is not logging heavily Okay, last grey log is again powered by elastic search It uses elastic search to store all its data and query the results The good thing about grey log is that you can manage everything Manage all the elastic search indices from the UI itself and it gives you a lot of flexibility to to manage indices based on time message size or Even the message count. So this is a sample configuration in which each of the index index is of 40 million can store 40 million max records and 250 of them can be made can be present at any point of time. So this is basically a configuration for storing 10 billion records several input and output options grey log has pipeline support and Just like It can take messages from a variety of sources in this example You can see I've just searched Kafka and there are three different modes of taking input from Kafka We can take input from syslog from TCP There are other options as well like rabbit mq Again output can also be a variety of different Plugins There some some of the plugins are built in the grey log itself. Some are community supported Some you can also write your own real-time log analysis and reporting So we can create streams based on some criteria. For example, an application name can be criteria. We can create alerts on on streams. So as we are getting the data for that application, we might grab some particular line and Trigger a page utility if that is of importance or maybe on message count. So there can be a variety of Real-time log analysis and alerting Now it's a from the future why it's wise it looks very good We start did a POC and we had a journey few months back. We started exploring the alternative We came up with a pipeline and Without it was great, but we ran into a lot of problems and learnings which I am going to share today So this was the this was the initial pipeline. So we tested it on staging at work fine We we docker is powering most of our microservices. There are tons of microservices and All the applications are logging to standard out Which can be cached by a docker logging driver We are using a flinty log driver Instead of the default one which writes into file This is this in turn is sending the traffic to flinty on a tcp port Flinty is a data collector. It can collect data from a variety of sources as you configure it It can process it in real time and send it to other Send it to various destinations over here. We are using flinty as a dumb box It will just receive the input in tcp and forward the message to kafka Why why not directly to kafka because a kafka logging driver doesn't exist for docker If that existed probably we would have used that kafka is a pubs published subscribe system It flinty is writing the messages to one of its topic Any number of applications can write to a topic at any point of time and they can be a number of Applications consuming from it. So kafka is receiving the messages And you have to subscribe to the messages So we have another flinty which is subscribing the messages from kafka Then formatting it in a gulf format gray log extended log format Before sending it to gray log in a udp port because we believed network would be reliable And utp is fast Gray log had a gray log had two components gray logs server and web server The gray log server has all the functionality while gray log web server is just A wrapper on top of gray log server which will Call the rest APS to show the UI and elastic search is used for storing the messages Now a first problem happened The flinty container consuming was dead slow Uh, but we believed it is sending the messages to udp It should be fast and it was consuming from kafka in real time So flinty should not have any problem What we did we upgraded that if all the versions for different Components as you see in gray log now the two components is merged into one That is the in the new version we got this upgrade We tried out beta rest api and ui is in the same box But the problem is still not solved The problem was at the flinty Guelph plugin So this was slow What we did In the new version of gray log we saw that it already has a kafka support We started formatting the messages at the source itself Before sending it to kafka in a guelph format And gray log is able to Collect the messages from kafka and also index it very well So the first problem case solved What could go wrong next? Our docker processes started, the daemons started crashing We used to get pgdut that in a production box Docker daemons crashed and when docker daemons crashed all the containers will crash Why? Because of the simple reason We onboarded few more applications and it was logging at huge It was logging heavily and the buffer of flinty was full So it crashed But why should a docker daemons crash Because of something less important like logging So we thought probably we should upgrade We did upgrade to, we read some articles, we upgraded to latest version of docker We also read few articles which suggested Upgrade the kernel to 4.2 We did all that, but now instead of crashing After say every four hours it was crashing at every five hours But the problem is still not solved So forget the fancy terms Don't use the flinty log driver Use native log driver which is which writes to file in the json format json file log driver And we will use tail plug-in influendi to tail the files and then send it to the pipeline So this was the solution I've just removed that component because it is now independent So it will stream all the logs by reading from the files It automatically detects the new files when and when new containers are created And the pipeline goes well, we are happy Again, we got into another problem huge log messages Now any idea what could go wrong when we receive such huge messages Elastic search indices are powered by lucine Which supports a max of 32 kb I saw a message which was 3MB a single message Of course it will raise a exception And gray log will try to send the message again and again five times before finally discarding the message And gray log will not send the other messages until this message is either sent successfully or discarded So there is a huge lag Solution either truncate the message or divide it into multiple parts We decided we will just truncate the message We came up with 8 kb as the max field size which the json will support And the problem is solved because exceptions will not happen if the message size is small Number fourth Now our platform is kind of popular More developers are interested in onboarding their application to gray log But they are not into the traditional they are not into the modern Docker platform they are running it running out on the system like the box So their application is generating the messages in json format in the files We need to tail the files and send it to the pipeline So that's what we did new apps onboarded which were not on Docker Now what happened some of the applications may have similar keys Like for example status as success or true it can also be an integer as one What happens that this detail is sent to Kafka gray log and finally it reaches elastic search The first time it comes to elastic search elastic search maps the data type Associates a data type with this field If the field is if the first message received is status equals to 2 it will map it as boolean The next message comes as success it will throw an exception that's what that that was happening So what we did the logic is simple at fluently level We since we have we are powered by like hundreds of microservices different teams Sending the details to us we wanted to solve it at our end So what we did convert everything to string before sending it to the pipeline Since all the data types are string it will not throw an exception Now we had a very we had many components but we realized that we are using buffer at each level so whenever some problem happens we have to look at each of the component Where is the lag they they can be lag at the first component second or third is the is buffer at each level necessary We found the answer is no The first the first one fluently is actually tailing the log files so the file can act as a buffer We have Kafka as we want Kafka to have the storage to store the data for two to raise so it acts as a buffer at gray log we can throttle the input to match the output so gray log it's not necessary it can always replay the message from Kafka it is not needed a final problem Fluently is tailing the files sending to Kafka but we experienced message loss and sometimes delay why because the fluently Kafka plugin was not fast enough it was written in Ruby and it was designed to be simple but not performant we adding more CPU and RAM will not help So what we we searched for an alternative luckily we found a project called HECA HECA it is a Mozilla sponsored project and HECA is written in go very performant what we experienced that it can do all the things which fluently can do plus it has more features but complicated so HECA with HECA we saw we are now 10x CPU friendly 5x memory friendly now finally who all said and then who liked it developers because we have a centralized center we have user authentication they can log into the they can check the logs directly they can do quick RCS debugging in staging as well as production we can sleep for some more time and the management can see beautiful dashboards get more insights and visibility you can also try it yourself this is something which I did few hours back we have uploaded a graylock docker image you can just run three these three commands you will have a graylock setup ready to use I think that's it I have I'll just explain one more thing that is the Kafka layer the reason why Kafka exist is because Kafka can with Kafka we can take another pipeline and use it for long time archival purpose like Sikor we are using Sikor for pushing the messages to S3 at the same time while graylock will have the messages for say seven days yeah I think we should open up for Q&A hi so looks like you pretty much replaced L and K of ELK stack with the graylock server right so what were the main benefits that you got out of removing because it seems like you're pretty much doing the same thing yeah so what was the benefit land in my experience most of the problems that happens is because of elastic search that's correct so the main advantage of having graylock is that you get everything at a central place when you're using ELK you often end up building on top of it like you will be configuring the lockstack shipper with you will be writing all those configurations you will be using Kibana for viewing the locks you get all those features at a central place that is the biggest benefit and of course it has tens of other features I mean what is the feature that's what I was looking for the feature like streaming support the ELK I think locks that has that support so it is similar the basic with the basic idea is when I understand it's an alternative but is there anything major benefit that you derived out of it which was not there in ELK the major benefit is that today I want to manage elastic search based on message count tomorrow I may want to use it based on time and all of those I can do it with a single click of a button that is the major benefit I have experienced second is that with ELK we will be building on top of it to actually make a mini grey log so lockstack shipper and rabbit mq all those come into picture and over here grey log comes into picture from the cpu cycles perspective it will be similar and we found this as a better alternative it can ELK can also do the same similar things hello yeah so I have a question this is actually a gel per driver for from docker itself so right right it also has the same problem of offer which you had with fluent driver no actually we wanted Kafka to be in between okay okay so the if it had gulf Kafka plugin who would have probably probably used used that we tried out that it works beautifully it can send messages directly to the grey log server there's no buffer issue which we had which you had with fluent it sends we were using it with udp so udp if the buffer there's no problem with buffer right if the destination is not receiving the messages messages will be lost okay tcp is also there but I think udp is more used than tcp hello I'm afraid that's all we have time for now the next speaker will have to get on so you can take your questions offline thank you