 Welcome back to PyCon India 2020 Hyderabad stage. Thank you to all of you for attending the event and speaking with us for another day of talks and start hunting. An important announcement before we begin with our next talk, it is for workshop and distance for life. Now grab them while they are still up for sale. Please welcome our special guest, Devi ASL, from Reservate. Devi has been a long-time pathanista and has delivered many talks at various conferences, including PyCon India and root cons. He has over a decade of experience in building software solutions in different domains in a variety of tech stacks. Please help me in welcoming her to the stage. Over to you, Devi. Hi, everyone. Good evening. Good evening, all of you. Thanks for staying back for the talk and let's get started. I'll go with my intro a bit. I'm Devi and thanks for the introduction which has been already done. I have around 15 years of experience in the industry and I'm working as a staff engineer at Reservate, working on observability and other cool products. And before my stint at Reservate, I was, I have spent five years as an architect at Power2Clyde. With that, I will get to the talk. The topic of the talk is logging. Logging is such a basic thing. All the developers, all of us, we do put log lines and we do log our code. What is it doing? And without that, we all know that it is, we are just blind without the logs. And this is the only way to find out what had happened at a given time frame or from any service or application that is on production. Without that, there's no clue, right? And not just to find out what had happened, we also find out, we also want to find out what exactly had happened and we want to do that quickly. And also, we want to find the scars. When an issue happens, we want to find exactly that log line which shows us what had happened. And we want to do that quickly and also find that it's root cause. And what had happened afterwards? I mean, what was following after that issue has happened, right? And we all know that. We want to do that, but it's not easy. Many a times, I have seen developers putting the log lines, putting more log lines and then deploying their code, just the diff, if you see, it would be just log lines and nothing else. And most of the devs struggle to find the log line that they are looking for. Sometimes that will be in order of minutes and sometimes it's an order of hours in microservices world to find out the root cause, right? The mean time to detect an issue is really, really, really large, huge sometimes. And how do we make it short by making logs easily searchable, filterable and aggregatable? And if we increase the logs, logs in our code at every point of time, then the logging cost would soar up because the maintenance of logs would be costly. How do we balance these two aspects of debugging and logging cost? And how do we optimize it? And lastly, we'll also see how can we trace the request across microservices in logs, right? With that, let's see the, let's attack the problem number one. How can we make logs easily searchable, filterable and aggregatable? This is without depending on the log management system that we are using. So we all know this, we have written hundreds of log lines like this or seen thousands of log lines like these. It just says, it just calls the standard logging library of Python, logger.info, some sentence, transaction succeeded with ID, blah, blah, blah, amount, blah, blah and payment mode, blah, blah, blah. And I'm giving the arguments to ID and amount payment mode from my transaction. This is how typically a log line looks like and we all are well aware of it. And this part of it is called log level, log level and there could be, there are many standard log levels and we can add more also. So some of the log levels that you would have seen or most commonly used are debug, error, info, critical, things like that. And you can add your own also. And on production, we would switch the log level to error so that we can see only error and above which is critical, error and critical. And maybe on the developer systems or dev environments we would start at debug so that we can see all the logs starting from debug, info, error, critical, so on. We are most of us diligently put the log levels also appropriately so that we don't incur so too many, too much of logging cost. And now let's start with the problem number one that we were looking for. How fast can I search for a particular log line that I'm looking for? And without, I don't know what has happened also sometimes, right? I don't know, that's the reason I'm referring the logs. And typically the log line will look like this. The example that we have taken from this in my, when my request is flowing transaction succeeded with ID amount and payment mode could be an EFT. This is how a typical log line looks like in our log files or on the log management system. Now, if I have a bunch of these can I answer a few questions like give me all logs where amount is more than 10,000 bucks? Or can I say give me all logs where in payment mode is RTGS? And how does that work? If you look at this carefully, this log line, from here, this is the application that it is emitting, this is the log level and this is the timestamp. And from here to here, this is all one piece of text, a sentence in English, which is trying to convey the developer the transaction has succeeded, right? But is this possible by machine? Does the computer or a server understand that whatever is followed by amount is really an amount? Or this is like an ID, which is always a string. This is a float. We can do filtrations like less than, greater than or do sums and aggregates on this. There's no clue, absolutely no clue for the machine to understand that, right? And this also, what is followed by the payment mode here, NEFT, that it is a string. It doesn't have a clue. We are not, we the developers are not leaving a clue to the machine to understand that this is how to extract this or how to pass it and we are dumping in the logs. We have been there, we all have been doing that. And so this is human readable, but not machine readable. We can do filter filters based on this because of the same fact that the machine does not know that how to extract this number out of this, unless we put a regular expression somewhere in our grogks on log stash or clue and D or somewhere else, right? Or if our log management system is doing it for us without being explicit. Aggregates, if I want to know in a particular window of time how much, how many payments have been or transactions have been succeeded. I don't have a clue to find, to figure it out from these logs. And how do I correct it? Treat logs as event streams. This is one of the 12 gems that 12factor.net tells us, right? And what do we mean by that? Treat logs as event streams. What is an event? So here comes the JSON part of it. So this is called structured logging. What are we replacing here? We are saying instead of a plain English sentence wherein we are writing transactions succeeded with blah, blah, blah and putting everything, every detail that we know into that sentence, here we are saying, look here, this is a new transaction which has happened and ID equal to this, right? So here if you see carefully, this is a dictionary kind of thing. These are all keyword arguments that are going into logger.info. This is key value pairs, basically. This is a message and these are all key value pairs. So whenever I say amount equal to blah, blah, blah, I am, as a developer, I know that this is always a float or an integer field on which I can do filters, like greater or equal to or equal to blah, blah, blah and this is a string field and all. And this information is really important for any search engine to figure out, to index, to give us the searches based on our filters. And can I do this on standard library of Python? No, unfortunately. So this functionality of being able to pass key value pairs is comes from a wonderful library called struct log, right? And how does a log line now look like with this in place? How does it look like? It looks like this, message, new transaction, ID, blah, blah, amount, this. Look, look here that there are no codes. Payment mode is successful, it's a Boolean, as we expect, and a timestamp. Now, each of these, it's very clear for the human being, for the developer assets and for the machine to do aggregates, to do filtrations or to do whatever that we want, right? And this is the first step in logging it right. You can do filters and aggregates. We are in the right direction of solving or getting to that log line that we are looking for faster, right? And how can we do that? Structured logging with struct log and it's a simple drop-in replacement to standard library, wherever you have logging.getlog, you could simply replace it with import struct log and structlog.getlog, nothing else. It simply works. And struct log also offers a plenty of powerful pipelines. Let's say you want to redact password, right? Or add some, some more metadata to the event, to whatever the logs are emitted. As a, at a central place, add a middle layer, middle layer, then that is also possible with any of these pipelines. There are quite a few, quite a number of useful pipelines that struct log offers out of the box. And to add new or customized pipelines, also it's very, very simple. It's a, it's simply a function that takes a dictionary and emits a dictionary out, nothing else. So we could simply use struct log and go with this to make our logs structured streams of events rather than English sentences, only meant for developers. With that, the first problem is solved. Now comes the other problem or the other aspect of the talk today. Logging is not cheap. We as developers, we see only the logs in our code, but if you see on the servers, how many logs are piled up or what is the cost incurred by them to maintain the network and the storage aspects of that. It's not really cheap. Number of logs, as you might think is always proportional to the number of requests that you're serving. And you want the number of requests to your website or the web application to grow, but not the cost, right? The cost is also proportional to the number of logs multiplied by the network and storage costs and the retention period. How long do I want to have my logs? Do I want to see the last year logs also? Or am I fine if I throw off the logs which have happened maybe last three months? It depends on that also. We want the number of requests on our application to serve, but not this part, right? So there is a, though they are directly proportional, actually, we don't want this to grow, but we want this to grow. How can we make that happen? And Python has magical, I shouldn't say magical, it has got a standard way of doing things. Okay, at RazorPay, we actually process two terabytes of logs every day, two terabytes. And it costs huge amount of money, believe me. And people who have applications which churns lots of requests per day, per minute, they do have terabytes, petabytes of logs flying in and it costs money. What do we optimize for? Cost or do we get big? Enter dynamic logging. So what is dynamic logging? Let's say somehow if I can turn the log level from let's say error to critical or from info to error in runtime without restarting my application. That's important. Without restarting my application in runtime, I want to change the log level. Let's say the request to my application is soaring up and I'm totally confident that my application is handling the traffic well. And I don't want the logs of every detail there. I just want to switch the logging level from info to error. How do I do that? Without restarting the application. So the standard library, standard logging library of Python provides us a way to do that, way to change the config by listening at a port. You simply write another small application actually to send your log config to the sport and you're done magically. Write out and it simply works. I have a full code snippet attached or a link to full code snippet attached to make this work which is on the documents of Python. This is one way of doing it, but when we are processing, I mean when our application is deployed behind something like G Unicorn or something, then all the processes listening to the sport might be possible or might not be possible. Then how do you do it? There's another way to do it. There is another way to do it, which is the standard way. Logger, you would get a logger from getLogger and logger.setLevel. This level, whether it is logger.info, logger.error or whatever it is, this can be changed runtime based on the metrics that you collect or the errors or latencies, whatever you want to. Let's say you store this and it is and pull it out dynamically based on the traffic. I mean anything can be done because this is dynamic. So once you figure out where to tweak this level and how do you want to change this? When do you want to change this? It's just turning that value and redis or posting a particular request to your config changer, whatever it is, right? And Python allows it so that I can, I will not lose my debugability, but at the same time, I'm being cautious about the volume of the logs that I am incurring. That's about dynamic logging. And we'll go to the last and the final aspect of the talk today, which is how do we make more sense of logs in microservices world? Today, almost all of you who are working on projects or applications and on production, you would have heard or working on microservices. You don't have one single app, actually, you have a number of microservices talking to each other and which are talking to each other and delivering a piece of functionality, right? To the end user. And in that, what are the challenges involved in logging and how do we make more sense of logs? And let's see that aspect of it. Now, let's take a typical example, which I have simplified actually of microservices. And let's say I have a frontend service which is talking to service A and which in turn is talking to microservice B. This is how the request is flowing. So this is the user facing service, let's assume. So the user, like you and me are hitting this frontend service and the request is touching service A and it is sending the request to service B, getting some response, getting back the response and the frontend service gives back this response to the end user, right? Now, if you see all of these services would be emitting logs. So frontend service and service A and service B, all of them are emitting logs and your log management system is taking all of them and storing them and making it searchable for you. Now, just for example, let's assume that at a given point of time, the developer sees something like this, a very, very simplified example of log line because of the space constraints on the screen, I just had to put one message, okay? So this is a log line from this frontend service, there are a couple of them and there are a couple of log lines from A, service A and there's one log line from service B. Now, as a developer, I'm trying to debug something and I want to know, I have seen this that, okay, this is the log line from frontend and it has occurred at a timestamp, blah, blah, blah. Now, I want to find out whether this log line is following this, I mean, the request is handover to service A, right? And it is handovered further to service B. Now, is this the successor of 91 or is this the successor of 91? How do I figure out? So how do I stitch in my head the logical flow of the request through these microservices? Do I have any clue here? Everything looks ordered in order of timestamps, but I don't have a way to say that log line with timestamp 93 is following this one, neither this. I can't say with any certainty and to know that what is causing, let's say this is an error and I want to see whether this user was getting an error or this user. How do I relate back or forth? There's no way, simply. And this wouldn't have happened if we had one single service, which is a monolithic application, but that's not, we are not living in that error and everything is served by one single monolithic application. Most of the times we have microservices and there are plenty of them, just not two or three like this and the developers who have debugged without any clue like this, they would know what that kind of hell it is, right? So what if I have a request ID? Actually, here I don't have any of these, but if I have a request ID originating from the front end, saying that for each of the requests, I'll have unique identifier. If I have some way to put in a request ID which uniquely identifies each request throughout all the services, then what would it look like? So this is originating from the front end because this is another request coming from the front end. It is labeled as something else other than ABC, right? The sum X, Y, Z. And here from D, when I, sorry, this should have been A and this should have been B. Sorry for the confusion, just take these. This is the log line from A and this is log line from A. And if I see ABC here, this means that this request, this part of the request which has come to my service A. Hi, Devi. Yeah. Sorry for interrupting. This is a reminder that five minutes are left for the talk. Sure, sure. We are good. Thank you. Yeah. We can simply see that this request ID is just the same, right? And I can clearly figure out that this log line from my microservice A is following this request ID. So I can simply group them. All of these log lines having request ID as the same thing, ABC, ABC, ABC. These are different components on front end service A and service B, right? Makes sense. Simply I can filter out with request ID, ABC, and can get that thread. Let's say an error is occurring here. How do I trace back to front end which has caused that? I simply can do a search for request ID, DEF, because I know that this is the cause and I want to see what has triggered that before reaching service A. I could simply do a search with a request ID and I'll get log lines from all the services. That's the key point here. But how do we do that, right? So this technique is called header propagation. One has to simply put in a header in the originating service, wherever your request is touching first. X request ID, whatever you want to call it as ABC and this will take the header, consume it and put it back in the headers whenever it is giving out another request, right? So that the request IDs flow across the services and to the log lines. That's it. You have request ID in each of these log lines and you can trace it back. Cool. So we have covered how to set log levels appropriately and how to make structured logging better. I mean, how to make the logs better by structured logging and how can we control the log volume by using dynamic logging and how to put in a unique correlation ID across all the microservices, connecting them all. Yep, these are the references that I was talking about and thank you all for staying back. If you have any questions, I would stay back on Hyderabad stage and take up any questions. And at Rezape, we have been working on amazing products and cool technologies and we have several openings and if you would like to know about the opportunities further, you could reach out to me on Twitter or LinkedIn by ASL Devi. Thank you everyone. All right, that was really an amazing talk, Devi. I would like to mention a famous quote I once read online. The quote meant like this, documenting is to build feature-rich software as logging is to maintain bug-free products. Moving ahead, let's get you some burning questions from our dear attendees. The first one is, how does GDPR affect logging? For example, can we log user IDs or anything that can identify a user? Yeah, that's a good question. So we shouldn't be logging any PII, PCI kind of data, personally identifiable information and because I am in payments domain, card information also we shouldn't be logging and that's where I have shown you that redacting of sensitive information, right? And stuck log provides you that kind of pipelines to do it easily and effectively. Yeah, we shouldn't be logging anything which identifies a user, personally identifiable information, PII shouldn't be logged. Okay, looks like we are running out of time here. So before we wrap up, on behalf of Python India 2020 team, I would like to again thank you for sharing your expert knowledge on the subject of logging in computer software. For any questions that we could not feature during the talk, please feel free to catch Devi at the HyderabadStayStream at Zulip and we'll meet you again in five minutes. Cool. Thank you. Thank you. Bye-bye.