 We'll be talking about debugging a performance issue of a distributed system. So how many of you, it's fine. How many of you ever been into a place where you need to debug the performance of a distributed system? Right, it's quite obvious. And we know the pain of looking into it. So the first things first. So I'm expecting the audience to know a bit about distributed systems, microservices and message streaming platforms. And looking into a very, if you look into the architecture that we were developing on, like from 30 floors above, this is a bird's eye view. This is an Oracle DB. And we are pulling some data from the Oracle DB and we are aggregating that data and putting into Kafka platform for streaming it. And then the publisher is the one which consumes and it publishes to the API services. And finally, we write that into Redis DB. The Redis is here, it's not a cache. It's a enterprise version. It's a disk persistent. Okay, I mean, everyone is aware of these systems, right? Any doubts here? Okay, fine. So I have heard of this poem by Dylan Thomas. Do not go gentle into that good night. If you have seen Interstellar, it's a great poem, I like it a lot. But when you're actually going into the debugging the distributed system, I say do go gentle into the deep night. So capture the metrics of the existing system first. So leverage all the dashboards logs that you have written in your code. Basically, we write logs in the code, right? So leverage these logs to get the IO operations or thread pulling or API calls that all your systems are happening. And keep a note on memory usage and CPU usage of each instance. And you can use JVM profilers, et cetera to get the profiles of these. And dashboards, we just had a talk about dashboards. So dashboards actually, they drastically bring down your debug time. So I would say invest time while developing only on dashboards. So once your MVP of your system is built, the immediate task for you should be developing these dashboards. So how many of you are aware of traces for the data flow? So every data that is entering the system and leaving the system, it should be tagged with a trace ID. So there are many open source libraries that are available and you can use those libraries. So they can, they will produce a unique ID for each message that is entering the system and exit the system. Catching hold of this trace ID, you can look throughout the system where your message is going. And yes, and think before you mess up with the default configurations of a system. For example, in our case, we had an SLA, we have a service level agreement that the data that is being pulled from the Oracle should reach Redis within one minute. So aggregator was able to pull around like two lakh payloads per minute from the Oracle and publisher was able to consume only around 2000 payloads per minute. So and then we tried, we tried spawning up the instances. We tried reducing the instances of data aggregator. Again, we had challenges in spawning up because the topics we were using in Kafka, we had 25 partitions. So the consumer groups, the consumers that the publisher has should be divisible by 25. It should be either one or five or 25 because if it is other than these, the, it will be a wastage of resources because the consumers won't get equally distributed among the partitions. So it should be either five one, five or 25. So we had these challenges. We can't scale up, we can't auto scale it. We need to scale them as the requirement comes as the data per, if you imagine, there is a market per data per market. So we will spawn up those instances which are pulling for that market to make the load balance. So yeah, coming to that, the message, the default configuration. So what we did is we were playing with the Kafka's default configurations like changing the retention time and then changing the number of, I mean, records that are being consumed by the publisher per consumption. Like we can control that. We can have like 10 records being consumed per time and we can take it up to 500 records. So, and then we can, we can also configure Kafka consumption for max, max wait time. Like if you took a payload and then before processing this payload you can stop the acknowledgement to the Kafka that you have consumed this. So that's what we were playing with. And then we felt we were actually going into a deeper mess. So I would say default configurations are default for a reason. They have been tested properly and then they are made reasons. So without properly looking into your system do not mess the default configurations of any distributed system. And then I would, done. Yeah, quickly, chuck out the architecture, write it on a white paper and then look into each service. And then it's not always a count of instances. So your system should be tested per instance and then spawn up for the complete instances and divide and conquer. Kill each part of your system, get the logs, see where is a problem and then fix it. Yeah, thanks.