 Good afternoon. Good morning or good night if you are attending this virtually. My name is Pandu Ajee. I work for the Azure Machine Learning Infrastructure Team at Microsoft. So we build, we run and maintain Kubernetes clusters for the Azure Machine Learning Team. We build tools to improve the speed and the reliability of the deployments and to also increase the observability of the application running in the cluster. About five years ago, we started to explore logging solutions for Kubernetes. There are two open source projects that caught our attention, FluentD and FluentBit. We went with FluentD at that time because FluentBit was missing some key features, file system buffering and then the scripting plugin. But fast forward, about a year ago, our business is growing and we are looking for more efficient logging solution. So we revisited FluentBit and we found both missing features have been supported and also we found that many people have adopted the FluentBit for their Kubernetes logging solution. So we made a switch early last year and overall we've been in Fluent ecosystem for about five years. So this is our production environment, the current state of it. Some of the interesting numbers are the overall log volume which is somewhere between 750 to 850 k per second in our busiest region. The log volume per cluster is about 80 k per second and the log volume per node is varies between 1 to 2 k per second. So today I'm going to briefly talk about our logging architectures just for a little bit of context and then I'm going to show you how we monitor the health of our logging pipeline just by using one single metric. So let's get into it. We have two logging solutions. One is the sidecar pattern. It's fairly straightforward. The application writes logs to the logging agent sidecar through forward or message pack protocol. The logging agent sidecar then sends that logs to the storage backend. The logging agent sidecar is automatically injected by the mutating admission controller. The application can enable the sidecar injection by adding annotations in their deployment spec. In our case, the logging agent sidecar is Microsoft internal agent, but technically you can replace the logging agent sidecar with Fluentbit. To send the logs to the logging agent sidecar, we instrument the application with the logging library. So one of the downside with the logging library is that when you have applications written in different languages, then you have to support multiple logging libraries. But we needed the logging library for two reasons. One is that we want all our application logs to have three defined schema just to simplify the login queries. And two is that we want to have the application to have the option to be able to write to standard auth and as well to write directly to the sidecar. So when the logging agent sees, hey, there is a sidecar on my pod, I will write directly to the sidecar. And if the application disabled the sidecar injection, then all the logs will go to the standard auth, which is then will they are handled by the second solution, the forwarder and the aggregator pattern. It is the most common pattern out there, I believe. So the forwarder is very light with a demon set that tells all the container logs on the node and then route them to the aggregator service, which is then subsequently the service will distribute the logs among the aggregator pods. So the aggregator then write those logs to the Azure storage backend. In our case it's Azure. And the aggregator actually does all the business logic to transform the logs and then send them to the sidecar. We also, for the data safety security, we enable the file system buffering on the forwarder and then also to adapt to the load, we enable the HPA on the aggregator. And next. So let's talk about fluent bit metrics. So the HTTP server on fluent bit, it exposes multiple endpoints. Two of them that we are going to talk about is the metrics endpoint that is in Prometheus format. This metrics exposes the metrics for each running plugin. The second endpoint is the storage endpoint. It exposes the storage information but they are in the JSON format. And this is the output of the storage endpoint. So we see chunks. What is chunks? Chunks is how fluent bit groups and store the logs in the file system. So there are two types of chunks, the up chunks. They exist in the file system as well as in the memory. There are chunks that are currently being processed and you can configure the maximum number of up chunks allowed in the memory. The down chunk. The down chunk only exists in the file system. So if you're logging pipeline, if they are healthy, typically the down chunks is usually zero or very close to zero. And if your pipeline is congested or slow, then the up chunks then starts to grow. And when it reach the maximum limit, then the down chunks start to accumulate. So this is actually the metrics that we use to monitor our logging pipeline. Before moving to the next slide, pay attention to the JSON path of the FS chunks up and down. It is storage layer chunks, FS chunks up or down. To export these storage metrics, we use JSON exporter deployed as a sidecar as well. You can find the JSON exporter in Prometheus Community GitHub. So this is the example of our config. So the name here, fluent bit storage layer is the metric name prefix. And the values field is the metric that you want to export. So in this case, is the FS chunks up and FS chunks down. And the exporter, to get the metric value, the exporter will follow the JSON path. Those are the ones that are in the curly braces. And if you look at it, it's the storage layer, the chunks, the FS chunks up or down. We use the kubes Prometheus stack to deploy our Prometheus. And this is our service monitor configuration. So does anyone use kubes? How many people are using kubes Prometheus stack here? Okay, cool. So service monitor is CRD. It's basically teleprometheus where to script the metrics. The first endpoint here is just the metrics endpoint on the fluent bit container. The second endpoint is the probe endpoint on the JSON sidecar container. So when Prometheus script this probe endpoint, it will fetch the JSON metrics from the target, convert them into Prometheus format, and then send the result back to the response. And to just visually test this, you can use kubectl forward. For example, this, for example, you can see on the screen. If you look at the URL, it's the probe endpoint, and then the target query parameters. And the result, you'll get the fluent bit in the Prometheus format. So it's the FS chunks up and FS chunks down. So let's take a look at our live dashboard. Let me switch for a second here. So this is our, it's still loading. So while it's loading, okay, it loads now. The top panels here, these are the global summary of the down chunks across all the cluster. So it's kind of nice to be able to see the overall health on a single pane. And you'll see that the down chunks are pretty much zero. There's most likely five, and then it dropped back down to zero. And these are from Japan East. And you can see this one is from Germany West Central. So because this is global, the drop down here doesn't work on this panel. Ideally, you can put this on a separate dashboard with the drill down. But we just consolidated this together. So the panels on the file storage group there are the down chunks and the up chunks by pod. And you can actually, this is our busiest region, which is in each US. So we can maybe switch to Brazil South, change. There is no down chunks. It's super healthy. And if you switch back to each US, and then in the IO group, the panel on the left is the input versus the output rate. And if you look at it, they are kind of like overlapping with each other, which what you expect from a healthy pipeline. Actually, why does it show? Can you switch, excuse me, can we switch the camera to the, can we switch the camera to the, like, my, how do I switch this? Okay. Since we cannot switch this, can we switch this to the other panel? Okay, never mind. I'll just go ahead and use the pre-screenshot thing. Can we get back to the presentation, please? Okay. Oops. Yeah. So this is the dashboard. The top panel is the global view, down chunks, up chunks. And then next is the input and the output rate. So if you remember, we look at the overall log volume, that's actually taken from this graph. So it doesn't include the volume from the logging agent cycle. So it's only from the forwarder. And the next two panels on the IO is the input rate and the output rate by POT. The last is the errors and the retries. We seldom use it. It's nice to have one additional diagnostic tool. Next, we'll take a look at a couple of studies. So case studies number one, you look at here the down chunks they are growing, but they are growing on only single POT. The input and the output rate are pretty stable. But then if you look at the input rate by POT there, there are kind of like two odd layers with significantly higher input rate than the rest. The last panel on the bottom right there, it's actually taken from another dashboard. I just shifted it here and took a screenshot. It's the CPU usage of the forwarder. And similar to the input rate, there are two odd layers. And if you can see the forwarder POT that has the highest CPU usage and the highest input rate, they are the same POT as the one with the accumulating down chunks. So what happened here? So there is one tenant that generated too many logs and it has a very few replicas. So it only happened on one or two nodes. It generated too many logs and it's hogging resources. So as a result, our forwarder is kind of like struggling and fail to keep up. Our mitigation is basically just add annotation to the app and onboard them to the sidecar solution. And this is case study number two. Now we see that the down chunks are growing too, but they are growing across all parts. Now if you look at the input rate, there is a very big jump like around maybe 1645, like maybe from 20K, 30K to 800K. And then if you look at the aggregator replicas panel, the aggregator seems to try to scale up, but then it scale back down again. And then if you look at the crashing POTs, most of the forwarder POTs are crashing. So this actually was pretty bad incident. So what happened here is that for some reason, the applications like scaling up rapidly, while the aggregator is adding replicas and waiting for the POTs to get ready, the existing replicas were overwhelmed and POTs are crashing. And when your POTs are crashing, the HPA is not going to work properly. So the mitigation is for us at around maybe 1845 there, we manually scale up the aggregator. And then if you look at the down chunks, it's the forwarder starts to recover. So there are a couple of fixes. One of the fixes is to make the HPA more aggressive. So the aggregator will scale up earlier. But this is just to alleviate the problem. It's not actually bullet proof because there's always a time delay from when the time the load starts to increase until all the new aggregator POTs are ready. So theoretically, we still can hit this problem. So we also make the second changes. The second change is to actually onboard few applications that both very, very chatty and they have the tendency to scale up rapidly to the sidecar solution. And what are the key takeaways here? There is actually no perfect solution for everyone. So choose a solution that works best for your environment. We have the forwarder and aggregator pattern as our default logging solution. But we also have the sidecar solution to handle the heavy application. So be creative. Second point that I want to make is that Kubernetes and Fluent Bit, they continue to progress and your production environment changed over time. So your solution must evolve and adapt to the changes. And also, for example, we added the sidecar solution just about a year ago. And also that we are exploring kind of like adding persistency to the aggregator or also the idea of merging the forwarder and the aggregator. So we only have a demand set without the aggregator. And finally, last but not least, although Kubernetes simplifies deploying containers in the cloud, it also adds infrastructure complexity. So having visibility on the infrastructure components such as Fluent Bit and also choosing the best signal for your monitoring is crucial and fundamental in production. With that, I'll open up for question. So we have about maybe six minutes for question. Yes, I couldn't hear you. Finally, we have a question. Hello. Sorry. I wanted to ask, you said you're currently using the forwarder aggregator approach, but you're switching to daemon sets? Not switching. We are exploring actually kind of like merge the forwarder and aggregator. The aggregator is if you merge that together, it's expensive on the node, right? Because all then your node then will have to do all the business logics that aggregator does. So it takes resources from the node. So we're still exploring that. Is that your question? Do I answer your question? Yeah. Thank you. But I also want to mention that the aggregator, although it is expensive, I think it works in certain scenarios, like when you're like for us in the second case study, because when you have the logging sidecar inside your pod together with your application, if your application scale up, your logging agent also scale up together at the same time. So I think that's the advantage, but it is expensive. Okay. Thanks for attending my session. If you have any further questions, feel free to talk to me afterwards. I'll be around today. Thank you. Thank you.