 Hi everyone, this is Drew from Amazon AWS ECST. I'm now focusing on container metrics and the logs related to work both inside AWS and on the open source one bit. Hi, I'm Nitesh Kumar Murcharla, a senior cloud support engineer with AWS as well. My major area of focus is containers and container related technologies and services on AWS. So today, we're gonna talk about the new feature we just finished inside the open base, which is stealing the open base Kubernetes filter in very large classroom. So for this topic, this feature is trying to address the ability issue, which is happening in the Kubernetes filter. Some customer use Blundee to, some customer using Blundee and they are stealing the API server for struggling and unresponsive when they try to scale up the Kubernetes filter, the Kubernetes clusters. And this is due to that Blundee is spamming the API server with least all possible costs. These are expensive costs and can bring down the Kubernetes API server and control plan. And this makes API server became the bottleneck of the scalabilities. So let's see the texture. The request, which Blundee and Blundee try to send out is to get my data from the API server and actually this information could not only get from the API server, but also from the kubelet. And so the kubelet is the primary node agent that runs on each node. And Blundee could instead call in the API server but just call the kubelet to get the same information. So we proposed an idea that we can provide customer an option to get the metadata information from kubelet instead of to the API server so that we can unblock from the API server on bottleneck. And in this solution, we will reduce the request to API server to zero and just increase the one request per node for each kubelet. And let's let me talk about all the results and our features. Thanks, Drew. So for this, I have baseline flow and D first. With flow and D, we saw over 67,000. So I created a cluster of 2K nodes and over 30,000 ports. And every node has flow and D agent running on it. So for 30,000 ports, we saw and I was turning at a rate of 1,000 ports per hour. At that rate, what we saw is the flow and D agent was making, the long set that you're seeing on the screen shows that the flow and D agent is making 60 to 70,000 API calls with that turn rate that is collected over three hours of time. Apart from that, we also see a lot of watch and get API calls that are made to API server by flow and D. You can also see that the P99 when a list pod, list pods are list pod is called by the flow and D agent or other agents. You can see that the P99 is high at that point of time. These are some internal metrics that have collected from EKS servers. Just wanted to showcase for your understanding how we know that the list API call is taking a lot of time a lot of time here. Yeah, these are the continuation of the metrics as well. You can see the list latency is spiky there in the third row. Yeah, for customers as we are internal to AWS we were able to see those metrics from the internal dashboards that we have. However, customers might not have abilities to view that granular metrics. So I also set up Prometheus on my EKS cluster to showcase the amount of API calls being made and yet CD request latency and API server request latency metrics there. I made over 800 lists and it made the whenever I churned my cluster with a lot of pods in my deployments. I see that there are a huge number of API calls being made to API server due to which my Qubectl commands were slowing down. That is what I'm trying to show in this particular screen with this particular screenshot. Apart from that, what I also noticed with Fluendee was I was running a thousand clients of log test agent that is available on that is available. And each sense about 600 lines of logs per minute and I had over a thousand clients running on my cluster for scale testing. And as you can see, wherever the pods have landed, there were multiple pods landed on one single instance and I see that the Fluendee is taking approximately over, is taking over 1500 megabytes of memory, which is a lot. And also on the bottom of the screen, you can see that the Qubectl calls that I made to API server took over 15 seconds. And sometimes this was not the maximum, this data was recorded when I just made an API call or when I just initiated a update to my cluster. Whenever I listed all the pods in the cluster, it sometimes took over 60 seconds as well and timed out. Yeah, so later I switched to Fluendbit with the Qubelet feature that we have developed. As you can see on the screen, I am querying for a data of over a week. I've been running that feature for over a week in my performance cluster and ran quite a bit of tests on that. And I haven't seen a single API call being made by Fluendbit with use Qubelet or Trueflag to API server. This significantly reduced the amount of workload or the reduced the amount of memory pressure on API server. And as Fluendbit is no more making that list calls to fetch data like Fluendee. Yep, so that's all for our sharing. Thanks everyone for attending. Thanks, Oli.