 Hello everyone, my name is Uwur Kira. I am Senior Specialist for Containers in here in AWS. Today, with my colleague Wesley, we will talk about scaling aspect of Blendbit and provide you more insight around it, based on our experience. Here is our agenda. We'll consider scaling considerations, scaling best practices, and we will discuss the worker concept with Blendbit, and we will present some benchmarking statistics. So, to start with, when we are talking scaling of your log delivery pipeline, you need to consider three main focus areas. Particularly, you need to consider your source capabilities, which means the total CP and memory capacity of your worker node, the network throughput that your worker node has. And second, you need to consider the your destination capabilities. What is the ingestion rate that your destination can consume? This destination can begin, this is data stream called watch logs or any other destination, but you need to understand the throttling and ingestion rate of your destination. This has the direct impact on your throughput and performance of your log delivery pipeline. And then you need to consider the performance and the performance and actual throughput of your pipe, which is the Blendbit daemon in this case. So you need to understand Blendbit fully. You need to optimize the ISCO configuration and you need to actively monitor your pipeline. And another aspect while we are talking about scaling is you need to choose the right filter and understand its characteristics. And this is the way really vital when we are talking about scaling. Let's take an example of take Kubernetes filter as an example. And as you know, one of the characteristics of this Kubernetes filter is for each pod it queries the API endpoint to get some extra metadata to enrich your logs. And this metadata provide these labels and annotation. However, this task to enrich your logs and creating API endpoint to get this metadata about your logs and where it attempt to provide, put more context to your logs. It add additional work on your API endpoint which may impact your performance and scale of your log delivery pipeline. And here, this is the default behavior. And one thing I want to call out, there is a new proposal for that. And within this conference series, there is a lightning talk called scaling the Blendbit Kubernetes filter in very large cluster. And our colleagues take different approach to get the similar metadata information rather than from directly API endpoint. They tend to rely on the Kubelet to obtain those information which reduce to load on the API service. And it's really interesting to watch if you didn't watch it yet. And another thing you need to consider is whenever possible, you need to offload your workload to the pull source and you need to apply filtering. How you can do the filtering? You can use blended filters. So in a successful log delivery pipeline, you need to only ship the logs or information data that's really at value for your business. And there is no point to stream a log that doesn't have any business value. So you can use fill-in filters to filter your logs at source level. Blendbit has a data stream feature which can be also used to filter and process your data and log at source. And when we come to the scaling-based practices, first thing, as mentioned, you need to understand the behavior of your Kubelet, the behavior of your filter and plugins that you are using. And here, we mentioned the Kubelet filter generate an extra log on your API service. You need to know that. And moreover, by default, it happened both labels and annotation to do your log recourse to add as a context. And here, in case most of the time, these labels and annotation doesn't have any business value, it's more in this specific thing, but it adds an extra load on your logging delivery pipeline. So you can consider to disable them whenever possible, as seen below. And the third thing that you can do is you can use the memory buffer limit option in large cluster, which is disabled by default. And here, based on our testing and what we recommend to set it somewhere between 50 megabyte to 100 megabyte based on the workload. And there's a nice article called backpressure in this plan with documentation, which really explain this in detail and help you to determine the right value for that. And along with memory buffer limit, you can also enable five-stem buffering. And this will also have little bit of large scale and will ensure the data and log integrity in case your log destination is not able to cope with the load. Or there's any technical problem there on destination site. And for CloudWatch logs, we recommend to use the hyperfond CloudWatch logs plugin called CloudWatch underscore logs. And based on this information we mentioned, we have optimized full-end configuration in our container inside page. So I recommend you to review this configuration and try to make your friend with configuration as close as possible to the configuration we shared under the container inside documentation. And last best practices that I want to recommend is monitoring hell of your log delivery mechanism that is really important and but an ignored aspect. Flint bit has a built-in HTTP server comes with it. And you can enable this in your Flint bit configuration as you see here. And then you can use really nice graph on a dashboard where it will tell you how to provide you more. Now I am handing over to my colleague Wesley who will talk about the worker concept and how worker concept helps with scaling aspect of Flint. Okay, hello everyone, I'm Wesley. I am one of the co-maintenors of Flint but I've contributed a lot of code to it for AWS plugins. So I'm gonna explain how concurrency works in Flint bit which is important for understanding scaling with Flint bit and the new worker feature that was introduced. So prior to workers, this is how concurrency worked in Flint bit. Flint bit uses a model called co-routines. I believe is short for cooperative routine. Basically technically Flint bit has multiple like P threads in use. But before workers without or without workers only a single thread is active at a time and threads cooperatively pass execution between each other. So what happened is you can think of this like so there's like the fluent bit core engine. It gets some task like let's say there's some chunk of data that needs to be sent to an output destination. So then what it will do was it will invoke a co-routine for the output that runs the output code. Pass execution to it which means the engine thread basically moves execution to the co-routine. So now only the co-routine is active. Co-routine performs some work. When it hits a network IO call rather than blocking on that call and blocking into all execution instead it will make the network call but then it will pass execution and yield back to the engine which will then take control and it can then run other co-routines to complete other tasks. And when the network, when the kernel notifies the engine that the network call has completed then the co-routine can be unpaused. Basically it will be sleeping and it can be activated again. So the way it essentially works is that there is concurrency and in some sense there's multiple threads but only one of them is active at a time and they're cooperatively and intelligently basically turning themselves on or off so that network IO calls basically don't ever slow down the execution of the program. Essentially it's just to do asynchronous network IO. But then now in Fluentbit as of 1.7 we have the 1.7 series, we have multi-workers. So with workers you can increase, you can have multiple threads active at a time. And basically the simple story is that workers are just dedicated threads for outputs. So here we have an example here with an output you put workers one that enables a single dedicated thread just for that output. Workers still actually can have co-routines inside of them which means you're still doing a non-blocking network IO to make things efficient and fast. It's just that now the co-routines for the output are all happening in the context of a single or an extra dedicated thread just for that output. And of course you can enable more than one worker if you want but I think in most situations probably folks throughput needs can be satisfied with a single dedicated thread. Now, so how is our concurrency and workers implemented in the actual code in the plugins? So basically all plugins have a flush callback function that they must implement which is called when they need to be given logs that then they can send. And basically in order to have currency the calls to this function must be fully independent. Basically it can't be storing state. There is a context object that basically this function can have that allows it to store state in between different calls. But if you are storing state in your output plugin then that prevents you from being able to have concurrency because it means that basically the engine can't independently have you send multiple chunks at the same time. Also another key thing and this kind of follows from that one is that your API can't have any sort of guarantee require ordering where calls must be made in some sort of like order in a series because there's no guarantee that that will happen when you're running currently because one co-routine might be performing a network call and will be sleeping and then another co-routine might start sending before the network call for the first sleeping co-routine might finish. So that is one issue. With workers basically almost any plugin I think should be able to support pretty much any plugin should build output should build to support workers with only minor code changes. Concurrency support interestingly is actually not needed in order to support a single worker the ability to have users configure a single worker. So the reason for that is that basically workers are a dedicated thread even if your output is like synchronous and it's not doing and it's basically doing blocking network IO instead of non-blocking co-routine network IO you could still have a dedicated thread and then just do all of those synchronous calls in the dedicated thread. So basically workers can still help. Now, if you want to let your users enable multiple workers in the output then you also need to support concurrency because if you have multiple workers then that's you can think of that as kind of equivalent to at least having co-routines except for now it's processed on multiple threads. Interestingly as a side note if folks are familiar with go-ling and like go-routines multiple workers which means multiple threads and then the co-routines operating inside them I'm pretty sure like fundamentally in terms of the programming model it's a lot like it's kind of a lot like the go scheduler running go-routines onto actual real threads. Anyway, so this is just looking some more at the code how do you enable concurrency in workers? So if you had an output that you maintain for fluent bit so concurrency is enabled by default for all outputs you can turn it off with this little line of code there. I wrote a whole developer guide that explains this actually a little bit more if you want to really understand what this is doing here but there's basically and there's this thing called an upstream and fluent bit which is a core networking concept representing basically an endpoint that you're connecting to and you can disable asynchronous calls which will disable concurrency and then there's this other call that you have to make which will then allow people to configure workers if they want. Okay, so then talking some more about worker and concurrency support in the plugins that I maintain. So I maintain the AWS plugins in fluent bit and there are four of them. There's S3, CloudWatch, Kinesis Streams and Kinesis Firehose. So basically support for workers and concurrency varies. So when we're talking about concurrency support that's supported basically the Kinesis Streams and Kinesis Firehose plugins, they support everything. They support any number of workers and they also support concurrency. The S3 plugin can support concurrency and multiple workers but only if you're enabling put object and I'll show that as the API that it uses to upload S3 and I'll show that in a moment. And then there is CloudWatch Logs which doesn't support concurrency unfortunately but it can support a single worker because as I noted any output basically can support a single worker which is a dedicated thread that then operations can happen in synchronously. So here's just some example output definitions to show what this all looks like. I'm emitting some of the required fields that you have to configure for these outputs. You can see in Kinesis Streams or Kinesis Firehose you can enable any number of workers that you want. In S3 with the default multi-part upload mode which can only be enabled if you, which only happens if you have a file size that's above I think 12 megabytes. But anyway, you can enable a single worker. This will be synchronous calls but happening in a dedicated thread. The reason why S3 multi-part is non-concurrent is because the multi-part uploads are a series of calls and they require storing state basically. So the plugin is not stateless. This is something that I've thought about probably there. There probably is a way to allow it to work with multiple threads but I need to rewrite the code and think a little bit more about that. Anyway, so then S3 was put object mode. So put object mode is an API where you're, you just send an object in a single request. And so that can be done concurrently which means you can enable any number of workers and it's fine. And then finally up here at the top, we have CloudWatch. Unfortunately, it doesn't support concurrency. The reason is because there's this concept of a sequence token in the put log events API which basically means, it kind of basically means that calls to the API have to be ordered where something obtained from the output of a previous call must be sent in your next call. So this means you have no concurrency. You can only enable one worker as a dedicated thread for the output and which you will run synchronously in. In a moment, you're just gonna show some benchmarking results. Surprisingly, despite the fact that there's no concurrency, it actually still performs pretty well. Anyway, so wrapping up concurrency and worker support. So these are like the unsolved remaining problems for the FluentBit community and FluentBit core to... Anyway, now I will turn it back over to Uygur who will go over some benchmarking results for FluentBit with workers. Thank you basically for your insight around workers and concurrency. Now I will provide some more insight on how different destination and log delivery pipeline performs with and without workers enabled. And particularly in this session, we will consider Amazon CloudWatch Logs and I was looking at this stream as a destination. And here you see the details of our benchmarking setup. And we will start with workload on CloudWatch Logs. So imagine you have a production environment which generate around 8,500 logs per second. And this is without worker enabled. And keeping all other variables more or less the same and by just enabling the worker on your FluentBit configuration, we see an increased throughput in terms of processing grade of FluentBit. And as we see here, enabling worker nodes and by keeping all the other variables stable have really major impact and improve the log processing performance dramatically. And when we look to the logs delivered in terms of log delivered in terms of network performance and here the similar workload generate, used to generate two megabyte per second logs and streamed the capacity over, this was the capacity over pipe. And parallel to increase in the number of logs that is processed in previous slide. And after enabling the workload, the delivery throughput in terms of networking also increased as we see here. So one thing additional we did is like we wanted to see impact of the work enabling work on for our resource consumption and what are the extra load it brings in terms of CPU and memory to our cluster. So here just numbers when we get while we test the FluentBit without worker for with 1000 and 5000 logs per second and there's the CPU and memory values we captured on the FluentBit ports that's streamed the log. And when we have the similar workload when we enable the workload, what we see there is the memory utilization is more or less the same and there is no any increase and the CPU utilization increased slightly. And when we look the same figures at 5000 logs per second with worker enabled, we see an increase on CPU utilization and whereas the memory utilization more or less stays stable. So this means enabling work on have an addition has a minimal impact on the resource consumption and where the memory utilization at port level remain more or less stable and it cause an increased CPU utilization. And now when we benchmark the Kinesis Datastream, so we built a pipeline log with pipeline including Datastream where you use the firehose to stream the data to the S3. So log generated through the Kinesis Datastream data firehose shipped into the S3. And here the view on EWS console and here imagine we have this setup running and in our production we generate around 4000 logs per second and then keeping all other variables the same by just enabling the workload, we see that our incoming processor rate increased dramatically. And when we look the increase in terms of networking throughput and we will experiment the same where parallel to processing, increasing processing rate of the logs our throughput also increased. And one thing here, these numbers are not the much throughput limits of the setup is just how a worker can impact your log deliver performance by just enabling them rather than the maximum capacity of the limits. And with Kinesis Datastream, you also need to think about the number of charts that you have and the throttling at Kinesis level as well this also contribute to the performance of your log delivery pipeline.