 Hi everyone, welcome to Petabyte Scale Logging with FluentD and FluentBit, a use case from Intuit. My name is Anarok, I'm part of the company Calyptia and I help run product care. So who are we? So today it's going to be myself and Hansel who's joining me from Intuit, who's on the cloud observability team there as a senior software engineer. And for myself I'm part of the OSS group that is a maintainer for the FluentBit project and I help drive product for both FluentBit and FluentD. So for those who are unfamiliar with what is FluentD and what is FluentBit, they are cloud native computing foundation projects. They are part of the graduated projects alongside Kubernetes, Prometheus and others. They started 10 years ago and the primary problem they solved is how do you take data from point A and send it to point B. And if we look at this problem over the last 10 years where you see that data sources continually evolve, we have more things like mobile applications, system logs, Kubernetes, microservices, containers, app metrics and we have additional destinations that just keep compounding. You might have things like Amazon S3, you might have Splunk, Elasticsearch, Loki, Azure, GCP, Kafka and those destinations just continue to evolve and expand. You might need to send data to multiple, you might need to send data to all of them, you might need to send data to a single one, but in any case, FluentD and FluentBit are really vendor neutral solutions to saying let's collect data once and send it to as many destinations as you require. Now when we look at challenges for logging at scale, we have to look at this in perspective of a broad overview. One is high scale can really equal very high costs. You can have things like reliability and buffering, how do you make sure that logs get sent where they're supposed to and you don't want to open up a request every single time that you have a new message. Think about that when you're sending 10,000 to 100,000 messages per second. Networking, always an issue here with things like ephemeral workloads and Kubernetes or you're in the cloud, you might have air gapped environments, how do you make sure networking doesn't hurt you when you're doing logging at scale. Event throughput, you might need to maximize how many messages per second you're sending, you might have latency requirements, you might have to get messages so that your developers, your operations folks can go and start to debug and diagnose all that information. Security, so again in a large scale environment we might have sensitive information that might make its way through how do you look at securing the data transition and then last but not least when you're running this in production how do you look at operationality or ways to minimize the performance impacts at a larger scale. So what we'll talk about in this session is how into it solve many of these challenges but also how D-Influent Bit have solved these challenges over the course of the 10 years that that project's been running. One is filtering parsing and compression, so support for compressing workloads, being able to take data and throw away stuff that isn't used, parsing data so you can take log messages and make really good sense out of them so taking like for example an Apache HEP server log and giving you the additional information like IPs which you can then go and enrich with geo information things like source IP, the request type. Reliability and buffering, so within the D-Influent Bit there are file system and memory buffers so if you need really fast performance and you want to send both messages you can use the memory buffer by default but both of these projects also support file system buffering which is great in case of things crashing for example you might have a server that goes down but because you're storing everything on the file system you're able to recover all of that information none of that data is lost they all follow and at least once mechanism which leads into networking so when networking goes down we don't lose data we have the configurable retry mechanism as well as back pressure handling so as the load increases from say a file or an application Fluent Bit is able to understand that maybe the application can't receive too many messages at the second and we'll retry at a desirable time in the future and especially when you're logging at scale you might have thousands and thousands of endpoints or thousands and thousands of applications logging data how do you make sure that all these retries don't try exactly at the same time and unfortunately cause a denial of service these are places where we've built a lot of that functionality within the open source Fluent D-Influent Bit Event throughput, another place where if you need additional performance and you have the resources to dedicate you can use aspects like multi-worker configuration in both Fluent D and Fluent Bit and give you the ability to utilize a lot of the server resources that you have at your disposal with security there's two places to this if you're securing sensitive information you can use anonymization filters you can use parsing to remove those secure fields and then of course with TLS both Fluent D-Influent Bit support the ability for encryption and transit operationally we can use things like different architectures so both products or both projects have been designed with this ease of use and flexibility in mind you might want to minimize the resources spent say at your source within containers within your Kubernetes nodes within your servers and do more of that processing at a centralized layer and we call that the forwarder aggregator called architecture so just some common architecture patterns right we have the forwarder only which allows you to take all of these sources and independent pieces and send data they each individually handle back pressure so you're not necessarily relying on a single point to grab all the data enrich it and then send it out the disadvantages is you have to do a little bit more configuration management and it is sometimes hard to end more destinations which falls a little bit with the configuration right if I'm sending data to Splunk, Classic Search and Kafka all at the same time I might have to go and configure that individually on a forwarder basis now a common architecture pattern we see with large scale is forwarders with aggregators so that is going to be less resource utilization on the edge devices all they're doing is collecting that data and forwarding it on to an aggregator it allows you to process and scale the aggregator to your independent you can add more back ends in a single tier and then of course the disadvantage here is you do have to dedicate some of the aggregator instance now that's all for just a general overview of how Fluent-D and Fluent-Bit are projects are placed within the Cloud Native Computing Foundation and now how you can look at some of the challenges that they solve for at scale I'm really happy to pass it over to Hansel who's going to be able to talk a bit about the use case specifics add into it and give us an idea how petabyte scale can be achieved in reality so with that let me hand it over to Hansel thank you entire now we'll get into the use case add into it into it provides financial solution for consumers, small businesses and self-employed individuals major products from into it include TurboTax, CubeBooks and Mint add into it Kubernetes is adopted widely across four business units it's used extensively at over 30 segments with more than 5000 engineers onboarded to it we also have over 80 engineers developing our Kubernetes platform we'll see how Fluent-D and Fluent-Bit are used to handle logging from Kubernetes clusters add into it the use case is to transport logs generated within the container which is returned to STD Out or STD Everstream to our persistent logs store to enable this we have a daemon set deployed in the cluster this ensures that a logging part is present in each and every node this part contains Fluent-D or Fluent-Bit which is then responsible for collecting and forwarding the logs from that entire node the local logging plugin fights logs into the files which are tailed and processed the log events must also be enriched with some metadata this would enable us to identify the cluster, namespace, port and container where each event originates from when running search queries at the log store and all this has to be fast and efficient we want to be able to transport huge volumes of data with as low end-to-end latency as possible we have a centralized persistent log store where all customers can see and query log events we wanted a common path to transport all the logs and supply plan this would simplify the connectivity between the VPCs containing the Kubernetes clusters and the VPC containing the log store you only have to expose the log store to the pipeline instead of all the clusters this helps us avoid sending the log events over the internet sending data over the internet is very expensive and we can cut out the cost drastically by sending data through AWS's network only during any connectivity issues with the log store once the buffers get filled up, the logs are gone for good if you have an intermediate store for these logs, this can be prevented this also provides an opportunity for other applications to read the raw log data this is especially useful for security, compliance, and analytics applications this is our Naive approach, which is a streaming pipeline there is a Demonset deployer in the cluster, so each Kubernetes node will have a logging pod running in it this pod will tail the log file from the host, Mounted as volume there is a FluentBit plugin that gets the Kubernetes metadata from its API and enriches the log events with that these enriched log events are then sent to a Kinesis data stream the events are then routed to Kinesis Firehose now the Kinesis Firehose has a native capability to write to the log store and that's how the enriched log data gets to the persistent store this provides a reliable scalable mechanism to transport the logs from hundreds of Kubernetes cluster onto a centralized persistent log store we only need access to the log store from Firehose and this data transfer happened within AWS's network also the logs can be read from Kinesis data stream while just using a consumer this pipeline was working well for us initially but as we started supporting more and larger cluster we started facing some challenges with this pipeline we had a hard limit on the number of events that can be transported every second in each pod with the solution that used FluentD as a logging container we were capped at 2500 events per second per node once we switched over to Fluent with base logging container we were able to double that limit to 5000 events per second per node however this was still not enough especially for larger nodes which contain a lot of containers packed inside also as you saw there are two hops between the cluster and the log store each of these hops were introducing some latency which was quickly adding up we started seeing the median end to end latency getting as high as 30 seconds this was not acceptable for our most time critical applications so we decided to reduce the number of hops and hence the time it takes for the transport to maintain this streaming pipeline we had to have some persistent resources running the log load that needed to be transported was extremely elastic especially based on different times of the day however these resources even with the auto scaling enable were not able to scale efficiently to match that log load this was making the pipeline inefficient in terms of the cost and ended up being too expensive we wanted to cut down some costs as well so in simpler terms the target for the new pipeline were to increase the throughput dramatically without increasing the resource consumption by that poor reduce the end to end latency of the log transport pipeline as much as possible and to minimize the cost needed to maintain it and this is where the S3 pipeline comes into play this is how the architecture of S3 pipeline looks like the space enclosed by the blue dashed line represent a node running in the Kubernetes cluster each node contains a daemon set pod containing fluently the logs written by the container on its STD out and STD out streams will be written as files that are rotated in the node and it's all done by the Docker daemon these files are exposed to the container by using a volume mounted directly onto the host the fluently container then tails these files and listens for any new log events and for any new log events fluently pushes it onto a buffer which is specific for each container at a certain fixed interval 10 seconds in this case the buffers are flushed onto an S3 bucket object after compressing it with gsit so each S3 object would contain logs generated by the container for 10 seconds compressed as a gsit file in the pipeline fluently is not responsible for enriching the log data with metadata but rather writes the logs as is onto the S3 object now we have a separate deployment in the cluster which ensures that a single pod will be running across this cluster this deployment communicates with Kubernetes API and gets information about any new containers for any new containers created in the cluster it will fetch its placement and other metadata including its labels and annotations and pushes it onto a separate S3 bucket within the log pipeline VPC we have a deployment running in AWS Fargate this deployment listens to any new objects in the log S3 bucket for any new object in this bucket the path would indicate the container for which the logs are for the log pipeline deployment then fetch the corresponding metadata from the metadata bucket both these are then sent over together onto the log store this communication happens over AWS's transit gateway the log store is then able to apply the metadata to each and every event in the log S3 object the key takeaways in the architecture are that the log data is compressed from the cluster and only deflated at the log store and all the data transfer happens within the AWS network and moreover process and transfer happen via a transit gateway created specifically for that purpose as you have not seen Fluendy is deployed in the cluster as a demon set this ensures that there is a single part in every node and each part is then responsible for transporting logs from all the containers in that node regardless of how big that node is from this we can realize that this part would become the bottleneck and the weakest link in the chain so the throughput will be set by the immune that can be transported by a single part the biggest gain in the throughput can be achieved by making the demon set process transport more and more events one mechanism to achieve this is to do as little processing as possible at the demon set part that is the key to note in the new pipeline design next we wanted to identify all the things that can be offloaded from the Fluendy and how we facilitate that function elsewhere after multiple iteration we were able to pick two CPU intensive processors that can be offloaded without any change in the events at the log store these were providing us with this massive challenge throughput this is the first thing that we were able to offload by timestamp parsing timestamp has to be parsed from each event for a couple of things to attribute a timestamp for each event and to identify the starting line of new event in case of multi-line events this was very CPU intensive but we were able to offload this onto our log store without any change in the end user experience the second thing we offloaded is the metadata enrichment in the previous pipeline each event was enriched with the required metadata by Fluendy itself this has now been offloaded and done by the metadata deployment external log pipeline deployment and the log store the metadata deployment communicates with the Kubernetes API to get the required container metadata in an S3 object the external log pipeline deployment then collates this big metadata with each S3 log object which is also container specific both these are then sent together to the log store which can then apply the metadata to all the events in the file the additional advantage with this is that the metadata enrichment happens in batch across all the events in an S3 object at once this reduces the processing needed and the amount of data transferred when looking at the cost breakdown for the pipeline we found that the data transfer cost was the highest component higher than even compute and storage so if we can reduce the amount of data sent over and use a more cost effective way to transfer it we can make huge savings in the cost one added benefit is that as we reduce the amount of data transferred the latency also goes down as mentioned before Fluendy compresses the data via gzip in the buffer when it's flushed to an S3 object normal log load we were getting around 10 times compression this gzip file then be returned to log store as is what that means is that the log data stays compressed as soon as it leaves the clusters and until it reaches the log store or the log data is always compressed in transit just this one change cut down the data transferred to one tenth the volume to that of before also as we are applying the metadata in batch for an entire S3 object even need not have the metadata along with it this was also giving us large savings in terms of the amount of data transferred by cutting down the size of each pro log event this is when we compare it to the previous string pipeline we also have set up an aws transit gateway between our log pipeline vpc and the log store vpc and all the data transfer happens through this this was giving us around 60% in cost saving as opposed to sending the data over the internet using an advocate way to demonstrate how this new pipeline performs I have set up a test when it is cluster with 3 broken nodes each node contains the demenset part and a few other parts to generate some sample log events this cluster also contains the metadata deployment as you can see from our metrics dashboard that the 3 nodes it's able to transport 90,000 events per second without any errors so that means that each node can support up to 30,000 events per second we also see that the CPU of the part stays stable and that it can constantly support the log over a period of time the log generator in this each node collectively was generating 30,000 events per second and we have 3 such nodes so the total log load was around 90,000 events per second and I ran this log test for an hour so that should give us 324 million events at our log store that's exactly what we see in our log dashboard we can now see that the end-to-end latency which is a time difference between the container logging an event and the log store indexing it we see that the median end-to-end latency is just around 9, 8 seconds and the 99th percentile is less than 14 seconds that's a massive drop from our streaming pipeline let me give a summary on the improvements of this S3 based pipeline over a streaming pipeline that we saw from the demo in the case of throughput we saw a massive 6 times jump in the number of events that can be transported from 5000 events per second per node to 30,000 events per second per node moreover we have just check transporting over 1 gigabyte per second from a single Kubernetes cluster next up we see that the median end-to-end latency drops to less than 1 third from around 30 seconds in the streaming pipeline to just over 8 seconds in the S3 based pipeline and the latency drops considerably when considering the 99th percentile which has the latency cut down by more than 75% finally looking at the cost we see the cost saving of more than 92% when compared to the streaming pipeline this is due to not having to keep position resources running compression of the log data and using transit gateway this was coming around to more than 50,000 US dollars saved over every petabyte transferred and was able to create a petabyte scale logging pipeline at Intuit using Fluent D and Fluent Bit. Thank you all so much for participating in this session and I would like to open up the forum for questions at this point. Thank you.