 Hey folks, welcome to Fluent Deed Scale, keys to successful logging syndication. This lightning talk is being given at FluentCon 2021. Let's get started. I'm Fred Moyer and I work at Zendesk, providers of awesome customer service software. At Zendesk, reliability is feature number zero. If we can't serve our customers, they can't serve their customers. At Zendesk, I spend a lot of time working on our telemetry firehose of metrics, logs and traces. Today, I'll be talking about how we use Fluent Deed to handle the ever increasing volume of logs that our distributed microservices generate. I've been doing production web applications for a while, so part of my reflections are from a developer perspective and part from an operational perspective. So first off, let's take a look at Zendesk's architecture and how we use Fluent Deed. Zendesk started off over 10 years ago as a monolithic Ruby on Rails app, part of which is still running today in a modernized form. Over the years, we've built out additional products as standalone services, in addition to our historic support ticketing service app, as well as adding other products through acquisitions. Zendesk moved out of co-located data centers starting around five years ago and moved into AWS services on Amazon. Nearly all services now are delivered from what we call a pod, which stands for point of delivery. Applications are delivered via Kubernetes clusters across multiple availability zones and regions. Our usage of Kubernetes has grown significantly over the past several years, which has resulted in an equally significant increase in the logging volume that we have to handle. Logs are written to standard out on Docker containers via the Docker JSON logging plugin. These logs are read from the file system by Fluent D, which is deployed in the pods as a Kubernetes demon set. The log records are tagged with metadata, such as the Kubernetes cluster details, pod ID and other metadata through a number of Fluent D filter plugins. As we have a number of log consumer stakeholders, we'll include as much metadata as possible on each log entry so that end users can take advantage of those in any of the monitoring tools we use, which are listed on the right. Finally, a set of matcher plugins sends the log entry to one or more output destinations. Currently, those destinations are Dada Dog, Splunk and Kafka. We have a work in progress to add S3 as an additional output destination. And the Fluent D output plugins made the actual mechanics of sending the logs to their destinations relatively easy and painless. So now that we've seen a high level view of how Zendesk uses Fluent D to transport logs, let's take a look at the challenges that we faced at scale. Zendesk has hundreds of services generating logs, which deliver the products shown in the previous slide. Those are distributed across multiple AWS regions to serve customers in the Americas, Asia, Latin America and Europe, Middle East and Asia. To manage these business rules, there are multiple engineers contributing to the configuration and operation of Fluent D. Like any large organization, the business rules implemented in Fluent D configuration can get a bit complex and not something that one can easily fit into a mental model in one's head. Anyone who's worked on a high scale web service will understand this. Now to up the stakes even more, we require a very high level of reliability. Achieving three and a half nines at startup level traffic levels might result in a few hundred lines of lost telemetry, but at much higher levels of traffic, whole swaths of telemetry are at risk if a deployment fails. There are multiple teams who are stakeholders of logging telemetry, any number of feature teams, security teams, customer advocates, who use logs very heavily and other teams. The business operates 24 by seven across multiple time zones. So there are a few opportunities for planned downtime for major changes. These challenges at scale have motivated the observability team, which I work on, to make a number of trade-offs in running Fluent D in our environment. Let's take a look at those and see why we chose them. So one of the common questions we get asked is why haven't we deployed Fluent Bit instead of Fluent D since it is appealing from a resource and performance standpoint? Our initial Fluent D deployment was four years ago and at that time Fluent Bit didn't have the production maturity that it does now. There've been a number of performance increases in Fluent Bit that we've learned about from the Calypso folks. Additionally, Sendesk engineers have a significant amount of experience in Ruby, which made the Fluent D implementation appealing. The number of output plugins available was also a big factor. Being able to get things up and running without having to write a lot of custom code was a very big draw. Reducing complexity was one of our main goals as it allowed us to economically scale our observability team without requiring a lot of toil for Fluent D configuration and operations. All that being said, as our traffic levels have grown over the past four years, we have revisited this decision and may look at transitioning to Fluent Bit. The business roles that we have implemented in Fluent D are an area of consideration that we would have to ensure compatibility with the migration of Fluent Bit. Because of our large footprint, a change like this would take time in the input of many stakeholders, but one that we may pursue since our log delivery rates continue to increase as the business grows. Our SVP of Engineering tweeted that it takes six months to make a major infrastructure change in a company's largest Sendesk. I found that to be true across the small number of large companies I've worked at. There are so many more details that have to be managed when dealing with a global distributed system as opposed to when working at a startup with a small number of customers. So this is trade-off number one. Another trade-off that we have made is choosing to implement simple business logic in our Fluent D plugins instead of complex logic. The JSON logs produced by our services contain a core copia of rich metadata and the Ruby-based plugin architecture makes it possible to do a number of complex and potentially valuable transformations in Fluent D. However, that would require parsing the JSON logs which is computationally expensive. And when having multiple committers on a critical piece of code, readability and maintainability are paramount to reliability. Most of our filter plugin logic consists of record inspection and conditionals followed by assigning tags to route the logs to a destination. The destination services such as Splunk, Datadog and Kafka consumers handle the complicated transformations. This has an added benefit of those transformations being implemented by the stakeholders who understand the business needs at the end point. SREs managing log indexing and Datadog have different needs than security engineers using Splunk. Essentially, we've chosen to implement Fluent D as a router and embed business-specific logic in the monitoring backends. We feel that this has paid off. There are enough routing rules that we've had to implement as a result of the size of our systems that trying to add complicated business logic in Fluent D would have significantly increased the risk of deployments and reliability. It's a bit of the Unix philosophy. Use small tools that have dedicated purpose. As you scale out, the usage of those tools will accrue complexity of their own over time as the systems that they support grow. Complexity in systems that scale is like in iceberg. The base business requirements may fit in one's head, but all of the small details that come with supporting dozens of teens and products aren't visible above the surface. Lastly, let's talk about performance tuning. When you have Fluent D configured as mostly a high volume router without a lot of computational overhead, it spends a lot of time in IO ops sending logs to each destination. There are a number of settings in Fluent D which allow one to tune the flush behavior of Fluent D. You may want to adjust these settings based on how your own environment behaves. Some of the changes to these settings that we have found helpful was to decrease the flush interval to keep buffers from growing too large as well as increasing the number of flush threads. We'd found that as the buffers would grow, the memory usage of Fluent D would increase so more frequent flushes would keep memory usage under control. Now these settings were specific to each output plugin. Buffer flushes to data org behave differently than buffer flushes to Kafka and similar, they were different for Splunk. Tuning these settings for best performance is a bit of an art form, but we were lucky to have some advice from the folks at Clipsha on this. The best values for these settings will change over time as your traffic levels change. I hope this information is useful. Feel free to reach out to me on Twitter and talk about how you use Fluent D. Oh, and also, Zendesk is hiring SREs. Look us up if you'd like to work with a group of awesome folks. And that's it, stay safe out there folks and remember to wear your mask.