 Hello everyone, I'm Dhruv. I'm a software engineer with a marketplace team at DigitalOcean. Hello and my name is Boyan and I'm a senior server engineer on the App Platform team. Today we're going to be talking about integrating Fluent Bit into a PASS. When the context is a platform as a service, the challenges are different and hopefully we'll be able to go through some of them today. So as for the agenda, we are going to first provide a context as to what our platform is and what are we working with. Then we would move towards internal logging, external logging, the issues which we faced and scope for future improvement. DigitalOcean App Platform is a fully managed platform as a service that lets customers build, deploy, manage and scale all kinds of different applications. These applications could be web apps, static sites, APIs, background workers and in a different number of languages such as PHP, Python, Ruby, JavaScript etc. Use this and create applications via our cloud dashboard, command line tool or API. Regardless of the mechanism used, the application ends up represented as a manifest specifying different components in the application. Each application may have one or more component. For example, an app may have an HTTP web API and a background worker. We take these components, stitch them together, build them and deploy them on our platform. App Platform is built on top of existing DigitalOcean products. Primarily we are built on managed Kubernetes, container registry and spaces offering, but underneath we leverage almost all four infrastructure products. App Platform consists of numerous Kubernetes clusters in various DigitalOcean regions and data centers. We treat our clusters as cattle, netpads and can reconfigure, destroy and deploy your clusters as needed. At a high level, there is a core cluster we can sell their component in our management control plane that's responsible for managing clusters in the data plane. Each cluster has several node posts. Some dedicated to running customer application workloads, some are dedicated to posting application builds, and some post our internal in-cluster control services such as Prometheus, Istio and Celium. All the reconciliation is then programmatically using GoPlan libraries and the combination of template and manifest and Go code. Similarly, there is a control component responsible for managing the builds and deployments of actual customer applications. These are orchestrated completely dynamically as well as needed via GoPlan. Within each cluster, we need logging visibility for our in-cluster control components. These logs need to be adjusted securely via a WebSocket into a DigitalOcean central logging system in our main management control plane. We use FluentBit deployed as a demo set within cluster to collect, process and send logs into the central logging system. We use a custom WebSocket output plugin to send the logs. And this is an example of our custom WebSocket plugin configuration looks like. We have options to specify the endpoint and any additional patterns, for example, the authorization bare token. While we just covered internal logging, a different use case is where we want to send actual customer application logs to other logging systems. We want to support sending logs using SysLog or HTTP protocols. Applications run in a secure, isolated sandbox environment using the G-Wiser runtime. FluentBit pods that perform the actual log collection and processing also need to run in the sandbox environment. We need dynamic and fine-grained control over logging configuration as applications change are deployed or reallocated around the cluster. For security and performance purposes, we need to be able to throttle and dynamically disable logging for applications. We also need to have detailed observability and insights into the health and metrics of different applications' logging behavior. For example, we need to track how much logging data each app is outputting. We need to know if logging is working and is healthy for a specific application. We need to know the overall health of the logging system itself. All of this needs to happen dynamically and programmatically as part of the normal application reconcile flow covered earlier. For application logging, we choose to utilize FluentBit operator to orchestrate all the necessary components and configuration. We deploy one instance of the FluentBit operator in each of the clusters. The operator manages the FluentBit DemonSet, which controls the FluentBit pods on all the relevant customer workload nodes. As applications are deployed by the app's reconciler discussed earlier, input filter and output custom resources are created for each application and logging destination endpoints. These custom resources get reconciled by the FluentBit operator into a secret containing the FluentBit configuration file. This gets mounted to the FluentBit pods, which then performs the actual log processing. One initial problem which we had was that the FluentBit operator did not expose the GoClient. We had to add the GoClient packages to the operator in order to use the custom resources programmatically. This is an overview of a single application's logging pipeline. We collect application logs using the tail input. Since we are running on container D, we use the CRI parser to parse the log records. Next, we throttle the log processing. We also use modified filter to set some additional application specific metadata to the log records. Finally, we route the records to the appropriate output based on the application's logging configuration. Here is an example of an input configuration. Applications live in their own namespace, which is determined by a unique identifier. We have to be very specific about the input path we specify. We only collect the current running containers in the app's namespace. We ensure we do not collect any Kubernetes system logs, and we tag the app very specifically so that we can control the routing. By default, the tail input uses input output notify system calls to collect data from log files as it is written to. Because both the application pods and the FluentBit pods run in separate GYSER environments, the file system call restrictions enforced by GYSER make this type of log collection prohibitive and non-functional. Fortunately, FluentBit 1.8 came with a tail input plugin option to use stat watcher to collect log data, which works around sandbox issues. The modify filter is used to insert additional metadata into the log records. These include the company name and the app name. For outputs, we match exactly the component within the app that needs to get sent to the log destination. For syslog output, we enhance the transferred record with the enriched metadata values inserted in the modify filter. These then can be parsed at the destination to associate logs with the actual app and components for nicer aggregation. For monitoring all of this, we have some specific needs for detailed metric collection in the log processing pipeline. We need to be able to tell on a per component level how many log records and how much bandwidth has been processed. We need to be able to collect all of this data in the Prometheus exposition format, and we need to be able to connect the metric values to specific application and customer identifiers. We would like to have a time series with metadata such as an app name, ID, and customer ID, like seen here. FluentBit provides an API for exposing plugin metrics. However, the API is limited in the data it exposes. It only provides the name of the plugin. Furthermore, even when using an alias to uniquely identify plugins, the name exposed in the metrics API is truncated to 32k characters. Even new IDs cannot fit nicely in the alias name. Ideally, we would like a solution similar to this, where we can specify the extra monitoring labels for the plugin, which then would be exposed in the monitoring API by FluentBit. We've created a GitHub issue that hopefully can be addressed at some point in the near future. As a worker, I would dynamically generate Prometheus recording rules for enhanced metric collection. It's enabled us to add estimated metadata to the ingested metrics as they're scraped from the FluentBit API. The name is a unique identifier for the plugin that we generate as they can fit in the alias field and connects metrics to app and customer IDs. This is a summary of some of the issues we encountered when working with FluentBit. The tail input plugin did not work under GVisor, and we had to use the disabled iNotify watcher option that only became available in 1.8. Dynamic configuration of FluentBit within Kubernetes is still cumbersome and requires some workarounds. FluentBit operator uses a custom process called FluentBit watcher that watches for the changes within the configuration file and restarts the actual FluentBit process when changes happen. The FluentBit watcher also uses input iNotify system calls, which do not work under GVisor in this context. So we had to add support for style calling to FluentBit watcher component within the operator to enable dynamic configuration within GVisor. As mentioned earlier, FluentBit operator did not have both client and had to add it as well. Finally, with some investigation, we found an issue where FluentBit operator is using a lot of CEP rendering the configuration file into the secret. We had to fix for this and submitted a PR upstream. For some future work, we want to explore the native FluentBit website that plugin introduced in 1.7 and see if that can fit our needs like our custom one. GVisor encourages performance when working with file system IO, we want to investigate this further with respect to tail input plugin performance. Additionally, we want to optimize the FluentBit operator. FluentBit operator stores a single config file for all the FluentBit dimensions in a single Kubernetes secret. Data in Kubernetes secrets has one Mabibit size language, which can be problematic if we store large amounts of configuration data. One relatively easy improvement might be to use compression such as GZip to shrink the configuration file. Additionally, we can try to fix this with some more work and to split the config files as they grow and rejoin them when they are read by FluentBit pods. Finally, we would like to solve the metrics metadata relabeling the issue that we mentioned earlier for our monitoring functionality. Thank you and we can answer any questions.