 Thank you. Well, my name is Eduardo. Today we are in this session called About Fluent D and Fluent Bet. Originally, in the description of the session, we started talking about our new journey to metrics. But before to jump into metrics, we're thinking, OK, maybe we need to go to a deep dive explanation why this is a good deal and how things work in logs. How do we handle them before to jump into the metrics space? So anybody of you is new to Fluent D and Fluent Bet. If so, please raise your hand. OK, perfect. No problem. As I said, my name is Eduardo Silva. That's my handle on GitHub, Twitter, or most of the stuff. And the creator of my intent in this project called Fluent Bet and my intent of a Fluent D ecosystem and the founder of this company called Caliptia, which we call the first mile observability company, which was started on top of the open source ecosystem that we created and has been in CNCF for a couple of years already. And it came from the treasure data team where we created all this technology. So the Fluent D ecosystem. We have to start saying that Fluent D is a graduated project from the CNCF. That means vendor neutral, wide adoption in the market, and where companies are contributing in a daily basis to the project. Fluent D started 10 years ago. And after that, well, around 2016, we wanted to have a more lightweight solution than Fluent D. Not because Fluent D was bad, but because we were targeting embedded Linux at that time. But at the same time, six, 10 years ago, Kubernetes was taking off. The container space was growing. And people started trying out Fluent Bet in containers because it was a lightweight solution than Fluent D. Fluent D is written in Ruby. Fluent Bet is written in C language. Fluent D has 1,000 plugins available. In Fluent Bet, it has just 100. So both projects are part of the same ecosystem, same license, and are considered both graduated from the CNCF. And we are now in a journey where most of people do it to their workloads that they have in their environments. Sometimes Fluent D is not capable to solve all the challenges. And that's why people are migrating from Fluent D to Fluent Bet nowadays. So there's one big observability challenge, right? It doesn't matter if you are doing traces, logins, or metrics. And they think that everybody wants more performance, more performance, but at zero cost, right? Get it for free. And such thing does not exist, right? There's nothing for free. So this is a quite interesting challenge because when you have your environment, you always have more data, more data, more workloads, more workloads. Things rose on that side. But in the infrastructure, the way that you handle that information, it's hard to scale, right? And that's why we focus a lot of performance. So let's talk about a bit of logging and how it works. In general, if you are just learning about this, when a log message is an event triggered by the application, the purpose of a log message is to say, I'm working fine, I'm doing this, I got this request, I got this response, and any kind of problems or exceptions. That information in a container environment usually comes from in a route, in a route text message. That's a kind of Apache message, quite of old. And then this hits disk as a file where you have multiple records separated per file. I'm not saying that this is the only way to do logging, right? Some application chips logs over TCP or you have firewalls that chips log messages over UDP by using syslog as a payload format. And when you do this and you store all this information, it's not just about to have the information there because your goal, as always said, is to perform data analysis, right? It's not about logging, it's not about the tool. You want to solve data analysis. You wanted to know how your applications are behaving, any kind of exception that you have, or anything that can give you some insights, right? From what's going on with your applications. So, but in order to do that, you need to read this information, kind of process it and send it out to a central place where you can run your own queries and analysis. You can send it to Elastic, Stackdriver or anything like that. Now, this kind of engine that collect the data and send the data out has a couple of roles and a couple of tasks that it has to accomplish. So I collect the data, perform data transformation because you want to parse the information, you want to unify in some format, which is a quite a challenge, right? Just if you have three developers in your environment doing different projects, I'm 100% sure that all of them will use a different structure for the logs, so when you want to do analysis on top of that data, you have to realize which one or how to approach them and it's really hard. Also, there are many applications that are very noisy that generate messages that you don't want. And what is the problem? The problem is that if you're using a backend, for example, it's a common example, Splunk, they're going to charge you by the amount of data that you ingest. But if you're sending a bunch of data that you don't need, you're any ways you're going to pay for that amount of data. And this is where people start thinking about, oh, cost reduction, we don't want all the data, we need something that can filter out information that is not required, right? So this little engine need to be smart enough to provide all these features set. Also doing buffering, because if I'm going to send this information to a backend database, to the right side of this pipeline, things can go wrong. I don't know if you've faced it, but Wi-Fi just was, we lost Wi-Fi for two minutes here in the conference. The same thing happens in environments, right? Network outage, power outage, disk failures. So, but if we don't save the data, we don't keep a state of what we are processing and we will recover, we are not able to retake the states, we are in a problem. That's why buffering is really important. And we should have the ability to send the data to different backends, right? This is kind of, you have to be able to avoid to have this trap of vendor locking when you get married with a vendor sometime. That's fine, with a database, no provider or any kind of service, but you should have the option to say, okay, I'm going to send this data to this backend and at some point to another one because cost reduction or any kind of other strategy. And as I said, the whole point of data centralization is data analysis, right? And you can choose whatever you want. And this is kind of the key of things of these tools around the flu in the ecosystem. And pretty much from a global perspective, this is input and output, right? You have an engine that process all the input, process the data, filter, do buffering and send this data out to your preferred destination. Now understanding this concept, which sounds very simple, is important to understand how to tackle problems that you get later in production, like, you can get performance problems in the input, you can get performance problems when filtering data, processing the data and we'll send this data out. So in the input side, you always deal with IO, this network metrics. Inside the engine, right? Which is the input means how do we extract or receive this data. Inside the agent, it's about parsing data, filtering, serializing data because, yeah, you pretty much consume and generate JSON. But when it's about to talk about an agent, like Fluent D and Fluent B, we have to serialize this data in a binary format so we can do really smart optimizations on how do we handle this data internally. Because at the end of the day, you want to want to CPU low, right? Your memory usage low. And if we don't optimize on those terms, we will get in a really tough scenario. Buffering, routing, scheduling, retries, retries. If I'm going to send the data and something happens and I could not send it, I would have to some kind of logic that allows me to send this data out. And in the output side, there are other kind of challenges. Network setup. Some of the backend use TLS, some others doesn't. Every backend use a different, expect a different payload for the data, right? The Splunk, Splunk expect a JSON payload that is totally different than the one expected by Amazon S3. So we have to take this binary representation that we have and convert it back to the expected payload. It's kind of unifying in a format internally and then we encode the data for the right backend. And this is what Fluentbit does, right? It cares for you about all this complexity of IO, network collection, processing, filtering, and making sure that reliable you can get your data in your backends. So as I said, Fluentbit is part of the Fluent ecosystem. It's under a patch license and now we have more than 200 contributors in total. And the good thing is that Fluentbit is being used widely nowadays. AWS, Google Cloud, Microsoft Azure, all of them are deploying Fluentbit on a really heavy scale. And just from the public stats from our Docker Hub repositories that are not related to how do they deploy Fluentbit in their own registries, we have around an average of two million poles or deployments per day. So which is an insane amount of deployments, right? It's like Kubernetes clusters going up, going down and all of them are using, what, majority of them are using Fluentbit. And we have a couple of companies also using it for different purposes. So people who run in the cloud, if you spin up a, for example, a Kubernetes cluster in Google right now, you will get Fluentbit on it. So this is quite a challenge, right? So from a maintenance perspective, this project has to be so stable that we cannot mess up something. If we wrote something bad, right? We'll mess up the whole infrastructure. And as I said, Fluentbit was designing with performance in mind, right? We, it was not intended to think we're going to write something for the cloud native ecosystem that will be around in five years. It was not like that, right? We just optimized it for embedded Linux. We made it very optimal on how to manage everything. And it was a good timing, right? The container space just came in and we just had all the pieces in place. Also buffer management is really important, right? How do we optimize when we group the data? How do we store this data between memory, file system? I'm going to explain that in a few. It's really important. So talking, we're gonna talk about more internal details. How do we serialize the data? I got the questions, I think that was yesterday. Why JSON? Well, the answer is we don't use JSON, right? JSON is a human readable structure format, right? But for a computer, it's just an array of bytes, right? But when we talk about binary format, is that we have some kind of a structure. We have some kind of a, not in index, but a way to say where's the data, what type of data we have, what's the length of that data, right? So you can see here in the table what is the difference between using JSON and message pack from a payload size perspective. But also if I have an array, it's easy for me to jump between positions, between one key and the other and the value. Instead, if I do that in JSON, I have to analyze everything and go byte per byte. Yeah, there's quite a few optimizations around, but we cannot compare the performance of playing JSON with message pack. Now, this is not like JSON is about, we're not going to do JSON in the internals if we can do message pack. And we use just JSON to communicate with other backends that speak that language, that format, right? And when we get the records, for example, if you think that we have a file, if it has multiple lines, every line is considered a record or an event. And this event get groups, right? In buffers, which is called chunks. And chunk can have multiple events associated under the same tag. Tag are useful to say all this data is coming from this source. Because in the configuration at some point, you will say I want to send all the data that matches this criteria to this pack and all the data that has attacked with other criteria to a different backend. That's how people split data into elastic search, splang with different routing rules. Now, when talking about buffering, it's really important to understand that we have limited resources, right? We have a bad practice around in general in the industry that nobody think about memory, nobody thinks about CPU usage, right? But for a service that has to run 24 by seven that doesn't have a short period of time of light, it's always running processing data. We have to always be optimizing how much resources we consume. So we have an hybrid approach. By default, all the data goes to memory. So all the data means all the chunks go to memory. So every time that we're getting records, we group them in chunks and the chunks get being filled in memory. But we always say that, okay, we have this amount of memory, but we can say let's have up to this amount of memory used by chunks. Otherwise, your service can explode, right? The kernel is going to kill your service. Now, in memory barfing, it's fine. It's pretty good. It's the fastest mechanism, but it's not persistent. If you reboot the agent and you have data that has not been flashed to the destination, yeah, that data is going to get lost. Also, at some point, if you just rely on memory, the kernel could say, oh, you're using too much memory. We are running out of it. It's time to sacrifice on one. So guess who will be the first target, right? It's the agent that are consuming all this information. And this is who can come up with an hybrid mechanism, right? All these techniques are not new. They're quite usually in database design, right? So there's no black magic here. It's that having a file system barfing helps a lot, right? So it's always good to have this kind of hybrid mechanism where the data that is getting in, if we cannot put it in memory, it goes to the file system, right? Usually, your file system will be times bigger than the capacity that you have in memory. But you can say, oh, but also file system is slow. It's slower. Yeah, it's slower than memory. But at some point, you are not processing all the data at once, right? You're always processing data by fractions. A couple of chunks at a time, not all of them. Otherwise, you will get really in a really bad spot with CPU search, high on these and so on. So when the data hits, also we have the same mechanisms that can go to the file system. And the ideal use case that we come up as a design is with hybrid mechanism, where you can have the data also in memory, which is being processed already to go, but also we have all the other chunks in the file system. And we have this concept of chunks up and down. A chunk that is down is just in the file system. A chunk that is up means it's loaded in memory. But they're not separated, right? Which is in memory is also a reflection of what we have in the file system. And we use memory map files. With this approach, we reduce the number of read write calls. So we reduce the number of sys calls involved on all this operation. And it's a very scalable design. So my buffering perspective, it's very reliable. Now, as the description said, we wanted to talk about metrics today, right? The approach of FluentD, but FluentD FluentD was always about logs, but what's the story behind metrics? And the few minutes that we have left is that for years, even in KubeCon, this conference, users approached to us and said, hey, why I need to have one agent for metrics and one agent for logs? Why I cannot have a more unified experience? And at some point we said this year, hey, we had metrics experience a little bit, but why not to extend our scope as a project to handle metrics, right? You might think, hey, you want to replace Prometheus? No. And now I'm going to explain about our journey into the metrics work. It has been quite interesting. So I found that doing metrics is more fun than logs, or maybe we have been doing it for so much time. The thing is that we are not new to metrics, right? FluentD has done metric collection for a lot of years, just the beginning, because it was for embedded Linux. So we have input plugins to gather CPU metrics, DSAIO, network, thermal, temperature stuff, docker metrics, but we always handle that as structured logs, not as with a metrics payload or a fixed data model, right? In logs you have anything, a bunch of key value pairs that nobody cares about the structure. What is first, what is second? But when you have a metrics payload, you have something very defined. It's a metrics name, a value, maybe description, namespace, or things like that. So, and this is a kind of comparison between logs and metrics. Logs we have on structured messages, structure, but metrics has fixed data model in logs you need to do filtering. In metrics, maybe you need some kind of do, some kind of aggregation, which is quite optional. In logs you cannot predict the size of the data. It's really hard. It's really hard to reduce the data in logs, right? You don't know what is coming in. In metrics, since you are in control of that, you can kind of manage or predict how much metrics you need to store, or maybe you can aggregate them and reduce the samples. In terms of types, logs, you know, have map booleans, integral floats, blah, blah, blah, while metrics pretty much counters, gauges, histograms, so it's more simple, I would say it's more simple implementation. And we're starting doing our own research, right? And if you look around, when doing this kind of projects, in Fluent D and Fluent B, we always try to be vendor agnostic. That means that we always try to integrate with others, right? Not just go ahead and replace every single standard. It's not the way to go. So when we say we're going to integrate with metrics, the first question is how, right? And okay, let's see what's the industry's running and the industry today is Prometheus. So the best way to approach this challenge or into getting to the metrics space was integrate with Prometheus with the open metrics spec. So Prometheus is quite stable, right? We have a metrics spec, we have collectors, exporters, it's network friendly, we can transfer metrics from one endpoint to the other without any problem. And we decide, okay, let's get started with Prometheus, right? This will be our first journey into metrics. And now, how do we get started with this? Okay, we don't have any handler for metrics, right? If you're coming from Prometheus, you know that Prometheus has this kind of implementation with SDKs. So we created our own library called Symetrix, right? And Symetrix allow us to do pretty much what you can do right now with Prometheus Go Client, which is manage all this kind of metrics, right? Create counters, create guides, create histograms, which is work in process. But you can do all this kind of atomic operations, create labels, query by labels, or change all of them. So, and then this got integrated into Flow and Bet. And one of the goals of Symetrix is that it's not just metrics handling. You can manage metrics, but then you have another problem, how do you send these metrics to Prometheus? How do you send these metrics maybe, oh, I'm using InfluxDB. I don't care about Prometheus, but I want to solve InfluxDB. So Symetrix allows to separate the context of content versus transport, right? Content is about how the payload looks like. Transport is how do I deliver that payload to a different place? So in Symetrix we implemented a way to transform all the content from a Symetrix context to InfluxDB, Prometheus RemoteWide, and as Prometheus exporter. We don't do transport, just content, but also as a metric we support namespace, subsystem, name descriptions, and labels. This is a simple example of how to create a metric. We used to see metrics. It's just, I think that if you compare this code, what you have in as a goal and client SDK, it's pretty much the same line of codes. And no joking, it's the same amount of lines. Actually we pretty much copy all the Prometheus goal client in C and we created Symetrix, right? So you have to take always the best practices. There's no need to reinvent the wheel, and Prometheus team is just great that it has really good specs for everything. Now, also this metric context could be just one line of function. With just one function you can create a payload for InfluxDB, right? So we have an example below. We can say, okay, I'm going to expose this Symetrix context over HTTP. So Prometheus can scrape this metrics. So we created also this extension to create the same payload. And also we support Prometheus RemoteWrite. RemoteWrite is a protocol that allows you to ship metrics from one endpoint to the other, right? By design it was not created for that purpose, but the industry or vendors that listen for metrics payloads, they start in creating this kind of remoteWrite endpoints. So if you're using Prometheus, yeah, get Prometheus to scrape all the data, but you have this Prometheus RemoteWrite ability to send the data out. And Fluent-Veteran Symetrix has been quite interesting because the first thing that we did, okay, we want to get into metrics. We created Symetrix. Now it's about, okay, let's create the use case. And most of users who wanted to not stop using NodeExporter, which is a Prometheus, one of the Prometheus program to gather metrics from the Linux boxes. So they asked us to re-implement NodeExporter as an input plugin for Fluent-Veteran, meaning that now the current version of Fluent-Veteran can gather the same metrics that NodeExporter and now allow the users to stop using NodeExporter and just use Fluent-Veteran for the same purpose. Of course, we have not implemented all the collectors that NodeExporter has. That is based on demand, but we support CPU, frequency, the stats, VM stats, memory, load, average. Also, that is the collection point. Now, if we think about the output, how do we expose those metrics, right? We implemented our own Prometheus Explorer plugin. All of this is written in C. It uses a very low CPU and memory usage. So you allow to the GitHub backends to just come into Fluent-Veteran and scrape these metrics for you. Actually, today, you're using NodeExporter and you have all your Grafana dashboards and you just switch to Fluent-Veteran, there's no breaking change. You will get all the metrics, all the stats right away. Unless, I'm going to get back, you're using some Linux Collerton that is not here. CDFS or any kind of fancy feature that most people are not using. But if there's something missing, just let us know and we will implement it. And from a configuration perspective, it's pretty simple. You just enable an input plugin. This is like a typical config file of Fluent-Veteran where you say enable the NodeExporter metrics plugin, scrape the metrics every two seconds, and then just have an output plugin which is called Prometheus Explorer that opens an HTTP endpoint and expose those metrics. Internally, all of this is loaded, right? So we have a bunch of input plugins for metrics and different ways to expose these metrics out, expose or send it out to a different place. Okay, I think we have eight minutes, right, Randy? Okay, we can do a quick demo. Sorry, my co-founder is writing me. Okay, so for example, I'm going to just show you here this config example. What I'm doing here is just starting the service, flashing data every one second here. So on this part, we just enable the NodeExporter input plugin in Fluent-Veteran. We expose information to two plugins, two output plugins. One of them is Prometheus Explorer, which opens an HTTP port and then to the standard output. So something like this is pretty simple. And then this command is going to start shipping the metrics to the standard output, which is just useless for production, right? It's just for demo purpose. But the good thing is that in the other side, we can query those metrics, right, using curl. And usually you will see that we get all the Prometheus-style formats. So this is same Fluent-Veteran shipping out the metrics in one format, standard output, and the other exposing through Prometheus. Now, what a possibility is this open? And I think that it's really interesting, a lot, a lot, a lot. Now also I have here, Phil Willen just, one of the contributors just sent out a plugin to a, I don't know, do you use NGNX? Some reason, NGNX a web server? Okay, likely you're using it, right? If you use NGNX, what it's doing is also might be exposing metrics from NGNX. If you're using NGNX, basically what you do is to redeploy the one product called NGNX Sporter, which is a Golem plugin made by the NGNX company that exposed NGNX plugins for the Prometheus world, right? So we did the same. Right now we implemented the same functionality as an input plugin for Fluent-Veteran that was just merged today. We are best in here the output example of those to the standard output. But if I'm not wrong, this is the end point. I don't remember what is the port that this is supposed to be. Let me check here in the config. This is pausing. Okay. It should be 80. Yeah, we're going to connect to 80 and, oh, 2021. If I'm not wrong, it should be something like this. There you go. So on the left side we are sending, we're collecting a certain NGNX service, shipping mail, the same thing. Printing metrics to the standard output and the right side, simulating a Prometheus that is scripting those metrics using Coral. Now, there's a bunch of things that we can continue doing in the metric space and we got more ambitious about, hey, let's implement more metrics thing, more collectors because there's a huge benefit and we get a great support from the Prometheus community. Now, what is ready now? Not exported for metrics, fluent bit metrics, which are internal metrics. In the output side, we can ship metrics to InfluenceDB, Prometheus Poter, RemoteWrite. And the ongoing work is that also we are replicating not exported metrics for Windows. This is also not a prefront Prometheus, but we're replicating the same functionality in fluent bit NGNX metrics, which is ready. We're going to collect these, that's the, also the ability to convert logs to metrics because there are many applications that ship logs as a JSON, but you wanted to get them into Prometheus. So how do you handle that conversion? That this filter called logs to metrics will be a good place for that. As an output side, we're going to implement Splunk Metrics, Datadoc Metrics and CloudWatch. Also, we are working on this kind of metrics processor because as I said at the beginning, you don't want always to ship all your metrics. What you want is to process them and maybe process, get the average of a value every minute and set that average, but not send every sample every second, right? Which at the end of the day, with a hundred of nodes, your storage start to grow, right? And that is more expensive. And the future, right? Right now we standardize on Open Metrics because, and Prometheus, what's the industry's running now? Once Open Telemetry gets on GA with metrics, I think that we should be shortly, we're going to start implementing all the OTLP protocol for those needs. Okay, I think that we have some minutes for Q and A. Thank you. I see that the C-Metrics library is built into Fluentbit now. Do you plan to break that out for separate users or is it always intention to be used directly inside of Fluentbit? C-Metrics is a standard library. It's hosted at Caliptia Open Source Project. It's a library that we wrote as a company. But it's hosted as a shared library, so you can, I can share the link with you after this talk. It's Caliptia-S-C-Metrics. You can get it, you said, we have the unit there, so you can see some examples of how this works. So yeah, it's kind of agnostic. We try to make it agnostic, yeah. So Eduardo, we've got a couple of questions online. One of them is kind of, can you differentiate Fluent-D and Fluent-Bit a little bit more? Yeah, Fluent-D and Fluent-Bit. And Fluent-D 10 years old, Fluent-Bit six years old, Fluent-D 1000 plugins, Fluent-Bit 100, right? Now in terms of performance, Fluent-Bit 10x more performant in our latest benchmarks. So unless you have a good reason to use Fluent-D, use it. But if you want more performance, optimize on your resources usage, just switch to Fluent-Bit. We have the same project, the same family, so it should be fine to stick with one of two because both are vendor neutral. And metrics? Metrics? Metrics meaning? Good, Fluent-D versus Fluent-Bit. Okay, so this is, I didn't mention this, but Symmetrics, we use it to empower Fluent-Bit for metrics, but shortly also other contributors took Symmetrics and created all the metrics handling in Fluent-D. So we get creating just one tiny library we just empower both projects. So you can forward metrics from Fluent-D too. Fantastic. Great. Any other questions, local? Yeah, okay. What in your opinion would cause somebody to stay on Fluent-D or go to Fluent-B or vice versa? If I answer my question based on personal experience with customers, everybody's caring about performance now. And Fluent-D for some intensive cases is not good enough, so I would say Fluent-Bit. Thank you. Just to do another one online here real quick. Can you speak to MTLS implementations in the two? We have TLS, and I think that MTLS is not something that we can handle today, but it's in the robot. So you talk about converting logs to metrics. Does that support fingerprinting namespaces for Kubernetes so that you can be like, these logs are coming from this namespace so that you can actually figure out who's the noisy neighbor in your cluster? Yeah, so we talk about Kubernetes, it's always have to consider about the context, right? And what give us the context is the API server. So we have those filters to enrich the records with that information. And what you care about are labels, right? Because you want to query by labels. So the goal, this is not just fully designed, but it's to take the record, which is already enriched with Kubernetes metadata and translate those labels to symmetric labels. So you can get kind of the same experience. Awesome. Okay, well, I think we're about time. Eduardo, awesome talk. Thank you. Thank you. Thank you.