 Welcome, everybody, to the session about logging and best practices. We're really happy to be here at KubeCon, and I don't know how many times we have been here, but really happy to share in what's new and how you can deal with escalability problems. Today we're going to try to do a more hands-on session, more than just going through slides. So I know we're going to get a lot of Q&A. We'll try to fit all the Q&A during the session. But if we don't have enough time, we just go out of the room. We can be there for an hour talking about best practices, issues, or everything that you need to solve. So feel free to just raise your hand at the end of the talk, or just talk to us at the end. And as I said, we got access to the early version of the Fluent Bitbook for Kubernetes. So just scan the QR code, fill out the form if you want to get access to it. So we'll start talking about the challenges that we got in logging, and the main purpose of logging is not to do logging or extract data. Actually, it's just do data analysis. But if you want to do data analysis, you need to extract the information that gives you the relevant insights for what's going on in applications or services. And with this, I'm talking from the kernel to our user space applications, containers, and everything that you have in the middle. And when you get an issue and you want to travel, you shoot things. Yeah, everybody asks, hey, where are the locks? Give us the locks. But sometimes, having more locks doesn't give you more value, right? Actually, more telemetry data sometimes can add a lot of noise, which makes it really hard to get and extract the insights from your environment. Services write logging information to different places. And usually in different formats, for example, NGINX has the access lock. Well, you can configure that for different type of virtual hosts in different type of paths, while other applications prefer to use the SysLock system, where those SysLock messages might end up in a different location. But as you can see, the content is quite different. And if you are a lovely user of Windows, you might end up with the events in a different place that can be only accessed by through some weird API through the system. And it's weird because it's really complex to do it. Okay, and logging doesn't... We don't talk about just the file system. Logging information can exist anywhere. And it could be, yeah, bar log messages, bar log containers, but also if you want to read the login information from the kernel, you want to open the kernel messages device, which is like a save to file, or maybe you want to read some terminal serial interface to collect data from, even if you are running some embedded Linux environment or you're playing with constrained environment devices, you need to collect the information from different places. And if your application is just connecting to SystemD or to JournalD, JournalD also exposes its own way to consume and read the data from there. And also, the information around logging exists not just inside the same machine, but also in remote endpoints. If you think about Kubernetes, you have a distributed system and you want to access the Kubernetes events, likely you need to go and connect to the API server, which is a remote machine, which is not a local service inside your node. And same as you go to pull for information for logs, sometimes you open up a TCP port or a UDP port to receive Syslog messages or others. So you might guess that having all this information in different formats, from different sources, dealing with local services and remote endpoints, gets really complex, because at the end of the day, you have to deal with all of that at the same time. You just cannot rely on a single file, right? And if you're doing in this space logging for a while, you might understand what I'm trying to refer. But your final goal is always to do data analysis, right? Don't forget the goal. Nobody wake up in the morning. I always say this, oh, I'm going to do login. I'm going to do metrics. No, you don't care about that. Nobody like it, but you have to do it. So the approach of do data analysis is gets hard when you face the truth. And you go here and see that different operating system, different ways to collect the information. But your goal is to do data analysis. And at some point, you start realizing that data comes in what we call an instructor format. Yeah, you can understand that the first fraction of this web server log entry has an IP address because your brain has been trained to understand what is an IP address, right? Not to compute it, right? But you understand. And you understand that later you have a timestamp. Then you have some HTTP components, like request, method, protocol version, status code, and so on. If you have an instructor log and we don't have a real instructor for the computer, this get really complex to parse when you are processing thousands of thousands of messages per second. Ideally, you want to get something like this. I'm not talking about that JSON is the best structure. I'm not saying that. But I'm trying to say that you should have a notion for the computer that says this key equals this value. And you don't want to parse all the content over and over. But if your data comes in different formats in an instructor way, how do you solve that problem at scale? Ideally, you want to do something like this. You take your instructor data, you go through a parsing process that will give you a final structure version of the information. But most of the time you're doing this. You're writing your custom script in a machine to extract specific information and tomorrow you create a different scraper and a different scraper and so on. And I know that people laughing because everybody has been here on this place. And then you ended up doing something like this. Picking just one single database. There are many, right? And then you start putting more information that database, you ended up paying a ton of money. And sometimes information that you're sending to your database, because remember you want to do data analysis, you're not creating all the information. Maybe you are consuming 20% of that information and you are paying for the 100%. Yeah, if all of you are operators and you are not paying for the bills in your company, good, it's not your problem. But when you're going to ask for a rise, hey, we're spending so much in infrastructure, blah, blah, blah. Make it cheaper, we can give you a rise next year. And it happens, right? So what is your strategy to deal with this, right? This doesn't scale. And everybody asks, sometimes we move companies, right? And then the next developer comes in or the next operator, I need to solve the problem again, again, again. Or the worst, you join a new company and you have to fix the mess that exists already, right? And without a strategy, it gets really complex. So the login challenges as a summary in a recap is around different sources, different formats, different endpoints and volume. Companies generates 20% to 30% more data every year. So what works today might not work the next year unless you have the right strategy to deal with all these components around a login. And this is where FluentBit comes in as a scalable solution that was started around 2014. FluentBit is from the FluentD family. If FluentBit is a graduated project with a CNCF, it's really in C language. Yeah, we're not going to debate today about Rust versus Golang versus XYZ, right? But it has a pluggable architecture. We have a more than 100 connectors between input sources, destinations, and it's cross-platform. You can run it on Linux, Windows, Mac OS, VSD, and it's highly customizable, meaning like if you want to extend, not just move data, but if you want to do processing and the features that we have to process the data in the middle is not enough for you, you can extend it by using Wasson, by using Lua, or Golang, which is quite powerful. And I'm proud to say that, yeah, it's a full-bender neutral project. It's something that is a project with a CNCF. It's not that a company will buy FluentBit and FluentBit will go away. It's not like that. Who contributes to FluentBit? Amazon, Google, Oracle, Microsoft. There are a couple of dozen of companies investing in general resources in to make FluentBit even better. And FluentBit is not just for logs. Today, FluentBit handle logs, metrics, and traces. And I know you will get ton of questions around this. But what I can tell you when you think about logs, in the Fluent World, logs, what we call is schema-less. We have a schema, but it's schema-less, meaning that it's quite flexible to contain all the data that you can imagine. And we just, also we can manage binary data, not just text format. And logs, schema-less, is just a ton of key value pairs, such as JSON. But we use something called message-pack internally. In the metrics world, yeah, this is a simplified version of our schema. We have, we support different types for metrics like counters, gauges, histograms. We can have dimensions for labels. You can have the description, the value for metrics, and so on. And for traces, yeah, this is an even more simplified version. Actually, in the traces, we use a very similar structure than open telemetry. And everything that I'm saying here between logs, metrics, and traces are compatible in the world with Prometheus. They are compatible with open telemetry. So that means that FluentBeat is a very versatile tool that just can plug into architecture and solve all the problems. And now I'm going to hang over the microphone and start talking about the main topic here, because I know that we are all suffering from logging and we're looking for best practices. So, Anurag, please welcome to the stage. Okay, so we've now talked about the problem. How do we go about solving it? And really, when we think about how we're going to solve this, these logging problems, these data problems, I like to think of it in three different buckets that we'll run through. So the first is processing, right? We're collecting all this information. And as we're collecting this information, we can do stuff with all the information we're collecting as we do so. What's enrichment, reducing log volume, redaction, conforming to a schema, make it more useful as we go and search or do data analysis. And even, you know, sometimes we just don't need to deal with logs. Let's convert that to metrics. We'll talk a little bit about some architectures, right? So FluentBeat and how you can collect that data. We'll talk a little bit about how you can pair that up with some Hotel SDK, Hotel Collector, all of the good stuff that's within the CNCF family. And then last, we'll run through some of the monitoring and operation side of logging, right? There's a ton of information going through. We wanna make sure that it runs as needed. So what I'll do is I'm gonna go ahead and switch my screen. And what we'll do is walk through a couple of examples here. So some, so first we'll walk through a first example. And this example, everything here is in FluentBeat. We're gonna be generating a message. So think of it as I'm logging something that says hello world once per second. And what we're gonna end up doing here is do some enrichment. So why is this useful, right? Many times logs can miss a bunch of information that might be relevant for someone debugging. Especially in a containerized world, we might have 50 logs that look exactly the same. We might have 50 versions of the app running. And without understanding where it's running, how it's running, what hosts it's running on, that information might, it might be really hard to figure out and pinpoint what happened. So let's go ahead and run that configuration. And what we'll see is while we're running that hello world, we're also enriching that with hostname. So we've taken something from the machine, from the box, enrich the log. So from a best practices standpoint, sometimes it might feel counterintuitive, but we've got to add information to make things faster. Add some volume, add some additional structure to understand what's going on. So next let's go into the second example. And in the second example, we're gonna take a very common use case, right? Someone accidentally turned on debug and trace and all hell is breaking loose within the backend. It's taken in that data like a champion, but it's also charging us for that data. And sometimes best practices can involve, let's save the data volume. Let's get rid of stuff that's not gonna be useful. Let's make sure our queries run really quickly so that way we're not searching through a bunch of junk. And here what we have is a filter that says, if I find something that includes info, I'm gonna go ahead or sorry, if I find something that says debug or info, I'm gonna go ahead and remove that. So we're gonna completely remove that log that's coming in from debug. And if we go and run this, now only the error are gonna show up, right? So a very simple example of, now we have a ton of stuff coming in, best practices, we don't have to just collect everything, star all, we can really go and filter it as we collect it. Why not? So next let's go into our third example. Oops, I almost gave it away there. And this is doing a little bit of business logic or processing as the log is being generated. So this is my credit card number. You can take a picture, whatnot, but as we want to go and show this to the whole session, what we're gonna do is process it out, let's redact it. So traditionally when we're doing logging, there can be times where we have sensitive information. And if we are collecting that sensitive information and it's routing and it's gonna get highly available with 29s of availability, well shoot, it's gonna be really hard to go delete that. So why don't we change the data itself, keep the structure but redact it in place? So this is another best practice where, hey, we've got this information, it contains a lot of vital stuff, but we don't necessarily need to collect that or store that in a highly redundant environment. Now, another one that's really great to see is logs to metrics. And what this is is logs contain a ton of information and sometimes as this message kind of shows, it's very, very heavy, right? I've got the namespace, I've got all this container stuff and in the end, all I care about, all I care about as an operator is how many times did this dummy message actually show up. So sometimes the best way to handle a log is to just convert it to a metric, right? We've got alerting, we've got all sorts of processes that are built on top of this and here we're gonna export it so that we can query it with Prometheus exporter. If we wanted, we could use hotel and send it to hotel metrics. So let's go ahead and run this. This is gonna go ahead and run and it's gonna give us a count of all those messages but let's forget the logs for a second and turn our attention to the right hand side and what we're doing is we're querying the Prometheus exporter endpoint and as you can see, it's also incrementing the number of messages that show up, right? So all that stuff on the left hand side, we don't care about any of it. We just wanna know this number, 17, 15, whatever it is at that time. So best practices in this sense is really how do we really just get rid of the log, get the information, extract the information that we want? Now I'll run through this a little quickly because I think we have another demo that I think is really great within the operation zone of logging which is architecture patterns. And sometimes we feel we have to do every single thing within the place that we collect that information but what's really fantastic about cloud native infrastructure, let me move this thing out of the way, where it's really great about cloud native infrastructure is we can chain these things together. We can collect logs on our servers, on our IoT devices, our MacBooks, whatever, fire that data up to a more centralized or operationalized central place to go do that processing and that can run our scripting, that can run our redaction, that can run our removal of those logs and if for some reason we're getting terabytes and terabytes of data a day, we can scale that up, right? We can scale it up using some of the, hopefully good stuff you've learned at this conference, bump the replicas up, make sure everything runs in parallel and you have a bunch of tools, right, within the CNCF itself that work together almost magically. So with that, let me go ahead and hand this off to LaCara, so we'll talk a little bit about the operationalized side of this. Thank you, Eric. Okay, hello everyone. I hope you're enjoying the conference. Gonna move to, okay. Okay, so when you're running in production, your observability, you would like to monitor a certain aspect of it. You would like to see how your ingestion is going to your pipeline, how it is actually being processed and later being sent out to your output destinations. So now what we're going to see is how we can detect and how can we handle some particular common issues like the errors that may occur when you are sending your data to your destination and how you can also deal with the back pressure. Let's say that you're sending a lot of data and your endpoint is not capable of handling this amount of data that you're sending. The buffering comes to the rescue to deal with this back pressure. So this is an instance where we have flow and bit running. Sorry. And we have an output server here. This is for benchmarking and demo purposes. It's just a simple HTTP server written in Go. That will help us to see how we're sending data. So run around flow and bit here. And this particular configuration is just sending a dummy message to our endpoint. So this is one of the Grafana dashboard that we usually deploy to look into the flow and bit metrics. So as Anurax showed before, flow and bit exposes the metrics in an HTTP endpoint. This is a built-in HTTP server where you can grab the metrics from flow and bit. In this case, we're using Prometheus to describe those metrics. And then we are showing here in Grafana. As I said, this is very, very small sample. Just sending one dummy message per second. So you will see that it's pretty much well-behaved in here. And nothing is wrong. Everything looks good. A lot of number of connections, et cetera. So what happens if your endpoint goes down? We should start seeing immediately on flow and bit that this is failing. It says that the endpoint is not available. And some chunks, which is the basic data structure that we use, will be dropped. So we should see that we keep ingesting data to our pipeline, but we are no longer capable of sending the data to the endpoint. And if we start the endpoint again, yeah, we're good, if we start the endpoint again, we will see how quickly flow and bit will recover. And we don't need to go to the actual log file of flow and bit, in this case, to see that it's recovering. We will see that pretty fast flow and bit will get to a good pace with the output size. So what if we put a little bit more load here to better see what's going on, right? To better see how flow and bit is able to recover. So this will generate a fair amount of load. This is a small machine, so not that much. But with this load, if our endpoint goes down, we will notice a lot of warning messages that it could not send the data to the output, right? And we will notice that. But we already saw the errors there. But what if we don't have that much errors, but we have like a delay in our endpoint? Okay, the input and output are pretty much the same now. Some drop records we can identify here, but here input and output are pretty much the same, right? So what if my endpoint has some delay? So we're going to add some delay here. So our output is up, but it is introducing some delay. It's not that it's not responding, it's not that it's down so there's no error in our other metrics that we may have, but it's taking longer to reply back. So we can see that this line that was pretty much the same a few moments ago, now the output is very low, right? Our endpoint is replying back, it's receiving the data, but at a lower rate, we introduce some delay. So what can we do to deal with this? Well, we can have our file system used as a buffering mechanism. So when it starts failing, it will save our chunks into the file system, it will use whatever space you define to be used there, and then it will try to resend this data again. By default, we will only retry once, right? But if you're not allowed to lose any data, well, you have to configure a good amount of space for your storage, and also configure the retry limit either to unlimited or to a higher number. And you will see that FluentBit will keep retrying until the number of retries is reached. Let's say you put there three retries, so you will have the first attempt and three more retries, and it will not send it again. So now that these numbers are pretty far away, right? We could restart the output server and see how it will quickly recover and get to a better pace of sending the data to your output. We have to take into consideration that some of those were dropped because we defined to only retry once, which is the default, the drop records, okay? And now we can see this was the input line, right? And the output, you can see that as soon as the endpoint become available again and was not showing any delay on replying back to our request, it recovers immediately, right? So depending on if you're allowed to lose data, if you want to keep up with your latest records instead of just sending everything, you will have options to configure the flow embed. You will have options to deal with the back pressure from your outputs. And yeah, that's it. Now back to Eduardo. Thank you. Now, oh, we put this in a hiding mode. Okay, maybe I can go through this by not showing the content here, but let's talk about on-skip. There you go. One of the keys on a fluid embed to deal with back pressure. Do you understand the concept of back pressure, right? Even in your house, right? If you have a pipe, you have water flowing through it. At some point, if you put more water, the water will not flow faster because the pipe has a capacity. The same happens in computers. Everything is buffering your pipes, right? So if you have more data, right, doesn't mean that the data will flow faster. And at some point, maybe your pipe could be really, really fast, but when you're sending that water or that data, the remote endpoint might not be able to receive at the speed that you are sending the data. Example, you have 100 nodes sending data to one endpoint. Example, elastic search. And now you wanna do 200 nodes, sending the same amount of data to that endpoint. It might not work. Indexing will delay, it will add back pressure. You cannot start processing the data and you start accumulating. And this is where buffering comes in. Buffering is the capacity to store the data in a very efficient and safe way. So if you cannot deliver the data fast enough, at least you can keep the state. And if the agent fails or your machine gets restarted, you can recover and your data can be reprocessed when it's time to go. By default, you can use a memory buffering. But now we have something that is, well, everybody in production does this, right? Everybody in production does enable what we call the hybrid system, what is called file system buffering. If you're familiar with memory mappet files, or if you're familiar with databases, there are very efficient ways to store the data into the file system that are really cheap from a system calls perspective. Normally, when you open, read, open, close, or read a file, you do a system calls. To open, to read, then to write, then to close the file system, right? You invoke four system calls. But when you're processing data at this scale, invoking all the system calls is really, really expensive. And that's why we need another mechanism that can deal with file system, but it's cheap and this is memory mappet files. And memory mappet files allow us to open a file and do kind of a mirror of the content of that file right away in memory. So we can reduce the number of system calls involved to read or write content. And the other thing is like, if you're using, and we were giving this example yesterday, if you're using just memory buffering, maybe you can tell me what will happen if I have back pressure and I'm just accumulating data in memory. What happens? Run out of memory. What that means? Processes stop. Well, the kernel will kill you. That's it. It will not let you eat all the memory that you have in the system because your process start accumulating data because this data, you cannot flash it away, right? And this is back pressure. If you're just using memory, that will happen. But if you use file system, which is the, everybody in production use it, Microsoft, Google, Oracle use this mechanism enable. What they do is they create the chunks in memory but they have a reflection in file system. And at some point, if they face back pressure, right? And your memory is going up, you can set a limit. For example, I don't want to use more than 500 megabytes. And when you hit this limit, all the chunks are up in memory. We have this concept of up and down, right? All the chunks are not being used but are up in memory will go to a down state in a very safety way. So of course, everybody has more storage than memory. So you can assign a couple of terabytes for this. And even if your endpoint is down for hours, your data will be safe. And you can deal with back pressure. But if you don't enable this, yes, you might get some troubles, right? So a couple of these practices that we were talking about today. The first one was, and Reg was mentioned about processing the data. You don't need all the data that you are collecting. Deal with that. You don't need anything. It's not like at home, right? Oh, we'll save this for later. No, it doesn't work that for data because you pay for that, right? The other, there's ways to exclude data based on patterns. You can process the data before you send it out, right? It will be cheaper. And the other aspect is always monitoring. It's not about to just run the agent and let the data flow. Hey, are we facing latency? Are we facing back pressure? Because if you notice that you're facing back pressure, hey, you might consider to add more storage to the pipeline, right? Because you want to be more resilient. And there's one thing. Everybody prepares for things that works. But in this world of Kubernetes, cloud native and moving data, you have to prepare for when things failed. What if you get a DNS issue? What if network goes down? Yeah, you are in the worst case scenario and you want to make sure to do the best to avoid losing data. Okay, so maybe you have some minutes for Q&A. I would like to, I know the microphone is there. So if you have any questions, we'll be happy to answer about back pressure, performance, login, open telemetry, metrics, anything that comes to your mind. It's better if you go to the microphone so then we get the recording of the question for people who are watching online. Hey, Bradon. Hello. Hello. I have a question about back pressure. Yeah. In terms of like back pressure strategy for like pausing input to make sure that you're not overloading while like your output is over limit or something like that. When you're in memory, the memory version, there's a MemBuff limit that if you're over that MemBuff limit, you can pause the input will pause. Yeah. That doesn't work for file system storage. And there is a close equivalent which is pausing on too many chunks open. So you can set your max chunks and then going open. But is there a possibility to introduce the concept where you can pause input on your output total limit size being over? Yeah, there's one option to pause the ingestion based on the storage limit that you assign it to an output plugin. It's called storage that pause on chunks limits something like that. It looks like that's not based on the output total limit size. It looks like that based on the max chunks up configuration. It should be on the queue size, but let me check. Yeah, before giving you a full answer. Yeah, because we're trying to, I'm trying to nail that down and I can't figure it out. Okay, thank you. Thank you. And they're just telling me in the back that we have to stop right now, but I'm happy to answer all the questions in the back. Thank you so much.