 Hello everyone, I'm Jonathan Gonzalez from Chile and I'm currently the maintainer of the post-reql output plugin for Fluent Bit. In the last 10 years of my life, I've been working with data processing and log processing for reports, stats, analysis, and everything, and for site reporting. So this time I'm going to talk to you about the output plugin for Fluent Bit. But what does it mean? This means that I'm going to explain to you a lot of stuff related to how to process the logs on the Fluent Bit site and push the logs inside Polresql. So let's start the talk. One of the motivations to start this plugin a year and a half ago was to play with the post-reql type JSON-B, which is apparently a representation of JavaScript object notation. We needed to create basically to process JSON logs and find one line between millions of files. It wasn't that easy. At that time, we needed to generate JSON reports, InfoCert reports for, I don't know, administration management stuff like that. And maybe create some stats starting from these logs. Like, I don't know, in summer we have a lot of visits on this site or this summer users started to log in more stuff like that. People can ask questions. In the beginning, the initial idea was to follow this simple process. Fluent Bit collect data, collect logs from different sources using different input plugins, and send them directly into Polresql into a table and from there you will be storing logs. But with time and talking with my colleagues, we finally decided that a best idea would be keep Fluent Bit sending logs into Polresql, but first process them using PLSql language. This will be allowed us to separate files in between different tables or maybe preprocessing something or even just create some rules. We have the logs inside our Polresql or in our case inside our PLSql. Fluent Bit will send the data logs as JSON objects, which means that we will have a big JSON plus some data. In our case, it will contain a timestamp and a tag plus the JSON object with everything. In the main table, we will store the raw data into one main table, meaning it's the table option in the plugin. And we decided to process using PLSql, to process and split the data into many tables, one table, or just split the data for the table or store them in the same table. That's not an issue. So we can use any object inside, any field inside the JSON object to decide when to store the data, where to store it, split it by using some special handle or anything like that. So we decided to base it on the JSON object. But what is JSONB or JSON on Polresql? Well, JSON is just JavaScript object notation. In our Polresql, there's an SQL technical report that you can check out on the slide right here. Which is the implementation of JSON on Polresql. The idea is you can have JSON or JSONB, but we use JSONB because it's worth indexing, which means that it's easy to create the data and keep it indexed because we know that index create faster queries. And it's easy to create. JSON is just, you can select data and a field inside, in this case, date from the table and you will get the data. But one of the biggest reasons is to use JSON is that it is really, really easy to export and use other apps. In this case, we can export JSON to any option or application we want to process data or to process logs from time to time or anything. How we can configure this plan? The main configuration, you will need obviously the host because you may not have Polresql running on local host and the port. People usually don't change the port, but maybe you will need to change it. User and password, you can put it on the configuration, but we strongly recommend you use pgpass file because it's the best option and secure for in this case, the database and a table. The table will be created if it's not exist on the plugin when it's a startup and connects to Polresql, so you don't need to create it. And we support CongressDB using CongressDB equal true. This is our CongressDB support Polresql protocol, but some queries don't, they don't support some functions and some queries may change, so we created some special option for it. There's a full list of options for the plugin, some interesting, some are not, you can test them and please report any back on it. So, how to query the data? The data, as we say before, is stored on JSON with two fields, tag, time, and data containing the JSON object. Here is an example of data of one record from Fluentbit using the tag cpu.zero, which is just using the input plugin from Fluentbit CPU. And limiting to one, you will see the data over there. It's nice, easy and simple. But in this case, with Apache Logs, you can see something else more interesting. In this case, we use the tag to separate the logs, that Apache, and you will see that we contain the code, the everything that comes using the Apache 2 parser from Fluentbit. This is what's using the tape plugin into some processing Apache logs. But years of logs, we needed to analyze three years of Apache logs, which is a really, really big task because we had around one terabyte of data and we needed to process it in just two weeks. So, for this, we decided to use the tape plugin, which needed some updates and we added some options into the documentation to process the Apache logs into PosterSQL. We used the PL SQL language to process and speed the data into different tables, because otherwise we will have billions of rows inside just one table. But in our case, we use partition tables per month, which allows us to create proper indexes and query just one amount. And this query for month took less than one second, which is really, really amazing because in other databases that store logs, this will obviously create more than one second because we have a lot of data. So, let's deploy inside Kubernetes our Fluentbit plus PosterSQL output playing. We will provide some URL on GitHub, which is open source and you will be able to use Customize to deploy inside. So, let's have some fun. And if our configuration, you will be able to see the simple Fluentbit configuration with the include statement to add the confiles into the Fluentbit configuration. And in the file, PosterSQL customization, it's just a confi map generator that will use the merge behavior to add output PosterSQL confile into the Fluentbit config map. And in the output PosterSQL.conf file, you will be able to see the host, which is our Office of Kubernetes deployment using a dummy password. And the user Fluentbit for the database Fluentbit and table Fluentbit that will match everything to the output plugin. Okay, that was easy, but let's use some PLSQA language because we want all the data in a separate table because obviously we get logs from many places. So, let's partition that table because maybe in the future we will need partition by month, year, weeks or days and use some condition to fulfill the empty fields. So, let's see how this will work. In our example code, we will first create our table with all the data we want to store with a partition option. That will be the useful in case we want to partition our table later. It's important that our table is owned by the Fluentbit user. So, it's the one inserting the data. After that, we create a default partition will be the one by default. Then we can create more partition after if we want. We now create our function that will do the final insert into the table. It's important that if the JSON object doesn't come with a Kubernetes object we can discard that row and return the full row since we want to store the data even if it's not a Kubernetes object. Then we go and start with the insert into our table. It's important to notice that we want a timestamp type and make sure it will be like that for that we use the default column that comes with the plugin. Then we can insert the data we want in our case it will contain container image, container name, namespanes and host. We will add the labels and annotations just to prove that we can add JSON data too. We return an old value because we don't want to store the data in the default table but the one we decided in our trigger. Let's drop the trigger if it exists and create our trigger for our default table. This trigger will be executed for a row inserted in the table. For the record, we will let the functions, the PLSQL function on the slide so you can take them and use them later as an example. As you can see here that we can query the data using a select distant based on the container field in the Kubernetes log and we will see that we have five records but there's one with an empty field. Maybe we need to look into this later and see that some objects will not come with the container name or something like that, maybe that's a bad but you can take a look and that will be your task. But a PLSQL function can be more complicated because sometimes we may want to send the data into more than one just table. Let's see a function that can speed the data depending on the tag. If the input doesn't match a tag, send it to a default table so we may be ending up having three tables. The following function will show you how to speed between tags used with Apache and tags or the data that comes with Kubernetes inside. You can use this function as an example for later but go and have some fun because there's a lot of options you can use here to speed your data and query it later. We will see some, this is a pre-recorded video and you will see some examples like this one where all the fields on Kubernetes log table which is not ideal but we can select distant container simat from the Kubernetes logs. In this case we will select the container simat and count them. Obviously we need to group by a field, this case container image and see that we have a lot of containers but maybe we want to know what if we split the containers and count them by host. That's easy, just add host and group by host. Then we have all the containers, split them using host. So yeah, this is a nice and this is cool but let's see something else. In this case we are going to select all the fields from the Kubernetes logs where the container image is fluent bit. Okay, there's a lot of data, maybe that's not so useful, you know. Well, but you can see it says que all. So let's do something different. What about this thing host and count all the fluent bit image running on different hosts? Okay, we have some good numbers here. I will ask why we have 13 just on master02 but that's up. In the following case we will see some Kubernetes log and split all the things that is running on master01 with the container image fluent bit. Okay, that's a lot of information again but maybe we need to split it or work it later. So let's see something else. What about the Apache logs? As you can see the example we use is just for field, host, path and code, plus the timestamp which is a lot of data, maybe not so much data but it will contain different fields. So as you can see there's a lot of data but not so useful, let's group by the path and count them. Okay, this is a lot of data here. No, Thomas data doesn't say anything. Let's do something different. What about adding the host but yeah, let's work with the host. Okay, now we have the host but still not so much information useful. So let's use something to split the data but about path like, I don't know, WordPress. There's a lot of data with WordPress. Well, this is something that will happen but we want everything that the code is 200. No, there's a lot of 200. As you can see this is a WordPress running there. So let's see everything that it's not found. Okay, this is more nice. But you know, it's Apache logs. You will have a lot of data here so let's see some other examples with different queries. But let's see now what is in the default table flow and bid. Oh, there's a lot of records. Okay, let's see the different tags we got there. And as you can see we have Apache logs that maybe didn't fit on the filter and the CPU. So let's see all the data that has the tag CPU. Okay, you will see that there's a lot of data that didn't fit in our functions. So it was stored on the default table. That was the sense of having a default table because you may not want to lose any data at any moment of your process. So let's see some data for example data for the CPU as you can see it's just like everything on flow and bid. As you will see we can do a lot of stuff with this plugin and once the data is on PostgreSQL. But there are some ideas you may want to experiment right now like read the data using Grafana which is a very useful tool to create graphs. Split the data per weeks, not just month or maybe per year if you have that amount of time maybe a day would be useful to start testing stuff. Create some script for automatic reports per month or per year that's also some ideas you can get from here. Maybe there's something else you can add in the ideas so please draw me a message right here. After this talk we are now able to use flow and bid plus PostgreSQL in its capabilities. We are now going to add SSL connection support for PostgreSQL and schema support. This will mean that you will be able to use SSL to connect to PostgreSQL. In the meantime we hope to get some feedback from the users, you and any other idea to add to the plugin. So please contact me on Twitter and let's go with the questions.