 Hello everyone. How are you? Good. We're going to start in about one, two minutes. Who of you is working on data or stream processing area? Just to get a notion. Oh, it's time. Well, my name is Eduardo Silva. I work at this company which is called ARM. We used to be treasured data, a company who was acquired by ARM one year ago. And I work as a principal engineer, but my main role is the maintainer of this project, which is called Fluent Bit, which is a part of this presentation, but not about the main topic in general. And the main thing about this presentation is all about data. And the goal of half data means that we aim to extract value for something, right? We can't have many data, but if we are not able to extract data value, there's nothing that we can do. And we are just wasting space in memory file system. And to extract value, actually what we need to do is perform analysis. If we do analysis, we can extract the value, but we assume that data exists somewhere. It's located into the file system, a database, or any kind of centralized service. And there's acute times of challenges here, because data comes now in multiple ways, in multiple formats, over TCP, file system, system D, sensors, we can say mobile applications, or whatever. There are many sorts of data, hardware. Most of the old hardware, or firewalls, ship data over UDP, over the network. And how do we accomplish this? Basically, a hardware of a software emit some record, which is an event, some bytes, this data gets centralized in a database, and then we access that data so we can perform our goal analysis. So with analysis, we can extract the value. But one of the problems is that we don't have just one single unit sent in data, right? Nowadays, this is increasing a lot. So we have many software, many hardware, sending data and centralized, all this information becomes more complex. And if we think about analysis, it will get delayed, right? Because as much data you have in the ingestion site, the data processing takes more time and your results and analysis takes more time. So extracting value takes time. In the normal workflow, we have that we collect data, we filter the data, then we aggregate the data somewhere, usually a database, then we index that information, because without indexing, sometimes it's hard to realize where to look for the right specific information that I really care about. And then we can perform data analysis. But data indexing sometimes is becoming the bottleneck. And it doesn't matter which kind of database you're running, even can be some kind of elastic search or I don't know, my SQL you have indexing. Without indexing, it gets hard. But if you have more data, your feedback loop takes time. And here's where we start talking about stream processing. We start connecting the dots here. Stream processing, it's the ability to perform data processing while it's still in motion. Likely, it's still in memory. And we can say that stream processing can have a different variation of the explanation of what it really means. But basically, it's the ability to perform all this data processing before this data hits the file system, hits the hardware site, and be stored. Because what we want to do is to avoid indexing. And in the stream processing, for example, we assume that everything is the kind of event. Records are emitted by applications, but we assume also that the events are structured messages. We have unstructured messages and structured messages. And the composition of every event or in logging is always pretty much the same. You have a timestamp, meaning when this message was created, generated, and you have the message. Consider this kind of a JSON as a pseudo format. Like, you have the timestamp, and you have temperature value of 53. One is the key, the other is the value. And we can have many keys and many values. And if you have different applications, the records will be different. Right? So how do you unify all of these? How do you process all this data? With low penalty. Because our general goals are fast data processing, avoid to have tables, indexing, and all that things. I'm not saying that we want to deprecate databases. Because that's not the goal. Our goal is to get faster data insights. But for that, we need different workarounds and different ways to solve the problem. So how do we do it? This kind of approach is like, okay, we put some kind of stream processor on top of a database or anything that can get the data right away and perform all the data analysis in memory. And then we get some kind of real-time analysis. I know that real-time, it depends what it means for who you ask for. For example, for the embedded people, real-time is something that happens in the order of 0, 0, 0, 0, milliseconds. For us, it could be that 20 milliseconds and so on. So the stream processor aims to be in the middle, like a middleware or a layer that allows you to query this information and get the results back faster. So one of the things that we need to have is, like, we receive structure events. And we also, if you're going to query the information, we need to expose some kind of query language, right? Because if you want to query, you have to select keys, you want to do filtering, you want to aggregate certain records and wrote results somewhere. And of course, all of these happen in memory. And when we talk about DH and the cloud, DH can be anything, right? Can be an embedded device. For the cloud space, DH can be a Kubernetes node. So DH is quite a particular term. But most of the deployments of stream processing, we can say that the stream processor runs in the cloud side. And that is good. That is not bad, what I mean. But also, we can do it better. And on our scenario, what people do is usually put some kind of log collector or middleware layer in the edge side, right, where it aggregates all the information from the different hardware software and then shift the data out to the cloud where you do another kind of stream processing. This is a common approach. You can use this or this. That is pretty normal. And this is kind of models. I'm not talking about which product you should use. But how we can make it better? So can we do a stream processor in a different way? And this is where we start talking about logging. And logging has similar challenges. Logging deals with data coming from TCP, the file system, system D, sensors, and so on. Pretty much the same thing that we are dealing with in the other scenario. And if we consider, for example, data from log files, you can realize in Apache log format, it's different than a JSON map and many others. All of them are like raw text format that, for example, you can understand what Apache logs means. But if you keep that value, that string to a computer for them is just a row, you know, an array of bytes and that's it. There's no structure. Maybe you can say that JSON has a structure. Yeah, we have a parser that we apply on top of that and you can read all the information. So a simple message like open source summit and logging gets the message, gets the timestamp, but also the stream from where this message was generated. This is the case, for example, from a Docker container. The application generated a open source summit North America and it created a JSON map or kind of structured message with more data. And for example, if you are using Kubernetes or running a big cluster, usually what you want to add is some kind of context from where this message is coming from. It's not enough to say, oh, this message is coming from this application because that application can be replicated in 10 nodes and you can have many replicas of them. And that's why a log processor also needs to take that information and append metadata, saying, for example, this message is coming from this node and that node and from this pod and that's a pod name, that's a pod ID, the namespace and all that kind of information that gives us context. Because when you do analysis, you want to say, hey, please show me a statistic of all these records that are coming from this specific pod and maybe that pod exists in many nodes. This is from a Kubernetes or classic point of view. So logging becomes complex too. And in general, what we do is to solve all these problems of data management, you collect the data from an input, you parse the data because you want to convert from unstructured to structured format, then you filter the data, maybe you care about all your data, maybe you want to filter data out, you buffer your data because you don't want to lose it, and then you ship out your data to your database cloud service, like InfluxDB, MySQL, Amazon S3, or any kind of bugging. And this pipeline is very well known and this is where we start talking about the project which is called Flu embed, which is a full logging solution which is part of Flu and D project that aims to solve all these logging challenges. Flu embed started in 2015, it started originally as a solution for embedded Linux, but in a few years it started evolving as a solution for the cloud space and of course it's fully open source in Apache license. It's written in C language, it's tied for low memory and CPU footprint. We say that we use around 500 kilobytes, but if you're ingesting thousands of records of data, your memory will go up, so it depends on how you look at it, and it has a plug-able architecture, we support more than 50 plugins, and plugins means inputs, parsers, filters, or places where you can send your data out, and it has built-in security with TLS and networking I.O. And in logging, in our case, it's pretty much the same, the application generates an event or a record for us. That record has a timestamp, we got the message, and that message all becomes serialized in binary. We don't use like a kind of internal JSON, because it's quite expensive to compute, but we use message pack which is really performant. And when we get this data in Flu embed, then we perform all this kind of aggregation internally, we group all the records by tags, they go to the storage interface, and they're wrote it out to any kind of output plugin, because maybe you want to send the data that belongs to Apache to Elasticsearch, and maybe you want to send all your syslog data, I don't know, to a custom HTTP endpoint. So you decide how to collect the data, and where to send the data out. Nowadays, Flu embed is deployed more than 200,000 times every day, so which is a lot in the cloud space, and it's being adopted widely by AWS platform, and Datadog is contributing back to, because they want the Flu embed can talk to Datadog cloud service, that is upcoming work. So how this correlates, meaning stream processing, login, and the edge. And the proposal here is pretty much connecting all the dots, because we're solving the same problem over and over in the cloud and on the edge. Having that now, the edge can be anything, and also if you consider embedded Linux or any kind of gateway, we have enough power to do a lot of processing, so we can defer some processing inside, do it in the cloud, you can do it on the edge. And the big reason for this is because performance, and performance, I mean, get our insights or extract the value faster. So if you look at the login paints, actually most of the paints is accessing the file system, collect the data, do the parsing, and then we have the data indexing, which is more expensive. And this is where Flu embed 1.2, which was released like a month ago, comes with full stream processing capabilities. Full, I mean a basic set of stream processing capabilities for the edge and growing. So this is usually the model that we saw a few slides ago. And actually what we do, since the new version, is that we integrate a full stream processor in the same Flu embed project, which runs in C code in a few hundred kilobytes, and it's very reliable. And why I say that this is really important, because if you want to do stream processing for small things, you have to deploy something maybe like Kafka, Apache Spark, or something that you don't want to deal with. Maybe it's very useful for a lot of amount of archive data, and maybe you don't care too much about the response time for data processing. But when you are on the edge, when you have thousands of notes, and maybe you want to get some alerts, but trigger something A, the CPU is going too high in a fraction of seconds, you need something like this. You cannot wait minutes. And how this works, in the same pipeline that we had, after we stored the data, well, in the storage engine, we send the data to the stream processor, and the stream processor is able to ingest back results in the same pipeline. And how this works, actually we provide a way to run your own SQL queries to select data, filter, and create your new streams of information. For example, you can do stream selections. You can select keys. I think that you're quite familiar, but here there's no database, there's no indexing. Actually, you get the same fields because all of them are key values, and you query the streams, which could be your tables, where the data is coming from. The difference is that the data is always flowing, and you perform the queries as soon as the data is flowing through the engine. We can do aggregation functions, because if you're doing stream processing, actually you want to aggregate the data and perform calculations right away, so you calculate average, minimum count, and so on. And, well, for embedded, also, this is really useful where you can do some kind of window tumbling, meaning that, for example, every five seconds calculate the average temperature of this specific device. And this is really useful because, for example, for each things, you want to measure CPU, you want to measure memory, or you want to measure any kind of things that generate information in your device, you can do it because you can connect to internal metrics of the system, you can collect data from log files, or listening for any kind of, I don't know, system D information, and perform some actions with the results that you get from your queries. And this is a way to create streams. We're going to see it in a demo right now. And, okay, let's do a demo. I think this is more easily. And the demo, the first demo is just a log file, it has a million records in a file, and actually, I only want to retrieve all the records where country equals United States, and number is equal to 80. So let me show you here. Can you read it? It's okay. Okay. And let me show you the data file. I'm going to code it. Okay. This is like a plain file that has many records. I just applied JQ to see how the data is formatted. We have a date, an IP, a word, a country, a flag, and num. And this is where we're going to apply this kind of query. And the query will be pretty simple. First, we're going to create a new stream with the data we just care about, right? And let me show you how we configure all of these, and we can start working. So we're going to do, like, basic stream processing and then do something more advanced like data forecasting. Fluent Beta, it runs as a service, and it has, like, a configuration file or you can run from the command line. Basically, you define input, and you define output. You can add filters in the middle, but we are not using filters here. What we're doing here is using the tail input plug-in. Tail just follow a log file and consume the new entries that we have. We're setting an alias called open-source-summit and going to read the file which is called data.log and apply a parser JSON. Because if we look at the data that we have here, for example, I'm showing the first line of data.log, you will see that there's a byte of strings. But internally, you want to have this kind of map in a binary version with key and values. And for that, you need a parser. Okay? That's why we're using JSON. And are you familiar with Elasticsearch? All of you? Okay. Let's do a test with Elasticsearch to see how it operates. Meaning comparing how to do Elasticsearch and without Elasticsearch. Let me check that it's running. Okay. Elastic is stopped. Let me start it. And we are going to wipe Elasticsearch. Because I don't want to have any kind of old information. Don't do this at home. Okay. And just wipe the database. And if we query the database using the HTTP endpoint, we can see the data that we have in the database. There's nothing. There's just headers. When you get data, you get more records saying the number of documents, bytes, and so on. So what I'm going to do here is to query the HTTP endpoint of Elastic every second on the right. Okay? And on the left, I'm going to flash data every one second to be more friendly. And we're going to send the data not to the standard output but to Elasticsearch. And we're going to format the data for them. Logs-format on. You can have many outputs. But now we are going to take the data from one log file and ingest the information into the database. We're not going to use the streams now. Comment it out. Okay. Should be good. Okay. And now we are seeing the documents. Let me increase this one. Oops. Terminal thing of something happened. It's running but it's not responding to the HTTP endpoint. And this was expected. When you're sending too much data to a special database like Elastic, sometimes it takes so much time to index the data that stop responding to the data ingestion. And some of you are smiling because you know that this is true. And the goal of this is not to explain how a simple database you can crash it, but how the data indexing can hurt all your data ingestion and pipeline. And computing everything is about buffering. If you have a huge buffer of data, and you know the output can handle this, you're going to get back pressure, but you're sending more data, and sometimes if this guy cannot handle that, it will stop working like that. So Fluentbit is trying to send the data. Say, hey, I got a connection problem and stop it working. And we just tried to send some amount of records. Happens that we did it too fast. We filled their buffers and it crashed it. Okay. Okay. Let's stop. Well, I don't need to stop it. Because what it wanted to do is to replicate this. I wanted to insert the million records in Elastic and then be able to query the data. Now, if I wanted to do that, maybe we can try this. This is the friendly elastic search setup where we decrease our buffers and we put our limits. This is the important line here. Mem buff limit means that I'm not going to ingest more than 10 megabytes at a time. So I read 10 megabytes and try to ingest 10. And I keep waiting. On that way, Elastic might succeed. Let me start it again. Let's wipe the database. Oops. Let's not start. It's running. Okay. And we're going to now ingest. I just started Fluentbit again. Just put some limits. And you will see that the number of documents here is growing. So if I decrease the amount of data ingesting, it's able to work. If you call any kind of database company, of course they will tell you, hey, put two more nodes, three nodes, four more nodes. And sometimes you can handle that in the pipeline. And this is working. But as you can see, it's taking some time because we are trying to index a million records. Okay. So if that took like a minute, it will take like two minutes in total. We are half of it. Okay. And this is working. I'm going to stop it because we don't really care about it. The thing is I don't want to wait until that finish to perform my query because I want to have my information. Okay. I'm going to stop Elastic. I don't want to run out of memory. I will stop the engine here and let's do something right now. The stream processor, it's able to provide your SQL queries to run on top of your data. So we are going to create a new stream of data with the query that we have there. We are going to create a new stream called results. We are going to tag those results with the name results. And select just specific fields of my original data from a specific stream which is called open source summit which comes from tail. And then with that specific condition. Nothing fancy. And if I'm not wrong, that query is in this configuration file, which is here. The only bug is that we support one line configuration. We are going to fix that otherwise it becomes unreadable. And I'm going to enable stream files. I don't need the limits. And I'm going to send the data to the standard output. I'm not going to send the data to Elastic search because I want to get my results right away. So this is not needed. I'm going to flush the data faster. Let me do it this way. I forget to do something. When you are doing the standard output, it's always good to format the data. Format JSON lines. So you get all your data. Oh, but we are not getting the, oh, we are getting all the data. My fault. Because we just want to match results. Streams, that should be fine. So now we are getting the results from the stream processor. We are parsing to the standard output. I'm trying to say with this that you can do a lot of processing. It just finished it. A lot of processing in the edge. And you don't need to defer that to the cloud. And what are the benefits? One is performance because you get your data faster. Second, you don't need to saturate your network band. Your network, right? You just send the records that you really care about to your back into your database. And not thousands of records. And now think if you are in the embedded space and you have a thousand of sensors, you want to send all the data samples of every sensor to the cloud? No. But that's what we are doing, right? With this, we can do it better. Meaning we can put some gateway or the same sensor device just leaving how much data we are really sending to the output destination and we are getting data faster. Right now, I'm sending the data just to the standard output. This is just for a demo. But we can send it to an H2DP endpoint. Maybe you could connect to a Slack or whatever and send this result what you really care about. That is one of the use cases for extreme processing on the edge side. And the other, if you like about machine learning and all that stuff, and honestly, I'm very new in that area, we just, one contributor for ARM just provided this linear regression implementation where you can do some forecasting for time series. It was a quite simple implementation, but let me show you how it works. The query is like this. We have one plug-in and flow-in bit. We just collect memory, say, how much memory we have, how much is used, and how much is free for the system that is running. So we created a new stream called forecast. We are adding a new tag and we are selecting the data using a time series forecast function, which is a linear regression internally. When we pass a timestamp, the memory is set and a special number is atop. The stream, and this is important, a window tumbling of 10 seconds. That means do all this forecasting every 10 seconds. In reality, you want to do it every minute, every 30 minutes, or you want to do some kind of hopping. The first window will be 30 seconds, the second window every 40, 50. You can configure that. Okay, so let me show you. This is the stream's file. Now we get more complex because we want to create a new stream with the first data of the memory because at the moment we don't support nested functions. For example, here, this is something I wanted to explain, we are using time series forecast because here we still don't support to put this function inside the other. So we create two streams and it works like a chain. So it's the same thing, but it will be beautiful to have some kind of unix timestamp, something like this. But it's not supported, but you can send as a patch. Okay. Oh, yes. Okay. And the configuration file for that one is this. The difference is that I'm loading the other stream's file configuration. I'm collecting data from the mem plugin. I'm tagging that with mem stat where I'm putting an alias. And I'm going to send to the standard output all the results from the memory calculation. But for everything that matches a forecast, I'm going to send it to a remote endpoint. On this case, I'm going to use one flow embed and another flow embed that the other will receive the results over the network. And this is pretty simple. I'm going to load flow embed here from the command line. I'm going to receive messages or collect data from the forward protocol, which is a TCP protocol that we use in flow embed. And I'm going to send the results to the standard output and format JSON lines every second. Okay. On the right side, I'm just listening for the forecast messages. And this guy is going to connect to this one. And if I'm not wrong, the forecast will run every 10 seconds. So right now on the left, you are seeing the snapshot of memories because the input plugin collected many data and just printing that information to the standard output. But every 10 seconds, I'm getting a forecast on how much memory will be used in the next 10 seconds using linear regression. We're not going to explain all the algorithms right now because I didn't write it, honestly. But this is the idea. And the idea is we're planning to add a many kinds of machine learning algorithms on this. And of course, you are more expert than us on this. And if you have some ideas, use case, we would like to hear about it because I think this is really useful because this is just to run on the edge. And we are just running a few kilobytes. There's not a database. There's not a Java environment. And I think that you got my point at the moment. That was the presentation and takes on March for attending. We have time for questions. If you have any questions, comments, what would you like to see? For the parsers, we got many kind of, we have different ways to parse the data. One of the backends that we have is using regular expressions. And down here, we also support a JSON backend. We support a log format, which is kind of how Golan created the login format and so on. So the thing is that you configure your own parser for your own data type. We have more internals of fluent bit if you want to see how it works. Okay. So we have time before we go to get some beers. So basically, it runs as a single engine. It runs in asynchronous IO with input line filters and everything is asynchronous. There's no blocking calls internally in the code. And this is really important to say because people say, hey, how many threads are you running? We are not running threads. We are running coroutines. So every time that we are going to do some kind of IO to the network or to a service, we defer the task to the kernel, we get back to the event loop and just wait for the notification to wake up the functions. And this, for example, if you consider the output plugins, we are going to ship data out, which is a more complex task, I would say, is that there's a common workflow, create a TCP connection, convert the internal representation. Do you remember that we talked about the message pack? Okay. Because elastic search is waiting for a special JSON payload. But we have message pack. So every output plugin needs to format the data for the output. We write the data, wait for a response, and report the final status. And all of these lines that we have been read, most of them are like blocking calls. If you write any kind of programming language, unless you are doing a synchronous IO, you need to perform some weight on that code, or a visual weight. This is a plugin, and this is a fraction of the elastic search plugin. And what it does, basically, in line eight, using the Fluent Bit API, is connect to the remote endpoint, on line 14, format the data from message pack to JSON. We create on 17 an HTTP client, but that is not blocking because that's just a context, but on 21, there is network operations. And what we do in the code, if you use this API, if you look at the function, you will see, oh, this is doing step A, B, C, D, and so on. But internally, when it hits number eight, it tries to connect, but it suspend that function on that moment. And return to the loop to collect data, transform parser, and at some point, that function is resumed once the connection succeeded or got an exception like an error. And we did it on this way because we as developers, we are very good at mess-up things. And when you're dealing with networking, you don't want to defer that job to the developer because a socket can fail, maybe you can get a status server that you're not handling, and you want to abstract all of those things. And it's the same line, the same with line 21, when we send the data to the server. And when we send the data out, we can return a status saying, hey, we flushed the data, okay, we need a retry. A retry means the database was down, I was not able to perform the connection, or maybe an error that means that I'm not going to retry this data. Maybe the data was malformed or the client cannot handle it. We have built-in plugins, sorry, built-in helpers, like HTTP client, upstream connections with TCP, TLS, OAuth2. The only one that uses OAuth2 is the Google Stackdriver plugin, timers, and we support LuaGit, so you can write also your own filters in Lua. And you load them just as a configuration. We have more than 50 plugins, but this is just a few samples of the ones that are available. Input to collect on tail, kernel log messages, serial interface, we can filter the data, put some limits. If you're using Kubernetes, you're going to use a Kubernetes filter, and output like Splunked as your Kafka, or anything that you are running. This is like a simple Lua filter, because in some filters, you said, I would like to remove some special entry from the record. But you're not going to write a C plugin for that. We wanted to have something more simple that you can write in the configuration, and you can use a simple function in the configuration, like a Lua table, and you just add or remove records easily. If you like to do monitoring, and I'm sure that everybody does it, we have a custom HTTP endpoint, so you can get a matrix of how the fluent bit and the pipeline is working, and we also can export the data in Prometheus format. So if you use Prometheus, you can collect and pull the metrics. That's the last slide. Well, that was the presentation, and I will be around. If you have more questions, feel free to do it. We have the project stickers if you want to. We don't have t-shirts this time, but we hope it should be. That's it. Thanks for coming.