 Welcome to my talk about Apache Kafka as a monitoring data pipeline. My name is Jakub Scholtz and I work as a software engineer at Red Hat. I'm one of the maintainers of a CNC project called Strumzy, which focuses on running Apache Kafka on Kubernetes and also a contributor of Apache Kafka itself. Maybe on the beginning we can try to quickly define what kind of monitoring I will be looking at here. If you Google it, you will find many different definitions. Trust me, I tried it on the beginning when writing the slides. But so in this talk I will mostly focus on collecting, analyzing and using all the different informations which we can get from a system and which help us to understand the system. That typically includes locks, metrics, traces and so on. And these data have different use cases when they are important. For example, we usually need to understand the current state of the system to know whether it's running okay, whether it has some problems, whether we need to do something and stuff like that. But it's also useful later, for example, to analyze past issues and problems or maybe to analyze, for example, possible future improvements in performance or in some functionality and things like that. And usually there's some monitoring pipeline which is used to collect these data, usually through some collector or agent running inside the application or on the same server as the application which gets the data and then there's usually some parsing to kind of parse the data because they have different formats. And you might need to normalize these and then you usually filter out which of these really matter and which are useless. And then you usually route them to some other application where maybe you just want to view the data. Maybe you want to analyze these data a bit deeper. Maybe you just want to store them for later or do some kind of processing of them. And one of the most known examples is the so-called EFK stack, Elasticsearch Fluendee Kibana which is used for collecting logging data. Sometimes Fluendee might be replaced with LogStash or that's then the Elk stack. And in this stack the Fluendee is used as the collector to collect the log records, parse them and filter them and then it forwards them to Elasticsearch which is used to store and index the logs and then Kibana is used for browsing through them, visualizing them, building tailboards and so on. And when these components communicate then they by default use usually HTTP or HTTPS of course where you have the Fluendee send the data with HTTP to Elasticsearch which then stores them and index them and then Kibana is again using HTTP protocol to request these data and show them. And so in this talk I want to focus on how Apache Kafka can improve pipelines like these and what value it can add. But before we deep dive into that let's first have a look at what Apache Kafka itself is and what are its main features to at least get some quick introduction. So Apache Kafka is distributed even streaming platform which combines the delivery and storage and processing of the records or messages if you want. And it's designed for high performance, high scalability, you can very easily scale it horizontally and it's also designed for availability and reliability, durability, fault tolerance and all these things. And it has also a huge ecosystem of tools, client libraries, connectors which we can use to work with data which we have in Kafka. And there are several main features which Kafka can bring into the monitoring pipeline. One of them is that it's using efficient protocol based on TCP which can give you very good performance in delivering the data but also it doesn't really care about the messages it's getting. It's treating all the messages as a byte arrays and it really just gets the bytes from the clients, from the producers and just dumps them on the disk and later when a consumer connects it's either read from the disk or maybe it's still in the disk cache in the memory but it doesn't do any parsing and that allows for high throughput of the messages which is one of the features which is useful for collecting the monitoring data because it's usually quite a lot of messages, usually fairly big amounts of data and Kafka doesn't have problems to handle these. The protocol also supports configurable reliability so you can kind of define how much reliability you want to how many different broker replicas should the data be written and how reliably should they be produced. Interesting enough from my experience a lot of the users using Kafka for the monitoring data have very different requirements for this. Some of them treat the monitoring data as if I lose some low records who cares while others are very strict about them and need to keep all of them and need to have super high reliability and that's completely fine for Kafka because you can really configure it and use it as you want. Another nice thing about Kafka in your monitoring pipeline is that it decouples the different components. It can basically serve as a buffer between the Fluent D which collects the locks and the elastic search which is then storing and indexing them because the Fluent D really just connects to Kafka it sends the locks there and then doesn't really care what happens next and then basically asynchronously from that Kafka can take these data and deliver them to other systems like elastic search but Fluent D doesn't care about it. So if the end system is for example unavailable because of some crash or some disaster situation the data will be stored in Kafka and they will be available later. Similarly it can help with some peak situations which is very important because quite often when your system has some problems and some issues there's a lot of lock messages floating around and you don't really want to lose them because somewhere in the pipeline you overwhelm some component. So again Kafka can ingest huge amounts of data and then later when the peak is over it can then basically deliver them to the other systems and basically even out the peak situation. So what we will really look into this talk and what I will show in the demo is something what you can call a afkk stack where we basically put Kafka into the middle of the afk stack and it will work very similarly Fluent D will collect the records and send them to Kafka. Kafka will store them and it will use Kafka Connect which is Kafka's integration framework integration component together with some connectors provided by the Apache Camo project and forward the locks to Elasticsearch and then as in the original stack Kibana can be used to connect to the Elasticsearch and look through the locks. So it will look something like this we will have kind of one partition consisting of Kafka and Fluent D sending the messages around and storing them in Kafka and then the second partition will be getting the data out of Kafka and pushing them into Elasticsearch or from Elasticsearch they will be used from Kibana. So yeah this is how the whole pipeline looks like and to make things more interesting we will in the demo run all of that on Kubernetes because yeah it's today's kind of very popular platform so using it in the demo seems quite obvious. So I already prepared some things for the demo upfront so first of all I have here my Kubernetes cluster using the latest Kubernetes version and running in AWS with several nodes and I also deployed some pods so first of all I deployed the StromZ cluster operator and using this operator I deployed already the Kafka cluster and the Kafka Connect cluster. To deploy the Kafka cluster it's super easy you just create with the operator you just create the resource with the kind Kafka and there you specify all the details, resources, number of nodes, listeners I enabled your authentication and authorization and Prometheus metrics and all of these things are defined there and then you created the operator just use this as a blueprint and deploy the Kafka cluster and I did the same for Kafka Connect as well if you don't know Kafka Connect that's Kafka's framework for integration between Kafka Brokers and other systems and to deploy the Connect I first used the Kafka user resource to create a user which will be used by the Connect pods to authenticate when connecting to Kafka and then I just used the kind Kafka Connect to actually create the deployment and Kafka Connect is using different connectors to connect to the different systems I for the demo will use connectors from the Apache Camel project and I defined them here in the YAML as well this one will be the one we will use for the Elasticsearch and I have another one for Slack for example and another one for Amazon S3 storage and again when I did Qubectl apply on this it did all the work for me and deployed the Kafka Connect cluster including these connectors so that's really easy what I also did is I already deployed Elasticsearch and Kibana so that's running here in my browser and yeah it doesn't have any data yet but we will change that soon so that's all the things which I prepared upfront that was really just to save some time in the demo now first thing to do before we start collecting the logs is we will need to create a topic in the Kafka cluster where we will be collecting the logs and I do that using the operator using the Kafka topic resource I will name the topic Fluent BitLogs and I just do Qubectl apply and that does all the magic and creates the topic in Kafka form and now what I can do next is I can actually deploy the Fluent Bit which I'm using here Fluent Bit is kind of a more lightweight version of the Fluent Bit it's coming from the same project and it has all the features we need and it consumes less resources than Fluent Bit so that's why I'm using it to deploy it I first create the service account and some cluster roles and cluster role bindings then I again create the Kafka user because Fluent Bit will need to connect to Kafka and authenticate and authorize and then in this config map I have all the configuration so here in this input section I basically tell Fluent Bit where to scrape the logs then I add some additional filter to add some additional information into the logs from Kubernetes such as names of the pods to which the logs belong and so on and then finally in this output configuration I actually configure the Kafka part of things so I configure here the bootstrap service where it can connect to the Kafka cluster I tell it which topic to use to send the logs to and then there's some additional configuration like some buffering and important I enable the TLS encryption and the TLS authentication here and pass the certificates created by the Kafka user and then the Fluent Bit itself it's running as a daemon set for those who don't know what it is that's a Kubernetes resource which creates a single pod on each node of your cluster so let's do kubectl apply-f on this it will create all the resources and when I do now kubectl get watch we can see that there are already several Fluent Bits starting some of them are already ready it should be seven of them because my cluster has seven nodes so let's check that one, two, three, four, five, six, seven so that seems to be running it's the right count so that's great so now we will connect to Kafka and check if it's actually producing any logs and to do that I again create a console user which I will use from the console and which will be allowed to read the messages from this topic so again apply the resource that creates the secret with the certificates and because I will connect there from my local laptop I have to extract the certificates from the secret I mean I will do a bit cheating and copy paste these commands here because otherwise I would make probably ten different typos while typing them and now all we need is to get the address of the cluster so kubectl get Kafka or YAML and that tells me that this is the address for the load balancer listener which I can use to connect and I will use Kafka cat as the client I again to save some typos copy paste it to the command Kafka cat is a simple command line utility for sending or receiving messages to Kafka so let's run it and let's see if we are getting some messages and you see that we are getting a bunch of jason's ray so let's kill it and look at them you can see that each of these is a single message it's always jason it has the timestamp of the log then it has the message that's important so this one is complaining about something with load balancer configuration and things like that so these are all the logs generated by the different pods and by the system and collected by the fluent bits and they are now getting to the topic so let's do something with them and what we will do is I will deploy this Elasticsearch connector into my connect cluster I tell it how to connect to the Elasticsearch I tell it how should the index be called I configure the converter to convert the values it's already jason so I don't need to do any real conversion and I specify the topics it should be using as the source and now as many times before I apply this and that creates with the power of the Strumzy operator the connector in the Kafka Connect cluster so I can now do kubectl get kctr or YAML and I can see that the connector is running the tasks are still empty so they are probably still creating what I can do is I can switch to Kibana and I can dismiss the security message I can zoom the window a bit and I can add data actually I should already have the data there so let's go to home let's go to index management and we should be able to create the index pattern already lockstar that works next we select the field with the timestamp which is the timestamp and create the index pattern and then I can go to the discover part and I already see a bunch of my locks here from different components and what I can for example do I can check I have here for example the pod name so yeah I can for example get locks only from one of these pods or search the locks or do whatever I want so yeah that's it's great because it means that we are now receiving the messages into Elasticsearch and into Kibana and all seems to be working fine so this was the first demo where we got the locks from Fluendi through Kafka into Elasticsearch but that's really just the beginning right because usually with the locks we need to do a lot more than that you probably need to do some archiving in some cases depending on the industry work on there might be even legal obligations to store the locks for I don't know five years or how many but even without that you maybe want to store the locks for some future analysis, future issues some ideas you would want to check later another thing is you probably need some kind of alerting to get some alerts when some problems are happening when your users are getting too many 404 errors or when someone tries to break into your system and you probably want to do all kind of analytics or maybe even some machine learning or artificial intelligence processing to get some additional information from the data and the good news is that Kafka can do it all because Kafka is more than just a PubSub messaging platform it has the Kafka Connect component which has many different connectors to integrate with all kind of systems it has also the Kafka Streams API which can be used which is basically stream processing API and can be used to process and analyze the locks and do things like windowing, aggregations join metrics with locks to maybe find some correlations and things like that and there's then a huge amount of different third-party integrations, clients, stream processing tools or as I already mentioned machine learning and AI tools which can also be used to process the locks so what we will really do is we will take it from this simple pipeline into something like this where we add some additional integration so for example we will do some stream processing to get the alerts sent to Slack so that we know about some problems or we can take all the locks and metrics and monitoring data we might have in our Kafka cluster and store it in Amazon S3 or Amazon Glacier for kind of archiving and later if we need these locks we can then easily go there and just recover the files from there after several years or however long we want to store them the next thing we try to do is to get the data with the lock messages into the Amazon S3 bucket which I have already prepared you can see when I refresh it that it's completely empty now and to get the lock messages which we are getting there I will deploy another Kafka connector and this time it will be a S3 connector using the Kamala AWS 2 S3 sync connector and again I specify the topic which I want to read it from I specify the name pattern how the files should be named and I specify this aggregation which is important because I do not want to have in the S3 bucket a million of messages individually there but instead I can tell the connector to batch the messages into bigger files each containing either 1000 messages or the file will be closed when we don't receive anything for 5 seconds and because I'm uploading this into AWS I have to specify the credentials but I do it securely so I load it from the files load it from a secret instead of having them hard coded here in the connector configuration so we can apply it when it gets created we can try to get kubectl get kctr s3 connector and we can see that it's running the tasks are still creating and hopefully when we switch to the UI we should soon see a new file here and you can already see I got a bunch of files here and the name is according to the pattern which we specified so it starts with the date and timestamp it's important when looking for the locks the file has about 680KB it was just updated a minute ago and if I would download it you would really see all the JSON messages being stored in this single file so yeah that's how we can take care of the archiving and we do not need to configure anything in the fluent bit we already had the data in Kafka so we just deployed another connector to do something else with it and the next thing I will show you is a very simple example of other things so first I will create a new topic and into this topic I will send the alerts and next I will create a Slack connector which will be reading messages from this topic from the logging alerts which we just created and it will push them into Slack and I again have to provide the secret webhook URL which it is using to publish these messages and all I need to do is kubectl apply that should create the connector so that's now ready so when I now send the message into this logging alerts topic I will start into my Slack channel where I can see it and take a look at it and maybe do something with it but we need to now generate some alerts first and what I did for that is I created a very simple Kafka Streams API application which creates new stream processing which reads all the log records in this case and it is coming from one of the Kafka pods and it checks whether the message in the log is about unauthorized operation in Kafka and when that's the case that means someone is trying to do something bad in my Kafka cluster then it sends me this nice error found Kafka authorization error import and repeats the log message takes these messages it generates and sends them into another topic which will be the one which created and which the Slack connector is reading so it's really simple it's not the most sophisticated app but that should be enough for the demo purposes and I now just deploy it as deployment running inside my Kubernetes cluster so I create the Kafka user again with all the rights which are needed and then I create the deployment which is running this alerting application which I just showed and so kubectl apply that should create the application the application will get started it will start reading all the different logs from the Kafka cluster which full end bit is publishing there and when it finds the right messages it will generate the alerts and send them into another topic where the connector picks them up and when I go to my Slack here you can already see a bunch of new messages with some authorization errors it looks like the full end bit is trying to do something what's not allowed that was maybe on the beginning in the first demo before the authorization propagated but we can see that this works let's mark them as red and that way I know that there's something wrong in my cluster and then I should look into it of course for production use you can definitely improve this alerting application and make it a bit more sophisticated in the demos we mostly focused on logging but the same model can be really used beyond logging for other monitoring data as well you can have the metrics or for example traces be collected and delivered via Kafka using the same architecture the components might differ but the principles will be exactly the same and the benefits you get from that will be also similar to what you get with logging and as an example the project which is a popular project for tracing of events across your applications is another example of a system which supports Kafka directly out of the box so you can have the Yeager collector send data to Kafka and you have the Kafka kind of as the middleman decoupling the collector from the ingestors which ingest it and push it into the storage and from the query server which is used for querying the tracing data so that's another example of this pattern used for tracing before the end of the talk we should probably also talk about when not to use Kafka in your monitoring pipeline what is important to understand is that it's another step on the critical path for your monitoring data so someone has to operate it and understand it all to the operators as from it try to do most of the heavy lifting you still probably should have a clue about how it works if you want to rely on it with your monitoring data and then even through Kafka reliable, full tolerant durable and so on there is of course always some probability that it will fail and cause problems to your monitoring pipeline and in addition Kafka is not always cheap to run it needs some resources so you either need to buy some additional hardware maybe or buy some more virtual resources in cloud so you have to count with additional resources and costs that's why at the end you need to think whether you have enough traffic and whether the monitoring is critical enough for you to make use of it because if it's not then using Kafka would really cause more problems and worries than would be the actual benefits of it so that's something you have to think about and decide based on your environment and your use cases so that's it for my talk thanks for watching, I hope it was interesting and useful for you the files for the demos and the slides they are available on this URL which will redirect you to the very long name of my GitHub repository so if you are interested in trying this on your own or going through the slides feel free to have a look and thanks for watching