 All right, and welcome to another DevNation Live. We're here for another great tech talk. You guys are going to hear a lot of good content today. We're going to talk about Apache Kafka. I know everybody's super hot for Apache Kafka. It's one of the hottest topics in the world right now when it comes to technical capability and technical messaging capability that we can add to our applications. We're super lucky today to have Jacob. He is actually the Stream Z lead, project lead. I actually posted the link to that open source project in the chat already. For those of you on the chat right now, please make sure to tell your friends and neighbors, refresh the browser if they can't hear and they can't see. That's always the trick for this. But we're ready to get started and dive in. We'll hold questions towards the end. I'll try to take some questions and answers in real time in the chat tab, but just keep in mind we'll use that chat. Just put your questions in right there and we'll try to get to them as quickly as possible. Then we also have a link to Jacob's GitHub repo, where he'll send out the demo. You can run the demo on your own in your own laptop at some point in the future. All right, so we're ready to turn things over to Jacob. You're ready to go Jacob? Thanks for the introduction and let's get started with Apache Kafka. So what is actually Apache Kafka? There are many different definitions. One of them is published subscribe messaging system. Another one is streaming data platform. Yet another one is distributed horizontally scalable for tolerant commit lock. A lot of interesting words. If you would search the internet, it will actually found out a lot more. But basically Apache Kafka is a messaging system. It's very popular project and it's well known mainly for its scale. Unlike the more traditional JMS based messaging system, it's very good at scaling horizontally and allowing a huge throughput on ingress as well as ingress and handling large number of connected clients. In the Apache Kafka, the messages are sent or received always from something called topic. But the topic is always split into one or more partitions and these partitions they act as shards. So every message which you send to the topic will be always written only into one of these shards and the messages will be consumed from these shards from these partitions. That's basically one of the things which allows Kafka to scale so well because most of the actual operations of the actual work is actually done on the partition level and the topic is more or less just the virtual object. Then the producer is sending the messages. It always decides into which partition it will send the message usually based on the message key. So every Kafka message or record can have a key. The key will be hashed and based on this hash, the clients will always know exactly into which partition this message should be sent. Thanks to that, all the messages with the same key end up always in exactly the same partition. If your message doesn't have a key, that's not a problem. In that case, the messages are basically distributed around Robin into the different partitions. Bunt the messages written into the partition, then its order is fixed. So they cannot really reorder once they are written into the partition and the consumers can rely on it, which is one of the other interesting features of Apache Kafka. While I'm talking about consumers, Apache Kafka has something called Consumer API. The consumers are actually not consuming the messages. Instead, the messages always stay in the partitions even after they are consumed and they are removed from the partitions only based on something called retention policy. And this policy can be defined either based on message age. So you can say something like, please keep the messages in the partition for one week or two days, one day, depending on the use case. Or it can be done based on the partition size. So you can say something like, please keep the messages until the partition is 100 gigabytes big. And when it's that big, start dropping the oldest messages. And you can actually even combine these two retention policies. So we can say something, keep the messages for one week or up to 10 gigabytes, depending on whichever limits is reached first. And then there is something called Compacted Policy, where the Kafka broker actually tries to always keep the last message for a given message key. And thanks to that, you can actually use Kafka not only as a messaging system, but you can use Kafka also as a kind of storage because with these policies, you can very easily keep all your data in the system, in the topics, in the partitions for years, for basically unlimited time. And then just use it as a store to recover the data from it, for example, when your application starts. And that's another very interesting feature of Kafka, because more traditionally messaging is to get the messages in and get them out to some other form of storage. But with Kafka, you can actually store the messages inside Kafka for quite a long time. And while I talked a lot about the partitions, the partitions do the sharding. So the partitions alone do not offer any form of high availability. They do not really keep the copies of the messages. But for that, there's something called replication where each partition can actually exist in one or more copies in your Kafka cluster. And these copies can be used to achieve high availability and failover and make sure that you don't lose any data and that the messages will be always available to your clients. So on this picture, let's have a bit better look on how the partitions work when the producer produced the messages. So on this picture, you can see a topic with three different partitions and the producer, which is sending the messages there. And you can see that the partitions, they really are a very simple commit log. So when the producer sends a message into the partition, it's basically appended, written at the end of the commit log, and it always gets something called offset number, which is an increasing integer number. And so if you imagine on this picture of the producer sends the next message into the partition one, that will be appended to the end of the log on the offset seven, the next one will be eight and so on. And the consumers, when they are consuming the messages from the partition, it's not actually the broker which decides which message have been already consumed by the consumer and what should be the next one and so on. But instead is the consumer coming to the broker and saying something like, give me the messages from the offset number four. And the broker finds the message on the offset four, pass the message from the offset four to the client, then five, six and so on. And when the consumer or the operator of the consumer application, for example, sees that, oh, something was wrong, there was some back in the algorithm with which the messages were processed and we need to reprocess all the messages again to do it correctly. That's quite easy with Kafka. So the consumer basically just changed the offset and decides to start reading the messages again from the beginning, so in this case from zero and tells that to the broker and the broker will basically replay all the messages to the consumer from the beginning. And this replay feature is another cool feature which most messaging systems don't really offer and it can be used for a lot of interesting use cases and tricks. Now, when you use the replication, then each partition will have some copies and in the case on this picture, you can see that we have a Kafka cluster with three brokers and inside this cluster, we have one topic with three different partitions and each of these partitions have replication factor three. So you can always see three different replicas, each one of the replicas on each of the brokers and always one of these replicas will be elected as the leader replica and the other replicas will be the follower replicas. So the leader replica, that will be the one which will be communicating with the producers and consumers and the follower replicas, they will be just sitting there copying all the messages from the leader replica and waiting that if something happens with the leader, they will be ready to take over. And when that happens, like in this case, imagine we lost the broker free, then one of the other replicas which were originally followers will be elected and will start acting as the new leader. So you can see that after the crash here of the broker free, what change is that the replica for the partition free which had originally the leader as the broker free has now moved to the broker two and all the clients can keep producing consuming the messages now from the broker two. There is also something what's called consumer groups. The consumer groups are used to group together different instances of the same applications which want to consume the messages in parallel and kind of share the messages or receive them concurrently if you want. So if you have application which is reading the messages and you want to run this application in multiple instances, you would configure this application to use always the same consumer group and all the consumers within the consumer group, they will always kind of distribute among each other the partitions which they are consuming so that each consumer has a unique subset of the partitions. And at the same time, each partition has always only one consumer from given consumer group. And if you put it together with how the messages are distributed into the partitions, we said that the messages with the same key are always delivered into the same partition and they are always kept there in the order as they were sent. So combined with these consumer groups, it actually means that a single instance of your applications or single consumer will always read one, at least one partition and will receive all the messages for given key and will receive them in order. And that's very important for most of the applications. There's a lot of people who might be wondering that the ordering per partition isn't enough that the messages need to be ordered within the whole topic. But in most cases, anyway, consume and process the images in multiple instances because of the scale which Kafka delivers and the total throughput which you usually use with Kafka. And thanks to that, the total ordering doesn't really matter and all what matters is ordering within the single partitions. And you can of course have multiple consumer groups. So if you have applications which are different applications, not just scaling of a single application, you can just assign them different consumer group, different consumer group ID and they will receive the same copies of the messages. So that's how you can do the publish subscribe with Kafka is that the different applications are using different consumer ID and then the same messages they'll deliver to the different consumers. So to maybe help explain a bit better some pictures. So in this picture, you can see topic with four partitions and two consumer groups. You can see that in the first one, the consumers are each consuming two partitions. While in the second group where we have three consumers, there's two consumers who have only one partition and one consumer which has two partitions which it's consuming and processing. Now, let's see what happens when we kill one of the consumers. So we killed the consumer in the group two and what happens is something called rebalance where the remaining consumers in the consumer group will get reassigned the partitions and now after one of them crashed, they will now both be processing two different partitions. Now, the way how these consumer group works, it means that if you have topic with four partitions, you can always have only up to four consumers who are actually receiving some messages and who are actually doing some work. And if you have a consumer group like here with five consumers, the fifth one will be basically sitting there idle doing nothing because it doesn't have any assigned partition. That's not necessarily wrong because what can happen is that one of the consumers doing some work might suddenly crash. And in this case, the fifth consumer can basically act as a kind of a warm backup and can immediately take over without waiting for some hardware reboot or anything like that. So that was kind of more theoretical introduction into how Kafka works. And now let's look more at some quote. So I'm gonna share my screen. So when you want to start with Apache Kafka, it's super easy. You can just go to the Apache Kafka website which is kafka.apache.org. You just click the download button and you can download the binaries and on your local computer, you will just unpack them and you can start using them to make this a bit faster. I already downloaded them and unpacked them. So I have here my directory with the Kafka installed. And if you would be setting up real production cluster with multiple nodes, you would actually have to do some editing of the configuration files and entering some IP addresses and so on. But if you just want to start developing with a single node then all the configuration files are already prepared there and you can just start the software. But Kafka doesn't really run on its own. So Kafka actually use Apache Zookeeper as a external dependency. Zookeeper is used to synchronize the brokers and bootstrap them and elect the leaders and store metadata for the topics and clients and so on. And that's why we have to first start a Zookeeper server. You don't have to worry about installing anything else. The Zookeeper is packaged there together with Kafka and all you need to do is start a Zookeeper server start script and pass it the configuration file, config slash Zookeeper properties and start it. And it will start listening on the port 2181. And that means that it's ready and it's waiting for Kafka to connect. So in the same way as I did it before with Zookeeper I can now use the script Kafka server start.sh with the config server properties file. And that will start my local Kafka server. And now the last log message says Kafka server with ID zero started. And that's my development environment and I can now start developing applications with Apache Kafka. So you don't need any editing. Everything is super easy for you to start with. Before we actually start using some clients, let's create a Kafka topic. Again, there is a script called Kafka topics.sh which makes this easy. You just pointed to the Zookeeper server. So I will tell you to connect to local host on port 2181. Some of the management tools for managing Kafka configuration and consumers and topics. Some of them are connecting to Zookeeper. Some of them the newer ones they will be connecting directly to Kafka. So don't be confused by it. Help will always tell you which server to use. And to this Kafka topics tool, I will tell that I want to create a topic and let's call the topic DevNation and let's create a topic which will have three partitions and replication factor one. So I have just one Kafka broker. That's why I'm using the replication factor one because if you want to create more replicas you will actually need to have multiple brokers running because the replicas have to be located on different brokers. But for the development purposes that doesn't really matter because I don't need replicas anyway. So let's create it with replication factor one. And it's a topic was created. And now as a real quick test that Kafka works fine. There are also some console producer and consumer tools which we can use to send a simple message or receive a simple message to make sure that the topic in our Kafka cluster is working. So I will use this Kafka console producer script. I will point it to the broker on localhost 9092. And I will tell it that I want to send the message to the topic DevNation. Now it connected so I can send the message. So let's say hello. World from Madline. The message was sent. We can now close the producer and we can have a look at the consumer. Again, there's a Kafka console consumer script which can do the consumption for me. I again pointed to the localhost 9092. I say that I want to get the messages from the beginning. So if I would not specify this from beginning it will start waiting for any new messages which would be sent after this consumer was started. Instead, when I specify this from beginning it will start really from the beginning of the topic. And I expect it to show me the message we have just sent. And I have started it. And as you can see, we have received the message hello world from command line. So now we can see that everything is working fine and let's do something a bit more complicated. So as part of the Apache Kafka project there is a consumer and producer API in Java which you can use to send and receive messages. So let's have a look at how does such a message consumer look like. So first I use the properties object to configure where the producer connects and how it connects. So again, localhost 9092. And in this group ID config I actually specify what should be the consumer group ID which I want to use with this client. And in this case, I will say the definition that's Java. And then in this first simple application I will just let it automatically commit the messages which I receive every thousand milliseconds. And later we can have a look at an application which actually does the commit differently. And then I specify these key and value deserializers as a string deserializer classes. And that's because Kafka in the inside, Kafka actually doesn't really understand or doesn't really work with any sophisticated message formats. Everything is just a byte array. But because you don't really want to always work in all the consumers and producers with the byte arrays you want to work with some better formats. In my case, I would be using string but you can use JSON, Adro, Google, Protobufs and many different formats. You can even write your own. Setting these deserializers will make it actually easy for you to use these formats because now when I configure them I can actually create the consumer which will be receiving string messages so I don't have to do any decoding. And to create a consumer or I will do I will pass the properties object into it. And then I will subscribe to the topic which we created, which is that nation. And then in the round loop, I will be again and again calling this consumer poor method which will return me one, zero or more records or messages from the Kafka broker. It might happen if I'm kind of at the end of the partition or the end of the topic and there are no new messages. It might happen that the records will be actually empty but when there are some new messages then in this for loop I will basically go through all the records which we received one by one and I will do, I will just bring some message to the output. So let's try to run this. So let's zoom it. And what you can see is that the client started and already received the message which we sent from the command line. You can see that there are some more details that the message was sent to the partition number two and that the offset was zero. And now when the consumer is running, let's have a look at the message producer to send some more messages. So the configuration is exactly the same. It's just a bit simpler. So I now configure serializers as strings and I again configure localhost 909 to as the Kafka address. Then I create the producer using these properties object and then again to make it easy for me in the loop I will first create a new record. The record that's basically the message which we are going to send. And I want to send it to the topic DevNation and the value should be just something like hello world from Java one, two, three and so on. And then when I have this record ready I will just call this producer send and that will send the message. And the send command will return a future which once the future is completed when once the message I send is acknowledged by the broker then it will actually give me the metadata with the result which will say where it was sent what was the offset and so on. And in this producer I'm actually sending the messages synchronously. So I immediately call this get method which will wait until the future finish and give me the results. And I will again just print the result of the send command. So when I run it and get back to the consumer you can see that we received some new messages and they say hello world from Java nine, hello world from Java 10 and so on. And in the consumer you can actually do the commits manually as well. So in this one I just disabled the auto commits set them to false. And now after receiving one or multiple messages I can call this consumer commit sync which what it will do is basically it will store the offset of the last message which I produced inside a special topic on the Kafka broker. And thanks to that the client when I restart it will be actually able to recover and continue from the last position where it ended. And in the previous example where this was done every second the Kafka client basically did this for me. And I can make it even a bit more sophisticated. So I can use this consumer which locks me the rebalancing. So I created here the special rebalance listener. And thanks to that we can have a look in more detail how does this consumer groups work. So when I start the first instance of this application what actually happens here is that on the beginning you can see that this client was assigned the responsibility for the partition two, zero and one. Now when I start another instance of this then the rebalance will happen and this new client which I started is now assigned partition zero and one whereas the old client is actually now assigned the partition two. And so only if I would start third and fourth the rebalance will always happen and the partitions will be reassigned but because I have only three of them fourth one would actually not have any partition assigned. Another thing which you can do is you can actually use asynchronous producer. So instead of just waiting for the future to finish in this one I specify the producer callback and the callback is called every time the message is delivered asynchronously so you can actually write the client in much bigger speed. And it's not just the Kafka clients in Java so if you are for example fans of Spring I have here a Spring application and if you follow the link to the demo repository on GitHub you can find there all the source codes and Spring has a great support for Apache Kafka. So when I start this application in this case I have a consumer there but I have also a very simple REST API and when I post to this API it will actually send the message. So I will send the message hello world from Spring enter and it did the post to the REST API the Spring application send the message and then it again received it from the broker and here you can see that it received the message hello world from Spring. But if you are not a Java fan at all then you don't have to be worried. So there are Kafka clients for pretty much every language every platform. So just as an example here I have a very simple Python consumer and Python producer. And for each of these I use the different consumer groups so you can actually see the published subscribe functionality the Python one now received even the old Java messages because it has its own consumer group and another example which I have here and which is in the demo repository as well is one using JavaScript and Node.js. So I have a consumer and I have a producer. So that's it for the demo let's go quickly back to the slides and so I hope these examples showed you that Kafka can be very versatile it has some very interesting features like the replay and the ordering and the scalability. And some of the use cases for which Kafka is used very often is for example the more traditional messaging and data integration use cases but it's very often used also for metrics and log aggregation. So to deliver for example logs from the actual system where they are locked to some elastic search or something like that in the way that it decouples the log collector from the system where you store the logs it can be used for website activity tracking. It can be used for something called event sourcing which is basically using Kafka to send a stream of state changes so that then you combine for example with the replay feature you can kind of go through the stream and reconstruct the state of some system in time. It can be also used for stream processing. I think two weeks ago, Marius Pogovic he had a nice talk about stream processing. So if you haven't seen that one then you should find it on YouTube and have a look at it. And it can be also used really as a commit log as a distributed storage. And if you want to run Apache Kafka on OpenShift or on Kubernetes that's actually quite easy as well thanks to a project which is called Strimsy which delivers operators for managing and configuring not only the Kafka clusters but it can help you to also manage the Kafka topics in the Kubernetes native way and Kafka users and provides all the containers for Kafka and ZooKeeper. And in Red Hat, we have also something called Red Hat AMQ Streams which is our enterprise distribution of Apache Kafka which supports either running Apache Kafka on Red Hat Enterprise Linux on bare metal or virtual machines or based on the Strimsy project you can actually run very easily the AMQ Streams on OpenShift as well. And that's it for the presentation. I hope we have still some time to answer some questions. Okay, yeah, there's a lot of questions and actually Paulo was able to show up and he's been trying to answer questions in real time because there are a lot of questions but we're out of time but I do want to cover a couple of things real fast. One key question is what is the difference between Kafka and a traditional AMQ broker like active AMQ, rabid AMQ that question came up several times. I like to think of a traditional message broker as something almost like a database, right? It's writing your message to disk and then figuring out at some point later how to get that message routed to the next person but what would you say the difference is between Kafka as a message broker versus what we think of as traditional message brokers? So the traditional brokers are kind of approaching the whole thing with the kind of paradigm that the broker is smart and the clients are fairly dumb and there's always the broker which is deciding about a lot of these things like which connected clients gets this message which connected client gets different message and the clients they just get the messages whereas as you have seen it with Kafka that's a bit different the broker is kind of dumb in a good way because that allows the scalability and the performance but it doesn't really decide that many things it's the client saying basically I want the message from this offset I want the message from that offset this message should be sent into this partition and so on. One of the other main differences is that the more traditional JMS style brokers they are more designed to really get the data in and get them relatively quickly out whereas with Kafka and the way how it doesn't really store anything in the memory and everything is on disk you can actually really use it to store the data for months, years if you want it's just about the amount of disk space which your servers have and then one of the biggest differences is that the traditional brokers they allow you to deal a bit more when you are interested in the individual messages so you can use things like selectors to select messages based on some headers or properties whereas in Kafka this is more about the streaming so the individual message is not really that interesting but it's very interesting if you want to read the messages in a stream one after another and do some processing and analytics and the individual message on its own doesn't really mean that much to you. Yeah, Kafka supports time windowing from that perspective with the streams. Yeah, so that's a feature of the streams API which is a special API so what I was showing in the examples were just consumers and producers but it has also something what's called streams API which basically wraps around the consumers and producers and it allows really advanced stuff for the processing of the stream data like aggregation, time windowing and really analyzing the data and processing them from Kafka and quite often returning them back to Kafka. So one very popular question was where is the data stored and how is it stored and what are the different types of storage mechanisms you could have underneath that mechanism? Where is the data going if you will underneath that broker? So the data with Kafka are always stored on the disk and that's basically the only mechanism which it has there's no in-memory storage or anything like that when the broker receives the message it really just upends it at the end of the commit log on disk and then it's reading the messages then it just reads them from the commit log. That's quite important because by not keeping them in memory that's one of the things which makes it so fast and so scalable but it relies quite heavily on the operating system disk cache which actually caches a lot of the data being written to the disk so often when passing them to the consumers it doesn't really have to read them from the disk but it actually reads them only from the memory but even if you read some old data which are not anymore in the cache in most cases you really maybe you start from the beginning with several years or late data but you read them from the beginning sequentially so even with old school magnetic drives for example you can get very good performance when doing this. Okay and in that case a Ceph or Gluster there were some questions around storage technologies like Ceph or Gluster have you seen those being used with Kafka? So one of the problems with these things is that they usually share the same network as the Kafka application and they usually don't rely on a dedicated storage network and one of the challenges there is if you have a Kafka cluster which has a throughput of several hundred megabytes of messages per second or even more then you do this actually multiple times because maybe each partition has one or two replicas and then at the same time you are writing it in the storage that's actually something what can create quite a load on the storage system. Okay and we are out of time but those were some of the hottest questions that Apollo did not quite get to he got to a lot of questions in the Q&A I apologize for not getting to all of them but do check us out on the StreamZIO project I've provided that link in the chat a couple times already you can reach out to the team and if you have more Kafka questions specifically Kafka and Kubernetes and optimizations for how to run Kafka and an OpenShift and Kubernetes architecture you can hit us up on that project there but we are out of time for today Jacob so much thank you so much for that presentation I really enjoyed it because Kafka is a really net new thing that's obviously very exciting we had hundreds and hundreds of people here today lots of questions lots of activity on the chat we will be recording that we make available that you'll see on that YouTube playlist but you'll also get an email with the link back to the webinar as well and do look forward to more DevNation lives as we bring out more content there will always be a new show we try to do about two shows a month except during the holiday period so more cool stuff happening from that perspective as well Jacob thank you so much Thanks, bye Bye now