 Okay, so welcome to my talk about making your Kafka cluster production ready. My name is Jakub Scholz. I work as an engineer at Red Hat and I'm also one of the maintainers of the Strimsy project, which is all about running Apache Kafka on Kubernetes. And of course, in my talk, the best way you can do afterwards to run your Kafka clustering production is using Strimsy. But the things I will be talking about, they are actually applicable to other operators or even if you use HelmChart or some self-made deployment. Now, who has ever seen this animal? This is actually a Kafka monster. And if you look it into the eyes, you will see that all it has in its mind is eating your Kafka clusters that are not production ready. The only exception is Sundays, because on Sunday, it eats red pandas for lunch. Now, unfortunately, in the Kafka community, I see quite often that people get some operator, get some HelmChart, get some simple example, deploy the Kafka cluster and it's up and running, it's sending the messages. So let's move to production. And as a result, often the clusters are not really production ready. And it's just asking for issues. Similar issue is that users sometimes start with some simple use cases, like just deliver some metrics or logs where they don't care that much about reliability availability. And then suddenly, without really any change, new projects are onboarded and now they are using it for banking transactions and customer data and they don't have any faults about being ready. So unfortunately, it's quite common and then this animal lives in your Kubernetes cluster and takes your Kafka cluster down. So what can you do so that it doesn't happen and so that this animal doesn't eat it? First, you should think a bit about infrastructure. Now, obviously, there are different kinds of Kafka clusters. Some are small, some are big, but usually they have something is in common. They can use quite a lot of resources. And it's networking resources in the first place, because it's not just the data which you are sending to the Kafka cluster from the producers, but there's also the consumers which are consuming these data. There's the replication between the brokers. So it kind of adds up and multiplies and it can create quite substantial network load. So you should make sure that your network is ready. The next thing is storage. Kafka doesn't work well with things such as NFS, but in general, it works quite well with most types of block storage. It even can work with what's called in Kafka J-Bot storage, just a bunch of disks. So you can just take multiple disks and use them in a single broker to kind of increase the capacity. Or what's great, especially if you are running on premise and don't have some super expensive storage appliance, you can also use local persistent volumes with it. Just keep in mind that all these methods, they don't have only the benefits. They also some negative side, some issues which you should be aware of and make the choice carefully. The next thing to do is you should configure your resource requests and limits. And that's not really just a Kafka thing or a Strimsy thing. You should actually do it for every application you deploy on your cluster. But with Kafka, some of these things are a bit more special. So one thing is that the memory is because it's Java application, it has different parts. So when you talk about Java application, the first thing which comes to people's mind is heap memory. But the reality is that the Java heap memory is just one part of the whole Java application memory. So you need to be careful about this because if you give a Java process one gig of memory, then you can't tell it that the heap size should be one gigabyte because there will be other things as well. So you should be careful about that. But Kafka also relies quite a lot on the disk page cache. That means that when the producers are sending the data and these data are written to the disks, they stay for some time when there's free memory for that in the disk page cache maintained by the operating system. And when the consumers consume the data which were just recently produced, the brokers don't really need to read them from the disks. They just go to this cache and grab them from the cache. So that makes things much faster. So when planning for the memory, you should keep in mind to leave some space for this cache as well. Now, the important part is also to consider how you configure the request and limit. Normally, I would actually for production ready Kafka clusters recommend to not use memory limits and just set the request and the limit to the same value or just set only the limit which then means is the same as the request because that's how you get the best performance and how you avoid some issues. But if you want to use the limit, you should make sure that the Java memory itself sticks into the request and the limit is only used for the page cache because that allows it to kind of scale up and down more flexibly. The Java part itself will not really do that. For CPU, this is much easier because you just give it the CPU you need to give it for the performance you want from it. And the request and limit that's handled quite nicely because the CPU can be used dynamically. The next thing to talk about is security. Now, some of these things are pretty generic and not really Kubernetes-specific, so I will just run through this a bit quickly. But you should, of course, use TLS encryption. You should use some authentication. You should use some authorization. Now, in Stream Z, we will, for example, automatically secure the interbroker internal connections for the replications out of the box using TLS and MTLS authentication. But for the listeners where your applications connect, it's actually configurable, so it's up to you what kind of security you choose. But maybe a bit special about Stream Z is that we have also this thing we call user operator, which allows you to manage the Kafka users using the custom resources as well. So you can create a Kubernetes resource called Kafka user either directly or using Argo or whatever you are using. And there you can say this user should use MTLS authentication or scramsha authentication, and it should have these and these ACL rights. And then the operator will go and create it. Or if you are in a more enterprise-like environment, then maybe you want to use things such as OAuth or Open Policy Agent, which give you kind of a central management for the security things. There are also some Kubernetes-specific things you can think about. One of them is network policies which can easily improve your security, where you can configure which applications can connect to the Kafka brokers and consumer-produced messages on the network level. In Stream Z, again, we would, for example, automatically set it up for internal connections. But for the consumers and producers in your applications, Stream Z doesn't really know out of the box which are allowed to connect and not. So you need to give it the guidance how to set up the network policies. And then, especially if you are writing your own deployment or using some helm chart, you should be careful about what RBEC rights does it give to the Kafka brokers, because if someone somehow hacks into the broker, you don't want it to hack your whole Kubernetes cluster. And similarly, you should make sure the security context it's running under doesn't, for example, allow escalating privileges and doing something harmful to the work node. Another area to think about is monitoring. Now, this is, again, pretty common. You should have in your Kubernetes cluster some login collection stack, some elastic search or open search or whatever is used these days. You should use some metrics, some dashboards with some alerts. In Stream Z, we provide you with some basics and examples, but this is normally quite easy to set up even if you use something else. Tracing is something that's interesting as well for monitoring the Kafka-based applications. But these are kind of the usual things which you will hear about a lot at KubeCon. What's maybe a bit more special for Kafka is consumer-like monitoring. Consumer-like is this kind of difference between your consumers and your producers. And it's important because if you, for example, use Kafka for some credit card fraud detection, you want to detect the frauds on the transactions which happened a few seconds ago and not on the transactions which happened two weeks ago. So this leg kind of gives you idea how much behind your consumers are and whether they are near time or whether they are using some old data. And in Stream Z, we integrate with a tool called Kafka Exporter for this, but there are many other tools as well which you can use. Now, another part, and there we get a bit more into the Kubernetes details, to think about is availability, reliability, not losing messages, having the cluster always available so that your clients work. But before we get to the Kubernetes details, this always starts with the Kafka configuration itself. You have to make sure that your Kafka topics are properly configured, that they use replication factor, that they use mean-insync replicas, and then you also have to make sure that your clients are correctly configured so that they are not losing messages when some broker is restarted at the same time and so on. So that is at the Kafka level, and you always have to make sure about it because there are many users who are like topic with single replication factor one client which doesn't care whether the brokers accepted the messages and acknowledged them or not, and then they are wondering where my messages are, and that's what you have to fix in the Kafka configuration. But then there are some Kubernetes-specific things such as how to do rolling updates. Now, Kubernetes has this great tool called kubectl or kubectl or whatever you call it, and it has this magical command kubectl rollout, I think, which does a rolling update of your deployment or stateful setting kind of rolls the pot. But kubectl doesn't have the knowledge of Kafka clusters itself, so there are several issues you can easily run into. For example, if you have a Kafka cluster with four brokers, as I have here, then one of these brokers will be designated as a controller. This controller has a bit more responsibilities, a bit more tasks to do, and when the controller roll changes, then it always means disruption for the clients using the Kafka cluster. They have to find a new controller, and they have to reconnect to it and so on. And now, if you do the rolling update incorrectly, then something like this can happen. You have the controller on the first pot, and then you restart the first pot, so the controller roll is shifted to the second pot. Then you restart the second pot, so the controller roll shifts again, and so on and so on. And at the end, you restart all four pots, so the rolling update is complete, but the controller roll actually changed four times during this update. And that's not the end of the world, but it definitely caused some disruption, which was not necessary. Now, if you have an operator like Strimsy, or if you do it manually, but do it the better way, you can do it like this. So you first roll the second pot, then you roll the third pot, then you roll the fourth pot, and only in the end, when you are done with everything else, you actually go and roll the first pot, which was the controller. And now, again, we roll the whole cluster, but actually, we did only one change of the controller, so lot less disruption for the clients. And because Kafka is actually a distributed streaming platform, there's a bit more to it than just the controller roll, because you will have the partitions and the replicas distributed across these brokers. And when the producers are producing data, these data are always copied to all these replicas. And now, when you restart one of these brokers, then for some time it will be down. It might be just five seconds to shut down, then 20 seconds to start up, but even in this short time, there will be new messages produced to the other brokers. So when the broker starts up again, the replicas which it hosts, they will not be in sync with the rest of the cluster. And if I will now go and delete one of the other brokers, then the partitions might be under-replicated, and your producers might stop working. So it's important that the operator when doing the rolling update waits for these replicas to become in sync again, to re-sync with the other brokers and get the latest data. And only then you can actually proceed and roll the next broker. Another thing to make sure you are using is Reck Awareness. Most Kubernetes cluster run in multiple zones. It can be the AWS zones if you use AWS, availability zones if you use AWS. If you are on premise, maybe that's just the separate rooms in your data center with some fireproof wall between them. But this is a concept which is quite common for all Kubernetes clusters. And now when you deploy your Kafka cluster on top of it, you definitely don't want it to be deployed like this, where you have nine different brokers, but six of them are in the zone A, two of them are in the zone B, and just one of them is in the zone C. And moreover, because it's not just about the brokers, it's also about the partitions and the replicas of them, which are on top of them. So it can also happen that these partition replicas are all in the brokers in the zone A. Now, what happens if you lose the zone A for some reason? Well, that's bad because you basically lost most of your brokers and most of your data and you have to wait for it to recover to continue. So what you want to do is you want to use the Kubernetes API, such as pod topology spread constraints or pod affinity and anti affinity to make sure that the pods are nicely spread like this across the availability zones. But you also need to propagate this application, which into the Kafka brokers, which basically tell it into which zone this broker is running. And that helps Kafka and the other tooling around Kafka, such as quiz control to distribute the replicas of your partitions across these zones as well. So now, when your Kafka cluster is nicely balanced like this and you lose one of the availability zones, it's completely fine because you have the other two zones which have enough replicas to keep running until it recovers. There are some other things to keep in mind as well. You should think about voluntary disruptions. That's, for example, when you do kubectl drain on some worker node to move the applications from it and run some upgrades. Then for that, you can use pod disruption budgets to control how many of the pods can be shut down at the moment. But what's important is that you need to also understand how the readiness probes work for your Kafka brokers. Because, for example, in Strimsy, we follow the original Kubernetes semantics for the readiness probe that the pod should be ready when the clients are able to connect to it and use it. But in Kafka, that doesn't necessarily mean that, again, all these partition replicas on this broker are in sync and that you can restart the next broker without having any damage to the availability. So that's why in Strimsy, for example, we have a tool which we call drain cleaner, which allows you to handle draining of the nodes by using the operator logic for the rolling update and for restarting the pods and not leaving it just to Kubernetes so that it can use all this thinking about how the Kafka cluster works to do it. Now, that's how it works in Strimsy. In other operators or in your helm chart, it might work differently. So it's important to understand what you are using and how it works so that you can handle these situations. And now, of course, you should also handle things such as disaster recovery or backups. Unfortunately, in Kafka, that's actually a topic on its own, which is quite hard. But you should at least try to think about it if you want to have your cluster production ready. And then, once you have the cluster deployed on a good infrastructure, secured, monitored, available, you can start thinking about things such as performance. You can do some things such as tuned the JVM, the Java virtual machine. There's always one or the other user who is 100% convinced that he knows which garbage collection class is best for his Kafka cluster. And to be honest, I don't understand it enough to fight with them. But in Strimsy, we actually have this configurable. So, yeah, if you think that's better, then you can change it. But then there are some more Kubernetes things as well, such as problems with the noisy neighbors. Today, here, we heard a lot of talks about databases. And I'm not sure if we had any talks, but there will be other talks at KubeCon about cloud native storage and about other messaging systems such as nuts and so on. And these all might live in the same Kubernetes cluster. And these often use the same resources. They all are typically network heavy. They are heavy users of storage and so on. So, if you share Kubernetes cluster for these deployments, then typically you want to make sure that, for example, your Postgres pods don't run on the same worker node as your Kafka brokers run because they might fight for the resources. And you can use the pod anti-affinity rules, for example, for it to tell Kubernetes not to schedule these on the same nodes. But if you have really big Kafka cluster, I normally would even recommend you to set up dedicated nodes and basically use the Kubernetes taints and tolerations APIs to have nodes which are dedicated only for your Kafka cluster. That, of course, makes sense only if your Kafka cluster gets big. And then another thing to think about is also cluster balancing. So, when you start with a fresh cluster, everything looks nice and simple. You have three brokers. You start creating some topics and partitions, and they are nicely created and distributed across the brokers. And it's perfect. It's everything balanced and equal. But then, as the time goes on and you use the cluster, things change. Some projects you deploy there, they will be a huge success, much more than you expected. And they will get a lot more traffic. So, their partitions will get bigger. Some others will be maybe a complete disaster and you will just delete them because nobody is using them. It might be also that some customer gets bigger and bigger and bigger. And the transactions for this single customer are half of your business and they create a huge partition in your Kafka cluster. So, over the time, the way how the Kafka brokers are used can change. And it can get it out of balance. So, it might look something like this. There, the first broker is kind of overloaded, running at 110% of its capacity. The second broker is kind of doing nothing there. And then the third broker is kind of, yeah, it's not overloaded, but it's quite busy. And the balancing means that you want to bring these in balance. And in Strims, we integrate with the open source tool called cruise control from LinkedIn, which does what's called cluster balancing. So, you tell it some criteria which are important for you, such as the best utilization of the available disk storage or the best latency, the best throughput. And then, you basically tell it how you want it to be balanced and it gives you a proposal. And when you say yes, then it will kind of magically reshuffle all the partitions and move them around the brokers and make sure that you have nicely balanced nodes in your cluster and that they are kind of all working well. So, that's the performance. Now, if the words you heard here, the Kubernetes APIs and so on were not something you already know, then don't be afraid. This is kind of my hierarchy of production ready Kafka needs, which shows you the things you should consider. And it basically has all of them and it's uploaded to the Kubernetes schedule in the slides as well. So, you can just go through it and Google it in the Strims or in the Kafka or Kubernetes documentation and that should get you started. Now, I have some other resources as well. That was one slide too much. Obviously, the Strims in Kafka and Kubernetes documentations are a good source, but we have also all kind of examples on the Strims GitHub and on the blog posts, we have some blogs about tuning the performance of the Kafka clusters and how to configure it. And then, finally, on our YouTube channel, I actually created a series which I think now has nine or 10 parts already, which is named the same as this talk and it kind of goes a bit more into detail about some of these things I talked about, which we didn't really have enough time today, but you can then go and if you are interested in something in particular, you can watch a bit more. Now, the obvious question might be why should you care about this? Why doesn't operator like Strims do all of this out of the box? Now, it's a good question, but unfortunately, it's not that easy because there's no one size fits all. Unfortunately, there are many different users with many different requirements. Not everyone use Strims or Kafka in production. A lot of people use Strims just on their laptop for development. Others use it in CIs and then others, of course, use it in production. Some use it for all of these things, but all of these environments has different requirements. You probably want to have the security in your CI for testing, but maybe you don't want to spend the money on having the high availability and all the replicas and so on. Similarly, the infrastructure is different between different users. So it can be some basic things such as some users run with free availability zones. Some users have just two availability zones. Some are playing a tricky game and they have two and a half availability zones, but it can be also some more trivial things such as not everyone is using the same label for the zone identification on their worker node. So it's not that easy to kind of produce one configuration which fits everyone, and that's why it kind of expects the users to think about it a bit and make sure that the configuration is tailored for their environment. Now that's it. Thanks a lot for listening. If you have any questions, I think we should have a few more seconds for that, but I will be also around here during the cube corner. So yeah, feel free to stop me if you have any questions to this or if you just want to talk about