 Thanks for coming along, everybody. This was a talk that I volunteered at the last minute because we had a spare slot. It was based on a 40-minute talk I gave at Apache Kafka, sorry, at ApacheCon last year in the US, so I hope it fits in the available time that we've got. I'm the open-source technology evangelist at Instacluster, so I talk about quite a lot of open-source technologies, but Kafka is certainly my favorite and it's the first one I actually learned from scratch when I started my job at Instacluster about six years ago, so it's the one that I've built up a lot of experience of over the years, and the change from using ZooKeeper to the new KRAF protocol is one of the more significant changes under the bonnet for Kafka in my experience, and I think it probably fits in quite well with the previous talk as well, talking about platform engineering. Kafka is certainly one of the more pervasive pub-sub-middle-ware technologies and the best known open-source one that's been used by just about everybody around the world, so Kafka abandons the ZooKeeper on a KRAFT, that's more or less the story, so Apache ZooKeeper was quite a well-known Apache open-source project for distributed systems coordination, and it's used by a lot of Apache projects, including Kafka, and it's actually really good. At ApacheCon last year, I actually gave a talk about why Apache ZooKeeper was so good still, and then the next talk I gave was about, well, actually you don't need it for Kafka anymore, so it was a bit confusing, potentially. So this story is gonna be assisted by a few train pictures, and this is one I took in New Orleans last year at ApacheCon. So Instacluster provides a managed platform for big data open-source technologies. We've got technologies for storage, streaming analysis, search, and orchestration. I guess today I talked about cadence, today I'm talking about Kafka, so that's our main streaming technology, and up until now it was three components, Kafka itself, Kafka Connect and ZooKeeper, which we all ran as managed services. So Kafka just briefly is a distributed streams processing system that allows distributed producers to send messages to distributed consumers via Kafka, which is a distributed cluster with multiple nodes and multiple things going on in the nodes, primarily topics and petitions. Kafka topic petitions enable massive consumer concurrency. Essentially you've got producers sending messages to one or more topics, you've got consumers consuming from one or more topics, and the way that the workload is balanced over consumers is you can have more than one consumer in a consumer group, and each consumer can consume from one or more petitions. So that enables high consumer concurrency, this graph shows throughput on the x-axis versus petitions on the y-axis, which is in millions. So the catch with Kafka consumers is that single threaded, well it's the default Kafka consumer is single threaded, which means you need to increase the number of consumers as the consumer response time or latency increases. So for example, if you want to achieve a throughput of 10 million messages a second, and your consumer has a latency of 100 milliseconds, you're gonna need one million petitions in that topic, which is a lot of petitions. Petitions are expensive, they have replication and metadata management overheads associated with them. Without going too much into the detail, the real problem is that the Kafka cluster itself has to manage the topic partition metadata and it also has to cope with the replication and a replication factor of three is quite common in production Kafka systems. So that means each message you send to a topic has to be replicated on three other brokers as well, which introduces quite a significant overhead to the whole system. So how does Kafka work? Basically it has a controller. The Kafka controller does all the control planes, stuff all the metadata management. There's only one active controller at a time. Each broker can potentially have one controller, but there's only one active over the whole cluster at a particular point in time. The Kafka controller manages the broker topic and partition metadata, it's basically the brain of Kafka. Most people don't even know it's there. If you're using Kafka as a managed service, you don't even need to think about the controller side that's all handled by Kafka for you and the managed service provided normally, but that's actually how it works. Which controller is active and where is metadata stored? Well, essentially the answer to that up to a few years ago was an Apache Zookeeper. Zookeeper is used for the controller election and storing the metadata. Zookeeper has the concept of an ensemble, which is just really a cluster of Zookeepers. So you have typically three, you can have more, but probably not more than seven. Again, there's only one of those which is active at a time, the leader Zookeeper. The active controller from the Kafka cluster communicates with the Zookeeper leader and keeps track of the election data and the metadata as well. So it's pretty slow, that's quite a big bottleneck. Zookeeper isn't great in terms of right scalability. Metadata changes in recovery from the failover are quite slow, reads are pretty fast due to caching in the Kafka cluster itself though. So it worked okay, I mean you could get some pretty big clusters up and running and reasonable numbers of petitions. So the new K-raft mode is something that's just come along in the last couple of years. K-raft is the Kafka plus the raft consensus algorithm abbreviated to K-raft. So the Kafka cluster metadata is now only stored in Kafka, so it's fast and scalable because Kafka is fast and scalable. The Kafka cluster metadata is replicated to all the brokers so failover is a lot faster as well. The active controller is just the quorum leader and it's using the raft protocol to elect the leader internally as well. So it doesn't, there's no dependency anymore on the external Zookeeper. So we had a couple of hypotheses that we wanted to test when we started out looking at the new K-raft mode in Kafka last year. We were interested in whether there'd be any impact on the actual data workload performance and we assumed that there probably wouldn't be. We guessed that we'd probably see similar results between Zookeeper and K-raft just for the actual data workload. But for the metadata changes and recovery from failover, so metadata performance, we knew that Zookeeper was pretty slow for some of these operations and we assumed that K-raft would be faster. That was the promise anyway. We knew that using Zookeeper, there were quite a little limits on how many petitions per cluster you could get out of Kafka and we assumed that using K-raft there'd be a lot more hopefully. And with the robustness aspect, we knew Zookeeper was pretty robust. There weren't many situations where Zookeeper caused any problems and Kafka itself was pretty reliable using it. But this is an unknown feature of K-raft because it's a new feature. It's only just recently become production ready and in fact it's been available for a while now sort of as a proof of concept. People can try it out in a developer context and test it out. But we weren't sure whether it was going to work particularly well in a production environment or not. So basically I did a couple of experiments using a fairly early version that we had available in our managed service. The first experiment was whether the data workload performance would be different a lot. So we assumed that there'd only be minimal or no difference between Zookeeper and K-raft message throughput. And this is purely looking at the producer workload at this point actually as well. Why? Well we know that Zookeeper and K-raft are only concerned with metadata management not the actual data workloads. Kafka producers only need read-only access to partition metadata so they should just be as fast basically. So how did we do it? We set up Kafka 3.1.1 on some identical AWS nodes with an RF factor of three. And what did we find out? Well, okay, it confirmed our suspicions that there was actually no difference in throughput whatsoever, which you can see there because you can't actually see the blue line that's hidden behind the orange line. So the bottom axis here is the number of petitions. So we managed to get up to about 10,000 petitions in the cluster. The throughput is the y-axis. Sorry, no, petitions is on the x-axis. It's here, throughput on the y-axis. And yeah, there's an identical throughput basically. There is a cliff. This is something we've seen before with Kafka when you hit a certain number of petitions you actually get a reduction in throughput. And this isn't too bad compared to some tests we ran about two years ago where the throughput started dropping off at about 100 petitions. You only start seeing this drop-off in throughput now around about 1,000 petitions. So that is still something to watch out for though. Again, petitions versus latency in milliseconds. Not surprisingly as the throughput drops the latency also starts skyrocketing to the point where it would become unusable which for a low latency system like Kafka is a real issue. So you want to avoid that, if you can. For the second experiment we had a look at petition creation performance. How long does it actually take to create a larger number of petitions? So how many petitions can we create? Can we actually create more on a KRAF cluster compared to the older ZooKeeper cluster or not? How long does it take to do this? So a few simplifications for this experiment we used an RF of one. Otherwise because of the replication that goes on with the topics and petitions the background of CPU can be quite high when you have a high number of petitions and a high replication factor. With an RF of one though there's actually no replication and there's very little background CPU going on. We could have used a bigger cluster but for the sake of the time and experiment we just set RF to one for this one. And then there was still about 50% background CPU load on the clusters with a hundred petitions and no workload going on as well. We tried a few different things to create a large number of petitions. The first approach was just using some of their Kafka tools to create a topic with lots of petitions straight off. Then we tried using the alt command to actually just increase the number of petitions on an existing topic incrementally. We tried curl with our inbuilt provisioning API because we were having problems of timeouts with the first two options. And finally we tried fourth approach which was just a simple script to create multiple topics with fixed number of petitions each. However there was a problem with all of these all of them actually failed eventually and the only difference is really how soon the failure occurred. And after some of the failures, disturbingly the Kafka cluster was basically unusable even after restarting the Kafka process on each node. So something was a bit strange. Now this graph shows the number of petitions on the X axis versus the creation time in seconds on the Y axis. ZooKeeper basically has a linear time for creating petitions and the more petitions you want to create, the longer it's going to take up to a point where there's an eventual timeout which stops you creating any more. With using caraft it's constant time. So it's a lot faster which is the orange line on the bottom there. So it's actually very easy using caraft to create lots of petitions. But still with the eventual failure occurring mostly due to timeouts as well which was interesting. So the other approach we tried was the incremental approach. So this is the time per thousand petitions increments which does increase with the total number of petitions in the topic. ZooKeeper's the blue line. So it is a lot slower than caraft. caraft is actually taking some time though. So it's basically a slow process still to create a larger number of petitions and still we were getting eventual failure. So the initial conclusions were certainly faster to create more petitions on caraft compared to ZooKeeper. We were hitting a limit of around 80,000 petitions on both ZooKeeper and caraft clusters and Kafka ended up failing. So it's actually very easy and quick to kill Kafka on caraft. Just try and create a 100,000, okay petition or more topic and something inevitably goes wrong. So another experiment we did was the performance of the metadata workload. For example, reassigning petitions. So common Kafka operation, for example, if a server fails you can move all of the leader petitions on it to other brokers and there's a simple command to do that. In our tech ops people are doing this all the time for customers, you run it once to get a plan and then again to actually move the petitions. So moving petitions from one broker to another to brokers was the experiment we performed with 10,000 petitions and a replication factor of two this time. And we got the answer to life of the universe and everything which, yeah, it's 42. But what's the question? Well, okay, so the question is how many seconds does it take to reassign 10,000 petitions using caraft 42 as the answer? So it does take quite a significant amount of time using ZooKeeper to do the same operation at 600 seconds. That's quite a significant overhead at that point not having one of your nodes available on a cluster. 42 seconds is certainly a lot better. Although just noting that there was actually no petition data for this experiment. So in real life it's quite likely that the time to move the data is gonna be dominant as well. Experiment four, this is the one that I was really excited about. Everyone had been saying I can get millions of petitions with caraft. I mean, millions of petitions is probably a silly thing to be trying to achieve but for the point of view of science and doing an experiment we still had that goal in mind that if we can hit a million petitions we will have proved that caraft was actually doing what it was advertised to do. So this is my final attempt to reach a million or more petitions on a cluster. Again with RF equals one. I didn't wanna have an enormous cluster to do this at the time. To achieve it we had to sort of cheat a bit. We used a manual installation of Kafka 3.2.1 on a large EC2 instance. So it's not actually a cluster at this point either. And we were still hitting the limits at around 30,000 petitions. So this is the sort of error that we were getting. It was a bit of an odd one. It's not one I'd seen before. It was a map failed error. Something to do with the Java runtime and the available memory. A slightly odd error because we knew we had lots of spare RAM. We hadn't encountered this sort of error before on very large clusters either. So we tried increasing the amount of RAM. We tried things like increasing number of file descriptors which we knew was actually quite important. And again because I was just doing this myself as an experiment we weren't relying on our managed service version of Kafka which has all these configurations sort of carefully configured for customers. I had to sort of recreate some of our settings. So yeah, typically we did have lots of file descriptors. 65K is the default on Linux and for Kafka you need a lot more than that basically. But it still didn't work. So yeah, we had plenty of spare RAM but we were getting this out of memory error. Googling this type of error actually did discover what the problem was. On Linux there's a thing which determines the maximum number of memory maps that a JVM can or a process can have. And again the default was only 65,000. And because every petition uses two map areas that actually limits you to about 32,000 petitions which was actually roughly the number we were hitting the limit at. So we set this to a very large number and tried again, did we reach a million? Well sort of. Yeah, so we managed to hit around 600,000 petitions just on the one Kafka broker that we were experimenting with and by inference for a three node broker we would be hitting over the one million petition mark at 1.9 million petitions we estimate. We have actually redone this experiment since this is a bit out of date with our new managed service with KRAFT. In fact that's exactly the sort of number that you can get with the KRAFT mode. So that's a lot of petitions on Kafka. We've typically our production customers only have hundreds of petitions on a Kafka cluster. Millions is sort of interesting theoretically but I don't think there's much practical application at this point in time. Okay, what about the batch error? It's still actually painfully slow to create this many petitions due to the batch error when you're trying to create too many petitions at once. This actually turned out to be a real bug. The quorum controller has a limit to how many things it can handle basically in the maximum batch size and the promise is that it will be fixed in 3.3.3 but we haven't tested that yet. So that's some takeaways about KRAFT. It's fast for data workloads. There's no difference in fact to the zookeeper mode. It's certainly faster for some metadata operations that we've tried. You can have clusters of more petitions, potentially more than one million. You still have to watch out for some of the operating system JVM configurations to support clusters with more petitions and get trust or a managed service provider because they've probably done the experiments. As we have what we're constantly learning. So should you use Kafka KRAFT mode yet? Well, yes, 3.3.1 is production ready and by four, which will be next year, there will actually probably be nodes zookeeper mode anyway. So it's probably time to start exploring the Kafka KRAFT mode. We provide Kafka 3.3.1 in public preview I think at the moment and you can test out a free trial for a couple of weeks using some quite reasonable sized Kafka clusters on various cloud providers. We support AWS, Google, Azure and a few other ones as well. So we're providing that as a public preview at the moment rather than sort of final general availability because we're still learning how KRAFT will be working mainly for our tech ops people. So it's giving them some experience and it's also giving our customers experience in using Kafka just in a development context initially. And that's it. So that was a 40 minute talk. It's shrunk down to 15 minutes. So there's a lot, lot there. But I think, I mean, hopefully that gives you a bit of an overview of the new KRAFT mode and it will be the future of Kafka. I don't think there's much choice about that. So thank you very much. Got any questions, I guess was all those lunch next. Ask me over lunch, if there's anything you'd like to know about Kafka in general or KRAFT in particular. Very nothing.