 Hello, everyone, and welcome to this session on monetary Kafka without instrumentation using EVPF. If you are using Kafka or operating Kafka, this session should be quite interesting for you. We're going to show a different complementary way to monitor Kafka. If you are interested in EVPF, we are going to show what is possible to do with EVPF. I'm special in a real use case based on protocol tracing, which is a new capability of some of the open source points we are going to show. My name is Anton Rodríguez. I work as a principal software engineer at New Relic. New Relic provides observability as a service. We are basically a monitoring company. Because of that, we are very heavy users of Kafka. We have almost all of our services use Kafka. If Kafka is something new for you, allow me to do a very short introduction. It's a distributed system to store information in a temporary way. It's a really performant ingesting data. Indeed, the name is because a famous writer, Kafka. Because Kafka is excellent in writing. One to use Kafka. For example, if there is an application producing a lot of data, instead of sending directly to other application, we store it in Kafka so it can be consumed independently. If there is a problem in the producer or in other applications consuming the SEI data, it shouldn't affect the rest of our applications. We are the coupling them with Kafka. And that's really good for distributed architectures and even driving architectures. This is also a very important use case for monitoring data. And that's the reason why we are heavy users of Kafka. Just to give you some numbers, we ingest around 125 petabytes of data per month. And it's growing every day. More than three billions of data points per minute. Our biggest Kafka cluster has around 273 machines, brokers in Kafka terminology, an average traffic of 20 gigabytes per second. And that's just only the traffic through Kafka, but there is also the internal traffic, so it's much more than that. Being completely honest, operating something so big was quite painful. We have the feeling we were pushing the technology to the limit with so big cluster. So we split the cluster in the smaller ones as part of our cloud job. One of the good things of working in our company, providing monitoring, is that we can monitor everything we want. And we don't have to pay for it. It's basically free, because it's the service we provide to our customers. But even with that and our experience in Kafka scale, we find monitor Kafka pretty challenging. And there are several reasons for that. Ryan, next slide, please. The first one is because we need information from different places. We need metrics from the operative system, things like the CPU, the memory, that are extremely important, but even more important for us, for Kafka operators, is the network throughput and the disk manual advocacy. They are the typical bottlenecks working with Kafka. We need also metrics from the Java virtual machine. For example, to know when and for how long Garba's Collector has been running. In a system processing data in near real time, the Java Garba's Collector may introduce latency and delays in our applications. Finally, probably more important, we need Kafka metrics. Kafka exposes them using GMX. So we use something called the GMX exporter, the project with our Kafka services. It exposes an endpoint, so we can have the metrics in previews and from them, export them to our observability platform so we can create a sports, alerts to see what is happening, and all of the stuff. We do a funnel, you really get a catalog, there are many of them. As you can see, we need to operate the different pieces and to know exactly what metrics we need and how to interpret them to operate Kafka. If there are new metrics, we have to add them and do some changes and that's a lot of work. Even doing that is not enough. Next slide, please. One of the particularities of Kafka is how to realize on the clients. Both Kafka brokers and clients work together to achieve a better performance. That's one of the key components of Kafka, the reasons why Kafka is so good. It works really well, but it also makes monitoring Kafka much more complicated. Big organizations, as our case, have thousands of different clients created by different teams and they typically use different technologies and papers. In our case, most of the teams use Java, but there are also Python called Lingo. And when there are problems, we need to know what's happening also in the client side. And for that, the only way is to instrument those clients. That requires a lot of standardization and governance with leads to frustration. Never tell a data scientist they can use Python. They're going to hate you. Also, the Kafka platform team has to validate clients and they use cases and this is really hard to automate all that stuff. A good example of this is how to know the versions of the clients using Kafka. It's important to know them when we are upgrading our clusters. In general, Kafka provides very good backwater compatibility. But yet, we have problems in the past because we have some clients running very old versions and they were impacting when we grabbed to a newer version of the cluster. And we are not alone on that. So someone contributed a feature in Kafka to allow the clients to build their versions. Unfortunately, all clients and some frameworks don't provide that information in any way. So we don't have a good picture of what our clients are using and that makes our life harder. There is a Kafka improvement proposal, a keep, to send all the client metrics to Kafka and make them available with OpenTelemetry. But it isn't ready yet. It goes against one of the fundamental design principles of Kafka, which is basically keep the broker light and simple so it can be easily evolved and performant. We see what happens if it is merged to Kafka but right now it's not ready yet. Next slide, Ryan, please. Great example of why monitoring clients is important are consumer rebalanced. We have here a producer sending data to a Kafka topic. A topic is basically a logical grouping of information like a database, table, but in Kafka terminology. Topics are divided into partitions. So they can be distributed in several machines, what we call brokers. And this is how Kafka provides higher ability and it can escape. Now, we have here an application, a project in Kubernetes with two bots. Or as we say in Kafka terminology, two consumer instances, consume the instance one and consume the instance two. And this is important. We can have more than one instance reading for one partition. Basically, partition is the unit of parallelism in Kafka. We can have a consumer instance reading multiple partitions but not in inverse. Now, if we launch a new both, so we have now three different instance on the right, in this case, consumer instance three, Kafka will detect the new situation. And it will assign one partition to its instance to make everything more efficient and the split the total for balance all the traffic to the all the customer instances. But in order to do that, it will do a rebalance. It will stop the consumers to negotiate and tell them what partition should consume. And that's exactly a rebalance. The problem with it is they have to stop the consumers and that introduces latency and latency in general is a huge problem for our clients. Problems in rebalances are usual and hard to do because we need to know what's causing the rebalance. And there are several factors like various new consumers or the stats or problems in the level, many things. So it require metrics from the consumer instances and the brokers to really understand what's happening and how to solve it. Next slide, Ryan. Thank you. There is one metric we didn't mention yet and it's fundamental, consumer lag. Consumer lag tells us how much data is spending or being consumed by a particular organization. If there is no difference between the last producers data and the last consumer data, as you can see in the first row of the diagram, we are good. That means there is no consumer lag. But if there is a difference, as we can see in the second row, there is consumer lag. This can happen for different reasons. There may be problems in the consumer and maybe it is low. Problems in the broker, maybe there is not enough network bunches, bandwidth or disk IO, or even it's a problem in the producer. It may be producing more data than we were expecting. In any case, consumer lag is an excellent indicator to know if there are problems and it's the main metric we use for alert. Most of our incidents start with consumer lag. That's also a problem because it's hard to know initially what things should be paid. Maybe we need a consumer, maybe the Kafka platform team, maybe the producer. And again, we need an external component to help this metric. There are some good open source projects like Buro from LinkedIn or the Kafka lag sporter for Lyman. The benefit of the Kafka lag sporter is that it's able to report consumer lag also in seconds. And that's much more useful than the number of messages. Consumer lag in seconds gives an idea of the latency introduced by the problem and how long it will take back to a normal situation. It gives us the message of the latency and that's very important. So, okay, those are the challenges of monitor Kafka. I think they're pretty clear. Next slide. So, how EPPF can help us in those challenges? First of all, in case it's new for you, let me introduce EPPF. It's a new feature of the Linux kernel. It allows us to tell the kernel, to notify when EPPF growth and something happen. It's like a brick pointing at the boomer. It also interrupts the situation when a brick point is reached, but unlike a brick point, a small program is run. It expands and collects any relevant state and then immediately resumes the security. One of the benefits, compared with other ways to obtain the same data, is that it's safe. We don't want to break things in the kernel. Cramps, so they don't interfere, or create any problem of security risk. The other benefit is the overhead is really low. So we can track a lot of things and we are in steaming resources from the programs running on our server. It's in the case of Kafka, it's the main priority. How do we create a EPPF program? First of all, we use the C programming language, we compile it to write code and we send it to the kernel. We specify when that program should be executed and the kernel is going to validate all that. And if there is a problem or a potential risk or security, it's going to reject it. Those programs have very strong limitations just to make sure they're safe. One is ready when the specific condition happened a network package arrives, a new connection is open, a program opens a file, whatever is happening in the kernel will be selected. The kernel will execute the program within a very efficient way. Typically, the program will retune data to the user space when we have a lot more freedom to do whatever we want. So we can show the data to the user or we can send it to a standard system or whatever we want. Next slide, please. Here we have some examples built with VCC. VCC is a toolkit which provides helpers and other useful components to build EPPF programs. We can do things like track the disk IO latency, which could be very useful for Kafka. We can list the process running both in the kernel and the user space. We can list the TCP connections open or we can define whatever other trace we are having tests on. VCC also provides frontend for RU and Python, but in general, we need to know how to code in C and a bit of the Linux kernels internals to be able to build programs with it. It's interesting, very powerful, but most of us probably don't want to do this. We just want to monitor our services and this will be overwhelming. Next slide, Ryan. Initially, a more popular alternative is to use a VPPF trace. It's a very popular command line. Basically, we can define with a domain-specific language, a DSL, what we want to monitor and VPPF trace will show it for us. And as you can see, it's very powerful with just one liners. We can see things like the files open by a process, the number of files done by a program, the number of files read, many, many other things. We can even execute it in Kubernetes with a Q-Control trace to obtain information from our pods, which is very handy. But yet to do this or to modify these things, you need to have knowledge about what you want to monitor and the ability to be able to translate it to the DSL. And it's also something you're going to read in the command line. So to analyze data is not so easy. Next slide, please. So another option maybe easiest to use is EVPF-Sporter. This is a building top of the VCC. It was open source by proper. It allows us to export metrics to Prometheus. So then we can quote them with Gafana or any other monitoring tool. And it's much simpler than other tools and yet very powerful. But if you want to use this in the context of something like Kafka, then you need to modify it and you need to understand how to retrieve those metrics and that's not easy at all. So now let me introduce a big thing. Cloud Native Cloud Computing Foundation project we makes to work with EVPF event mission. Next slide, Ryan. Just to introduce Pixie, I want to say hello to my colleague, Ryan. He's a Pixie committed who's going to show us Pixie in action with a very cool demo, and especially specifically for the context of Kafka. Yeah. Thank you so much, Anton. I'm Ryan. I'm a software engineer at Pixie Labs in New Relic. And I work on the Kafka tracing capabilities at Pixie. So what is Pixie? Pixie is an open source CNCF observability platform targeted at Kubernetes applications. And Pixie collects all the data with auto instrumentation from EVPF. So Pixie's vision is to help developers understand what's happening in their Kubernetes cluster. When something goes wrong in their clusters with all the microservices, it's often very difficult to debug. Pixie provides a set of tools for developers to figure out what's going on in their clusters when there's a performance or functional issue. One of the most important features of Pixie is to provide network traffic tracing. When there are different services interconnected in a cluster, we want to know what services we're talking to each other, when they're talking and what data they're sending. Pixie started with tracing HTTP traffic and expanded into other protocols like GRPC HTTP2, database protocols like MySQL, Postgres and Redis and advanced streaming protocols like Kafka. And there's one key requirement we had in our mind when building Pixie. From the user's perspective, there should be no instrumentation. Pixie handles all the instrumentation required automatically with EVPF. That means for our users, there's no code modification, no recompiling and no redeployment of your application. You simply turn Pixie on and it automatically collects the data on your running cluster. This makes debugging, for example, large-scale Kafka systems in production much easier because code modification and redeployment can be very inconvenient and costly. And tracing with EVPF also has low overhead and allows Pixie to always stay active. So I would also like to give an overview of Pixie's approach to protocol tracing. Basically, there's a pod called the Pixie edge module or PEM deployed on every node of your Kubernetes cluster. On each node, the Pixie edge module captures the network traffic of all the other pods with EVPF K probes in the kernel space. Every time a network-related CISCOT happens such as send and receive CISCOTs, the K probes get triggered and capture the data and the metadata of that connection. The traffic is then classified into specific protocols such as HTTP or Kafka and shipped into user space. In user space, the protocol parser sorts, understands and parses the data into more structured messages. These messages are then stored into tables for querying in the future. And a user is able to come online, browsing the UI and query the data with Pixie's querying language, which we call Pixel scripts. The queries are sent to a powerful query engine which retrieves the requested data from the data tables. And this is basically a high-level overview of how Pixie uses EVPF to trace network traffic. But you might still be wondering, so how exactly is this data traced? Basically, you could think about Pixie as very similar to Wireshark, which snoops all the network traffic happening on the host. The difference is that Pixie traces the standard Linux CISCOTs much closer to the application, whereas Wireshark traces at the data link layer. And by tracing the Linux CISCOTs, we're able to skip over the complexities of parsing IPTCP packets and achieve low overhead. And this also allows us to trace at the standard interface, which means that tracing will work regardless of target application. In Kafka, this means that it doesn't matter what components are connected to your Kafka brokers or what clients are used, whether it's Python or Java or Go, as long as they use the Kafka wire protocol, Pixie will be able to trace it. And for example, in this diagram, we see that the Kafka broker is sending send and receives this CISCOTs back and forth to the Linux API. And all that information is being captured with EVPF and sent to user space, where a protocol parser processes it into the raw data into fetch and produce messages for Kafka data events and join groups, and group messages for consumer rebalancing events. These messages are stored in the table store for querying later. So with that being said, I would like to move on to a very simple demo scenario where we'll see how Pixie works in reality to debug a live Kafka cluster. So in this example, we'll have a very simple kind of an e-commerce site with one topic in a Kafka cluster, the order topic with one producer, the order service producing to the order topic every second or so. And we have two consumers, a shipping service and the invoicing service, both consuming from the order topic. The difference between these two services is that the shipping service is normal and able to keep up with the load. The orders are produced. However, the invoicing service is very slow and is actually unable to keep up the load. And we'll see how Pixie is able to discover this issue and provide us information about the consumer producer latency, et cetera. And so I've already deployed Pixie very easily with this one line bash command on my cluster and I also have the application deployed already and allow me to move on to the demo. So the first thing I wanna show is the Kafka overview page. And right, so first I would like to introduce a little bit about the Pixie UI at the very top you can select the cluster we're interested in. And here there is basically a list of different Kafka, a list of different pixel scripts that we can select from. Different pixel scripts will give us different data views. And here we can see that we have a bunch of scripts from HTTP to Kafka to MySQL, Postgres, et cetera, and also have specific scripts for specific nodes and pods. On the right side, we have this start time button and now we're basically looking for information in the past 15 minutes. So if we hit run, we actually see that in the middle of this view, we have a Kafka flow graph. This basically captures the high level view of the data in our Kafka cluster right now. We see that in this flow graph, we have one topic, the order topic, we have one producer and two consumers. The producer is producing to the order topic at about 64 bytes per second. The invoicing pod is falling a little behind. It's consuming at about 33 bytes per second, whereas the shipping consumer is completely normal at about 65, about 64 bytes per second. So it's able to keep up with the load. And below we have a table that gives us a summary of all the Kafka topics. In this case, we only have one topic, the order topic, and we see that it has five partitions, one active producer to active consumers, and we also have the total number of bytes produced and total number of bytes consumed in this view as well. And all of this information is being captured live with the EVPF in my cluster right now and compiled with pixel scripts. So we just noticed that the invoicing pod is consuming message not as fast as we want it to be. Maybe there is something wrong with this invoicing pod. And to investigate, we can actually go to the Kafka step page. And in the Kafka step page provides us some basic metrics, including the latency of our Kafka messages, the request throughput, request throughput command, et cetera. It also provides us with some information on the pods in our Kafka cluster, including request throughput, latency of the messages, and the total number of requests. And we can already see that the shipping pod has sent about 2,500 requests to the invoicing pod have sent much less. So this is also an indication that something is wrong. We can also click here to view some specific metric of the invoicing pod. So on this page, we see information like the CPU usage, network traffic, network throughput, disk usage, and also memory usage. If we scroll to the very bottom, we actually see that Pixie provides us a flame graph basically showing us what our CPU in this cluster is currently spending time on. The flame graph feature currently works for CEC plus plus and go and support for Java and Kafka will come very soon in the future. So I would now like to go to the Kafka data view. Basically, this Kafka data table contains all the raw Kafka messages captured by ABPF in the past five minutes. If we set the maximum record yield here, we actually saw that there's been 4,670 records, 78 records produced in the past 15 minutes. And for each record, we see the source and destination pod and as well as the Kafka recommend. If we actually sort by time, we see that there's a whole bunch of Kafka outcodes traced by Pixie, such as produce, fetch, offset commit, heartbeat, et cetera. Pixie also supports full body tracing of these outcodes. If we click on this one, the produce request, for example, we can actually see that Pixie parses the request body and gives us information on the name of the topic, what partition this is producing to and the total size of the message set. If we look at the response, you can also see very easily if there is any error coming back from the Kafka broker, the base offsets, log of hand time, et cetera. So this view is very useful if we're looking for one specific message or if we wanna filter by a specific command. And on the right side, I would like to introduce the Pixel script. So this is the Pixel script that actually empowers this Kafka data view. Basically Pixel script is Pixie's data programming language that resembles Python. It's very easy to write and allows users to apply customized transformations to their data, such as filtering or joining tables. If we look at this Pixel script right here, we can actually see that in this line, we define a data frame based on data in this Kafka events table. We can add the source and destination columns here to the data frame. We could also filter the data frame with customized source and destination filters. And we can also select at the very bottom select the columns that we want to show in this view. Yeah, so this is the Pixel script. Next thing I want to show that's pretty cool is the Kafka producer consumer latency. So this is one of a very unique feature of Pixie's Kafka tracing capability. We're able to show the consumer producer latency in one clock time. Basically, this is the time between when a message is produced to when it's fetched. And if we come into this view, we enter the default namespace, immediately we see that there's one topic of order. And if we enter the order topic here, we see below we have a plot showing us the delay of the producer and consumer. You see that we have one producer to consumers and there's something really weird going on here, especially for the invoicing service. After by shipping, we can actually see that the plot looks perfectly normal. All five partitions are able to keep up with the load and latency is almost to zero the entire time. However, if we go to the invoicing consumer, this is very weird. We can clearly see that all five partitions, latency has been creeping up slowly over the time now into the 30, 35, or even 40 seconds range. And there's also a zigzag pattern as latency increases. And this is because we've intentionally made the invoicing service very slow so that it's actually taking a very long time to process each message. So every time it consumes a bunch of messages from the Kafka broker takes an even longer time to process them and the next time it has fallen behind even more. And this is why there's a zigzag pattern. But the overall trend also shows that we have a big issue in terms of increased latency specifically for the invoicing part. So being able to measure the latency in one clock time is very important because it's very indicative of any problem in our cluster. The other view I would like to show is actually the Kafka consumer rebalancing events. So these rebalancing events happen when a new consumer comes online or an existing consumer goes offline. And the consumer in the consumer group are assigned to new partitions. It's important to monitor these events because either some or all of the consumers are stopped from consuming the messages when the rebalancing is in progress. It will also cause consumers to lag behind if these consumer rebalancing events are happening too often. So in the background, I've just rescaled the consumers to trigger a rebalancing event. We can see that now we have actually three shipping consumers in the shipping consumer group and two invoicing consumers in the invoicing consumer group. If we look at the table down below, these are the join group and sync group records collected with EBPF live on my cluster. And each consumer rebalancing event consists of one join group and one sync group request. And in the table above, we can actually visualize the delay between the join group request and the sync group request. In this case, it gives us information on the generation ID, the group ID, and the specific member ID for each consumer in a consumer group. And more importantly, we can actually visualize the delay in this view. So if we just look at the newest consumer rebalancing events, we actually see that there's a couple ones with very low delay and the delay is the time between the join group request and the sync group response. And there's this one specifically with the invoicing pod that has shown an especially high delay of 13 seconds. And this is concerning and also indicative of a problem in the invoicing service. So to come back from the demo, we just saw how Pixie could work in action to debug some Kafka issues. To summarize, it's challenging to have good visibility in a Kafka clusters because of the different layers involved from the operating system to the JVM to the Kafka broker yourself. And there's also different components connected to the Kafka system and different clients used in Kafka such as from Java to Python to Go. And this is what makes Kafka observability, visibility very challenging. And Pixie uses eBPF to automatically trace network traffic with no instrumentation needed. Pixie requires no code modification, no redeployment and it's really easy to use. And lastly, Pixie is also an open source project. So if you're interested, I would really encourage you to try it out and we always welcome feedback and contributions. So this concludes our talk. Thank you very much. We'll take some questions. Thank you very much for the talk. I see there is one question in Q&A. So is it possible to get total number of TCP connections between the producer, consumer and broker cluster? Could be helpful for centrally spotting heavy okay streams topologies. So is it possible to get the total number of? Of TCP connections between producer slash consumer and broker cluster? If you want, if you open the Q&A tab, you can see the question as well here on Open. And Anton, we cannot hear you. Sorry game. Oh, not for long. I can ask where and then Ryan can complete it, right? Ryan? The answer is yes. It's as possible because you have access to all the TCP connections open between the brokers and the clients. The thing is how to filter that information. And right now that's a bit more challenging, but modifying Pixie Descript, it will be possible. So you come, for example, select a specific consumer because it access to the internal Kafka protocol. It can filter those consumers and map them to specific TCP connections. So it should be possible, but it's not provided right now out of the box as far as I know by Pixie. Yeah, you don't have a view for explicitly, seeing the total number of TCP connections, but it is definitely possible with pixel scripts. And what tooling and frameworks are you using to define BPF probes we use BCC? Okay, great. Thank you for your answers. Are there any more questions? It seems no questions are coming. So if you would like to discuss further, you can move on to Discord or you can use also virtual platform work adventure. So feel free to go there and you can discuss the related topics or anything else. Thank you again for your presentation. And that's all for now, for my side, for you. Thank you very much.