 Hello, I'm Cornelius and I will be talking about measuring Kubernetes network performance. I'm a software engineer at Asovela, where I work on Cilium which is a network observability and security platform for container environments such as Kubernetes. We implement Cilium by extensively using the BPF, but this talk is not about Cilium or BPF. In the Cilium team, we spend a lot of time evaluating network performance identifying bottlenecks and fixing them. So in this talk, I would like to offer some practical advice for practitioners who also want to evaluate the network performance of the Kubernetes cluster. And I will start by discussing potential motivations for doing so. Performance evaluation is really useful for asserting or disproving assumptions. For example, it can be used to verify that using more expensive machines with better network cards will actually improve application performance. And going even further, it can also check or verify that the benefit, the performance benefit that applications will get will actually justify the additional cost of acquiring this new hardware. On a related topic evaluation can be used to offer concrete and objective arguments for making technical choices. For example, if an organization wants to select a cloud provider and data performance is a factor in this decision, evaluation can be used to actually quantify the different trade-offs between the different cloud providers and also try to do the experiment considering what are the unique needs or workloads for this specific organization. And finally, evaluation can help with better understanding your platform. For example, if there are applications that exceed bit random delays in communication, you can use evaluation to try to figure out whether this is a network problem or something else is going on. And generally speaking, playing with benchmarks, it's a great way to understand your infrastructure. The problem is that benchmarking is a really hard process and Kubernetes network benchmarking even more so. The reason for that is that there are many different layers involved in making, for example, two Kubernetes pods talk to each other. First, there is a network itself, which in many cases might have an unknown topology. And also it might be used by multiple different users having different workloads. And this means that performance might vary between different times or even depending on where the machine is located. Furthermore, the NIC or the network interface card, which is a part of the hardware piece that connects a machine to the network is a very complicated piece of hardware. And it includes a number of optimizations that are responsible for providing good performance. Just to give some examples, this includes multiple NIC hardware queues, which receive site scaling, protocol checksum offload, hardware segmentation offload, inter-appcalasing and a bunch of others. And all of these optimizations need to properly work in order to get the best performance you can out of your Kubernetes network. Another factor is that the host machine might not be a physical machine, but actually a virtual machine, a VM. And this VM would share the node hardware with other virtual machines, which introduces additional complications when trying to evaluate performance. Moving further up the stack, there is the host network stack, which includes everything from the NIC driver to the protocol processing stack. And this is also a layer which has a lot of different optimizations, such as polling, such as segmentation offload implemented in software, and things like fast packet forwarding between different devices, which is really important for Kubernetes networks. And again, all of these optimizations are needed in order to guarantee good performance and how they interact with the hardware or with other layers is not always obvious. And finally, there is the piece of software that implements the Kubernetes network abstractions. And so these are things like pods and services, load balancing, security policies and so on and so forth. And this piece of software uses mechanisms such as connection tracking or address translations that complicate packet processing failure. So to summarize the hardware and software stacks that implement a modern Kubernetes network are extremely complicated, and this makes performance evaluation a really challenging process. So now that I've talked about the challenges, I'll talk a bit about the evaluation process from a highly perspective. And I find it useful to think of this as a series of steps, starting with setting the goal for this evaluation, then defining the setup. And this includes things like the workload or the metrics that we want to monitor or generally speaking the system configuration. While we have defined the setup, we can move forward and do the experiment and get the results that we can evaluate. An important thing here is to try and treat this as an iterative process where you evaluate results and then go back and do other experiments to kind of refine your understanding and improve your understanding rather than a thing that you do once and you are done with. Now, I'll talk a bit about the setting and evaluation goal and while this might be might sound simple, it is really important that each evaluation strives to answer a question and the more specific the question, the better. And it's even better if the answer to the question will lead to specific decisions that you can move forward in whatever your goal is. Another thing that I find useful is to try and set an expectation of the results for the evaluation. And this might be verified or not, but I find it really useful to think about the sort of thing about the sort of things proactively. During this talk I mentioned a couple of pitfalls that I think people should keep in mind when trying to evaluate Kubernetes network performance. And one of them is that it's really easy to get poked down in details due to complexity. As I've mentioned before, the hardware and software stacks are really complicated and it's easy and sometimes it's also interesting to figure out exactly what's going on on every different player. It's really important to try to focus your efforts and keep things as simple as possible so that you can get meaningful results and then iterate and refine your understanding. So this is something to watch out for. Now, I will try to mention some examples of what do I mean by a question that an evaluation should try to answer. An example of such a question might be how does Selen perform, but this is not a very useful question because it doesn't really say much about how to design the experiment and what to do with it. Instead, if I say what is the performance overhead introduced by Selen for two points exchanging TCP messages, that is a much better question and help to guide design the experiment. So one big part of benchmarking networks is the workload and specifically the network traffic that we apply to do the experiment. And this includes different dimensions. For example, it includes the communication pattern. It includes the protocol which can be either UDP or TCP or even layer seven protocols such as HTTP. But also what are the different endpoints that talk to each other. For example, is it pod to pod? Is it pod to host? Is it pod to service or what have you? And another kind of common pitfall here that is really important to keep in mind is that while benchmarks with artificial workloads are definitely useful and I'm going to talk a bit more about this sort of benchmarks in the subsequent slides. It's really important to keep in mind that they do not paint the full picture in terms of production workloads. So in production, things are much more complicated. And so when doing this artificial workload benchmarks, it's really important to keep this in mind and there is a very kind of delicate balance between kind of keep things simple so that you can make assessments but also, you know, not getting to carry the way in terms of having expectations with respect to actual production workloads. So I'll start talking a bit about some different communication patterns. I'll do to make the examples a bit more concrete. I'll talk about pod to pod communication using TCP. But many of the patterns that I talk about here can be expanded into other kind of protocols or other situations as well. So probably the most popular benchmark is the one where one can find point a pod in this case tries to push as much traffic as possible to another pod. And in this case traffic goes over goes over only one direction, at least the user traffic because also acknowledge packets that go the other way but this is not part of the user traffic. And so the metric for this benchmark is measuring how much bytes we can push on on the on a time of unit, for example, bytes per second. And there are many optimizations that target this particular workload such as segmentation, no flow, the interoperability and also things like jumbo frames. And while this is one of the most common benchmarks people tend to use, it's useful to keep in mind that it might not be representative of the workload that your applications have. And if your applications frequently exchange large files, then sure, that's a very useful workload to keep in mind, but if your applications exchange small messages, as I don't know, many microservice applications would do. Then this is probably not a good workload to kind of study or understand the performance of your network. Instead, for applications that exchange messages, this echo reply pattern is much better. And in its simplest form, this is a pod sending a message to another pod and then waiting for the reply, and then sending another message and so on so forth. And we do this for a number of times, and then we record the round trip latency of these requests and replies, and we can do meaningful statistics on them such as the median latency, or even the tail latency, such as the 99.9% latency. And one thing to keep in mind here in terms of optimizations is that many of the optimizations that we talked on the previous stream benchmark, not only they do not help in this particular case, but they can also be detrimental. So from a high level kind of point of view, the thing to keep in mind is that it's not uncommon, it's not uncommon for optimizations for bandwidth and throughput to hurt performance for latency critical workloads and vice versa. So optimizations are, in many cases, depending on the type of workload that you're trying to apply to your network. And a generalization of the second reply workload is instead of one to help multiple requests in flight. And this sometimes is referred to as the best as the burst parameter. And in this example I'm using five. And this means that at any point in time there are five either requests or replies in flight. And using this variation, we can now start thinking about throughput in terms of requests per second in addition to the latency that we've discussed before. And again, other optimizations now that this is a more throughput oriented benchmark coming to play such as, for example, polling or other things as well. So this particular benchmarks allows you to do a very interesting, in my opinion, graph, which is this one. The specific numbers don't really matter that much. This is from an experiment that did, I think, one year ago or so, or maybe less. The idea here is that on the x axis you have throughput and on the y axis you have latency. In this case I'm using medium latency but you can also do tail latency, for example. And what we do here is what we do different experiments for different values of burst. In this particular example I'm using burst with bursts with increasing parts of two. And the interesting part here as we draw the different points and doing the graph is that there are different areas that have very different behavior. So in the beginning we can see as we increase the burst size throughput increases without affecting latency. But after a point which is around the 1.28 value, we see that adding more packets in flight saturates the system and throughput is not increased that much and latency basically rapidly increases. So this is something interesting because it's really important to know which part of the curve your benchmark or your applications are and what is the point where your system will saturate. Now, so far I've talked about basic kind of point to point communication patterns, but there are also a lot of reasons why you would want to scale your workloads. One specific reason is that nicks are getting faster while cores do not. And this means that for say fast network card such as 100 gigabit definite devices, a single TCP stream that is a single core cannot saturate the link. And this means that stressing the system requires exercising multiple CPUs and multiple streams. So in the graph, so here you can see that a single stream can achieve 20 gigabits per second. And again, the specific numbers don't really matter so much different configurations or different hardware will produce different numbers. But the basic idea is that if you want to kind of stress the system, this requires using multiple CPUs. And this can be done by using either multiple processes multiple threads or even multiple pods, depending exactly on what you want to do. In addition to that, you can also think about using multiple notepads for evaluating the full cluster rather than just to see who knows. And this makes the experiment much more complicated because you have to track multiple things but it's also really useful because it's a much more realistic workload. And you can start evaluating things like quality of service and furnace and think like that. I've already talked a bit about metrics, but I've mostly mentioned basic performance metrics. So things like throughput which are buckets, bytes per second or packets per second, and also latency. What you note here is that tail latency that is kind of worst case scenario latency is really important as scale grows and there's an interesting paper that talks about this from Google folk that is called tail the tail at scale and you can look at that for more information. In addition to basic performance metrics. When doing an evaluation you should also consider overhead metrics. And this includes things like CPU utilization or even memory utilization. And to give you an example of why this is useful, there might be an optimization or a system a different system that achieves 1% performance improvement at the cost of doubling the CPU utilization. So if you consider the overhead, this is not necessarily a very attractive optimization. So it's also worth keeping in mind what the overhead of a certain approaches. Another set of metrics are metrics which have to do with performance interpretation or identifying bottlenecks and to give some examples here one thing that you can measure is how the network card interrupts are distributed into different cores so that you make sure that the traffic is actually balanced. There are also things like BPF flame graphs which give you a really nice way of looking at where time is spent when kernel processes packets. And also, there are many tools using BPF which can help kind of trying to figure out specific things such as what is the size of the packet buffer processed by Nick driver functions. So, with respect to system configuration, the basic goal here is to try to control the environment. And to give an example, if you want to do a pod to pod benchmark, you need to ensue that the pods are scheduled in different nodes because otherwise it's a very different sort of situation. And generally speaking, the system configuration should try to isolate systems to avoid unwanted interference. So for example, when we are benchmarking Celium, we try to remove all unwanted interaction. So we run on bare metal, we disabled PMS in the kernel, we disabled core frequency scaling, we pin Nick interrupts specific cores and do a bunch of other things just to focus on the Celium performance and not external factors. Now, obviously, the specific decisions might not be appropriate for the goal that you have said. So, this is a process where given the specific goal that you have, you try to figure out what are the most important things that you need to focus on and try to configure the system so that you exclude all other things that might mess your experiment. So, when it comes to time to actually perform the experiment, I would definitely advise you to, in addition to the results, record system information. And this includes hardware and these configurations. This, for example, might be CPUs, number of CPUs, NICs, what is the NIC configuration, memory and things like that, but also the software parts as well. So, for every software component that you feel it's relevant, you can record its version or its configuration and this includes things like the kernel to the Kubernetes version or the CNI that you are using. In addition to that, running just one experiment is typically a bad idea because you don't really know whether the experiment that you did is actually a representative one. So you are strongly advised to increase your confidence in the results by performing multiple experiments. I would argue at least three, if not even more, just to make sure that, you know, what you're seeing is actually something that happens commonly. And to give you some sort of a war story on how we do network benchmarking in Cilium, during the 1.9 release cycle, we did a very extensive performance evaluation. And one of the things that we figured out was that native routing as opposed to tunneling had reduced performance, which was unexpected because tunneling, for example, has additional encapsulation. So we would expect native routing to perform at least as well as tunneling. So the way that we tried to address this issue was to try to figure out whether the problem was in Cilium or in something else. And what we did was we replicated the setup, but in this time without Cilium so we did everything, but without the BPF data path that Cilium introduces and did the same benchmark. And it turned out that the performance problem was still there, which kind of led us to figure out that the issue was on the Vith virtual ethernet device handling rather than Cilium itself. And once the issue was pinpointed members of our team contributed modifications to the Linux kernel that actually address this issue. If you have a recent enough version of a kernel and recently enough version of Cilium, then this problem will not be there anymore. So I'll, I want to also talk a bit about tooling. There are many tools that you can use for doing the benchmarking. My personal preference is network, which is an old tool, and maybe not very user friendly by today's standards, but it's very powerful. And it can implement all the different workload patterns I mentioned before, but also many more. We also developed our own internal tool, which is called KubeNet Benz, which uses network underneath, but it deals with some of the things like, for example, spawning pods into a Kubernetes cluster and I think that they run on different nodes and some automations, for example, such as gathering automatically gathering configuration data, or even automatically gathering the road data for doing BPF flame graphs. And there is a wide variety of BPF tools that you can use to understand performance. And I highly suggest looking at Brandon Greg's resources for that, either the book or the web pages are really great. They can help you really, really use BPF to better understand performance. Another utility that we frequently use is Perf, which is very common and popular in developers. It's also not very user friendly, but documentation has been improving over the last years. Finally, some additional resources for people that want to dig deeper into performance valuation. One of them is a book again by Brandon about systems performance and this discusses all different aspects of systems not only networking. The book is a bit older, maybe not so relevant these days, but I also, but I still find it useful, at least in terms of the principles it discusses and that's the art of computer systems performance analysis. And finally, if you want to understand how the internals of the kennel work, either with respect to networking or something else I highly recommend the kennel pages on LWN linear tweeters. So, to conclude, I'd like to summarize with three takeaway messages that I'm hoping people will consider from this presentation. One is design your benchmarks so that the answer is specific question. The other is start simple and iterate. And the other is more connected to performance. So I would like to thank my colleagues, Daniel, Gilbert, Paul and the rest of the ceiling team, and also mentioned that you can find us on ceilings like on the performance I'm a testing standard where we can talk about governance and performance. Thank you very much for your attention. And I'll be online to take your questions.