 Hi there, welcome to our talk. Kubernetes is experiencing an exponential growth, but the benchmark tools to evaluate Kubernetes infrastructures are still catching up. Do you wonder how to evaluate and compare the performance of two Kubernetes platforms for a target use case? Do you wish there was one tool that can point and shoot at a Kubernetes cluster and get you all the key performance metrics summarized in a nutshell? Well, then you are at the right place. In the stock, we present Kbench, which is a framework to prescriptively benchmark a Kubernetes platform. I am Karthik Ganesan and my co-presenter is Yong Li. So when it comes to performance, there are multiple aspects to look at. Some aspects related to the control plane include responsiveness, meaning how long does a target platform take to respond to my change requests? Next is scalability. How well does the control plane of this particular platform scale, say I want to deploy 1,000 parts concurrently, can this particular platform handle that? And then how resilient is this platform to require from failures? On the other hand, when it comes to the data plane or application performance, there are basic things like the core infrastructure capabilities, like how powerful are the CPUs or what are the IO and the network capabilities of the underlying infrastructure as these things can have a significant impact on the application performance. Say one has a network intensive application and want to know how it is expected to perform on a particular platform or compare multiple platforms for this particular case to choose one. And then comes aspects like resource efficiency. Hey, can I pack more pods on one platform or another without having a performance impact? And lastly, performance isolation characteristics are becoming more and more important. How will my performance be affected when there are noisy neighbors? There are a multitude of small tools to evaluate some of these aspects specifically. Some are reliable, some are not, some are accurate, some only provide a very coarse-grained picture. And you can get quite tedious if one wants to get the full performance picture for a particular platform. Having gone through some of these pain points, we rolled up our sleeves and started putting a versatile framework together so that we can get all, we can benchmark all these various aspects with ease. That is how we got to Kbench. So what is Kbench? Kbench is a configurable framework to prescriptively deploy and manipulate Kubernetes objects. And in this process, it lets us benchmark both the control and the data plane performance aspects with ease. For example, on the control plane side, you can ask Kbench to deploy 1000 engine X pods concurrently and observe pod startup latencies at millisecond granularity. Say for data plane, you can deploy a persistent volume and record a synchronous read rate bandwidths to it from a pod. Kbench has an extensible design and provides a multitude of configuration options for the end user. So using these options, one can orchestrate pretty much any workflow with ease using your target workloads. Each use case in Kbench is represented by a simple and a prescriptive configuration file. At the end of your runs, Kbench provides an intuitive set of performance metrics and a lot of detailed diagnostic data that can be very useful to go ahead and resolve some of these performance issues. So let's take a closer look at the features offered by Kbench separately for the control and the data plane aspects again. For the control plane, Kbench provides accurate and fine grained critical path for the start-up latencies for your workloads. When you think about some of these latencies, you can think of what is the pod scheduling latency, pod initialization latency, startup latencies, et cetera. But if you think about it, right? Kubernetes is a declarative system where you specify the target state of the cluster and the cluster eventually gets to this target state via a bunch of asynchronous events that happen in the system. So what is the latency for my trigger change? Well, it is certainly not how long Kube-Kerl took to say submit a request, right? It is actually how long the cluster took to get to this final state that I desired. So when it comes to measuring these latencies of trigger changes, we need to keep track of the various events in the system until the cluster reaches the target state. In Kbench, we use a novel methodology to keep track of both the client and the server-side timestamps of all these different events to tackle this problem and provide meaningful control path latencies for you. For the data plane, one can evaluate application performance using real containerized benchmarks in Kubernetes paths. While one can use their own workloads, they can also leverage built-in workloads provided by Kbench to stress specific infrastructure resources. For instance, you can use Redis-Memtier that comes pre-built in Kbench to stress the compute and memory aspects or say the FIO benchmark to stress the IO aspects, et cetera. So using these artifacts, one can scale up or scale out resource usage to study infrastructure performance. Scale up meaning use a single part to increasingly stress a particular resource category or scale out meaning add more parts to the system to consume resources in a particular category. Kbench also includes built-in blueprints of workloads that take advantage of these benchmarks to evaluate different aspects of data plane performance. These aspects can include things like, hey, can I get some information on what is the part density that this particular framework, I mean this particular platform can achieve, et cetera. So now that having given a overview of Kbench, Yong will discuss the different elements of Kbench to construct a workload, to evaluate the different control plane aspects. Do you, Yong? I'm happy to be here presenting Kbench control plane. First, let's start with some basic terminology. When we're talking about Kbench actions, we're referring to the Kubernetes object lifecycle and dev-op operations, such as create, delete, list, scale. Actions can be specific to resource types. For example, you can create any type of resource, but you can only scale certain type of object like deployment. Actions run with action and resource type specific options. For example, if you are creating a service, you can specify service type and ports as options. On the other hand, if you are scaling a deployment, you can specify a number of replica as an option. Multiple actions run one after another for the same resource object from an action chain. Action chain is useful when you want to define a list of actions that depend on each other. For example, you want to create a list of pod before updating and deleting them. Based on actions, we define operation. Operation in Kbench contains a collection of action chains and each action chain is executed for one particular resource type. Action chains for different resource types in one operation will run in parallel. We also define predicates in Kbench, which are conditions under which a particular operation will be triggered. There are many different type of predicates. For example, you can define a predicate that check a status of your running object. You can also define a predicate that run command and check the return status of the command and only run operation under certain scenarios. You can even define predicates that examine a runtime environment inside a container. Kbench define a set of metadata before running on your workload that attach operation thread ID that manipulate a certain set of object. And later you can refer those metadata labels for filtering and selection purpose. Of course, you can pass your customer, define the user labels and use for selection purpose later. Now let's look at the overall design framework. Here is a running Kubernetes cluster with a bunch of objects. Kbench accept a Kbench config file along with standard Kubernetes YAMI files. The Kubernetes config path and dispatch logic is responsible for parsing those config files. It depends on what type of resources you configure to measure and benchmark. It generates a list of resource managers. Each resource manager is responsible for managing the lifecycle and events and resource metric collection for a particular resource type. For example, pod will be responsible for managing pod resources. And you can specify for each resource type what actions and operations you want to run. Here, for example, for pod, you define a list of actions to create a pod first and then update and then run command instead of pod. And for other type of resources, you can define a total of different action lists. You can put all them together in one operation. And those actions for different resource type will run in parallel, but action chain for one particular resource type will run sequentially. And you can specify concurrency and resource manager will check your concurrency configuration and maintain a thread pool of appropriate size for that resource type. So you can use key labels, key bench pre-attached labels and your customized defined labels for resource selection purpose. You can define predicates to guard your operations. So all of those come together to form a very flexible execution plan. Where you can define workflow to be executed on certain type of resource at a certain concurrency under a given condition, against a pre-selected resources. In addition, key bench expose container interface to outside users. So users can run command micro-benchmark inside the container and mirror performance or clouding. We also integrate premises and wheel front for resource monitoring and triage purpose. The framework is based on client goal and can run on different platforms such as vSphere, OpenShift, and GKE, Kubernetes offerings. Let's look at a list of example resources actions and configuration options. For pod, you can create, list, get, run, copy, update, and delete, perform those actions. And for different actions, as we mentioned, you can configure to run with different options. For example, all the options, you can specify concurrency in the sleep time, which is the time in seconds you want to sleep after each action. For create, you can specify image pulling policy, image to use, and yummy spec. For run action, you can specify the command to run. And for copy, you can specify the passes, local pass, and container pass, et cetera. And for other actions, you can specify labels, label key, and label values, et cetera. For deployment, in addition to all the above actions, it also supports skill action. And it's create action has some additional options like a number of replicas. For replication, controller, and a state of set, we provide a similar set of actions and configuration options. Keep in mind to also support some other type of resources like names-based service, config map, et cetera. Here's a use case example layout as a configuration file. You want to define your workflow to run no more than 60 seconds. That's a timeout on the top. And here, you define a list of operations. And your first operation will be guarded by a predicate that checks the running status of your pods initializer. And after that, you define the body of your first operation. Here, you want to create deployments using this given yummy file at a concurrency four. And in parallel with that, you want to create pods and run a self-script and attach those labels at a concurrency of two. And after your first operation, you can define more operations which will run after one of another. Keep in to produce C-compliant API metrics. It reports 50%, 90%, 99% API latencies and API responsible time. For certain operation, also break down and report fine-grained query pass components. For example, for pod creation, it will report scheduling initialization, image pulling and the startup latency on the node. It has improved accuracy compared to some existing benchmark. Currently, Kubernetes server side will give a very cross-grained timing information once to a second. So when you want to study some metrics as mid-second level, this is not surfacing. So we rely both server and client-side timing information. And we use an event callback mechanism and checks the triggering reason and condition under which the event is triggered and measure and record timing information accordingly. So for example, here, we want to study the pod initialization scheduling latency at a different concurrency. Using an existing benchmark, it just give us all zeros because the pod scheduling is very short-running operation. On the other hand, using Kbench allows us to report the detailed client-side metrics as mid-second level for different concurrency. This is pretty much about Kbench control plan features. And now I'm gonna hand it over to Karthik to talk about data plan features. Thank you. Thanks, John. Let's take a closer look at benchmarking that data plane using Kbench. With the rich container interface that Kbench offers, one can orchestrate pretty much any workflow. Say you have a use case in mind and you want to evaluate how your use case will perform on a target Kubernetes cluster, Kbench can help you with that. You can start with the create operation, using which you can deploy your Kubernetes resources with your target containers and assign them some labels. You can just reuse the YAML spec of any Kubernetes artifacts that you might have. Labeling these resources helps us filter and select specific resources on which we want particular actions to be performed. That way we can orchestrate different workflows with ease. Next, you can copy any workload artifacts or workload specific configuration files for current run into your containers. Then you can use these labels to selectively choose those Kubernetes or objects to which you want to apply this action. Then you can run commands inside the pause to trigger your workflows. One of my favorite features of Kbench is the ability to trigger actions using conditional predicates. These predicates can be Kubernetes system-based or something that is evaluated inside a container. Say you have a particular server pod deployed and want to generate load from a client pod against this particular server pod. You can wait until the server pod gets to running or even better, you can wait until the server process inside the container gets to a particular state using these predicates. Now maybe again, you want to copy your results out of these containers and to the client from which you are orchestrating all this. And then you can go ahead and use the delete action to clean up your cluster. Now having looked at how to orchestrate a workflow using your workloads in Kbench, let's take a look at some of the pre-integrated workloads that come with Kbench to stress different infrastructure dimensions. These pre-integrated workloads are ready to be used out of the box. As you can see on this particular table, you have workloads that is memtier that can be used to stress CPU and the memory aspects. You can get aggregate transaction throughput across all the different parts in a particular cluster. And you can also get transaction latencies for your deployments. And with FIO, you can get read-write bandwidth for various read-write ratios, block sizes on FIMerals and persistent volumes. IO-Ping is integrated with Kbench to get IO latency on FIMerals and persistent volumes. You have IPerf3, which can provide you inter-part TCP UDP bandwidth information. There are blueprints with varying pod placements on nodes, zones, regions, et cetera, to get you more specific information. And that is QPUF integrated into Kbench and it can provide inter-part network latency. And again, there are blueprints with varying pod placement that can orchestrate QPUF for you. Typically, the end results or the generated performance metrics only paint the final picture. If you notice an anomaly or a performance issue, just this data is not enough. Analysis and iterative improvements need deep infrastructure diagnostics. Kbench provides support to inject performance and diagnostic data to dashboarding services like Wavefront and Grafana. These are done by using distributed telegraph data collectors and they can be configured with supported output plug-in to view results on your dashboarding platform of choice. These telegraph collectors are fed with thousands of handcrafted performance metrics that can be monitored for Linux and ESX hosts. This data can be valuable when correlated with actual performance results to deep dive performance issues and resolve them. Having talked about the data plane features of Kbench, let's take a look at some example use cases where we have put Kbench to good use. Kbench was used extensively in evaluating and improving VMware's Kubernetes products. In this use case, we deploy a standard Java benchmark inside multiple Kubernetes parts. We use Kbench to find the maximum cluster level aggregate transaction throughput for these Java workload. That can be achieved on different Kubernetes clusters, but with the same hardware resources. Kbench showed how a virtualized Kubernetes cluster can actually beat the performance of a bare metal cluster both using the same hardware. It also enabled us to get deep insights into what was the root cause for some of these performance differences. Some of the results generated by Kbench were so impactful that they were used by VMware CEO Pat Gelsinger for Kubernetes product announcements in the opening keynote of our annual conference. Here is another example of a data plane pre-integrated blueprint called DP-Internode. This blueprint automatically deploys two parts on two nodes using anti-affinity routes. It uses a headless service to run IPerf3 across the parts to provide TCP UDP bandwidth information and Qerf3 to provide the network latency information. These blueprints can be run, all these different blueprints for both control and the data plane can be run as a suite to get all these key metrics just in a nutshell. In summary, we presented Kbench, which is a highly configurable and easy to use benchmark framework to evaluate Kubernetes performance. It can be valuable for competitive benchmarking across multiple platforms to identify performance issues, route cost them using diagnostics and iteratively improve your platform performance. Kbench is open sourced and we would like to welcome everyone to use the tool. If you have a need, provide us feedback and to contribute to the project. We thank you for your time and we are happy to answer any questions you might have.