 Good morning and good evening to everyone, this is Rucheng Zhang, and this is the co-work with my colleague, David Chen. We both work in ARM. Today, we will share some Kubernetes performance tests and tools on Cloud Instance. Our motivation is three points. First, has a hardware platform increasing performance, such as the higher CPU performance, increased core count, the IP integration. Cloud native's deployment could benefit a lot. Secondly, we want to explore performance across different size clusters. Here, we use AWS Graviton 2. Especially, we'd like to figure out what performance bottlenecks with large-scale Kubernetes deployments. Here is the question. We don't have 1000 and more physical nodes. How do we test and collect performance data? Last, we'd like to analyze performance difference between different architectures. Let's start from the lifecycle of POD. POD is a set of containers, and it's the smallest unit to create or delete in Kubernetes. On the left is a typical deployment config file. This config will create three HTTPD pods. On the right is the POD create process. First, users send the create POD request to the API server. Then, the API server will write the information to ETCD. ETCD is a kind of storage solution. Then, the scheduler will find that there are new pods, and the scheduler will bind the pod to one node that fits the requirements. Then, the Kubelet notices that a new pod was bound and unloaded. The Kubelet will call Docker or other container runtime to create a container. At last, the Kubelet will update the pod status to API server, and the API server will write it to ETCD. The pod termination will do the same opposite to creation. The user sends the delete pod request to the API server. And the Kubelet will notice the delete and do the delete by container runtime and inform API server at last. We divide the test in two parts. First is workload. Some typical workloads are listed here, such as web server, mid-vile, database. We choose NX and Redis as workloads. The bottleneck of a database is usually IO, so we didn't test any database. A lot of tools are available. Here we use 6-bench to test CPU and memory. For workload test, we choose WRK and Redis benchmark. Both of them can test multiple items, such as requests per second, average and some number percentage latency. And the threads of a tool can be set to check different pressure. Last but not least, it's easy to use. We test two situations. One is with Kubernetes and the workload will be deployed in a pod. The other is without Kubernetes and the workload will be wrong directly on OS. In both situations, the test client will run on another instance and it will request the workload service. The other part is test the scalability. We choose class loader to do the scalability part. Class loader is an official Kubernetes scalability and performance testing framework, and it has a long list of support providers. It also decoupled from Kubernetes South as compared with performance test cases. Also, it provides fine-grained control over the number of pods and nodes. There are also some other good tools. The most important reason that we didn't choose them is that it is a simulation. Cut loader is the only tool to simulate a large class. This is the instance we use. All the instances are on AWS. We use two different architecture, ARM and X86. Both arc has the same CPU, same set of memory, and similar cost. First of all, we test CPU and memory by six bench. On the left is CPU performance, single core, and multiple. Second of all, six bench calculates prime number using given threads. The result is the event per second. We test two situations of one thread and all 64 threads. On the right is memory copyright. The memory block size is eight kilobytes, and the total size is 64 gigabytes. As the graph shows, ARM is better on CPU and X86 is better in memory. Here is the engine throughput. We test four situations, ARM without Kubernetes, X86 without Kubernetes, and X86 without Kubernetes. Corresponding with four situations, the four lines are on the right. We also test different pressures by setting the client threads from one to five hundred and half. At the very beginning, four lines show the same trend. Before reaching the top, each instance are not four-speed. When the threads go to a certain number, the throughput doesn't increase at all, which means the instances are on four-speed. Then the line is falling, which means that instances spend too much CPU time to do process switch or interrupt or other traffic things. From the graph, it's easy to see that first, Kubernetes will decrease throughput. Second, the ARM show higher performance, which is faster CPU, loss in high concurrency scenarios. First, the X86 instance perform discrease worry first. This graph is about 50% latency. As the pressure goes up, the 50% latency of engines goes up too. It's not very hard to understand. More requests arrive at the same time, and some of them should wait until the other accomplishes. From the graph, we can see that three points. First, Kubernetes will lead extra latency, and ARM performance better on this. And all latency are under four milliseconds. It's tolerable. This graph is about 99 latency, and it's similar with the 50% latency. X86 perform better on this, and all latency are under 40 milliseconds. The last one is about Reddit testing. Reddit benchmark tests some function of Reddit, such as get. Apparently, Kubernetes will also reduce the performance of Reddit. On ARM is 7%, and on X86 is 17%. We made some summary here. First, ARM instance show better results with higher load, particularly with high concurrency scenarios. Second, the extra latency caused by Kubernetes is acceptable on both ARM and X86. Third, the engines on X86 with Kubernetes only reach half to the peak performance in high concurrency. Next part will be Dave. Hello, everybody. This is Dave Tran from ARM Unlimited. Today, I'm going to introduce a couple of tools used for scalability testing in Kubernetes. So the first one is the class loader, and the other one is the Kubernetes Mark. What's the class loader? Class loader is the official Kubernetes scalability and performance testing framework. We can use the framework to profile CPU and memory. With the sample of the scheduling, check the startup latency of the port, etc. The tests are defined in a YAML file using declarative statements. For example, how many nodes that the port will be scheduled? How many state-of-the-art sets will be created in the cluster? Currently, class loader, sports, Kubernetes Mark, kind, local, and other providers. And the Kubernetes Mark is a tool which allows users to run performance testing on the fake nodes. It simulates a node by creating a port. Well, Kubernetes and Kubee policy services are running inside of the port. In this way, you can simulate a cluster with thousands of nodes by creating thousands of ports instead. The primary use case of the Kubee Mark is scalability testing. As the simulated cluster can be much bigger than the real one, the purpose is to expose problems of the controller plane component, for example, API server, on big cluster. In our experiments, class loader, combined with the Kubee Mark, are used to create a big Kubernetes cluster based on multiple AWS Graven 2 instance. And the CPU memory profiling is done in this cluster. Here are the strategies on the class loader's configuration. Kubernetes has documented the ports and protocols used by the Kubernetes component. You should be aware that class loader will access to those ports to collect the metrics or profiling data. So we need to open this port accordingly. You might need to manually enable the profiling for the components like ETCD or control manager, as it's possible that profiling is not enabled by default. ETCD specific setting like the listening port or the third location should also be specified to make sure the metrics data could be connected successfully. Finally, class loader has tested to measure the performance of binding the persistent volume with a port. But the test assumes your provider supports the feature of dynamic provisioning. This is not always true. So you might need to either implement the feature by yourself or disable the persistent volume in the config if you don't want to evaluate the performance of the storage. So basically, we will talk about three kinds of tests. Start port, testing, and load. We will check the metrics when 15% of the ports or 90% of the ports are scheduled respectively. Firstly, let's go through the indicator we use here. Schedule to watch. Schedule to watch is the time taken from the party scheduled to the event is received, which shows the party is running. Post startup is the time from the party is created to the party is running. Round to watch is the time from the first started container to the event that shows the party is running. There are other indicators, such as the create schedule and schedule to run. Let's check the throughput first. This measurement gets a scheduling circuit. For example, we can try to schedule 2000 ports to 1000 nodes and then connect the data to analyze the startup latency for the spots. The data keys show there are no major difference between two platforms. The difference on the schedule to watch is relatively big. But if we run the test multiple times, we will see this is still with the margin of errors. Testing test. We are trying to schedule 13000 ports to 1000 nodes, which means originally 30 ports will be scheduled to each node. The result that we created is only deployment and then to the profiling against the CPU memory for each of the Kubernetes components. Schedule metrics connect to the metrics for different schedule plugins. We can see that bind is possible bottleneck on both platforms. It takes nearly one hour to finish the binding of the spots. This is not accessible. This binding plugin will interact with the API server to bind the product to a specific node. This is because in our experiment, we only have one controller plain, controller plain node. So the possible solution here is that if your cluster is too big, you need to consider the whole standby high availability mode to spread the load of the API server to other controller plain nodes. Node testing. In node testing, we are trying to schedule even more ports to each of the nodes. What's the difference here compared with the testing is node testing we are trying to deploy demo set, state of set and other kinds of the Kubernetes resources to each node instead of just the deployment. In other words, testing testing is a simple node testing. The data on the data thing is similar with each other. Again, there is no much difference between different platforms and the class loader contains other testing to measure the performance of the functionality of the network and storage. You can play around with the class loader to evaluate the performance of the network and the storage as well. Profiling results. The class loader can collect the profiling data for the scheduler, control manager, API server, etc. We can export a core graph or a flame graph from this profiling data. It's clear to see the resource consumption and the method stack in this core graph. In this example, we can see that the prioritized method is still close to 20 milliseconds and a cumulative cost of the method is 110 milliseconds is here. Besides, we can also generate a flame graph. It's very easy to find how many CPU or memory resources is occupied by the method in the flame graph. Here is one example on how to find the protection issues based on the testing test. We can list the top entries with the command line. Then we found that the logic of the preemption was called. The preemption is not the normal process for pod scheduling. It only happens when there are no available loads in the cluster. It might be resource exhaustion or the topology cannot be satisfied for the pod. But in this test, the test framework will create 1000 nodes. We only have 30,000 pods going to be scheduled. And the copy scheduler has a plugin which is named as topology spread, which we are trying to spread the pod evenly in the cluster. And one node is able to draw with 110 pods by default. So the logic of the preemption in the scheduler should not happen in the test at all. Based on this assumption, something changed in the codebase might bring down the influence of the topology spread plugin. And eventually we found this is a regret issue. The reason is that in this specific case, if there are a lot of pods with no request of the CPU or memory, the change in the source makes the specific plugin, which is named as a balanced resource allocation, has a higher impact on the final score. So the node which has lots of pods with no request of the memory of the CPU will result to be a higher score and lead to the uneven in pod distribution. GapG collector goes through the data connect from the node testing. We found that the memory footprint of the GapG collector is a little bit bigger than our expectation. The peak value is around 16 megabytes. We can use the source by the command line and check where the memory is consuming. And based on the data collected by the gondar-p proof, we find that 18% of the memory is taken on the updating of the map defined as a graph builder. This is understandable if we dive into the source code and change that the tester will try to create different kinds of resources and then try to scale up an update. For example, the tester will create the deployment, which is right because defined as 250. This means 250 pods will be created for this deployment. And when the deployment is updated or deleted, all the supports managed by the deployment will be either deleted or updated as well. Both will trigger the process of GapG collection and update the map of the graph builder accordingly. And last, let's take a quick look on the gondar runtime. This slide shows that gondar runtime performance is a little bigger and a little better on x8664 if we trust the data connected by the gondar people. Let's drop off this presentation. So in this presentation, we have compared the Kubernetes performance between different architectures. And we have observed that gondar runtime might show something different, but the impact is not that big as what we have seen from those scheduling metrics. The software like Kubernetes doesn't seem to have many performance differences on two different architectures. If we don't have enough physical nodes, we can use the class loader and the kubemark to simulate a large cluster with 500 or even 1000 nodes for performance profiling. The next one is that if you are planning to use the class loader, please take care of the configuration carefully. The real environment is complex, and the default configuration might be locked with the provider like GCE for example. We can connect various metrics or profiling data based on those tools. Analysis on those data will help to reveal the potential issues which can be found in large cluster only. And finally, let's improve the Kubernetes community together, upstream your enhancement PRs. Okay, thank you. This is our email address. So if you have anything want to discuss, please email us. And thank you for your time. Bye-bye.