 to manage 60,000 nodes in a cluster. This also means that we run about 1.8 millions containers or part in the Kubernetes world. So we will kind of show you how we achieve that. Of course, not everyone need that larger cluster. But for our cases, we need a larger cluster to manage both VM and containers and other work nodes. So here's the agenda that we have prepared for our talk. I will start with the introduction and the background and tell you why we're doing this and what it is, the architecture project. And then Ying Huang will take an hour and talk about more of the changes that were made to the Kubernetes and what optimizations were made to Kubernetes to achieve 60,000 nodes in a cluster. She will also show some of the performance test results and analysis. Hopefully, in the end, we have time to answer questions you may have. So with that, I'll just give you some of the big background, who are we, and then what is the architecture project. So we are part of a cloud lab at FutureWay. FutureWay technology is a R&D organization that focuses on the open research, open innovation, open standard, of course, the open source project. I'm currently leading the cloud lab. And Ying Huang is a principal cloud architect in our lab. So by the name of the cloud lab, we focus on how to manage, to build and optimize the open source cloud software or cloud open source projects, such as Kubernetes, OpenStack, and cloud containers. So we do a lot of cloud containers on that. One of the projects that we did a couple of years ago is called Centaurus project. Centaurus project is a Linux Foundation cloud infrastructure project. We donated to Linux Foundation around December 2020. So it's 1 and 1 half years old. But internally, we actually started a project since the beginning of 2019. So it's kind of a three years old. So today, the Centaurus project consists of a group of sub-projects. And the first project is called Octos. And this is the project where many folks are today in this talk. And Octos is a large-scale computing cluster management systems for that. And we will talk more on this. The second project is MISA. MISA is a network project. We call cloud network project based on the EBPF and XDP technology. MISA is we're using this to support the network, configuration network provisioning for the Octos. We actually have two sessions related with MISA. And one is tomorrow, and the one is on Thursday. If you're interested in those two sessions, you can search, look up the schedule by the name of MISA, if we have two sessions like that. The other two projects are on the Centaurus project group. And one is for AI, and the one is for Edge. And those are kind of out of scope for this talk. But if you're interested in the Centaurus project as a whole, and you can go to the website called centauruscloud.io. And we have a bunch of information there for you to available. We also have a booth on the first floor. So if you're interested in Centaurus as a whole, you can go there. And we have a demo for you to build a Centaurus cluster and manage the VMs and the containers. So why we're doing the Centaurus project and what is Octos project at all? So in fact, this is mainly driven by our current infrastructure platform challenges. And as many companies, we're using customized OpenStack to manage the VMs, and we're using Kubernetes to manage your orchestrated container-based applications. However, we encounter some scalability issues for both platforms. There are two challenges I can stand out that get us to working on this Centaurus project. And the first one is we have a challenge to manage a cluster. We want to manage a cluster that more than 10,000 computer nodes in the cluster, minimal on that. And just that we today, our current platforms manage 1,000 and 2,000 computer nodes per cluster. Now for our public cloud services, we offer public cloud services. And that we have to deploy and manage a lot of OpenStack clusters. And of course, this is a high cost to us for managing perspective. Now the second challenge we have, we have a customer ask us to provision more than 5,000 VMs within three or four minutes. And we can do this with our current platform. And that's where we get us started. OK, we need to build a new platform or unify the platforms for that. So that's where we get to start to do Centaurus. Now secondly, we also want to manage, unify our platforms for to manage VMs, our platforms to manage containers, where our platforms to manage a serverless. We want to combine two into one. Now only that we reduce the cost, but also driven by a customer request that they want to deploy hybrid applications with a single API, which we cannot do that today with our current platforms. We currently have actually three platforms for each of this. Platform for VMs, platform containers, platform for serverless. When I see a hybrid application, I mean that the application is designed so that part of the application or components, for example, like back end, that run in dedicated VMs. And part of the applications or components running as a container or with assembly, people using a web assembly, that can be around anywhere. So they want to do this, deploy this hybrid application into one single API or one shot, which we cannot do that today. And so that's become a goal for our Centaurus project. Additionally, we have a customer expressed interest that they don't want to manage Kubernetes cluster. And it's become very complicated. People that we experience, it become very popular, everybody knows it, but it's become very complicated. They don't want to manage the other cluster themselves. But they want to use the API to manage Oxford application. And this is a declared API very easy for customers to use. But that means that for us, we have to build a truly multi-tendent Kubernetes and provide isolation to the customers if multiple customers, multiple tendons, using the same cluster. We have provided isolations in current Kubernetes standard that we actually don't support it. And last but not least, they're driven by our AI platform for coordinated ML trainings, for federated AI trainings, whatever it is, they require a platform to manage the nodes, not only in the cloud, but also on the edge, so that they can push their training job in the cloud or on the edge, and it depends on the scenarios, the applications. So that become a goal for the centralized project also. So that's a highlight of why we're doing this project and some of the background we're doing this. Now, let me kind of introduce a bit of the architecture project itself. As I mentioned, on the centralized project, and the architecture is for compute, we call compute project, that we call a cloud infrastructure platform for manage very large scale computer clusters and orchestrate different type of applications. It's based on Kubernetes, and we drive from the Kubernetes, but we make it a lot of fundamental change, design change, and architectural change to Kubernetes. And the result of that is Oculus. So Oculus project is based on the Kubernetes. I want to go through each of the bullet point because I will talk about each of this high-level change that we made in the following slides. The fourth is, first, is architecture. Here is a well-known, everybody knows Kubernetes architecture. Where you have API server, you have EDCD, and store all the objects that are created by the customers. And then you have a scheduler running the schedule, allocated which part of the running on the which node, and then you have a bunch of controllers running to finish your deployment or finish your workflows. And of course, you also manage a list of the nodes in the clusters. We take this architecture, and we split in the two parts. And the first part we call Tenton Partitions, TP, Tenton Partitions. And that partition has its own API server, EDCD running, and but only store the Tenton-related objects. So it has most of the controller running, like a replica controller, and jobs, demon set, or staple replica set. So that's what we call Tenton Partitions. And that's the partition where you receive the request from the customer for deployment or orchestrate the application. Now, the second part of the split is what we call resource partitions, called RP, where it doesn't have any schedule running, but it has its own API server. It has its own EDCD, only store the resource-related objects. In this case, many is the node. So you're running only the node controller within the partitions. So you manage a lot of the node information. All the node steps, we report back to that API server of that partition, not the TPs. So that's actually, that's the high-level idea of the scale-out architecture required is split partitions to the two. Of course, there are many ways you can increase the scalability of Kubernetes, but we talk this way because we want to build a platform that manages not only the containers, but also VMs, and they're able to scale more than 50,000 computing nodes. So InHuan actually will talk about more about how to detail, how this design, and then on the optimization we made to it. So that's actually very simple at a high level. Of course, there's a lot of detail figure out, and Huan Yin will show some of the detail information with you. The second change we made in the Kubernetes and it now becomes a part of our architecture is the multi-attentance. The first thing we did, we introduced tendon concept, introduced the tendon objects or tendon controllers. So now before any customer using the shared clusters, architecture clusters, we have to create a tendon for them. We have to create a registered tendon for them. So when you create a tendon, we create a space, a quick logic space. So on that space, all the objects created by the tendon will be in that tree of the objects they store in the EDCD. So this is how we isolate the resource between the tendons. And so we provide the resource isolation for the tendons. We also add a metadata field called tendon to almost all the objects in the Kubernetes, that means including the part. So in the part, when you deploy part or application now, you have to specify which tendon it is. And that will distinguish you, that part, with different part. Similarly, we also introduced the network concept and the network CRD and the network controllers for the network isolation. And most of you already know that in the network design in standard Kubernetes, it's a flat network. Every part in the clusters can talk to any other part. Of course, you can use by default. Of course, you can use the network policies to specify which part can talk to which part. But by default, they can talk to each other because they are on the same IP range. So we change the model and we introduce the network objects. So we allow tendon to create the own network. And each tendon can create as many network as they have. So each network has its own IP address, has its own DNA server. So when you create a part, now you have to specify which network it is within that we are located in that network. So by default, different network, different tendon by definition, because they are not in the same IP range. So they don't talk to each other. And that's how we isolate from network perspective the part communications, which provides more strong isolation than the network policy. Of course, within each network policy, you still can use a network policy to specify which part to talk to which part. So the third major change we made to support the unified systems we call the runtime infrastructure. So we abstract the part concept further to including VMs. Now we have a VM part and a container part. In the future, we plan to support, for example, WebAssembly. So we're able to support WebAssembly part. So with that in the part definition, now all the Kubernetes scheduler, controllers, and all workflows, they just work with VM as well. So you can just orchestrate the VM just like containers. But in order to do that, and you have to make a change to unify the runtime on the node agent side. So on the node side. So we extend the node agent to not only support the container runtime, we also support the VM runtime. So we add the VM runtime support on every node on the node agents. And that's where we're able to unify this. But that's transparent to the customer. We extend the CRR, the container runtime interface to including a lot of the fail, the VM required. So we extend that definite interface to run that. We also introduce some of the action objects. And that's also new for the containers, for the VMs. Because you can do some of the operations on VM. For example, you can bootstrap VMs. You can make a snapshot for the VMs. You can detach or attach a device to the VMs. Those are not available for containers. So we have to introduce that. The act will call action objects. So customer can manage the VM through the Kubernetes API perspective. So that's the actually main change we made to the Kubernetes. And again, the result is the octos project. And they're able to manage 50,000 or 60,000 computer nodes in their clusters. So with that, I think I will hand this over to Yinghuan. She will talk about more about how we make the change, how we support assisting 1,000 nodes for clusters. Thank you. Introduction. OK. So I'm going to talk about how we scale out octos. It's actually a scale architecture based on top of Kubernetes. And what kind of optimization we have made to make the cluster to be able to manage 60,000 nodes. So this is the architecture for Kubernetes. We are already very familiar. Dr. Xiong already showed that before. And this is the same. And we split doubt to partition. Tendent partition and resource partition. So resource partition is specialized at managing nodes. So we can see here it has its dedicated ETCD and API server. Kubernetes talks to API server in its own resource partition. So its report is node healthiness. So we leverage resource partition to manage the node healthiness of the cluster. And in the tenant partition, we are only focusing on customer workload. It has its own ETCD instance API server. And it has a scheduler. Scheduler gets port information, the customer request, from the API server of the tenant partition. And it gets the node information from the API server of the resource partition. And then it determines which port it want to put on which host. So by this way, also Kubernetes in the resource partition, it reports its node healthiness to its API server in its own partition, the resource partition. And it also downloads the port assignment to its particularly host from the API server of the tenant partition. So this way, we successfully split out the request, different workload management by the nature. We separate the node management request to resource partition and the customer request to tenant partition. So the relationship between the tenant partition, resource partition, they are many, too many. And they can scale independently. And for now, we support multiple tenants per tenant partition. But right now, we are not supporting cross tenant partition tenant. So each tenant has to belong to one tenant partition. What's the benefit? So tenant partition, we call it TP, resource partition, we call it RP. So they have more specific workload. And we are focusing. They have their own focus of workload. We are easier to make individual performance improvement in different partitions. And they are independently from each other. What is benefit-based bond? They can scale independently. We can have 10 TP partition and only one or two or three resource partition. And each tenant partition, they actually can take all the resources. From multiple resource partition. So arbitrary scale we have can potentially make one tenant partition to take all the resources from multiple resource partition, as long as we were able to handle this workload in a single tenant partition. OK. In addition, each of the tenant partition have its own scheduler. And the scheduler work independently in its own tenant partition. So they can provide multiple system throughput. OK. So we have done some optimization, mainly focused on several areas. We're trying to eliminate the unnecessary ETCD rates from API server. We're trying to regulate client list operation. We will tell you why this is important later. And we also did the improvement on the watch mechanism. So for those of you who are familiar with Kubernetes, this and watch is a very important operation in Kubernetes. And it also becomes a major bottleneck of the performance. So we will present our improvement with a performance test result together. First of all, this is what we're going to talk about. The performance data, the benchmark tools we're using, and the critical changes we have made for performance improvement. First, this is our final performance data. Actors 1.0, 0.9, and the compare with Kubernetes 1.21. So the first nine tell us the cluster size, how many number of worker nodes we're managing during our performance test. The second nine tells how we partition this cluster. So the first number is tenant partition, second number resource partition. So in our 1.0, we were able to go to 60,000 nodes by three tenant partition, three resource partition. And before that, we tried the arbitrary number of combination for tenant, for TP and RP, and trying to see what's a good combination and how we can reduce our management cost. The third nine, the pod creation request rate during the test. There are two numbers. So the first number and second number. So for whoever is familiar with the perf test tool, there are two phases during the perf test. The first phase is called a saturation pod creation phase. So in this phase, it's trying to use a large number of small pod, trying to take over to make the cluster reach full capacity, meaning 30 pods per host. And on the second part, it uses a relatively small number of pods. But each pod, they take lots of system resources. So it's guaranteed almost one host can only host one pod in the latency test phase. So the first QPS is the request we're sending per second for the saturation test phase. The second number is the request rate we're sending in the latency test phase. So in Kubernetes, the default test QPS is 20 and 5. So in the saturation test phase, they create 20 pods per second. In the latency test part, they create five pods per second. We actually, during our test, we are saying this QPS is very important in the actual latency data they are reporting. So meaning the more number you are created during the test, the higher latency you are getting, which is not as ideal. Because we want to support a big cluster, you are supposed to get more customer, right? So you cannot have a very low QPS when your customer is growing. So the latency pod, startup latency data here, we have P50, 90, and 99. Those is just a number for the latency pod when we create it's in the second phase. Because this is a criteria used by Kubernetes first test to determine whether their saturation test is successful, the latency test, whether it's successful. So they use the five second as the standard. But we have some number is higher, some number is lower. But roughly as long as it's around that five seconds, we think it's a successful test case. So from this, you can say we're using for actors after we scale out and after our several optimization, we have much better QPS and the relatively comparable pod startup latency. So the benchmark tools we're using, we use the building kubemark to simulate the real cluster because the real cluster of 50k, it's very expensive. So for the Holo node, it's used 0.1 CPU for each node. So we were able to reduce the size. But in the real test, we use a 505 number of machine to do a 50k node simulation. And we adopt this Kubernetes proof test tool because we need to make some changes. It's the industrial standard proof test tool from Kubernetes. It's called cluster loader tool. There are many tools, but this is the one we're using. It's from Google. So if you need an explanation, please let me know. So we have to make some changes to cluster loader tool. So first, because we support multi-tenancy in actors, so we have to make some changes, which enables us to, for each our test, for each tenant partition, we use one separate proof test process so that we reduce the impact that if your proof test tool is slow, you cannot have high throughput in your actual cluster. So we use a multiple test process. And we increase our QPS. So after increasing, you can see the total amount of running time is reduced to like 15% of the original runtime, which hugely saving our testing cost. The money you spend on Google GKE is huge if you have to run those tests again and again. And we have to make minor changes to the proof test tool because we detect there are some customer introduced issues just for the proof test tool. So that's why we're calling it like a regulated customer behavior for this part. So we did several critical changes based on our observation. So the observation is some factors that are factor of Kubernetes performance. First, read and write from into each city. It's big impact. And the least large amount of data from API server is also have huge impact. So most of our changes are around this area. So there are some data. First of all, this is a change ported from Kubernetes. So I want to mention that because this gives some indication how we should perform those performance improvement. So we ported from Kubernetes because when this change was not there before we fork Arktos out of Kubernetes. So they introduce this change later after we fork it out. So during our test, this test is performed on 1TP-1RP with QPS 105. So here's a cluster size. Before this change, for this latency number, it's impossible for us to claim we are supporting 50k cluster. And after this changes for 15k, we are below the bar of successful run. And we were able to go further. We increased the cluster size to 25k after this change. So this is the latency data. Sorry. This is the changes in the node controller it uses watch instead of list for port. So as we already said, the least large amount of data is a huge performance impact, and especially for pods. When no pods has the most number of objects. So it reduced the list of pods by 15%. And this is the performance improvement we got. So from that, we're encouraged. So we did our own optimization. So we make some changes to Kubernetes. In Kubernetes, when it's trying to patch it always reads, I mean, in the original Kubernetes, or always trying to read first, trying to get the latest node version so that it can patch. But we said, you are consistently watching node. You don't necessarily need to read it again. So we just use the watched node to do the patch. And using our observation, there's almost no conflict. So again, so with the one TP-1 RP 25K, so now we are able to support 25K. So the latency has improved about 20%. So the get node, this is the count that we were able to reduce the number of gates. Because our change was saying, if your patch failed because of the confliction, you will have to read. And then patch. But so the get is this number means there's almost, in our test, there's no conflict. So it makes sense to use watched data to just do the patch. So third one, so as we said, the perf test tool is have some issue. During its latency test time, it does a list per pot. It's trying to create. So there's a whole bunch of lists. So instead of doing that, we said, you should just use one list. Don't do the particular list. Just do a general list, and you watch from there. So we were able to reduce, because this list is actually relatively small, because it's only this particular pot assigned to this node. So it's very small. However, the large number of lists also have big performance impact. So by reducing this, we were able to go to around five seconds for our P99. So this is 25K cluster. So now we can claim we support 25K for our single cluster by one TP1 RP. Then we go further. We want to do two TP2 RP, see whether can we support a bigger number. So this is a change. We did some watch optimization for this. Before that, right after we made the previous optimization, we were able to support 25K by one TP1 RP. So we increase the cluster to two TP2 RP and goes to 50K nodes. Of course, we have doubled the QPS. And we say at this point, our before P99 is relatively high. It's compared to the standard of per test. We cannot say we support this big cluster. So we're looking to say, what is the problem? What is blocking us from supporting 50K nodes? So here, we give you this event watch optimization. What is that? Why it has a big performance improvement after this change? Here. OK. So I'm not sure. This is a watch mechanism in API server. How many of you are not familiar with this? Should I explain further? Or I will start from this. OK, so if you have questions, please raise your hand. So for Kubernetes, listen to watch. So we know that each watch has to be, it only have a lifetime of several minutes. After several minutes, it renews its watch connection. When it starts to watch, it has to provide a resource version. It is one to watch from. So when the API server gets this resource version, so it will scan it event catch and find what is the event your resource version one, you are watching from to our latest resource version. So we call it RV2 right now. So it has all those events there. And not all those events should be sent to the client. So it goes to the event processor and use the watch predicates to determine whether these events need to be sent to the client. Why this is a problem is because in the client, it only use the event latest resource version to update the resource version it's maintained. So if it does not get any new events, it will keep using its old resource version. So when our cluster goes bigger and the chance of each node get a pod or get an event is very slim. So there's a non-pure time that you are not getting any new events. So your resource version gets older and older. And then your elite event is going to be bigger and bigger. It will cause the API server save fuel time to scan and filter out all those events. So by solving this problem, we introduced a bookmark event. So this is not new. Bookmark event is already supported by Kubernetes. In this client library, it will say, if I get a bookmark event, I'm only use this event's resource version to update the resource version I'm maintaining. So which means in the next watch session when it's trying to renew its watch session, it will use a newer version. So by using, we give it the latest, our way goes to the client. And the next time we will renew, it's use a higher, it's much newer resource version so that we have successfully reduced the number of events it has to scan. OK? So I hope this is clear. So from the log, and API server has this kind of log. It tells that it's a big number of unit events. It's processing. It takes this much CPU time. And before this change, API server has this number of unit events processing freak out log. And after this change, we were able, we cannot get rid of them. It's about, say, like 20, 30% left. But this number, look at the profiling of the API server, the CPU profiling. So before the change, it takes 62% of CPU time in our sampling time to process this event. And after this change, it's less than 19% of CPU time. So any questions? Yes? No, this is a single change. This is a very small change in API server. So because we use the same code base for our TP and RP API server, we use the same code base. So they both get this updated. So it's a small change in API server, watch, catch, processing. You don't need to change your client. There's no need to change that. I'm sorry, can you say that again? First test is a separate client, right? It's just using some kind of client library to say, I want to create a pod. It may use the watch client, may not. So if it's using watch client, it's a benefit from that. If it doesn't use watch client, then it's just like the beneficiary is mostly in Kubernetes. Because Kubernetes is the one that gets this watch renewal and lots of immersive unit event processing. Any fit is actually, the change is being done in API server. We see this improvement. So API server has much better performance. It's not heavily loaded because it's trying to process the unit event. So from the client perspective, the client will have less latency when it's trying to renew the watch. It's actually all the client, every client should be benefited from this. From the RP, yes. So because this is our connection, again, going back. This is a watch to API server. So API server will have much better processing time. And client will have much faster response time. So then here is our performance improvement. We were able to reduce the P99 by 70%. It's not there. It's not designed. It's not designed this way. So this bookmark event, what I'm saying is the client library, it has some logic saying, I'm going to ignore this event. I'm going to ignore its object. I'm only going to use its resource version. So that logic is already in Kubernetes. But in this case, it did not send an extra. Because they think this is an extra. This may not be relevant to the client. So we are saying, in the sense of resource version, it's relevant. But his point is that it's. Yes, of course. But it seems like easy. Yes, it's just, you know, Kubernetes have a long way for code review, PR, and those things. Yeah, that's our process. This is a very, I personally think it's a very good improvement. We can contribute better. And if we have time and money, we would want to say how Kubernetes got improved after this. We want to do further tests after this change for Kubernetes as well. OK, again, this is our final performance data. Now we claim that we support 60,000 nodes cluster with real performance test data. 3T PCRP. So by this 5TP, 5RP, you can tell that because each additional TPNRP will have more management cost. You have more ETCD, more API server. So that means the more money you want to pay, the higher throughput you will get. So yeah, there is some CPU data. This is Kubernetes CPU data. And it's not very obvious, but we can say that the CPU utilization is around 40% or 45%. And our TPE, first of all, the CPU utilization for our TPE here is around 35%. And the jump here is because when we tear it down, we use a very high QPS to tear it down. Because we don't care about the performance at that time. So it uses a lot of CPU, but it shouldn't affect our performance test. And this is for RP CPU usage. It is below, I'm sorry, this is below 20%. Which means our RP capacity have good potential to grow. So maintain her way. Scale out architecture is one of the most fundamental things we have made changes. And it builds a solid base for our big cluster. And this operation is very expensive. Try everything. Change your master component. Try to use as less list as possible. And regulate your client behavior. Don't let them to list arbitrarily. And even to watch proof is very critical. This is a big change. It's actually a small change, but a big improvement we have got. So our goal next, we want to support over 100,000 nodes per cluster. By further reduce ETCD rates, maybe making some changes for the API server, control plane components. And we do further API server profiling. So and also thinking outside of Kubernetes. So some references. And any questions? No, it doesn't need to be equal. You can have one tenant partition and multiple resource partition. Or one resource partition, multiple tenant partition. Depends on your API server load for different partitions. Do you know what has 25,000 nodes? Oh, 25,000 nodes. Actually, we have 30k, but the performance is a bit lower. It's not official, yes. It's because your customer can potentially use up all the resource cost multiple resource partition. And also, our cloud provider, they want to support multiple tenants. They want to unify the management part. They don't, initially, it's for management perspective. It's a single cluster. It's a single cluster the way you can ground multiple RTPs and then partitions. We set up an API gateway free in front of it. So it's one endpoint for the customers to deploy the application. And the HTTP using all the resources across. So the HTTP has a schedule and I mentioned that. And the schedule knows all the nodes from all the resource partitions. So in that perspective, it's one cluster. Not only the logic and the bits and the edicts, because it cannot take part away from that. I think the other part of the question you have is mean that there are other ways to do this to scale out our clusters. Some people deploy multiple covalent cluster and then we split the ratio there. We find that some challenges for that, one, because each cluster, they have own node management. We can leverage each other. That's one. The second one is if customer deploy application across multiple clusters, the communication between those clusters is pretty complicated. Of course, there are solutions there, but it's very complicated for that. So we kind of take this approach to say how it goes. So I think they said it stopped. So it's not a, we can answer questions here, right? Are we not, we can still answer questions here? We're out of time. We're out of time, yes, but. OK. So one more question I have, then we'll start, yeah.