 So welcome, everyone, to join this today's session. Today we are going to talk about Qoq to see how we can use Qoq to do Kubernetes scheduling tests revolutionally. Firstly, a little bit about ourselves. My name is Wei Huang. I work for Apple in the Apple Service Engineering team. In the upstream, I'm the co-chair of the scheduling. I'm also the maintainer of some Kubernetes sub-projects like scheduling plugins and Qoq. My name is Wei Wei Yang. I'm also from Apple working in the AML data infrastructure. And I'm EXVP of Apache Green Corn. And also, I love Kubernetes and open source. I've worked on many, many open source projects before. So today's agenda is three parts. First, we will get you to know what's Kubernetes scheduling tests, what's current state. And secondly, we'll see how we can use Qoq to solve the scheduling tests. The point in a little different angle. And lastly, we will show how in real reality, how Qoq helps Unicorn to load the scalability test. The first part, what is Kubernetes scalability test? So it's the turn on scalability test sounds a little scary. You may be panicked when you hear it turn. But by its different definition, it is nothing but see how your components respond, along with the number change of your Kubernetes API objects. So it can be multi-dimensional. It can be Nodes change, it can be Power Change, can be Service Change, Endpoints Change, PVP, etc. And also it can be a particular combination that you care most. For example, for Istio, maybe you care most about how the endpoints, endpoints size, etc. or how it works like, or it's related to your objects. So why this is important? Because I think most of companies right now enter a day-to-world of running Kubernetes. So for day-to, it's a master-to-majority component. To be assured, you know the limits and the boundary of your application running there. And then so you can pre-plan your capacity. And so then you can control the cost. And also, your users know the limits. So they can plan the application as well to know the limits and get the best user experience. So in practice, how we do scrimmage tests in Kubernetes context? Oversimplified paradigm is like this. We should have the data inputs. The data inputs is nothing but a series of workflows. But you may not choose the raw, vanilla, YAML files, describe all the nodes, all the power, etc. Maybe you choose some tools to help you abstract that workloads. And you can also help you orchestrate your term window like how you'll describe a scenario to deploy 5,000 nodes and deploy 10K parts and see how the server looks like. Then in the market, there are some existing open source tools like cluster loader 2. It's maintained by scalability. It can offer YAML-based directory for you to describe what kind of workloads orchestration will look like. So once you have the input, you just put the data into the Kubernetes cluster. We will get back later about how we compose, spin out the Kubernetes cluster later. And then you run the workloads. You wait for the workloads to complete. And then you collect the data output. So the data output can be metrics, can be log, can be anything you care about. And then you'll do the analysis and show it up. You can also leverage the existing famous observability tools like Kubernetes Graphana. And also you can build custom dashboard. So let's take a look at Kubernetes cluster. Obviously, a straightforward idea is, yeah, it's nothing but the Kubernetes cluster. I just run the real cluster. That's right. In the upstream CI suite, there is a job is to spin up 5,000 GCP nodes. You may be a little surprised, but it does exist. This kind of 5,000 GCP real nodes test. But the interesting thing is that in the right top, it says run on all days. You may be wondering, why is it run all days? Why now? Why it doesn't run every day? And why doesn't it run every few hours? Or just upon the PR's merge, right? The answer is also obvious. That's a lot of money. You spin out 5,000 nodes. VM, right? That's a lot of money, even upstream cannot afford. And also, you have to wait for the cluster to real be up, and also destroy it can be not trivial work. And then the third limitation is that you are restricted to the cloud offerings SKUs. You cannot config arbitrary SKUs like how many CPUs and memory you want. So these kind of three limitations. Yes, this kind of test does has a value, especially if you want to cover the end-end from API object creation to the KubeNet. So it does exist to have a value, but in most of the time, if you want to test the control plane only components, it's way too much to be affordable. So for control plane only components, usually we run simulated Kubernetes cluster. There is a tool called KubeMark, which is short for Kubernetes benchmark. So basically, this essential idea is that I don't want to really run the KubeNet because I don't care about the runtime behavior. I care about the control plane behavior. So the essence is I will implement the minimum KubeNet interface and wrap it up as HoloNode. And then HoloNode talks to the API server to represent itself as a node. So that if you want to compose 5,000 nodes, you just spin up 5,000 HoloNodes, the boundaries, and you will get a fake 5,000 nodes cluster. Let's use a diagram to illustrate. So first of all, you have to have a control plane you want to test against. If you're not caring about the control plane, just use upstream one and prompt your specific component. For example, if you are the controller, you want to test it. So this is the target you want to test with. And then you don't have nodes, right? So you will spin up this kind of HoloNodes. And those HoloNodes, as I mentioned earlier, will reduce themself as a fake KubeNet and then talk to the API server in the target cluster. But this kind of HoloNodes, as I mentioned, does need compute resource. They are not coming for free. You have to have a concrete physical place to run this kind of HoloNodes. So usually, a popular pattern is you have to have another KubeNet hosting cluster. They are real clusters. And then these HoloNodes, in the context of hosting cluster, is nothing but individual part. They are not null. They're just part, be scheduled as usual. And then in the right part, the yellow part consists of the hosting cluster. But in the left part, themselves, when talking to the control plane, they are null. So this sounds a little complex, but yes, it does. That is why this is the pin point of the complexity. So the other pin point is that it doesn't come for free. It consumes resource. The initial memory-free print for each HoloNode consumes 100 megabytes. So if you want to compose a 5,000, that means you have to have 500 gigabytes to prepare for this kind of testing. That is not really the resource I would say. Maybe you have to have a hosting cluster for 100 nodes, maybe, to simulate the 5,000 nodes tests. So how we can make it better? So that is the motivation of the project clock. In this part, we will share what it's called and why it's performed and how it's distinguished in terms of design philosophy to other frameworks like QMock. So clock is short for Kubernetes without KubeNet. So you may be super confusing about term without KubeNet. But believe it or not, you may be already using it without noticing it. For example, if you're an upstream developer, when you write an integration test, it doesn't spin up KubeNet. It just spin up a piece of NACD under the particular components you want to test with. Also for M-Test, maybe you use KubeNet to speed up your controller development. So there is a tool called M-Test working on the MIS, helping you write very distant test suites. So M-Test also doesn't spin up KubeNet, but it spares a lot for your controller testing. So this kind of tool has its limitation. The first one has a bar there. It's designed not for any user. It requires you have certain knowledge of Kubernetes. It's SDK-oriented instead of API-oriented. So you have to use code to compose this kind of test. And also some of them doesn't have the concept of virtual nodes. For example, M-Test doesn't give you accessible mechanism to create nodes. So it's just focused on controllers, functionality testing. And also the other more pinpointing of this is it doesn't simulate the full life cycle of the objects. It just starts at a certain stage so because there's no KubeNet backing them. So the problem will always be in pending states. And if you create a node, there will always be no ready status. So it may not fit your testing a lot of further scenarios in terms of scouting tests. And for HoloNode, as I mentioned, it does have the entity to represent KubeNet by designs, by designs OAM fake nodes and has a natural view of memory footprint. So how Quark resolve all this problem in one time? So first of all, this is a diagram of the classic Kubernetes architecture. So in the right part, if Kube is involved, it doesn't need nodes, it's just a fake nodes but not fake node entity, just fake node API objects. You can use KubeControl to create these kind of node objects. And then when these nodes are created, Quark will be notified. The Quark controller will be notified. And then do the rest of work for you to simulate the rest of the life cycle of the nodes, to make it like real ones, to maintain the heartbeat with the fake node objects to API server. I will show you later in a demo. Because this is a fake node in Kubernetes design. Everything is API objects. So these fake nodes, they exist as API objects. So they will be aware by schedule. So when your schedule terms and paths are there, they will be scheduled properly by schedule. It's just a landing on fake nodes. And also, Quark will continue the rest of life cycle to continue from scheduled states to running states or completed states, you'll name it. It provides injective annotation of some other stage term for you to define what kind of rest of the life cycle you want this part to go to. So why is it performant? The first thing is that it's by design. It doesn't have the concrete entity to represent the KubeNet. So it's by design, it's O1 memory footprint. And also internally, it doesn't follow a lot of controllers designed philosophically to use Informer to keep all objects in memory. It's more like streaming processing about incoming objects. So it just comes through and go away. And then because it's not a real workload, so not real nodes, I don't need to keep the current state in memory for all my life cycle of the Quark controller. So it saves a lot of memory. So I will show you a little bit of how the memory footprint looks like. In Quark itself, as I mentioned, and diagram I mentioned earlier, it has the most core problems, the Quark controller. It's aimed for maintain habits with the habit from nodes to a password, as well as paths, maintain this full life cycle of paths, as well as some other objects. And it's tailor designed to optimize memory. Also, it comes up with another handy toolkit called Quark control to help you set up your local testing environment very quickly. So let me get into demo. So I have a Quark control. I have nothing running here right now. I have a Quark control, it's alias for KW. So because for demo purposes, and also I'm running on Mac OS, it's the most convenient way for me to show you how it looks like is running in the binary mode. So I have pre-built binary of SD, schedule, IP server, control manager. So when I created this, the cluster is up. And also it has the Quark controller running there. So it's too much, get into it. Yeah, this is the Quark controller itself. It's only 16 megabytes. And the other components used to look like KB server, control manager, schedule. So right now you are running a bare bone control plane. There's no nodes, there's no paths. So let me create some nodes. You can definitely use Qt control apply a nodes back. Again, again, to spin up. But here we have a good toolkit called Quark control to scale the nodes to 20 nodes. Now once you run, you have 20 nodes. And how about the scheduling? So I will start from small later. Well, try some crazy thing. So suppose you have deployment, and the replicas is 100. As I mentioned, I have a full bare bone control plane. So scheduler can schedule that. And then the Quark will take over the rest of life cycle. So I will give you, looks fine. So all 100 paths get scheduled on this 20 fake nodes. Also, one thing I forgot to mention is that when you use Quark control to create the things, you will read some pre-config settings. Like here, I just do some customization on the QPS because I want to push the status, or push the limit of scheduler. So there's a node template here. I just specify 32 CPUs. And the scheduler also I tweak the QPS to 100. Nothing else. So yeah, that's pretty much. Let me delete this. Okay. Next, we will show you how Unicorn community benefit from Quark. Yeah, hopefully by far you understand, you've got a basic idea about how Quark is so different than Quark Mark. And I will be more talking about from a user perspective because I'm from the Unicorn community. What do we do? Is a Unicorn scheduler. And that has a lot to do with performance testing or scalability testing. In case you don't know, Unicorn is a standard known Kubernetes scheduler that brings multi-tenancy readiness to Kubernetes clusters. It's being widely used to schedule large-scale data processing analytics and ML workloads. Right now we have been operating really large-scale workloads on Unicorn already. So it came with a lot of scheduling capabilities such as hierarchical cues, resource fairness, job ordering, job preemption, gown scheduling. You may have heard many of these times very well. Unicorn came with all those features. And also it can be very easily integrated into Kubernetes clusters, also with different cloud vendors. Actually a lot of use cases with Unicorn is work with cloud vendors clusters with auto-scaler even with carpenter. So why performance testing is so important in Unicorn? Essentially Unicorn is the scheduler. So it needs to work with every pod and node and do the allocations from pod to node. And when we operate on large-scale, when we deal with thousands of nodes or hundreds of thousands of pods, the throughput really matters or not. And we don't want to run into unknown zones before the users. So we want to stretch to that node. We want to discover how many nodes we can scale, how many pods we can scale before that really happens on production. And also the throughput, I think, makes huge difference in terms of the cost efficiency and also the client-side latency. Just an example that if we have 50,000 pods to be scheduled on cluster, just an example, 10 pods, throughput versus 20, as a matter of difference for the customer's users, they'll be waiting for 40 minutes or 80 minutes for you to submit a job at the end of the queue. So that's a huge difference. When we do performance testing in the past before Quark, this is how we do it. So there are a couple of steps. We build a cluster, we run experiments, we do a bunch of iterations, collect metrics, and then assess the results. Then we go back to run more experiments because we want to find out what is the bottleneck instead of the code. And so earlier days, so the only tool we can use to put mark, it's an excellent tool. So it basically gives us the opportunity to spin up thousands of nodes cluster. We also develop a tool to do the node simulation and also some tools to collect metrics and draw charts. So that's kind of what happened in the past few years. We've been running this for three, four times. But why only three, four times? Because it's very expensive. Coolmark-based solutions is expensive in terms of both dollar cost and also the engineering cost. Every time we run such large-scale testing, we need around 20 physics servers. And in order to simulate like 10,000 nodes and 100s of a thousand-parts scale, that is very difficult to set up. And sometimes we even need to tweak, each of the server configs to, for example, reduce to increase the limit of processes or file descriptors. A lot of things to work on. And every time the overall turnover time is about a month and dedicated engineer work on a month to get ourselves ready for the performance. And it's also extremely difficult to automate. But actually it shouldn't be that hard because in the schedule performance, what we really cares about is the steps in the right box from the part creation to the part allocation. And that is where the heavy lifting happening in the inside of the schedule, basically doing the calculation about the scheduling. We don't concern that much about the binding phase. And what we really need is the real control plane, but really lightweighted Kubernetes to scale. That's why Quark is changing everything for us. So in the past, like I just mentioned, we set up 20 servers just for the performance testing. But now we can do that in one node, even in one laptop. Quark can give us just like a way to demo. So you can easily bring up a cluster at thousands of nodes to that cluster on your locomotion, which is amazing. And to use Quark, there are basically two approaches. I've been using both. I think they both have different use cases. One is used in Quark managed cluster. Quark comes with Quark CQL, which helps users to bring up the cluster very, very quickly. And what it does is basically bring up the control plane components. And it gives you the runtime options. What we just demo is using the battery, but also you can use the container runtime, use Docker or Pubman. And also you can use kind. After you have the control plane, you can use Quark's command or spec to create a node, any number of nodes, any type of nodes you want. Another way to carry out your test is to bring your own cluster to Quark, which is sometimes more useful for us because we want to really test some real control plane. Then we need a real Kubernetes cluster with some of the real Kubernetes nodes. One of the limitations in the previous mode with the binary is that you don't have an even single one real Kubernetes node. But in this case, you can actually have a bunch of them. The choice for the cluster, you can do that locally if you want. You can use kind, mini cube, or even desktop if it doesn't matter if one single Kubernetes is working for you. And also you can use a remote cluster. You can basically connect to any of the clusters provided by cloud vendors or even on-prime. After that, similarly that Quark can simulate any number of nodes. Also those nodes are coming with tanks, so you need to make sure your application or workloads are running with the tolerations and node affinity to make sure that they can run on those thick nodes. So then the environment setup becomes really easy. Setup of the Kubernetes, even either use the Quark Managed or bring your own, then install Quark, create nodes, then install your app stack. By saying app stack, for example, for Unicorn, we do install, how we install Unicorn, then promises Grafana so we can gather all the metrics we need it. Then just start exploring. So there are a bunch of tips, so you possibly want to increase the QPS for control plane in order to really push the cluster to the limit. Also, in some cases, you may want to tweak the nodes back to the different tabs where you want to cover your test scenario. Then we do collecting the metrics from the dashboard. This is the query we use to directly gather the binding rate and also binding pod accounts from promises. Right now, there's a GRI in Unicorn community is doing, we're actually actively working on replacing our toolkit, performance testing toolkit with Quark. And we are even thinking about building a pipeline, so as a pipeline, so we can do the performance testing over and over again to avoid the performance regressions. Quark is a perfect suit for Unicorn use cases, but I believe it can suit for many other use cases. If you are doing development on Kubernetes, either it's a controller or application, you can use it for local testing. And because this is so easy to bring up a cluster, the deleted cluster, you can do your local testing very easily. And also, of course, performance testing evaluation, I think, I believe this is most of the people who are due with Quark. And because it's so efficient, really, we can do a lot of things in one single machine. Another part, we don't talk too much on this talk, but I think Quark comes with some chaos engineering capabilities where you can randomly inject fingers to pod nodes. I mean, that is very good for maybe 90% use cases for some of the common fingers. And then the CSAT pipeline, because we no longer need a set of nodes, 20 nodes or more to set up an environment. So now we have the chance to build into the CSAT pipeline. With the GitHub Action, you can trigger this per PR or lightly run. So this can help a lot to avoid performance regression. Also, if you build your test cases, Marley, you can do that with some of the functional test too. So I'm going to get back to wait for the demo too. I actually have two demos. Yeah, test, test, test. Okay, next we are gonna do some crazy things. So I just show you, I create a 20 nodes cluster and with 100 nodes, when you run your cluster, that is just a toy, right? Let's do some serious thing. So you name it. How many nodes you want to scale up to? Name it, 40,000? No, that is beyond the limits of, I don't know if my laptop has support. So basically on the 5,000, I can give a try. 5,000, okay. So you want to use some time. It's because this operation is nothing but issue the kubectl create node. But it has some optimization because the default kubectl has default QPS limitation and also kubectl doesn't have the list pager. So it basically, oh, sorry, this one is not. I'll show you later. Yeah, we'll wait for a while for, I think it's really, it's like 5,000 nodes, maybe needs 50 seconds to finish. And then also at the same time, I can show you the footprint of memory. So this is the default 1.27 ARM64 binary of API server control manager. So you can see it's a little already interesting. Okay, finish. Good. Then there running. Nice, 5,000 nodes on my laptop. So next let's do another crazy thing. How many parts you want to test with? Name it, 10K, 20K, the name it. Maybe we can start with 20K parts, probably. And I have an in-house program to measure the throughput. So that is also a tip when you want to test it again. So your components, you can definitely use metric, but for some specific case, you may need to more fine-grain metric collecting logic. You want to customize. So this is the, how can I open source this? This is just pretty easy. I specify, I hope I forgot, 20,000, right? For this program is just watch the schedule pass and then for every second is output D. Yeah, let's get it. 20,000 parts on 5,000 nodes. Yeah, starting and same term, let's take it, you can see my CPU usage is spiking, right? And by the memory footprint, you can see API server also as time goes, ISO spinning up. And let's take it like 200 megabytes in this case. I would expect that there's some kind of pulse. That is what I mentioned. It's optimized by design that is like streaming processing. So that's not allowed objects stay in memory for the quad controller. Yeah, it gets server, it's okay. It's almost a finish, I think. Also, if you don't care about that kind of fine-grain accuracy of your third part, you can also use the default metrics here because I spin up Prometheus locally and you can use this check. This is the API server check, it's about resource and about the bonding operation, a bonding restore resource and the verb is post. That is what default schedule is post to API server. So yeah, because it's the binding rate, that is the thing that other simulators cannot do because it doesn't simulate the rest of part of the lifecycle of the part. It's just a scheduled phase. All right, that's pretty much, but maybe one more thing is that, what if you want to dump the current state of the cluster? Quad controller also helps you to do that just in one single command. Yeah, let's snapshot, I think. Structure, export, specify the local cluster name. A filter is filtered, no, and part because right now scheduling only cares about that. And yeah, locally, let me see how sweet can be generated. So basically it will have at least, oh, it's super cool, quick. So basically it will have 5,000 nodes and another 20K part in total, 25,000. This is in a not compressed format. It's just nothing but node and part objects here. See, node, and then current part. So yeah, that's it. So basically this can be useful if you want to shut down your local cluster and want to replace some place else. So that's pretty much the demo, yeah. I hope you enjoy it. If you have any questions, there's a microphone. Yeah, there's another page. So where do I go next? So this is the official website of the Pog. It's still a pretty young project, just I think in one year. And you can ask any questions in the Slack channel. And also there's two coupon talks related with how you do simulation using or not using Pog. And the second talk is, I think there's a good point to say, okay, maybe we need some kind of tracer snapshot instead of just one single point. You want a snapshot along a ton window so you can catch up, you can catch all the events that have been in a ton window. So yeah, that's all my talk. I will talk today. Thank you. And if you want to give some feedback, yeah, feel free to scan the QR code. And we have a couple of minutes to take questions. Cool, thanks for your talk. This is really interesting. I had two questions. Does clock have any sort of API endpoints like if you wanted to interact with it programmatically instead of through the command line? No, all the API is Kubernetes API. I see, okay. So you'll create whatever you're doing against the real cluster you can do against the... Got it. Yeah, core cluster. And then does clock expose Prometheus metrics or any of the nodes or pods that are running on it? There's some ongoing work on simulating the metrics that are happening on the node. So there's some details out there so you can check the website. I think there is some, yeah, some working items ongoing. Okay, perfect. Thank you so much. I don't know if this is viable or possible, but is there a way of running a service and simulating just almost like a hello world text on a page? Is there something like that can you do with this to kind of simulate bringing up a lot of web services at a very basic level? Does that make sense? If it involves concrete runtime logic, like you said, don't have some output, then use compute resource. That is not in the scope of this kind of testing. But you can bring up the services, can't you? You can have the services and they could point to the pods, they just wouldn't necessarily do anything. There is items in the community to ensure the service is reachable. But not runable to out call that. So if you have a pipeline to test the connectivity of the service actually called, can I help you do that? But for a long time. Anything else? Yeah, yeah, yeah. Excellent. Thanks a lot. I think we are out of time, maybe. We can take maybe one more question. And then I will just, outside to ask a question. Have you thought about supporting other types of workloads, like virtual machines with Qvert? Sorry, could you kind of repeat? So like you showed you pods for the main runtime you use. Have you thought about supporting virtual machines, like using Qvert and actually supporting that as a potential workload at test scaling? Support virtual machines. So I don't know what kind of layer you mean by supporting virtual machines. Sure. So like with Qvert, the virtual machine actually runs inside the pod, like it's just a community process. So technically like, I mean, Qvert has its own control plane that runs inside Kubernetes. So I guess the idea, like kind of what I'm saying is, think about it as instead of a pod, like you're sitting around YAML everywhere to test the control plane. It would be just testing the virtual machines. Yeah, if you want to test a Qvert controller. Yeah, exactly. You use the YAML. It's a perfect set. Yeah, so I guess there's no way to ask it. It's like you can technically support any CRD or API extension to move from YAML through the API servers test and scalability. Okay. That's a perfect fit for this kind of scenario. Okay, I will be off site if you want to ask questions. Thank you everyone.