 Okay. Good afternoon. Thanks for all of you coming here today. Very appreciated. My name is Chen Yixing. I'm from Intel Shanghai, China, and I'm very interested in system architecture, performance and usability. So everyone here cares about normal scheduler, especially about its performance issues and possible optimizations, I think. So I'm here to talk about it. Another thing is I have implemented a performance profiling tool for normal scheduler measurements. I will prove how useful it is by showing you what I've got. This presentation is based on 58 rounds of experiments. They will show how well current normal scheduler can perform under the standard one scheduler deployment, which is recommended by the community and the scheduler subteam. They will show how well current normal scheduler can perform under the standard one scheduler deployment. And if you know the performance, you may wonder how to optimize that. And I will show the existing options that can break the limit. And of course, they all come with compromises. So I will help you with the pros and cons. I will also finally talk about the future designs of normal scheduler that are considered by the community to optimize the performance. All this work takes me about two months. Takes me about two months. I'm a little nervous if I can't realize the idea, but it is much more enjoyable when I see the pattern coming out from the result data. So first, some better to know before the main content. The first is the scheduler architecture. And the second is how to analysis the scheduler based on this architecture. And the third is how to emulate the real open stack deployment to get the accurate result. So this is the process of normal scheduler. There are four services included. The Nova API, Nova Conductor, Nova CPU and a scheduler. The request will be first accepted and checked by Nova API. Then Nova API will send the request to the Nova Conductor. Then Nova Conductor will call Nova scheduler to get the decision. And the Nova Conductor will send the request to the specific node based on the decision. And when the request comes to Nova compute, the resource tracker will check if the available resources can meet the requirement of the request. If it is successful, the virtual machine will be spawned in this compute node, but it failed. The request will be retried again from the Nova Conductor. So this is basically the whole process of Nova scheduler, but there is still an important part here, the database. The scheduler will make its decision based on the resource view from database, and it will request all the compute node states during each processing of requests. And if the request is successful in compute node, the resource tracker will update the latest state to the database. And the resource tracker will periodically correct the database maybe in 60 seconds. So I can categorize the phases into three categories. The first is the pre-scheduling phase, that the Nova will accept the request. And the second phase is scheduling. During this phase, there will be a decision to which compute node it will success or it will be failed. And the post scheduling will spawn the virtual machine there. So in my analysis, the basic idea is very simple. I fake the post scheduling phase so that there is no real supervisor needed in my simulation. So I can manage 1,000 compute nodes, launch them within 10 seconds to the single machine, and I can manage them very easily. And secondly, all the important services are sliced to create points to generate logs. And then all the important calling points, including the conductor calls the scheduler and the scheduler calls the database to generate logs during the real scheduling. And after those requests are delivered, then the logs will be passed by state machines implemented by offline passing program to get the results, accurate results from this ROM. So it's a basic idea and how to better emulate the environment. I ran all the services, controller services, including Nova scheduler, MySQL, and RabbitMQ into the controller server. And all the other compute nodes are in a separate compute server so that their performance will not affect the controller. And the only difference between a real open-stack environment and the simulated one is that the post-scheduling phase is faked and some network interfaces are faked for convenience. So if you have any concerns about that environment, you can remember the question and ask me later. So this is the main performance of the standard one scheduler deployment under the various conditions. Firstly, I only send a tiny request to the simulated environment. This request will be processed from Nova API to Nova compute, and the log will be passed offline to show how long each request will be stayed in each services. So for example, in the left side, in 200 compute node simulation, we can see that this request will be stayed in scheduler services for 0.2 seconds, and the whole processing time will be 0.6 seconds. And I conducted experiments from 200 compute nodes to 1,000 compute nodes so that we can see the difference there that only the scheduler overhead will be increased by adding more nodes. And that is the scheduler overhead will be increased up to 64% of the entire scheduling. That is very large. And if we, I'm very curious about how this blue strap is, what's inside that blue strap. So I made further analysis of this that shows that nearly 60, 90% of the time is in the get or host state method. That is exactly why the Nova scheduler tries to refresh the states from database. So it means that cache refresh is a major bottleneck in this simulation, and it is blocking the entire scheduler process because the schedule overhead is very large during the entire scheduling. And you may wonder, what if we add some more pressure to this environment so I send 50 concurrent requests to the same environment so that you can see that there's a different story. The scheduler, this scheduler overhead is increased by the nodes. And there's also the message processing time is also increased. And it is even larger than the scheduler overhead. And it is where the conductor sends the request to scheduler and between the scheduler accept this request. That's very strange that it is bigger. And I investigate that because the scheduler has a limited bandwidth there so that more requests that exceeded that the schedule capacity will be blocked before the scheduler. That's where the yellow one lies so that I also take the weight, weight phase into consideration so that the schedule overhead will be up to 91%. That's even bigger than the previous experiment. And I also investigate the scheduler service that the cash refresh of bottleneck is even larger. It is up to 98.5%. So it strengthened my idea, my conclusion that the cash refresh is a bottleneck of entire scheduling. And there's a new metric to measure the performance of scheduling. That's the throughput. It means that how many requests can be proceeded in one second of this environment. It is decreased to 1.08 queries per second in the 1000 node simulation. And we can see that there's a difference between the first experiment and the second experiment, that the performance will be different concurrency. So I also did a third round of experiment to, you can see the diagram that the leftmost is the first experiment that I only send one request there. And the right side is that 50 requests I conducted in the second experiment. And I feel all the requests between them, then I can draw this diagram. The first significant thing that is the scheduler overhead is increased. And then it will be steady. When the scheduler overhead is steady and the message weight will be significantly increased, so I call this the saturated scheduling. And also the API overhead will be increased by concurrency. And also it proves my conclusion if I extend the request to 200 concurrency, that the scheduler overhead will not be increased. So this scheduler overhead, including the weight overhead, will be increased to 80% in this experiment. And if we look into the scheduler service, we can see the obvious saturation there. And the cash refresh overhead will be increased to 98% before the saturation. And after the saturation, this will be become steady, the cash refresh overhead. And there's another analysis to see whether the scheduler performance will be changed by the concurrency. And the answer is no. The performance will be 2.7 queries per second after saturation. So if you think it is complicated, so I will summarize that the current schedule process is like a pipeline. And it can be saturated by requests. The thinnest part is the no scheduler service. So if there are more requests exceeded the capacity, they will be piled up before scheduled services. And the schedule overhead of entire scheduling is up to 90%. It will be grown by adding nodes and adding pressure. And the cash refresh overhead will be steady at 98%. So it means that the bottleneck is cash refresh from database. And the performance of the entire simulation is 4.3 to 1.1 requests per second. If we know about the normal schedule performance, we may wonder how to improve that performance if we don't satisfy with that. So we have three choices there. The first one is to add more schedulers, obviously. And secondly, we can add more workers of Nova API and Nova Conductor. And the third choice is to change to another scheduler called the cash scheduler. If we add more filter schedulers, we can see the first significant thing is that the weighted messages will be consumed by multiple schedulers. So it causes the scheduler overhead decreased to 70%. And it will result in faster query time of one query processing. But this diagram also shows that it's not true that more schedulers means the better performance. And we can also see the performance, the throughput improvement is up to 300%. But the cons is that the scheduled, the multiple scheduler decisions will be collided with each other that if there are 16 schedulers, all of their decisions will be collided into three compute nodes. And it results in, even results in your choice in the first picture. And finally results in the performance overhead. And if we have more workers at this picture, we can see there is very little difference here. Only thing we can see the pattern here is that the API cost is slightly decreased, but other things are not. So the left choice, the only left choice is cash scheduler. As we can see in the filter scheduler experiments that the largest overhead is the cash refresh overhead. And what if the cash refresh overhead is eliminated so that, so it is what cash scheduler is doing, that it only reads the cashed states, and it won't read database during a request processing. So we can see the result is the performance boost is very significant that the left, the new diagrams scaling is only one-fifths of the filter scheduler scaling. And it means if the bottleneck removed, one request can be up to eight times faster. It's in 1000 node simulation. It will be the fastest. So using the cash scheduler, it will minimize the schedule overhead and the faster query and even the best throughput improvement up to 800%. But there are still cons there that the filter, the cash scheduler only updates its resource view, sorry, only updates its resource view periodically 60 seconds by default so that it always, and it means that it always have an outdated resource view there. So it will affect the placement accuracy if the resources is constantly changed by deleting instances or migrations in the compute nodes. And you may not want to launch multiple schedulers there, of course. And I also visualize profile the schedule process in each choices. And as you can see in the first graph that if the scheduler is processing 50 requests, it will cost nearly 20 seconds to complete all the requests. And those requests will be first stacked in the Nova API. Then they will be wait in the message processing for the scheduler to process all of them in a very slow speed. And if we change to four filter schedulers, this performance will be boosted because the schedule bandwidth is increased. But the scheduler overhead is still very large. And if we change to the cash scheduler, we can see that we can barely see the scheduler overhead there. And the request will be consumed in a very fast speed. And the new overhead will be in Nova API. So the optimization strategy is if we can optimize the database, and it is the most effective choice, but it is a little difficult to have much better performance. And the second choice is to use caching scheduler. It has the best performance improvement, but it is a bad choice to bring up multiple schedulers, and it will always have an outdated resource view. If we can't tolerate that, we can choose more filter schedulers, and it will be also a better performance there. The last choice is to add more workers. It will have the least performance, because in my investigation, current Nova filter scheduler will be directly connected to the database without the conductor, so that's why if we add more conductors, the performance will not be boosted that much. So we know the performance there, but how we can do as developers to improve the scheduler. So the first thing of the future scheduler that we want to improve is that the largest overhead. We have three choices. The first one is to improve database, and the second one is to optimize the database query method of the implementation, or the third one is to totally exclude the database. That's the three choices, and the first one is we can change to memory-based database, and the second one is there is a resource provider scheduler in the future, and the third one is the shared state scheduling. And if the largest bottleneck is solved, then we can change to optimize Nova API, and then the post scheduling is also very important. I can introduce some of the future implementation, some of the future scheduler design, but I will not go into details because they are still in consideration. The first one is the resource provider scheduler. In this design, the resource provider scheduler will not refresh all the whole states from database. Instead, it will do the filtering inside the database and query the least whole states as possible to increase the performance. And the shared state scheduler will directly receive incremental updates from the compute nodes so that the database is excluded. So if you want to know more about Nova Schedule performance, you can see all the Nova Schedule benchmarking data in this link. And you can also, if you are very curious about the risk conditions, you can see one scheduler with problem discussion there. And the shared state scheduler, the resource provider scheduler, the details are in these links. You can also try this tool in your environment, and you can optimize your environment to see if there are differences there. And you can ask me if you have met problems. If you have no test environment, you can also try OC cluster if you got good ideas. And if you want to contribute code to Nova Schedule, you can attend to the scheduler sub team meeting bi-weekly. So finally, I will thank Jingming Tong, Zhou Zhenzhan, and Liu Junwei for their support. So that's all of my presentation. Do you have questions of this experiment? Yes, I do. No deal from zero stack. So can you go back to your slide number 19? I just wanted to know what you're highlighting there. 19, please. So you have this matrix? Yeah. Next one. Next one. The first matrix, you have 1, 2, 8, 12, and 16. Yeah. Is the 8 the number of API workers you recommend or why it isn't purple? Sorry, it's not 1, 2, it's 2, 4. It's a mistake there. And the 8 is orange because the default setting in my environment is 8 so that the previous experiments all have 8 workers. And where is your previous experiment? I have conducted three sets of experiments. The one requests and the various nodes and 50 requests with 200 to 1,000 nodes and 1 to 200 concurrent requests in 400 nodes. That's the three experiments in my presentation. Okay. Well, I will take a look. So that's basically across slides, like slide 18, 19, that purple line means connection between all those experiments, right? It's not something you recommend as maximum number of API. I did get an answer. Yeah, thank you. Thank you. I apologize if this is a newbie question. When you talk about caching, what does that cache reside? Cache means scheduler cache. The scheduler will make its decisions based on this cache. Is it a, you mean on die CPU? I mean what, I mean does CPU technology make a difference in terms of the performance? No, you mean how caching scheduler will improve the performance? Right, but you say cache, it has to cache somewhere, right? Yeah, it makes its cache in the scheduler memory. Oh, so it's irregular? Yeah, it's in memory cache. Okay. Yeah. I think your confusion was processor cache here. Is that what you meant? Did your previous question, did you mean processor cache? Yeah, yeah, real memory like memcache D kind of stuff that Keystone uses and things like that. So you can have an in memory cache. Inkshink people might be curious to know if there's any forward path for this in Nova. Any chance of this awesome scheduler optimizations being implemented? Yes, there's two scheduler I have introduced. The one is the results provider scheduler, which is proposed by J-Pypes and another is the shared scheduler, shared state scheduler proposed design. I have designed that they will be considered in the community so that you can look forward to them. Not yet again. So just to get your view, like you have an answer partial but want to listen to your opinion. So when you have this lack, data synchronized, like you have this cached state and now you said if it is one minute behind or one second behind the state you get from the database or cached one will be older. Yeah. It's not up to date. Yeah. So now that means if you are highly provisioned, you might the scheduler might say I'm able to allocate but after one minute it will discover that it is not able to allocate because all resources are consumed. Yeah. So how would you eliminate it? Like is it okay to tell users oh I'm sorry I made a mistake cached state or do you tune it or do something on high concurrency request. I'm not asking for definitive answer but curious to know your thoughts. Thanks. So you mean that if the resources is not available in compute nodes what will happen? Correct. But before coming to know this you already told the user that you will be able to place this eight VMs in the nodes. No. The scheduler will not tell the user that this decision whether this decision is successful. Only the VM is spawned in this compute node then its state will become the spawning then active. Then your API time does not decrease then. Yeah. The API will directly return that this that VM is in scheduling but it will not the API will not show whether it is successful. You may you must use novelist or novel show to query that. It won't show. It will not show on novel list because it's not provisioned. It's on spawning state. It's in the database record. I see. Yeah. So what we are getting is end-to-end VM provisioning time is not decreasing but the decision making from scheduled phase this spawning is decreasing. Yeah. What is not really meaningful from the operator perspective. Is that roughly right? Yeah. This means that the scheduling time is increased not to the API return time. At least scheduling time did not decrease right because you are still waiting to VM to get allocated get the latest state. Yeah. Your optimization. Yeah. I think I got answered. Thank you. Thank you. I think I'd like to add one comment. Scheduling time does decrease because he'll have the state of all the resources in memory. You're not going today's novel scheduler makes several round trips to the database to get the state of the resources. So that's definitely decreased and your other comment about what's happening in a cloud that's already you know pretty crowded with lots of VMs it's going to be hard especially to find in that scenario for any scheduler. So I can anticipate that maybe your scheduling approach should change depending on how utilized your cloud is like if you have like five concurrent schedulers it might have to dynamically adapt adapt then say maybe I have to drop to one or two so there's less conflicts type of stuff so that could be a part of it depending on how crowded the cloud is. So any more questions? Well I want to ask two questions. Yeah first in your experiment what what are the scheduling mechanism did you use? What do you mean the schedule mechanism do you mean the filter schedule the filters and wires? Yeah yeah they are the different default settings using DevStack. Okay so you did not do any work to optimize the filter? Yeah no no optimizations they are all the original OpenStack. Okay the second question is how do you measure the scheduling schedule? How do you measure the scheduling collision? If you use the collision it is based on the logs I can see from logs that this decision is made to which node and then if it failed I can even back to logs to see the what it when it is tried and the second decision is made aware so I can collect all these to see how many nodes are there. Did you analyze the impact of collision on your experiment result? No but every collision that every retry will cause an entire scope from no conductor to scheduler to compute node and we can we can calculate that. Yeah okay thank you. Thank you. He did see the impact of collisions and in fact the collisions were reduced because the period between capturing the resource usage and then the scheduling to the resource that that window decreased so there were less collisions. There was that slide with the collisions if you want to show him. Is this one? Collisions to seven compute nodes and it will cause retries in the first diagram and will finally causing the throughput decreased. I was curious on the the query to get all hosts from the database that that slows down the scheduler is that is that on a per project basis or will that slow down if you have lots of VMs in a completely separate project as well? Separate projects. So like if you have a if you have an open stack that has multiple tenants or projects will it slow down based on how many VMs are in all of your projects or just limited to that one project? No the scheduling design is that it will query all the compute nodes. It will not based on the projects or tenants. It will always query all of them. Thank you. Thank you.