 Hello everyone, welcome to my presentation. So the topic for the presentation is challenges in supporting virtual CP hot plug in SOC based systems like ARM64. Well, I am systems software architect working at Huawei Technologies, Cambridge, UK, and primarily in my past, I've done a lot of networking stuff dealt with system aspects of it. Here in Huawei, I'm also official maintainer of their networking driver, HNS, across different SOCs. And in my recent attempt beside this virtual CP hot plug, I've been also involved in enabling XDP on HNS driver. Agenda for today's presentation is in the recent past, there have been attempts made to add the support of the virtual CP hot plug for ARM64 based systems in Timo and Linux guest column. We have received some sort of mixed reviews from the community. And though some vendors are practical reasons to have such a support added, but few of the community members have an apprehension as well to support it. So this presentation is an attempt to highlight the key issues in supporting the virtual CP hot plug feature for ARM64 based SOCs outline. So the entire presentation has been divided into some sort of a very quick overview. And then we go through the known challenges across the system architecture, KM host, Kimo virtualizer guest kernel, and we'll discuss the workarounds. Some implementation attempts have been made in the past and the recent times as well. We'll see the problem being faced in them, and then somebody will discuss about the future work, and then we'll open the session for the Q&A. So why do we need a CP hot plug? In general, what we know is it can be used for provisioning, while provisioning rather. So for example, in case of pre-provisioned resources, but which are auto-scaled. So orchestration frameworks like K8s can use a feature called as vertical pot or scaling in which they can add up a certain CPU resources and remove them dynamically, depending upon their occupancy. For example, during the night, server systems are less occupant. So they might like to remove certain CPUs. And it can also be done for the load balancing. The thing is the on-demand provisioning. So a very good example is a capacity upgrade on-demand in which certain kind of customers would like to add or increase the capacity later on once their business have grown in. So now they have kind of, they can afford that kind of resources. So it is used for that purpose as well. So you need a basic CPU hot plug feature to be able to support this particular demand. And then it can be used to isolate the error-causing CPUs, the offending CPUs within the system due to the errors, identified due to errors. And this might be required just because to stop the propagation of the errors within the system. But I'm not very sure. This is something which makes sense in the virtual world. But anyways, another reason could be for on-lining and off-lining the CPUs for the suspend resume. This kind of support has been there for the ARM64-based systems for quite long. And this is based upon PESCI calls, CP of base hot plug. And you could see this support is already there as part of their kernel changes for quite long and has been there. There's bit of a detail which I've mentioned as part of the links you can see below and you would like to read them. I'll just skip for now. Well, there's certain even sequences which happens while you plug in the CPU and while you unplug the CPU. So this is for the virtual CPU hot plug. So I've just added these diagrams as a completeness. These are very well understood framework and has already been part of the kernel for quite long. I'll just kind of scratch through it to get things started. So we can see different layers like hardware, hosts, KVM, KEMU, then ACP interface and the guest kernel. Whenever the CPUs plugged or unplugged, the events gets exchanged from the KEMU through the ACP interface to the guest kernel. So certain kind of events get sent by the guest OS to the KEMU as well through the ACP control device. And for the events from the KEMU to the guest OS, GED device is used to intimate the kind of event which has happened. So eventual result of these event is to kind of know first of all what event has happened and then to associate, for example, in case of the hot plug, the physical ID, which in ARM term we call it as an MPIDR to that of the logical CPU ID. And of course, in case of the reverse that is on plug, you identify the CPU which is being unplugged and then you try to offline that particular CPU and the eject action take place. So I won't be going into detail as I said, these are well understood concepts and my idea of what to presentation is to assume you know already these things and I'm gonna present further the known challenges. So in the previous past, there have been certain attempts what have been kind of attempts to add up this virtual CPU hot plug support, but there have been some problems what we have faced across a different area. So in case of the ARM 64 system architecture, first of all, we know that the system architecture assets does not have any concept of physical CPU hot plug. There's no specification as such from the ARM which defines any standard way to realize a virtual CPU hot plug either because you don't have a physical CPU hot plug specification on the other hand, the ARM components like gig also have not been designed to realize a physical CPU hot plug capability. As such, gig requires all the CPUs to be present at initialization. Now known challenges within the host KVM, KVM also requires all the vCPUs to be created during the VMInit time. Now, this is something which you can say is kind of reflecting what the system architecture requires in but because it should have all the features the vCPUs created and it has some sort of effects on other components which are part of the KVM. For example, the vCPU, each vCPU will have VGEC related resources initialized and fixed during the creation. And various VGEC per CPU static data structures, for example, they need to be initialized early. Some config of the related private interrupts, for example, SGS and PPIs, they also need to be present at that time when the CPUs are created. So they get initialized at that time, right? And also any resources like memory regions related to re-distributors, they also need to be done at the VMInit time, right? So for example, whenever the vCPUs are getting created at that time, basically all these resources, memory regions, data structures get initialized and you cannot change them later on. So it's all related to the vCPUs which should be great, which should be present and should get created all during the VMInit time. Once the vCPUs have been created in the host KVM, their destruction is also not supported yet. This is not something which is 64 specific, but Intel architecture also has got the same limitation in the KVM, but there are workarounds around this. So there's not a big challenge, but people have already solved this particular part. Also we know that generally the MPIDR is something which uniquely identified the vCPU in the system and for the virtual world, as of now, this particular value of the register is getting derived and set by the KVM. Ideally it should be a responsibility of the user space instead of the KVM and right now it is being derived using the vCPU ID which is sent by the key mode to the KVM and it does some sort of mapping, as you can see in the diagram and derives the MPIDR value which gets programmed into vMPIDR register for that particular CPU. So at the KVM level, because there's a limitation imposed by the vGeek at the host KVM level, KVM must create and initialize all the vCPUs at the work map unit time. Now, this has got a ripple effect kind of thing because you need to do this thing. KVM must then ensure all the complete initialization of the gig as well. This includes initialization of all the redistributors, ITS related to all possible vCPUs. Realization of vCPUs and its threads in KVM might not be desirable for possible vCPUs which are in disabled state. Also, the unwiring of the interrupt setups between the vCPUs and the gig requires further consideration in KVM. And this is something which is not present by default in the KEMU. So as part of the changes of the virtual CPU hot plug, this support has been added. For ARM64, KEMU lacks support to correctly specify the vCPU topology, that is on the basis of SOC cluster course threads. This is required to uniquely identify a vCPU being plugged or unplugged. And ideally, this should be mapped to something like MPIDR. MPIDR is unique, I remember. And it's kind of broken for a physical system. It does have the affinity, but that affinity is something which we cannot use as suggested by ARM people to get the CPU topology information. So it has got no relation as such, we can say it. But at the virtual world within the KEMU, we do require some sort of association between what we are trying to plug in and where we are trying to plug in. And correlated with MPIDR using some sort of a mapping. So perhaps this needs to be done and discussed. Also, KEMU lacks the support of PPPGTable, which shall be used to pass on the vCPU topology to the guest kernel. Now, within the guest kernel, any ARM64 architecture related changes done inside the kernel for the guests should seamlessly run on the host kernel. Now, this is a big requirement because ARM64 is a very, very high-end requirement because it has got a lot of implications, which means that you can place any kind of switches within the guest kernel to distinguish whether this is the code running as part of the guest kernel and is related to the CPU hot plug for the virtual CPUs. Now, vCPU as such, vCPU hot plug might benefit from some sort of a standardization in the architecture and firmware is CPI level if we can do that. And any future specification allowing physical CPU hot plug must not be duly constrained by vCPU hot plug interface defined now. So what about we do as part of the standardization because currently physical CPU hot plug doesn't exist. Suppose in future it comes and that shouldn't contradict to what we have done and should pay attention to that. So that is just a kind of a consideration we need to pay in. That's a kind of a challenge as well. So there have been certain workaround which has been discussed for the challenges what we just discussed. So various workarounds which have been discussed within the community across the years are something like all possible vCPUs are created within the KVM and Kimu and the VM initialization. So it's kind of pre-creation of all of the vCPUs but the Kimu only realizes possible vCPUs that are not disabled. Now remember the realization of vCPU means that the threads also get spawned. But as per the new changes the threads are not spawned for the disabled vCPUs. So we save a bit in that as well. And that just kind of helps in in just fitting the changes as part of neatly within the QOM and the way the QOM works as of now for the normal case as well. We have CPU hot plug case as well. Now Kimu pre-initializes vKit distributors for all the possible vCPUs in the vKit. Now I remember because the vKit has got this requirement that this needs to be done as part of the initializes as part of the VM-init process, right? So Kimu does that and all the distributors are created along with the vCPUs at the VM-init time. And yeah, and of course because the redistributors are like for each of the vCPUs. So they're kind of both present together at the VM-init time. So there have been certain discussions which are related to these you can refer to below. I've just provided them the links. Well, Kimu and host KVM is enhanced to support user space, should be enhanced to support the user space configuration of the MPIDR value of the vCPU. So right now as discussed in the previous slides, this is being done by the KVM and it derives the value of MPIDR using the vCPU ID. But ideally there should be for the virtualized case there should be a user space then. And KVM shouldn't get involved in the derivation of the MPIDR value except for configuring that value and within the VM-PIDR register which is a privileged operation. Also on vCPU hot unplug, Kimu parks and powers down the vCPU. So remember that there was a limitation within the KVM that you can create the vCPUs but you cannot destroy them which is kind of a limitation which is also present in Intel. So the common framework which has been added as part of the Kimu is to park down the vCPUs which have been unplugged rather than destroying them at the KVM level. So at KVM they always remain alive just that they are brought down to the low power mode. Now Kimu provides a complete MADT table including all possible vCPU interfaces and it redistributes to the guest kernel. This information is used by the guest kernel to basically initialize its data structures for the base addresses and other parameters. At the boot time guest kernel uses info from the MADT table to size its various data such as including initialization of redistributors as said with all possible vCPUs. So redistributors also start to exist within the guest kernel just that they don't have their vCPU instances for the disabled possible vCPUs which gets created and reflected while the CPUs are plugged or unplugged. Now these are the recent and the earlier attempts which have been made very recent were the ones done in the June by me. So both the kernel changes that is the guest changes and the virtual CPU output changes for the Kimu were floated. The host kernel didn't require any change and these were found to be working as such without any change within the host kernel. Though there's one change like ability to configure the MPIDR from the user space that might require a bit of a change but as such, these are pretty agnostic of the host. So all the workarounds take care of the limitations what we just kind of discussed within the Kimu as well as the guest kernel and the changes in the guest kernel are non-destructive but again, since they are kind of done in a way and we don't have a specification so that can be a concern, we'll see it later. So you might like to go through these references. So what are the problems being faced in the streaming now? So on 64 system architecture does not support physical CPU hot plug as we know due to the absence of the suitable specification there have been concerns which have been raised to avoid any divergent system architectures being invented by different vendors. So it's not just one vendor who is trying to support this thing maybe other people might come with their own ideas and it might happen if there's no standardization of this then the things can get a bit messy. And probably this has left patches standard across multiple attempts. Kimu patches cannot proceed till the time the guests kernel patches are duly considered by the kernel community which currently we feel that it has not been done and so I would like to request the community for more reviews and involvement in this. So just to recap the summary and what is the way forwards? ACP based support of the virtual CPU hot plug feature is a much requested feature for the use cases stated earlier in the slides. And as you can see we have pose a practical implementation but it seemed to be making a little progress in the upstreaming for the reason which again we have discussed. So the key thing is how do we overcome the resistance to implementing a virtual only feature while minimizing the possible clashes with the say for example any potential feature physical equivalent if it ever will come but we need to take care of it now and would some sort of a minimal specification to ensure some sort of a consistency of implementation in the virtual case specifically elevate such concerns and it is how to do this. The future work is which is left is around the live migration support support of the hot plug with a new CPU to apology as mentioned earlier, PPTT support to hand over the right VCP to apology info to the guest. Test cases, lots of docs to explain because there are a lot of work around so perhaps require a proper documentation for this. Well, I'd like to take this opportunity to thank all the individuals who have worked for this particular topic in the past and have discussed and their ideas. I'm just trying to carry forward the past pattern to me and actual work have been done much earlier the concepts I just tried to implement the ideas which were very discussed and to demonstrate that those were achievable and actually those work. So anyone that I missed into my list please forgive me. As I said, it's a tribute to everyone who was involved in this effort. With this, I'd like to open the session for any Q and A. So, and also I would like to thank you for listening patiently to my session. Thank you so much. And I'm open for any questions from here. Thank you.