 If you're a system admin or have experience in uploading hypervisors, you must be grateful of the long maintenance windows and the VM down times. And even more scared is to face a failed VMM upgrade after having spent a long time in the maintenance window. Hi, I'm Soham and my colleague Prasadthose from Nutanix. Together, we are going to talk about these issues and our solutions for such problems. As part of today's talk, we are going to cover the following in this. Firstly, we'll cover what are the problems with VMM updates. First, we are going to discuss the design implementation of our solution. And finally, we are going to share our results and the future work. So what are the problems with VMM updates? Firstly, the VMs need to be migrated outside of the host, and then only we can update the hypervisor stack on the host. And finally, the VMs need to be migrated back onto the host. Now this process needs to be repeated for each and every host of the cluster. As a result, we end up having a long maintenance window. First, I will highlight few issues that we have often observed at Nutanix while performing VMM updates. The first issue is non-determinism. What we mean is that the VM migrations can fail after an unbounded amount of time. Next is the guest impact. During VM migration, due to, for convergence, we need to throttle the VCQs of the VM. Now this has an adverse effect on the guest performance. And finally, it's resource contention. This tech network, as an example over here, during migrations, a large number of pages need to be transferred over the network, which results in an increased traffic of the network. Now this same network might be used by the storage service as well for IO operations. And as a result, we might end up having a network congestion. Because of reasons like this, we have often seen that the customers tend to defer the VMM upgrades. So how do we solve this? To solve such issues, we propose a novel approach of upgrading VMMs using local live migration in Livevert. With this approach, we can migrate the VMs locally within the host using Livevert's existing migration workflow. There is no memory copy required within the VMs. And as a result, there is no dirty logging and throttling of VCQs required. With this approach, we can migrate the VMs at a very fast speed and with near zero downtime. Since there is no requirement of transferring the VMs outside of the host and then bringing them back, so the upgrade workflow time is greatly reduced. And hence the maintenance window is also minimal. We hope that with future work, we can further reduce the maintenance window to the point that it can be gotten rid of entirely. Next I will go over the high level overview of the local migration workflow. So here we have a typical setup of a host where we have a QMU binary installed and here we have QMU 6.1 installed. And on top of this, we have a VM running, which is being managed by the Livevert service. Now let's say we want to upgrade the QMU binary here. So we issue our EM update command. This updates the QMU binary to the latest version of QMU 6.1.1. Now from Livevert, we issue a local migration command. What it does, it spawns up a new QMU process on the newly upgraded QMU binary of 6.1.1. And after this, it transfers the VM state from the old QMU process to the new QMU process. Once this is completed, we can then finally destroy the old QMU process. This completes the workflow of the local migration. Next processors will be going over the design and implementation part and the remainder sections of the presentation. Hi, I'm Prachatosh and I'll discuss about the design and implementation as well as the results we worked in for our work on the VM in Live updates using local migration. Before diving into the details of the implementation, we'll discuss some of the design challenges we had as well as some of our goals for implementation. While implementing local live migration, we wanted to keep the machine ABA unchanged. Specifically for us, this meant that we had to handle the issue that Livevert expressed the name and UID of a VM in a particular host to be unique. Clearly, this is not true in local live migration because there will be a source VM and a destination VM sharing the same name and UID. Therefore, we have to account for these conflicts and handle them appropriately. Some other dependencies on name and UID include the absolute path dependencies Livevert has on for some of the files. For instance, a monitor socket path depends entirely on the UID of the VM. Therefore, this ambiguity is needed for this case as well. Once you have addressed the issue of keeping the machine ABA unchanged, we have to make changes to the existing Livevert migration workflow to enable local migration. The current Livevert workflow has five phases for migration. Begin, prepare, perform, finish, and confirm. Of these, begin, perform, and confirm, work on the source host, and prepare and finish on the remote host. However, for local migration, the remote and source host are one and the same. Therefore, the remote steps must be made to be working on the same host as the source during local live migration. Finally, we also have to resolve the correct domain object and use the correct domain object. This is a remote object for the remote phases and the original object for the source stages during migration. One might ask what the benefit of local live migration is if you are still incurring the memory copy phase. This is the most expensive and the most unpredictable phase of migration. Therefore, we really want to avoid the memory copy. We will see how we achieve this in the next slide. As I mentioned, avoiding the memory copy was a primary goal for us while doing this design. To allow bypassing memory copy, we implemented an FD transfer mechanism. Through this, the source KMU uses live work as an intermediate to pass the FD of its memory backends to the destination KMU. The way this works is as follows. First, we change KMU to add a new QMP command called fetchbackingFD. This fetchbackingFD command will send the FD for the memory backends, all the FDs for the memory backends to the client using SCM rights. This is invoked during local migration in the begin phase. Libvert queries KMU and receives all the FDs. Next, Libvert prepares to start the destination KMU process, the destination VM in the prepare phase. At that time, we use existing Libvert FD passing workflows to send the received back end FDs from Libvert over to the destination KMU. Now, we again have to make changes for the destination KMU so that it accepts this FD and knows not to open the file or allocate memory, but instead simply use the already opened FD that it has got, that it has received from Libvert as far as memory backends. Now, when the iterative phase is supposed to start during the perform phase, the destination VM already has the FD and all the memory that it requires. Therefore, we can mark all the pages as capable, and there is no memory copy, copy, there are no memory copy iterations at all. In this way, we avoid the entire copying phase and improve on the migration time as well as improve reliability since the failures during copying are not going to happen for this case. The every transfer mechanism is not restricted to memory backends alone. It can be extended to other devices such as block devices and character devices. This will improve the migration times further. However, every transfer does not solve all the problems related to enabling local migration in Libvert and KMU. Just in cases, we can see that every transfer does not work. For instance, with a monitor socket path, it is already open in the source VM which keeps running during the migration, and therefore, the FD of the socket cannot be passed to the destination VM. However, the monitor socket path has a dependency on the VM new UID. For instance, it is some location, then domain-vm1uid-monter.sol. When we do local migration, we'll have two VMs with the same UID. Therefore, this will cause a conflict. Instead, we disambiguate between these two VMs using the numeric domain ID that Libvert assigns for each VM. For instance, say VM1 has domain ID 1. We will append the domain ID before the VM1 UID, so the path will become domain-1-vm1uid-monter.sol. The back-end compatibility will create a sim link from the expected path without the domain ID to the new path that has been created. This way, the expected path is always available and pointed to the correct path in the initial fields. In the prepare phase, we launch the new KMU process, and for it, we create a monitor socket with domain ID 2, which is the domain ID for VM1 hash. Now, the VM1 hash UUID and VM1 UID are the same. VM1 and VM2 are the source and remote counterparts, respectively, and therefore, they share name and UUID. So, the domain ID is the only thing differentiating between these two paths. In the perform and finish case, there are no changes, but in the confirm phase, we kill the source VM and we also remove the monitor socket. Now, the currently running VM is the destination VM-VM1 hash. Therefore, we switch the sim link over to the VM1 hash. We can see that at each step of the migration, the expected monitor socket path always points to the correct path in the new format. In stages where the source VM is running, it points to the source VM, and when the switch over takes place and the destination VM takes over, it also switches the sim link to point to the destination VM. This maintains backwards compatibility, while also allowing us to disambiguate between the two VMs, which we required for enabling local migrations. We'll see a similar example for the case of modifying the migration phases. For the migration phases, the main challenge we have is to use the correct domain object, as well as ensure that the data structures Livert stores do not have any conflicts. Livert has hash tables indexed by the UUID and name of the VM. In normal operation, this is not a problem since these are unique, but during local migration, a query for a VM and UUID may fetch two different VMs, the source and the remote, if we do not make any changes. In order to separate these two VMs, we create a remote hash table, which is initially empty and is only used during the remote phases of the local migration. We name the existing hash tables, source hash tables, as they will be used in the source phases. The source phases have a little change and in the begin phase, we query, if you want to query for VM1, using the VM1 UID, we'll get the source VM VM1. Before starting the prepared phase, we need to populate the remote hash tables, since that is the hash table we want to use. So we create an entry for the same UUID, that is VM1 UID, which is the VM being migrated, but we create a duplicate of VM1 called VM1 hash, which is a remote object corresponding to the source VM. This way, when we query on the remote hash table, using VM1 UID, we'll get VM1 hash, which is the correct object to work on for the remote phases, but when we do it on the source table, we'll get the VM1, which is the source VM, which is the correct object for the source phases. We can clearly see how we are separating these phases and ensuring that they work correctly by adding this extra table. So prepare works on the remote hash table, perform on the source hash table, finish on the remote hash table. Now, after confirm base, the VM1, the source VM, will get killed, and VM1 hash will take over. We have to make sure that this is also correctly reflected in the source hash table, since for all limits, interfaces, and queries, they'll be on the source hash table and remote is only used for the local migration work. In order to ensure this, we copy over the entry from the remote hash table over to the source hash table. Now, the VM1 UID queries will turn VM1 hash, which is the correct VM at this stage, since this is the destination VM that has been successfully migrated. Note that again, at each step of the migration and at each phase, the source hash table always contains the correct VM that is running. Therefore, commands such as worst list and other commands will always return the correct VM and they will continue to work at the end of the migration as well. So in summary, we wanted to avoid the memory copy for local migrations. Therefore, we implemented and extended the every transfer framework from the KMU, using from KMU to the destination KMU. This avoids the iterative memory copy and improves our performance. For cases where we could not use FD passing, we used a sim link approach. We created entries for the paths for soft monitor socket and log files using the domain ID for differentiation. And then we used a sim link that we switched over to point from pointing to the source path, then to point it into the destination path at the end of the migration. Finally, for changes in the migration flow, we added a shadow remote hash table, which is used during the remote phases and then switched it over to the source hash table once the migration was completed. We can see how we did not break any of the existing interfaces in LibBot. However, we were able to implement local migration while having the while having two VMs with the same name and UI. We'll now go over to a brief demo and then look at some of the results for our local migration implementation. Prior to the demo, here is a high level approach of the workflow we expect. Using your favorite package manager, for example, YAMD, one can update the KMU binary on the host. This will move the KMU file in the file system from an old version to a new version. However, the VMs running on the old version continue to run on that version itself. Therefore, we have to locally migrate the VM to the new version. This is done using the local migration command available from BERSH. We add the minus minus local parameter and the command list BERSH migrate minus minus local domain id. Other fields are not needed for local migration. After the completion of local migration, all the VMs should be able to run on the updated KMU binary. We'll see this workflow in action in the demo. On the left, you can see that there is a guest VM running Nubuntu. The right is the host on which this guest is running. Now, we can check this by issuing a BERSH list command which should show this VM. Next, we want to update the KMU binary. So let us first look at what version of KMU the guest is running on. When we check this, we'll see that the version is 6.1.0. So, of course, we have a version 6.1.1 with some fixes we need to, we want to bring into this host. And therefore, we install it using a package manager, for instance, using the package in this case. Once this is installed, the KMU version on the host has been bumped up to 6.1.1. However, the VM still remains running on 6.1.0 and I'm missing the changes between these versions. Therefore, now we need to issue a local migration. Prior to this, as an example of the impact of local migration, we want to run a program that has a very high write-show group. The program we're going to run writes to each page and increases the dirty rate at a rate specified by us. And a higher dirty rate leads to longer convergence times since the iterative copyface keeps on going until all the pages are left to copy over. Now, we'll execute this program and issue the BERSH Migrate local command. Note that the dirty rate we have specified here is 8GB, which is which means that it is dotting the entire size of the VM. This is also 8GB in size. We perform the local migration command and it finishes within a second. Note that here there is no dip in the dirty rate at all during the migration. So in terms of impact of local migration using every transfer, it is basically zero since there is no iterative memory copy and there is no auto converge algorithm which will throttle the throughput of a process. Therefore, throughout the migration, the time the dirty rate is the same as before. Now, we can take a look at the metrics, specifically migration time as well as downtime that was incurred for this migration. We show the downtime for command on the VM and we will see that the total time taken is 830 milliseconds, which is very low for a VM running a high throughput workload as well as the downtime is only around 30 milliseconds. We'll elaborate on these results in the following slides. The first metric we'll take a look at is migration time. The graph on the right shows the migration time for three algorithms for four categories of VMs. The yellow line shows external migrations. The blue line shows local live migrations using an iterative memory copy and the green line at the bottom is local live migration using every transfer that is the approach we propose. So VMs are in increasing order of size. VM 1 has 8 GB of memory and VM 4 has 32 GB of memory. All of them are running a high throughput workload, the same we showed before in the demo and the workloads are darting the entire memory in a second. So this is a very pathological case. We see the impact of this on the external migrations in the yellow line. The time taken grows from 20 seconds to over a minute for the largest VM. And this will keep on going and some migrations for even larger VMs may fail to converge. Or iterative memory copy it is better, but still there is a linear increase and the largest VM takes over 45 milliseconds or 45 seconds of time. These are not viable approaches as they are affected by the size of the memory and the workloads. In contrast, the FDA transfer graph, the green line is almost flat. If you look at the left, we see the same graph of every transfer in milliseconds. The time taken varies between 800 to around 850 milliseconds for all the cases. Clearly, this shows that there is no dependence on the size of the VM or the workload that we are running here. Even for larger VMs, the time taken should not grow. And this is a clear benefit of using every transfer mechanism for an implementing local live migration in this fashion. One can migrate any workload of any VM in a very low time, which will reduce the maintenance windows required as well as the reliability will also improve since there is no chance of failures during the iterative copy case. The next metric we'll take a look at is downtime or the time the guest is unavailable during the migration. We want to reduce this as much as possible to improve the upgrade experience. We see a similar trend for downtime as we saw for migration time. The downtime is of the order of less than 50 milliseconds for all cases, respectively of workload or VM size. For iterative memory copy, the round time is around 150 to 200 milliseconds. And for external migrations, it varies widely from 400 to 300 milliseconds, depending on several factors. For instance, the network bandwidth at the iteration prior to the downtime starting. Again, we see a similar trend and this proves that the fact that we do not, the size of the guests or the workloads are not a factor for this approach, which is obvious since there is no memory copy phase in part. We also note that the guest penalty is zero as we saw during the demo. Therefore, we have improved the performance as well as the reliability since we have avoided the most unpredictable phase of memory copy. And this should significantly cut down on VM upgrade times. In conclusion, we have enabled chemo upgrades using local migration. We have made changes to the existing network migration workflow and have extended on the every transfer flows in chemo and live code to allow a local migration without memory copies. Using our approach, our migration times are around one second and our down times are less than 50 milliseconds, irrespective of the size of VMs or workloads or other metrics of what is running on the VM. In terms of future work, we want to extend the every transfer framework to all types of devices. We have tested with a subset of devices. However, we want to explore more and look at a wider set, including looking at the case of past devices as well. Finally, we want to continue upstreaming our patches. We have sent out a prior version of local migration for review upstream last year. We want to update this with the latest chemo and liver changes we have made and want to want to continue the discussion upstream. We look forward to the comments and and discussing our approach. Thank you for listening to our talk. Please feel free to reach out to us at our email id shared below.