 Hello, everyone. I am Jae Yong, and I'm going to present about the live migration that my team have developed. And before I start, I would like to first thank to Ian Campbell that who helped a lot of developing this migration. And basically, he gave very constructive comments, and migration turns out working very well. Thank you. And the title suggests that it is performance evaluation of the live migration. So I think some people yesterday asked me that what is your optimization point for live migration and stuff, and actually I just engaged to develop this live migration in short time, and I didn't do anything. I didn't do much thing about the optimization at the moment. So it is basically just working version of the live migration, and so this talk is about live migration one-on-one. And it starts with the motivation. I think you guys already see this chart a lot, so I'll just go briefly. That basically the number of data centers is of magnificent of energies, and the number of data centers are growing. And basically the electricity that in order to run and operate the data center increases is very sharply. And this chart shows the total cost of ownership here, and it is almost one over third of the cost is electricity. And that basically makes the financial manager not happy. And what we can do here is we can use and low-power processors, for instance, ARM cores. And there are many hardware for servers, as Stefano talked about it this morning, and also softwares and virtualizations, et cetera. And along with this trend, our team decided to develop live migration. So this slide basically shows the required modules for enabling live migration. Basically live migration is moving the running guest to us from a physical host to another. That means we have to copy the states of the virtual machine to the target destination host. And that basically includes the memory contents and VCP registers and some device registers, et cetera. And one most important thing of the live migration is that while moving the guest, we have to not disturb the guest as possible as we can. For that, we perform, we also do the 30-page tracing. I think this slide briefly explains what is the 30-page tracing, et cetera. Basically, this is the overall sequence of the live migration. Whenever Zen2 Stack, Excel, or the Leapboard, or the management software, OpenStack, whatever, requests that I want to move a running VM to another host, then it starts the migration process from the source host side, and it triggers the received process at the destination site. So the destination can be ready for receiving a VM. And it first sends out the memory map from the source host, the memory map of the guest. So the destination can prepare some memory. And based on that, source host sends out the memory from the entire memory of the DOMU. And we have to note that DOMU is still keep running. That means DOMU performs writes while sending the memory. And those writes should be delivered to the destination host in order to keep the destination site memory contents up to date. For that, we use 30-page tracing. Basically, we trace all the writes from the DOMU in the when the source host sends the memory contents. And that this 30-page tracing is keep iteratively performed until some stop condition is satisfied. After that, at some moment, we have to suspend the running guest to us. We cannot live migrate a virtual machine without suspending the guest to us in case of this post copy. And after suspending, DOMU domain does not perform any kind of writes. So we copy the last 30 pages and VCP registers and device registers, and et cetera. And those information is basically sufficient information for resuming the guest to us at the destination site. And then resumed the guest to us and kills the VM in the source host side. And I would like to talk a bit detail of the 30-page tracing. Which is not a trivial task to achieve. Basically, it consists of two parts. The first one is how do we detect that DOMU is writing some contents. And the second one is how do we tell the tool stack? Basically, tool stack delivers the domain new memory contents and send it to the tool stack at the destination site. So we somehow tell the tool stack that which pages are dirty. And the first, the detection. This slide basically shows the page tables inside Xenarm. There are three page tables. On the right tab is the page table for the guest. And the right bottom is the P2M table maintained by Xen. And the left bottom is the Xen page table for running Xen binary itself. And what it does is the following. Whenever the dirty pages process is started, we set all the pages in the P2M table to read only. So whenever domain new tries to write something, write permission for it happens. And it is trapped by Xen. And we can basically record the page frame number. So we can know that it is which page is dirty. And then we have to set this page to read write in order to keep the domain new continue executing. That means at this trapped moment, we have the guest physical address, which should be mapped to the guest machine address. And we have to change the corresponding guest physical address, the corresponding Pt, leave Pt entry, to the read write pages. And there are two choices. The first one is you can just manually walk in the page table. It's very simple. And the second one is virtual linear page table. Manually walking is just find the index, find the proper index of the guest physical address. But that IPA is actually the guest physical address. It is the term from the arm manual, and I accidentally misuse. And find the index. And these P2M tables are not mapped to the Xen page table all the time. So we have to map it and read it again and map it and read it again. And come to think of the guest writes something, and then we have to detect it. This 30-page tracing happens really frequently. And this kind of map and un-map and read is very inefficient. So what we can think of, actually, this is Ian's idea, that we can prepare a virtual linear page table. The reason that we have to walk the page table is that the page table is basically hierarchical. The third table, the real machine address exists in the third-level page table, and we have to walk by them. But if we somehow map those page, those leaf page table entries to the virtual space with the linearly ordering, for instance, the first one, it could be positioned whatever in the physical memory space, but it could be mapped in the virtual address space in the first slot, second one to the second slot, and so on. Then by just calculating from the guest physical address, we can immediately find the leaf page table entry of the corresponding page. And then just change the read-only bit to the read-write. And this is the current implemented version of the 30-page detection. And as I mentioned earlier, this 30-page table happens very frequently compared to the tool stack asking which pages are dirty. And for that, we have to temporarily store the dirty pages before the tool stack asks. So whenever a tool stack asks, we can read the 30 pages from the temporary storage and mark the bit in the bitmap from the tool stack. And for this purpose, there could be several choices for this temporary storage. And currently, we use linked list. That means basically whenever a dirty page is detected, we store the address to some kind of linked list. The reason that I have chosen this method is that at the time of this tool stack ask which pages are dirty, we tell the tool stack and we have to reset all the dirty pages to read-only again. So whenever the guest writes again to the memory, then we have to redetect that. And for that purpose, we have to reset all the page tables of the P2M entries. And if we use some kind of a bitmap or some embedded bits in the page table, we have to perform the full search, full search of the entire memory frames of the domain guest. And I want to avoid that. So I decide to use linked list. And it turns out that there are very optimized methods for finding 0N1s in the bitmap. And it is already implemented in Zen. And I got some comments that I can use it. And rather, not using linked list. Because linked list requires some kind of a memory location while in the dirty page detection process. And it is not that efficient, as I think. But current evaluation has performed in the linked list version. And this is how my team developed the live migration. And basically, about the performance evaluation, we just not just evaluate the live migration, but also want to see the energy efficiency between x86 server and ARM server. Actually, I want to say I'm server, but the actual evaluation performed in the ARM boards. So what we have done? We set up x86 server. And we prepared the ARM data board. And we both installed Zen. And we tried to run streaming server in the domain guest and measures how many concurrent streaming clients can those streaming servers can support. In the meanwhile, we also used the power meter to measure how many watts are consumed for supporting those kind of numbers. And as you know, the hardware is not fair. It is absolutely not fair, because one is a server and one is a mobile-featured board. But beside hardware, we tried to be fair, as fair as possible in terms of the scheduling or the driver mode, we both used PV and other things. So what is the number in case of the ARM data board? The maximum streaming clients that one ARM data board can support in based on Zen virtualized platform is 110. And if I try to increase the number of VMs, for instance, I run two VMs. I both run the streaming server at both sides and try to connect and getting streaming, the numbers are decreasing. That is because ARM data board have only two cores, and one for DOM0 and one for DOMU, is the best configuration at the moment. And the consumed watt is 14. So the client per watt, basically one watt per watt, how many clients can be served is seven. As you see the row, bottom row there. And x86 server can support 11. Basically, higher is better. It was, this result was quite disappointed and my manager not happy about it. And we tried to figure out and currently, the core is really bottleneck. And if we increase the number of cores, of course, in server case, we are going to increase the number of cores. And we think that it would be much better. So somehow, we acquire a quad core onboard and it has some, how to say, some features for the fast network and et cetera. And the result is much better. As you see, for quad core onboard, the best number comes when we use three VMs and the maximum streaming client is 300 and the required watt is 18.9 watt and the client's per watt is 15. And finally, my manager was happy about it. And I would like to still want to say that this is not the optimal values in both side. Also x86 side and both arm side. There are many, many, many good things for increasing this performance. And whatever we know, we tried it both side. Whatever we don't know, we didn't try the cores because we don't know. And it gives this value. And we are thinking that, and please keep in mind this is a streaming workload, which is IO bound workload. And for certain types of IO bound workload, we believe that ARM servers could give quite nice performance. OK, back to the live migration. One of the key use cases of the live migration is server consolidation or energy efficiency. And of course, it works very nice. And if you can consolidate many servers, then energy saving ratio is increasing, of course. This is quite obvious reasons. I'll just pass. And for implementing the dirty page detection, I mentioned about two methods. The first one is page table walking and the second one is the virtual linear page table. And virtual linear page table gives very nice performance. The lower is better. This is the elapsed time for handling one dirty page on the left. And the elapsed time for telling the tool stack how many, which pages are dirty on the right. And both shows very nice performance. And service downtime. As I say, suspend and resume is requested during the live migration. And in between, that is typically known as downtime. And it is around one second. Then I'm not happy about this result. And I would like to further profile which functions take which how many times and want to optimize it. Also, the total time of live migration is also one important metric. While I was trying to migrate a streaming server, when the streaming clients are around 10 to 20 streaming clients are attached to the server, the streaming clients, the quality of the streaming decreases when I migrate because of this dirty page tracing and very many things that happen in the server. That means we have to somehow decrease these values. And if you see the iPod case has the highest value. It is over 80 seconds. And the reason is that it eats up all the network bandwidth that. And live migration also requires some network bandwidth. And it basically conflicts. And I remember one guy in the KVM forum told about the RDMA-based live migration. And I think it is quite a cool feature. And we can maybe apply that one here to resolve this kind of problem. And this slide shows the dirty pages while during the dirty page iteration. And we can see that it is immediately converted to the minimum values. And the stock conditions, actually, I didn't mention about the stock condition before. I tried to read the stock condition from the x86 code. And I tried to apply the same one here. And it shows these results. And I think ARM server is not that powerful at the moment. And that means the right speed of the DOMU guest is not that fast. And that means the dirty page generation rate is not that high. And that means that it gives very fast convergence. And we may need some additional stock condition. Because we can immediately stop the VM whenever it is converted and just send it. Because that makes basically the total time of live migration smaller. OK, that's my talk of this live migration. Do you have any questions? For the migration part, do you have data for ARM? Do you have the comparison with x86? No, not at the moment. But I also would like to see the comparison to the x86, especially this convergence thing. I want to see what is the convergence characteristic in the x86. My question is actually about the linear page table. So as you mentioned, the linear page table is programming one of the page table interests to point to the page itself. I mean, the linear page table is shaped by that. I mean, at x86 for the x86, you have the full memory as a page, as a page table, right? I think that you have there are many entries. Basically, you grasp one entry and point it to itself. Is that made by this? Yeah, of course, the total size of the page table for the third entry is very large. And that makes via virtual linear page table requires a very large amount of the virtual memory. So actually, my question for this linear page table is if you want to use this linear page table, this sort of page table must amount to the then itself. Then you can use the virtual address to assess that. So every time when you try to read among these tables to this page table, or maybe then it's sharing some virtual, for example, virtual address with the guest kernel, I don't know. Okay, I got it. So I was actually trying to understand because in the x86, you are using HBAM. For example, you're using a shared page table, you cannot place a linear page table in shared page table because otherwise the guest can access a certain portion of a linear address to really access the shared page table itself. It brings security. So actually, I was trying to understand. I don't. Oh, okay, okay. Maybe with this, I can try to get the arms back. Thank you. Yes, in both x86 and ARM, you put the linear page table into Zen's page tables rather than into the page tables the guest is running because it's Zen that needs to access them. Any more questions? Okay, we're running a little bit ahead. So the next talk will start in about five minutes. Thank you. Thank you.