 Okay, I think we have most critical people here, and let's study. Let me introduce myself again, considering we have new people just joining us. My name is Jack Jackrain, I'm from Intel. Actually, I'm not a virtualization guy, and my main ground is Android. Early this year, I started to work with our virtualization team trying to utilize virtualization technology into Android. Actually, both Shantai and Dongxiao are from virtualization team. We work together to implement one of the prototype that is just an innovation project. It's not an official project, actually. Okay, this is the agenda. First, I will give an overview about what this project is and what our goal is. Then I will talk about the design details. Even so, we still, in this way, we can get almost a little performance, but we still, and thank you for the several gaps. So I will talk about the gaps here and analyzing, and also including the optimization we have done. Finally, it's a summary. Okay, that is overview. Let's get back to the Zen Summit 2019 in Seoul. Jin Nakajima from Intel. Actually, he is an architect in Intel virtualization team. He had a talk called mobile virtualization using Zen technology. His comment is that mobile virtualization will be more important in the future, and Zen will have a unique advantage there. For example, Zen is the same, and it's a light hypervisor. It is a perfect choice if we want to watch mobile devices. And Jin also proposed something like this. That is, Domino ran the host that I enjoy, and Domon Wang or Domon U ran the guest that I enjoy. Both of them are PV. Actually, he also provided another system that I didn't show here, that is Domon U will be used as an H-AVM and run the vendors. Actually, I believe all of you are familiar with this diagram because that is a traditional one. That is, the native device will be passed through Domon Zero, and also Domon Zero will provide the back end of the virtualizer driver. Domon U will access their physical devices by front end driver through the back end driver in Domon Zero. That is a well-known diagram by all of you guys. Today, we promoted another usage case that is just a run I enjoy in Domon Zero, just a run guest, and the H-AVM will be used as a TE. TE means trust execution engine. That means we could do something security in the H-AVM. But we don't want to sacrifice the performance power too much because both of them are the keys to their mobile devices. If this solution is adopted in the production, we cannot tell the end users that because we use the virtualization technology so you have to deal with the back end driver introduced by virtualization. To the user, they just care about the performance, they don't care what technology we are using in the device. We don't want to sacrifice the performance and the power too much. That is the main topic I would like to talk today. Design details. In this diagram, we run the enjoy almost later that is, for example, the IO. All the IO will possibly enjoy so that the Android driver can access the physical device laterally. In this way, we can get the relative performance when the Android tries to access physical devices even for the graphics. The second is the CPU. All the virtual CPU will pay into physical CPUs. In this way, we can emanate the CPU schedule and parity. We disable the ZN schedule. We also disable almost all of the timers from ZN except a slab-short timer which is provided always to the guest service. For the MMU, we use the virtualized method. Actually, that is limited by domain zero because the current domain zero is still PV, not HVM. PV MMU has good runtime performance, but PV MMU has its own problem. Especially when we try to run an application, we could get a slow launch time just because of the application trying to allocate a lot of memory and cost a lot of page fault. Once the application is launched, the performance is very good. That's the impact of the launch time. For the IRQ, ZN owns the IRQ and uses a traditional way that it dispatches to Android, sorry, not ZN. Whereas even a channel, the main overhead is a rain switch. That is, ZN will get the interrupt and forward interrupt, injected interrupt to the domain zero Android. In most cases, 99% of cases, that is very good. Except the straining cases IRQ overhead is still available. For example, when we try to run the Wi-Fi throughput test case, we found that the Wi-Fi could generate more than 10,000 interrupt per second. In this way, we found that throughput will be a little bit worse than the native one. Another FPU. FPU is parallelized. Actually, we can also use path rule. That means just a path through FPU to the domain zero. We found that even using parallelized method performance is still very good. So we just use it. The main reason that we paint the virtual CPU to the physical CPU we also disable the ZN scheduler from ZN. That means FPU, you could save, you could save, you could restore, just happen during the task switching in the domain zero. Then it itself will not trigger any FPU save restore. For these three items related to the power, the first is CPU idle. We just pass through CPU idle to the Android. That means when the Android is idle, the administrator will just call the MOA and put the CPU into the idle mode. The next one is the CPU freak. We just pass through to the Android. The main reason is that Android has its own CPU freak governor that is an interrupt governor developed by Google. For example, when the user launched the application, actually the actual manager will request a high frequency. In this way, it can boost the CPU frequency and get a quick response when the application is launched. After that, CPU frequency will be lower down again. The next one is the standby. For the standby, we also pass through Android. When the user just pulls the power button, the Android will try to pull the system into the standby mode. Android has its own power management. It is based on the clock. In our design, because we just have one guest running on top of them, it's totally fine. We just pass through the standby and pull management to the Dome Zero. But the standby A3 is a little bit tricky here. I will talk in the next slides. In the original design, if the Dome Zero is running on top of them, actually when the Dome Zero idle thread is trying to end the low pulse state, the behavior is that this CPU will be scheduled out by the ZIM. Not really put the CPU into low pulse state. In our design, we just use the Dome Zero on the fourth standby logic, including the standby device, put the device into D3 or D3 hot mode. And the ZIM will provide a hypercall to another user to issue the real monitor MO8 calls. So to put the CPUs into C6 or even lower pulse data than C6, for example, for an item that is so-called A0I3, not just put the CPU into the C6 state, it also turns off the CPU totally and turns off most of the devices by shutdown. Another thing is that when the user tries to put the Android into standby, the boot CPU will offline the long boot CPU first. At this time, the long boot CPU idle thread will have a call called means vCPU up. The parameter is vCPU up down. Let's try to put the CPU into low pulse data. In the original design, let the hypercall just escape out the vCPU instead of putting the CPU into real low pulse data. Once the long boot CPU is in C6 state, the CPU zero, CPU zero is a boot CPU, then it should come on to put the whole system into low pulse data. Then the system will end the sleep. Once the user creates a power button or some wake-up event, such as USB or Wi-Fi or phone call, causes the system to wake up and the boot CPU zero, boot CPU will be waking up first. It will issue the command to wake up the other CPU from then. Then it will return from the previous hypercall and return to the previous state for the offline CPUs. All the CPU is online. In this way, we can get a double faster than the later one for the E3 reason. That is a preliminary power data we get at the beginning. You can see that more than 90% of the performance reaches 95% of the later power. We also found several gaps. First is the browsing that the user tries to do the web browsing over the Wi-Fi or do the video streaming over the Wi-Fi based on the HTML file. Another issue we found that is home screen scroll. That is when the user just like this, scrolling the home screen, we found that the power consumption is higher. That is the power. We still identified several gaps. This page is about the performance. You can see that 90% of benchmark reach, 97% of the later performance. But we still found several gaps. For example, the Coding I.O. Coding I.O. is a benchmark widely used in Android devices. It just tries to measure the I.O. performance on Android devices and publish the data over the web line. The user could refer to this data to determine which device is the best. Another issue is about the safe bench. Safe bench has many different KPIs, but one of the KPIs is to measure the minor performance. We found that the performance is just 33% of the later one. It's bad. We identified several gaps on both power and performance, and we needed to develop some tools to help us. First, we enabled the V2. V2 is an Intel homemade DL tool. It is based on PMU, which performs monitoring to connect the event, such as cache miss, CPU cycles, etc. We enabled it very quickly, just pass through PMU and then probably to the domain zero. The data still runs in the domain zero. That means that DL tool is just able to profile domain zero. It cannot profile itself. If we want to profile the domain itself, we developed another tool, which is Zentrace. Actually, Zentrace is based on the region of Zentrace, but we modified it to just call the key events and hypercalls, such as page 4 and RQ and hypercalls. They tell us how many CPU cycles are used in those key events or hypercalls. Another tool are PORF and ZENTrace profiler. PORF is based on PMU. It can either profile or tune the domain zero. ZENTrace profiler can use it to tune ZENTrace. In our case, we mainly use the first two tools, that is the V21 and ZENTrace. In the next two pages, I will talk about the case study. Just as a point in here, Coding ILO has a gap. The gap is about 21%. We analyzed the Coding ILO by... We found that the storage data are cache in the page cache, which is adequate for high memory. Each page cache, if someone wants to access their page cache, it can map the page cache. After access, it can map the page cache. That introduces another problem because it introduces a lot of the PMU hypercalls, such as the PMU update and the PMU extend the hypercalls to flush the TLB. Instead of modifying the ZENT itself, we modify the Android itself. That is, first, we shrink ZENT memory from 106 megabytes to 72 megabytes. In this way, we can get more low-end memory in domain zero. Then we force the page cache to be allocated for no memory. In this way, the page cache can be accessed by a corner address. No need to use a key map to get the watch address. In this way, we can reduce the gap from 21% to 8.5%. The question is, can we continue to optimize and close that gap of 8.5%? Let's try again. First, we use vTune to provide the data. We found that ZENT overhead is about 4% more than 4% against CPU cycles used by application itself. We use ZENT to further break down that overhead. We found that 4% of overhead in the PVMU overhead occupied about 7%. Our condition is that it is very hard to further close the gap of 8% of 8.5% due to PVMU overhead. In that case, it's a home screen scroll. When the user tried to scroll the home screen, we found that the power is higher than the later one. So let's use vTune and ZENT trace to get the data again. First, we use vTune. We found that the ZENT overhead is about 1%. The total gap of home screen scroll power is 1.2%. If we use the ZENT trace to further break down the overhead in ZENT, we found that PVMU overhead occupied nearly 60%. So PVMU overhead again. It looks like we have no good way to optimize that gap. Actually, that is the question I would like to leave the ZENT community. Can we find the beta way or to optimize PVMU to get the beta performance? Okay, you have other gaps. For example, some of the gaps have a similar ZENT overhead caused by PVMU or TLS stack switching. For example, in the guest, if the task switching happens, the schedule will try to switch TLS, local storage or the stack, which is done by hypercall. Some of the cases can be optimized by reducing the hypercall numbers by optimizing the guest instead of optimizing ZENT itself. For example, our method is to to force the page cache allocated from the low end memory. In this way, we can avoid unnecessary KMAP and KMAP. So to introduce the PVMU hypercalls. But some of the cases are very hard. Some of the cases could be harder to optimize due to the PV overhead. For example, C4 benchmark model. In the Android model, it is implemented in the Bionic. In this mechanism, in this Bionic algorithm, each memory block has a tail and head to indicate the memory block size or whether that memory block is used or not. When the user allocates memory by the medical function, the memory block will try to update the tail and the header that will trace the two page faults. When the user tries to free the memory block, that will trigger another page fault. So the three page faults. We can optimize the minor algorithm just to update the header and illuminate the redundant tail update. In this way, we can illuminate one page fault. In this way, we can get from the previous performance, 33% to 66%. But we still have 63% of our gaps. But we have no way. We think that those issues can be fixed by HVM Domain Zero. We have a presentation yesterday to talk about HVM Domain Zero. Actually, we found that those PV overheads problem in this innovation project and we started to think about to use HVM Domain Zero. But unfortunately, in this innovation project, this item version cannot support EPD. That means even if we use HVM Domain Zero, we still need to use ShadowPage. We still have a performance issue. In the next generation item, of course, we will support EPD. At that time, we can totally avoid that issue by HVM Domain Zero. That is summary. First, Domain Zero and Joey achieved a near native power and performance. But we still found some power and performance gaps caused by PV ops. For example, PVMMU or TLS stack switch. Those gaps could be fixed by HVM Domain Zero in our next generation of items. Okay, that's it. Any questions? So one of the performance issues was TLS switching. Which kernel version were you testing? 3.4. Guess the kernel, right? Yeah, 3.4. I can't remember when the optimisation went in, but we have been improved and we do a lot fewer TLS switching now. Yeah, which kernel version are you referring to? And there have been other performance improvements that have gone in since 3.4 to reduce some of the PVMMU overhead, particularly with the two big gaps. Okay, that's great. Thank you. So a couple of times, you had lists of things that were taking all the time and you had multi-calls in which you attributed to PVMMU stuff. Yes. But a multi-call can contain any hypercalls. Did you actually dig down to see what the constituents of multi-calls actually were? Yeah, you are right. Actually, multiple-calls... Okay, let's go back to... There was one where it was like 30-odd percent. Yeah, multiple-calls here. Most multi-calls are used for the PVMMU update. Yeah, but it also has some multi-calls of TLS. But in this case, we just ignored it because TLS switching just happened in the context switch, yeah, in the context switch in the ghost corner. We can ignore it. Okay, thank you very much. Okay, thank you.