 Hello, everyone, my name is Liang and I'm from the infrastructure team of DD2C. Let's get started. With the rapid development of artificial intelligence in these years, TPU plays a very important role and is widely used. It is usually provided as a form of virtual machine instance by cloud service providers. For some technical reasons, creating virtual machine with a pass-through TPU card is very slow, which is a common issue faced by all cloud service providers. Today, I will share you with the practice that how did you solve this issue. The agenda of my presentation contains the following parts. The first part is the background, and the second part will list the main issues slowed down the creation of a virtual machine with a pass-through TPU card. The third part will give you some details of the solutions, and then the fact of the optimization will be showed. The last part is the conclusion. In this topic, I will call a virtual machine without a pass-through PCIe device as a CPU virtual machine, and a virtual machine with one or more pass-through TPU cards as a TPU virtual machine. Creating a CPU virtual machine instance is fast. It usually takes several seconds. While creating a GPU virtual machine instance with the same resource configuration is slow. It may take several minutes if the virtual machine has a large RAM. In most cases, virtual machine creation time is not critical. But in some interactivity scenario, a long virtual machine instance creation time means poor user experience and computing resource waste. Virtual machine creation time is defined as a time interval between queuing process start to execute and guest color start to run. It's divided into two parts. The first part is virtual machine initialization time, which is defined as a time interval between queuing process start to execute and the VCPO start to run. The second part is BIOS execution time, which is defined as a time interval between VCPO start to run and the first guest color log is printed. The chart in the right shows the creation time of different virtual machine instance before optimization. Some factors like virtual machine RAM size, the type of GPU card, and the count of GPU cards will affect virtual machine creation time. You will be curious about the reason for the long time it takes when creating a GPU VR virtual machine instance. What is slowed down the creation process? To find out the reason, we can use proof to get the hotspot functions of queuing process. The frame graph in this page shows the hotspot is in the core chain or core function, VFIO pin pages remote. This page lists the main factors slowed down the virtual machine creation. So key factors is the function VFIO pin pages remote is slow. By the way, you will find some repeated VFIO DMA map and unmapped for the same IOVA area, which makes sense worse. Besides that, PCI device reset, and measurement metadata initialization in KOM, and other is seniors configuration will slow down virtual machine instance creation. I will describe these factors in detail. For VFIO pin pages remote, there are two main issues make it slow. The first one is third auto operation, which is required when allocating pages for user space. Third auto page content will make sure sensitive information invisible to user space. Because page allocated may have been used by other process of the kernel and has sensitive information retained. As a solution, we introduce a new feature called pre-zero auto free page to speed up page allocation. The idea is simple. Third auto free page in advance. When allocating page, zero auto can be skipped. The second issue is the VFIO DMA map pin memory in a page by page way. This result in too many page type entries access. As a solution, pin memory embark is used to reduce the cost. For pre-zero auto free page, it is based on free page reporting. Page-zero auto operation is done in a codework slab. After page is zeroed out, set the page zero flag in corresponding page structure. When page allocated leads to be zeroed out, check page zero flag first. If it is set, zero-out operation can be skipped. When pages are free, page zero flag is cleared and the zero-out worker will be woken up. I have sent the RFC patch set to upstream. You can find the implementation details with the link in this page. I will not talk more about it. To make VFIO DMA map pin memory embark, we add two functions in kernel. They are getUserCityPace and getUserCityPace long-term. That corresponds to getUserPace and getUserPace long-term. City here means physical continuous. The new function will try to pin memory and get the information about embark or physical continuous memory. The chart on the right shows the different behavior of the new function and the original one. The new function will be friendlier to function with or draw, getPFN, and make its life easier to pin memory embark. To take more benefits, huge page should be used. VFIO DMA map is inefficient. There are two reasons for this. The first one is the same IOVA map experienced as a map, unmapped, and then remapped procedure. It's unreasonable. To serve this issue, the mapped IOVA area information are retained in QMU to avoid unnecessary VFIO DMA map and unmapped operation. For the whole sense to work, VFIO IOMU unmapped DMA outcontrol is old-cored when mapped or conflict IOVA area. Conflict here means a new added IOVA area, which is intersected with an already mapped IOVA area, but not equal to it. The second issue is inefficient IOVA area mapping update. For example, to change or tribute our subregion of an IOVA area which is already mapped, the whole IOVA area needs to be unmapped first and then be mapped again. This will happen for an IOVA area containing the address of 1M. As a solution, the IOVA area is split into two parts. One part is below 1M and the last part is above 1M. It can prevent high part IOVA area from being unmapped for modifications to slow part. For PCIe device reset, one reset takes about one second, so it's a slow operation. There are two issues in QMU. The first one, a device was reset twice during virtual machine creation. One is in function QMU system reset and the other one in VFIO group get device fd IO control. Reset twice is redundant. One of them can be removed. The second issue is PCIe device reset operation are specified, are serialized. If a virtual machine has more than one GPU card, it will take more time to reset the device. Do the PCIe device reset in parallel will be more scalable. And first of all, make the PCIe device reset in parallel with VFIO DMA map can maximize the benefits. In the QMU color model, there are some time consuming operations which can be optimized. DERPAID DMA initialization is one of them. Virtual machine with pass-through GPU does not support live migration currently. So DERPAID login for PCIe bars specified MMI range is the usage. Which can be skipped. When PML is enabled, set EPT Entrance Debit is the last time consuming operation. It uses reverse map to get the EPT Entrance for setting the Debit. Trosing all the reverse map entries are time consuming if the virtual machine has a large RAM. During the virtual machine creation, most of the EPT Entrance are empty. So, processing all the reverse map to find a few or effective EPT Entrance is not worth. This can be improved by making the reverse map coercer more efficient. For example, we can introduce a sparse bitmap mechanism. Like HPF map used in QMU to skip the empty reverse map entries. In our environment, we use a lot of simple way to serve this issue by counting the effective reverse map entries. The reverse map entries for memory slot can be skipped. If the effective reverse map entries count is zero. This is not an ideal solution. But it is a balance between the benefits and the development complexity. There are some configurations which will affect the virtual machine creation time. They are listed above. C-Bow's boot manual has 2.5 seconds of time out by default. If the boot manual is unnecessary, please disable it. If Linux guest is used, the graph may config a timeout for selecting different color. If it is unnecessary, please change it to zero. NUMA memory puzzle will affect the speed or page allocation when some of the loads have memory pressure. So be careful with it. This page shows the QMU process flame graph after optimization. You can find the hotspot we saw before has disappeared. And this page shows the accumulated time takes by some way QMU functions. As you see, the time is reduced greatly after optimization. This page shows the GPU virtual machine instance creation time when different optimization method is used. Compared with CPU virtual machine instance, it takes a little longer. And this page shows the creation time of a GPU virtual machine instance with one GPU card and 48GB RAM. Before optimization, it takes about 38 seconds. After optimization, it takes 3.3 seconds. The creation time is reduced by more than 90%. 4GPU virtual machine instance with 4GPU cards and 192GB RAM. It is reduced from about 2 minutes to 4.4 seconds, which is reduced by more than 95%. The concluding part. I have to point out, worse optimizations are not limited to GPU virtual machine instance. They apply to virtual machine instance with other PCI pass through device. And some of them apply to CPU virtual machine instance. About pre-serial or free-page, it has some limitations. Its current implementation is not friendly to huge TRB file systems. Extra work is needed for this case. And there is a court case. If the free-page are not served out in time, it may lose the effect or boost the sense up. So it's far from perfect. All solutions have some pros. The first point is transparent to guest. Nothing needs to be changed for guest. So it's more appropriate for public cloud environment. The second point, DMA operation in BIOS stage can be handled correctly. Other solutions based on parallelization lead some work around to forbid DMA operations in BIOS execution stage. About GPU virtual machine creation time, there is some room for further improvement. We find Linux memory management seems inefficient in device pass through scenario. For example, all the features for page migration and memory or commits are useless in this case. So it's possible to make sense simpler and more efficient. It's in our future play. One more thing, we will contribute our work to upstream. That's all for my presentation. And this is my email. If you have a question which I can't answer online, you can send an email to me. Thanks. Bye bye.