Hello everyone, this is Vynan from Alibaba Cloud.郭成 and I complete this practice.I will present this session today.Our topic is speed up the boot time of guest in Alibaba Cloud.This is the agenda of today's session.First, I will introduce the background and why do we need to do this.Second, I will figure out our solution.We call it asyncDMA map.Third, I will show the guest boot process with asyncDMA.And then I will list several optimization design for this practice.At last, I will share our achievements with this solution.Let's start as the background.So, what's the problem?As you know, we need to DMMap all the guest memory when there is a pass-through device.Since the device might DMMap to the whole guest memory.And the memory cannot be swapped,which is DMMap target.But we don't know what memory is the DMMap target.So, one simple solution is pinmap all the guest memory.It might not be a problem when the guest memory is very small.But as you know,the memory is not a scarce resource today.So, a guest might have a few hundred or more gigabyte system memory.And the DMMap is one time consuming process.Then that will be a big problem.There are two charts to show the impact on the guest boot and QMap initialization time along with the VM's memory size increase.Let's start as the left one.The horizontal coordinate is the VM's memory size.And the vertical one is the boot time with the unit second.You can see if we are then 8GB memory to the VM,the whole boot time is around 20 seconds.The boot time will increase along with the guest memory growing up.When the guest memory reaches 300GB,the boot time of VM will be upon two minutes.In this time,the users don't know what happened in the background.And they are also not sure if the creation is still running.That's why we buy the user experience.We need to figure out the root cause.Then we have the right chart.It shows the QMap initialization time versus memory size.As you can see,most of the boot time is in QMap initialization.And most of the initialization time is in doing DMA map.Before we raise the solution,we need to know some conditions.For this problem,first,it means more time cost along with more memory.Second,no DMA,no DMA map.If there are no devices need to do DMA,of course,you don't need to do DMA map.But there is one important information.DMA only touch a specific range of memory at a certain time.It gives us a chance to make optimization.Maybe we don't need to pinmap all the guest memory during the creation of VM.Based on the conditions,what options can we have?The first thing that comes to mind is the virtual IOMU or Pyro Virtualization IOMU.It should be a good solution,but the implementation is very complex.And it also needs much development effort.So we choose one simple solution.We call itEthing DMA map.There are two key points.The first one is only map necessary memory first.It ensures the guest operating system boot up.Then maps the other memory asynchronously in the background.It might bring a little perception to the user,but it gives better user experience.This page shows the overview of the memory size with a password device.First,let's talk about the current status of KVM.Let's use GPO as an example.The host pinmap all the guest memory before the guest OS boot.Then the GPU driver is loaded.And then the application generates workloads for GPU.When the workloads come to GPU hardware,it might trigger DMA.If the DMA address hasn't been mapped,hardware access error will have occurred.Our solution is adding the vortail balloon driver.The vortail balloon and the GPU driver both can allocate the system memory.If the balloon driver is loaded before the GPU driver,it can balloon some memory ranges first.Then the GPU driver doesn't have a chance to allocate from these ranges.DMA won't happen in these ranges too.So it's not necessary to pinmap them during the creation of guest.Map them asynchronously is fine.This page shows the architecture overview.The solution will touch three components of KVM utilization.The first one is QMU.It's responsible for tracking DMA map and balloon change.It's also responsible for tracking the balloon pages.The second one is the vortail balloon driver.It's responsible for ballooning pages and tying to the host.The third one is the VFL driver.It's responsible for doing pin DMA map.In the QMU initialization,the backend driver of vortail balloon in the QMU initializes the balloon size of guest.When the frontend driver of vortail balloon in the guest is loaded,it will query the balloon size configuration and try to balloon to target.It needs many loops with a little change one time.Every time balloon,it will send a notification to the host.The backend driver receives these notifications to track the balloon pages.They are used for generating the balloon page table in the host.The whole page table is generated after it is finished.The QMU calls DMA map for the memory ranges beyond the page table first.The other memory ranges in the page table can be mapped.How about the communication channel of vortail balloon?This page shows the related functions and struct.The communication channel is ready.The only thing we need to do is recording the balloon pages address.There are two whatqs named InflateVQ and DeflateVQ.The frontend driver uses the InflateVQ and DeflateVQ to send notifications to the host.One handler is attached to these two whatqs in the backend driver.That is whatail balloon handle output.We can get whatq element in this handler.It contains the guest pifn and page number information.So everything is ready,just add a simple recording logic.This page shows the balloon range tracking workflow.It's also very simple.In the Inflate process,the frontend driver sends an Inflate notification by InflateVQ.Then the backend driver receives it and dispatches it to the handler.The backend driver tracks all the Inflate pages and gets the GPA by pifn.Then add them into the balloon page table.In the Deflate process,it's similar to Inflate.The only difference is the backend driver needs to remove the released pages out from the balloon page table.This page shows one whole picture of the guest boot process with async DMA map.There are three faces.Face one is initialization.Face two is DMA map async runnously.Face three is completion.Face one begins with the commune initialization.The commune initialize the waterhole balloon size.Only leave necessary memory for the guest.Then it performs DMA map below 4GB as euro.Then let's turn to the guest wheel.The guest OS will enable the waterhole balloon driver first.Then it curious the balloon configuration and begins to inflate the balloon to the target.It will call fill balloon to allocate pages and tell the host the pifn of the balloon pages.The backend driver receives this notification and use the pifn to generate the balloon page table.After the balloon process is finished,the host will know all the balloon memory ranges.Since the balloon memory won't be allocated by other devices.So DMA won't happen in these ranges.So commune can only map the memory ranges beyond the page table.Then the path through device driver is loaded as euro.Face two,commune triggers deflate balloon step by step.The frontend driver of waterhole balloon receives this event.We'll call leak balloon to deflate.As same as inflating process,commune can receive the deflate notification and get the released pages pifn.Then commune updates the balloon page table and trigger DMA map of the released pages.After the balloon is implanted,everything is back to normal.During the practice,we met several problems.Then we have some optimization design for these problems.The first one is auto-combination during the inflating process.The problem is the balloon driver only allocate one small page at a time.And send notification to the host every one megabytes.The commune will get huge number pages.The best method is combining the adjacent pages and create bigger memory range in the balloon page table.Actually most of the memory ranges are adjacent.Since the balloon driver is loaded very early and most of the memory is free.After the inflating process finished,commune trigger DMA map of all the memory ranges beyond the page table.It can reduce the DMA map times.Second optimization design is increasing the balloon page size.Here is the source code in Linux kernel.You can say balloon page allocate only allocate one page at a time.4k page is too small for the current virtualization environment.That will import heavy but unnecessary communication between guest and host.If the guest has a few hundred or more system memory.Just make a little change to use log pages to allocate one large size of memory.For example,allocating two megabytes inside of 4k makes the communication much more efficient.One time,virtual talk can inflate or deflate 512 megabytes memory.It can reduce the communication frequency significantly.The third optimization design is premap.The async DMA map can start early,independent of deflating notification.Commune triggers async DMA map step by step.If there is a new notification from deflatevq which contains the released pages information.Check if they are in the mapped range.If not,then map these pages and give ack to the guest.This optimization design can speed up the async DMA map process.Last,let's see what are the achievements of this practice.This test result is based on the initialized balloon size.It's set as 8GB.You can see the commune initialization time is still around 7 seconds.Although the guest has more than 300GB system memory.The washcreen command can return back very quickly.Ok,then let's see the guest boot time versus memory size.The boot time of the guest hasn't increased along with the memory size increase.You can see the boot time keeps around 20 seconds even though the system memory is upon 300GB.So the result shows that this practice can speed up the boot time for guest significantly.Ok,that's it.Thank you.