 Good afternoon, my name is Wei Wang and I'm from Intel. In this presentation, I'm going to introduce using hardware accelerators to accelerate non-migration of virtual machines. Those guys here are also from Intel and they are contributors to this project. Let's have a look at the agenda of this presentation. So in total, I have five parts. In the first part, I'm going to introduce the goals of this project. And in the second part, I will give an introduction of the high-level architecture of this solution. And in the third part, I will introduce some important features that are used in this solution. And in the fourth part, I will show some test results. And in the last part, I will introduce some future works that we plan to do. Okay, let's start from the project goals. So today there are some 10 points in non-migration. So the first one is that virtual machines with memory-intensive, quite-intensive workloads are difficult to migrate. So this is because the guest writes to the memory. They dirty more pages than the pages that can be transferred during migration. So the second one is VMs with large memory size. You already take some time to migrate. The third one is the VM migration may consume large network bandwidth. So there are some existing solutions in the current Q&A. For example, people may choose to use CPUs to do compression, like multi-structured compression with ZIP. And the problem is that it's a snow based on our experiments. And the second issue is it consumes too many CPUs on the host. So this is not expected from the cloud vendors. So our solution is to offload the compression part to inter-acquity quicker assistance in technology with efficient approaches. So by efficient, I mean higher migration throughput. So this is measured by how many pages we can transfer the phone source to destination during migration. So the higher the better. And the second one is lower CPU utilization. So we don't want to waste more CPUs on the host. So the second goal we have here is to have a common design ready for future more accelerators to join. So on the upcoming inter-software CPUs, we will have data streaming accelerator short for DSE and the inter-analytics accelerator short for IEX. So we have these two more accelerators to be integrated into the CPU chip. And we want to take advantage of them to process the guest memory during migration as well. And also we wanted to have a smarter selection technique which is smartly and dynamically select an appropriate accelerator to accelerate a lot of migration. So I will introduce this a little bit more in later slides. So let's have a look at the architecture of this solution. So on the source machine, so on the bottom here, you can see that we could have multiple accelerators here and each of them have their own software stack like the Q&A library and the driver. And on the top, there are two threads here. So the migration thread is the one that's already in existing CPU. So in the migration thread there, I conclude it into four steps. The first step is the migration setup. So basically it will do some preparation for migration including the accelerator device initialization and the device pulling throughout the creation. So in the second step, the migration thread will do the page searching. So it searches for 30 pages to process and send it to the destination. So in the third step, we will do a smarter selection. So the migration thread will select an appropriate accelerator based on the history of the acceleration efficiency. So once it decides that for example, using Q&A is more efficient, so it will choose Q&A to process memory for the following pages. And the fourth step is to dispatch requests. So once the accelerator is selected, it will compose a request data structure and submit it to the device to do processing. For example, if it's using Q&A, and there may have multiple Q&A compression engines. So it may deliver dispatch requests to those compression engines in run aerobic fashion. And for the device pulling thread, its main job is to pull for responses from the device. So the response here means that the previously submitted request has been processed. And once the response is obtained by the pulling thread, it will release the request data structure and it blocks when there is no response ready. So the second thing, the pulling thread doing here is to send the compressed data along with the related header to the destination, so the network. So in this model, we basically have a split of the migration flow. So in the current migration flow, the migration thread is responsible for searching for dirty pages and then send it to the network. So in this model, the data transfer is given to the pulling thread to send. So here is the people on the destination side. It's similar to the one that we saw on the south side. So on the bottom here, it also have those accelerators and on the top it has the two threads as before. And the migration setup is similar as the south side. So it'll do some initialization work. So the difference here is, it has a page receiving step. So this step is, it receives the data from the network and then the migration thread and passes the migration protocol. For example, we added a multi-page protocol. So there is a multi-age header. And also it will select an accelerator to process the memory. So the header tears the migration thread on the destination side that which accelerator to use. For example, if the south side use the query to do compression, then on the destination side, it will select the query to do decompression. So for the device pulling thread, on the destination side, its job is simple. So it adjusts the pose for responses. So once a response is obtained, then it will release the request data structure. It also blocks when there is no response ready. And for the device that decompressed, the data are DMA to the query memory directly. Okay, let's go to the third part. So I will introduce some features that I used in this solution. So the first feature we use that is zero copy. So it allows, with this feature, it allows the accelerator device to direct me access to the guest memory. So the second one we use is the multi-page processing. So the current migration flow only supports the single-page processing, meaning that it finds only one dirty page and then compress this page and then send it to the migration, to the network. So with this feature, the migration flow is able to process multiple pages each time. And the third feature is the extension request of caching. So this feature just caches the extension request the data structure for efficient memory allocation. So I will introduce some more details about this, those three features. So for zero copy, so at the migration setup step, the migration thread that needs to pre-allocate and pin, on the query memory needs to be pre-allocated and pinned. So this is to prevent the memory to be swept out during migration. And so on the destination side, the memory will be un-pinned when the migration is done. But this is not a DD in the future when we have VFIO based driver. So I will introduce in our future work. So for the request composing, on the source side, the DMA, we will set up a DMA buffer for the source device to process the memory. So the DMA read buffer, it points to the query memory. So like the query device can directly you DMA read from the query memory and then compress the guest data. And for the DMA write buffer is allocated by the library, for example, the query library. So when the compression is ready, it is done. The data are DMA write to the buffer. And on the destination side, the DMA read buffer is allocated by the library. So there is a piece of buffer. And the DMA write buffer points to the query memory. So the device reads the compressed data from this library buffer and then do decompression. So once decompression is done, the device do DMA write to the query memory directly. So this achieves the zero calling. So for the multi-page processing, so here is an example that we have 12 pages. So the migration thread, we are finding the multiple dirty pages one time for example, here it finds that page zero one is dirty and it composes a data structure that the dirty pages starts from zero and the size two meaning that there are two pages, consecutive pages. And the second group of dirty pages starts from page three and it has the four pages. And the third group starts from page eight and page nine and it has three dirty pages. And then the migration thread do the request composure and it composes the request which is submitable to the device. And it also needs to set up the DMA buffer. So it's a scatter gather buffer. So like here it has nine buffers. So the DMA buffer points to these nine pages and it's channeled together. And the orange box here is the buffer allocated by the QT library. So the QT device fetches data from the QM memory and then do compression. So once the compression is done, the data, those nine pages and the compression of these nine pages, the result is written into the orange box which is allocated by the QT library. And then the device opponent thread will transport those compressed data as a whole to the destination side. And for the data, each one is associated with the body page header, which pairs the address of the QM page. Like on the destination side, when the data is received here. So it knows that there is a multi-page header and the payload is compressed the data and then the migration thread on the destination side where it composes the request. And it finds for the DMA buffer, the destination side address is uncalculated by the multi-page header, which pairs where is the QM page that should hold those decompressed data. So the device reads data from the compressed data and the duty decompression and then writes those data to the destination side of the QM memory. So for acceleration request attention, this is a common technique. So during the device setup stage, the migration thread pre-allocates some amount of acceleration requests and data structure and the fields them into the cache pool. And during the request composing stage, the migration thread we are allocating the request. So instead of doing the malloc, it directly take a request from the cache pool. Once the requests are used up from the cache pool, it will do a malloc. So, and then it initializes the request based on the new pages to send. So for the response pooling thread, well, when a response is obtained, the thread there frees the request to the cache pool. So instead of doing the calling the free sys call. Okay, let's have a look at the test results. So for the tests, we test it on the Intel Xeon CPU E5 2699 before running at the 2.2 gigahertz. And this is the Broadway CPU. And for QTE, we use the PCIe Gen3 QTE card. So it's a PCIe card plugged into the PCIe slot. But in all several CPUs, we will have QTE integrated into CPU and it uses PCIe Gen4. So the speed will be much faster on the upcoming Intel several CPU. So for the DRAM, we use the DDR4 and running at 2,666 megahertz. And for the network card, we use the 40 giga network card. So for the migration setup, the downtime we use is 300 microsecond. So it's the default downtime using QNU. For the network bandwidth, we didn't set at a limit. So it can use up to 40 gigahertz. But in reality, it won't consume those much bandwidth. For the compressor level, we set it to one. So this is the fastest speed to compress. For the multi-page, we added a new parameter called multi-page, so meaning that the migration so that it can process multiple pages each time. So we set the value to 63. This is the maximum value can be supported currently. So for the GASTER, we have three types of GASTER to test. The first type of GASTER has the four vCPUs and 32 giga gram. And it runs the workload with the writing compression friendly data. And the second type of GASTER has the four vCPUs and it has a 32 giga gram. But it runs the workload which writes sequence numbers to memory. So sequence number is not very compression friendly, but it's still okay to compress, not difficult also. So for the third type of GASTER, it has eight vCPUs and 128 giga gram. And it runs a memory cache workload with the writing random numbers. So random numbers are relatively difficult to compress. We will see the compression ratio later. So let's have a look at the first test. So we want this dirty workload inside of the GASTER, which writes data to the GASTER memory in a specified dirty read. Like here, we can set it to write 1,000 megabytes per second. And for the data, so there's something probably you need to understand first. So for the throughput, it's the migration throughput and it's measured by how many pages we are transferred from the source to destination. So the higher the better to make the numbers simple. So I put the body player here. So for the compression case, it can send like 17 body player 10,000 pages per second to 29 multiply this number. So for the 16 CPU case, so it's a little bit, it's almost two times larger, more than two times larger for the QT. So it's around the five times larger. So we can see the normalized throughput here. So the QT case has much larger migration throughput than the long compression case. So for the largest migratable dirty reader, it means that we run, we turn this number, this dirty read inside of the guest and we get that if we turn the number to like 1,000 and 200 then this VM cannot be migrated without compression. So this 1,000 and 130 read inside of the guest is the largest dirty read that the guest can have to ensure it can be migrated. So for the 16 CPU case, it supports this number and for the QT case, it's much larger than the low compression case. And for the extra CPU utilization, it means how many CPUs are used in addition to the migration thread. So for the long compression case, there is no extra thread. So it doesn't consume extra CPU. So for the 16 CPU case, it consumes the 678 CPU percent it doesn't consume all the 16 CPUs because this data pattern is easy to compress. So for the QT case, it's less than 40%. So it's much less than the CPU compression case. So for the compression ratio, you can see that the QT compression has a higher ratio than the CPU ZD compression. So the compression algorithm we use here is the same. But the compression ratio is different. This is because the multi-page compression. So when we have multiple pages because for this workload, it writes all ones to the memory. So this repeated one can be, can generate only for example, one or single one as the compression payload. So in this case, multiple page have the higher compression ratio. If we don't have this repeated numbers, the compression ratio between the QT and the CPU compression will be similar. You can see from the last slide. Here for this sequence number, so it's not repeated, it's just a number from zero, one, two, three. So we run this workload inside of the guest and for the migration throughputer, you can see that the QT case is here much larger than the low compression case. So an interesting here is that we find with the 16 CPU compression, the migration throughputer is even lower than the low compression case. This is because the compression isn't efficient with CPU compression. So it actually de-accelerated the migration. So for the largest migratable data read, so for the QT case, it's still larger than the low compression case. And for the CPU utilization, the CPU compression case consumes all the 16 CPU and for the QT compression case, it's less than 70% each. So the compression ratio between the CPU and the QT is similar here because they use the same compression algorithm. So for the third test, we used the, we set up with the memory cache T environment using this. So basically it's a client, memory cache T client called the mem step client and it divides random numbers to this memory cache memory pages. So the random number is much more difficult to compress than the previous data pattern we can see from the compression ratio here. It's only 1.6, but the QT compression cases still has an advantage over those two cases. So it's like more than two times faster for the migration throughput. And the migration time here, infinite means that the VM cannot be migrated. Like in the low compression case, so all the CPU compression case, the VM cannot be migrated. But with QT case, it takes around the 60th second to successfully migrated the VM. Okay, let's have a look at some future works that we plan to do. So the first one is a VFIO driver based zero copy. So the current zero copy is implemented based on the UIO based QT driver. So this requires the QT to be root per village to gather the virtual grass to physical grass mapping by the page map. This may be difficult for some cloud vendors because the QT doesn't have root per village. And this also requires a QT to pin its memory with the VFIO based the driver. We will have the VFIO support to do this. So for the QT, the VFIO based the user service driver is still working progress and we will be able to switch to that later. And the second work we plan to do is smart acceleration support. So this comes with the idea that the DSA can do a dearth of the dirty memory. So the functionality is the same as the XBZRE in the current QT. So it finds out the exact page that the guest, it's the exact bytes that the guest dirties. For example, if the guest only writes one byte, so there is no need to send the entire 4K page to the destination. It can, DSA can help to find out the one byte. So this works efficiently when the guest only modifies a smaller part of page. But we know if the guest writes the entire 4K pages, the 4K bytes each time, then DSA, DERTA encoding might not be efficient. So in that case, we can use QT to compression the 4K bytes instead of doing the DERTA processing. So we wanted to have a smarter acceleration here. So with this technical, the migration side will be able to dynamically switch to use QT, IX is also accelerator that can be used to compression. So it can select either QT or IBS to do compression or use DSA device to do a DERTA processing to non-migration. So this will rely on a prediction based on the compression ratio history and the DERTA encoding history. So for example, we can choose to do, to compress the 10 requests at the beginning and find out what's the compression ratio. And also using DSA to find out the encoding rate. If the encoding rate is higher than for the upcoming pages, we can choose DSA. If the compression rate is higher, we can use QT for the following processing. So that's all for this presentation.