 Привет! Меня зовут Денис Луниф. Я расскажу сегодня о взаимодействии на снапшот и ревертирующиеся на снапшотах в Куямо. Давайте начнем этот взаимодействие. Эта операция выступа в следующем способе. Виртуал КПУ остановился. И миграция процесса началась. Миграция стрима выживала на один из виртуал-дисксов виртуал-машина. Проблема здесь есть, что процессы выживания можно взять слишком много времени. И все это время виртуал-машина не могла ответить на любые external requests. Например, с 1 ТВМ на очень быстром ВМИ процессы могли взять 20 минут или даже несколько часов, если вы выживаете HDD и не выживаете ВМИ. Один мог сказать, что миграция процесса очень хорошо работала, и мы могли выживать донатайм в несколько секунд. Но это не будет работать на снапшотах. Проблема здесь есть, что мы начнём миграцию и мы сможем выживать донатайм, в котором мы можем выживать не весь рамп, но очень небольшой донатайм, но не в момент, когда миграция на снапшотах выполнена. Но через несколько минут или несколько часов или где-то в предыдущем будущем, но миграция на снапшотах надеялся, что он мог бы вернуться к конкурсу к конкурсу. Ок, что мы можем делать? Мы не должны только трекать диртимомер как в миграции конвенции, но активно защитить диртимомер из модификации. В этом случае, когда виртуал CPU пытается модифицировать диртимомер, QEMO будет recevoir диртимомер и защитить диртимомер пэйджа, перед тем, что пэйдж в принципе, модифицированит диртимомер. Также мы должны получить какие-то бэграунные треки, чтобы защитить диртимомер, который не застанет диртимомер. Ок, это то, что actually commercial hypervisors делают здесь. Это конвенция и очень знаменитая. Почему этот диртимомер не был диртимомер? Проблема это, что нужно получить диртимомер, что диртимомер действительно модифицированит пэйдж. Так что это можно делать в два способа. Мы можем получить специальный QEMO или диртимомер диртимомер диртимомер. В данном случае, в мэйлинг-клисте коммуника приходится диртимомер, что нужно получить диртимомер. Это вылечит в последнем году в линусе 5.7 и это диртимомер для диртимомера. Также, есть еще одна проблема. Есть несколько статков И только один дискстейт можно вырабатывать в течение времени. Почему это проблема? Майграция стрима вырабатывает к штатам, которая должна быть только читана, когда снапшут будет сделана. Так что мы должны быть able to have two states writable in QCAUTO driver. Мы пытались resolver this puzzle, and we were unsuccessful. So we have to change some architecture and not use QEMU snapshot infrastructure. We have come to a decision to use the migration infrastructure, where the migration stream is saved to the disk by the external utility. Fortunately, this functionality is available through LibVirt IO Helper, and that is useful for us. We have implemented new migration capability, background snapshot, which setups with stuff, and once this capability is specified for the migration, we could just call virtual save virtual machine and get virtual machine state saved externally. This is also useful in terms of virtual machine testing. So far so good. Okay, let us remind one thing. Dirty page tracking. As I have said, we are using user foldfd with write protect feature and the guest. Once the guest writes to the page, that page is saved to the migration stream. Once the page is saved, that page could be unprotected. Great. So far so good. Let us switch to a synchronous rework to the snapshot. Right now, we should load the whole migration stream and once the stream is loaded as a whole into the RAM, start the guest. The size of the stream increases the time, once the first request in the guest could be served. So, we need to implement something. But fortunately here we should not change any code in the QEMO. There is a well-known feature, which is called post-copy migration. We start the guest with CPU state and device state loaded and any missed guest memory page are served within the page fold for non-present page. But in this case, there is one problem. We need to be able to find the guest page by the guest physical address inside the migration stream. And that is not possible at the moment. Right now, the stream doesn't have any index. And if we will start to invent those index, we will have to introduce some incompatibility within QEMO protocols. They will have to have that index, save it to the migration stream. Also, this index would not be so small. And that index should be able to keep several instances of each single page. As during the migration, those pages are sent several times. That is really awful. But we could do a little trick. We could convert the migration stream at the stage of saving it to HDD within the IO Helper. In that case, we need to replace standardly with IO Helper with some new tool. That new tool would be called QEMO snapshot tool. Okay. That snapshot tool should convert the migration stream at saving. And also, that migration tool should be working as a server for post-copy migration. Thus, it should initiate the migration in pre-copy mode TransferDivideState, and maybe some memory. After that, it should switch the migration into post-copy mode and actually start the guest. We do not need separate control channel for that. That is really good. They have a complexity of the migration stream storage kept in one place. Great. But not really great. There is still a problem. We should invent a format for the migration stream storage. The memory in the migration stream is sparse. For two reasons. First of all, physical memory inside the guest is sparse. Some address ranges are not present. Also, the balloon driver inside the guest could kick out some pages from the guest. And wet pages should not be stored. Okay. And also, we should store other part of the migration stream. Normally. The shiny side of this is that we should not invent something really new. There is a very good well-known format, which will fit for the purpose. We could reuse QCao2 in a little bit different way. We could store RAM in a data area of QCao2 as virtual disk, and we could save the migration stream without the memory to the same area as it was done before. Just beyond the biggest physical address available. Okay. All the code for the implementation is available inside QEMU. Thus, this code could be reused. The only thing we should need here is a proper start of a post-copy migration. But fortunately, post-copy migration could be started with a server specified in the conventional command sent through QMPA channel. There is nothing very special and new here. All will work out of the box. We just need to provide the tool to serve as a server. That's it. Let us talk about the performance. We should measure two main things. We should measure the whole time of the snapshot operation. And we should measure the time when the guest is not accessible. For this purpose, I have used Linux VMs of different size. And for the on time, I have measured not just simple ping, but the time to fetch some small file from HTTP server running inside virtual machine. As one could see, for synchronous snapshot, the downtime is equal to the time of the whole operation. And for the background snapshot, the whole operation time is a little bit bigger. We have not done our whole work for a snapshot tool yet to improve the performance of saving. But the downtime is really small, good and shiny. It is several milliseconds. And not 10 for seconds and so on. That's good. For revert to snapshot, the situation is the same. Here we have spent more efforts for IO Optimizing, but less efforts for downtime measuring, downtime optimizations. Unfortunately, the downtime is equal to half a second, which is really a big number. They have to improve it in the future. Okay. What is the current state of the project? Right now, user fault FD, this right protect, as I have mentioned, is measured into Linux Kernel 5.7. Background snapshot feature is merged into QEMU 6.0 and QEMU snapshot tool RFC has been sent in the May of this year. They have to improve this stuff further. What problems do they have? There are several problems. First of all, the guest is not working really good just after revert to the snapshot. There is no memory and they have to solve this puzzle. First, we can track what memory is accessed after the snapshot is taken. We are good at that. Once we have made the snapshot, we are monitoring which memory pages are changed in the guest, in the background snapshot operation. That is not enough. We should also start tracking access pages. This can be done in user fault FD, but that would be slow. I think that they could have a bitmap of accessed pages inside the guest. That information is available in EPT or RVI inside the processor. They will have to fetch it to really calculate the working set of the guest after the snapshot has been taken. There is another problem. Unfortunately, user fault FD is single threaded. If virtual CPU faces right protected page or faces non-present page, it stuck until that page fault is not satisfied inside QEMO. Unfortunately, only one thread at the moment could process this operation. They can either make user fault FD multi threaded, which is quite complex, or at a first attempt create several user fault FDs for pieces of physical address spaces of the guest. At least we should think on that and make some improvements here. That is what we are actively working right now. So this work should make the QEMO working in the same way as commercial hypervisors. That is all that I would like to talk today. Thank you very much for your attention. If you have some questions, I would be happy to answer to them. Thank you again.