 Hello, and welcome to my talk about container migration and how I would like to present how the downtime during the life migration can be decreased. So this is basically about process migration, but I'm using containers to better demonstrate it. What I want to do, this will be, I want to show a few live demos and I hope they all work. I'm using a run-c-based container. I'm using Crewe to live migrate the container, and in the container I have a server called Xonotic. It's a first-person shooter, which I will migrate a server while my client will stay connected, and I hope I can migrate a server almost around the world during the presentation. So I'm going to start immediately with the first demo. The demo is the first step. It will only be a local migration. I have two VMs on my system. I will migrate the container from one VM to another and hoping that the client will stay connected during the time. So this is my first VM. It's called REL 1. I have a container running there, which is not running currently. Let's start it. Okay, so now the container is running with a client, with a Xonotic server in it. So now I'm starting the client. This is what it looks like. I'm in the game. I can run around. I'm alone. We wanted to do it with the second client, but we didn't want to try it over Wi-Fi. So I'm now running alone here in this game, and now I want to migrate it from one VM to another. So I have a script which basically calls the Crewe functionality to checkpoint the container and transfer it to another system and restore it. So I'm going to say migrate, then migrate container ID destination. So my container is named Xonotic, and I'm going to migrate it to my second VM and switch to the client. So now I cannot move any. It already migrated, but it's local, so it's pretty fast. I'm going to migrate it back. Maybe we can see it better. So now the container is running here. Not a system. I'm going to migrate it back. Migrate Xonotic to row one and start a migration. So you now see it says I'm disconnected right now. It's copying now stuff, and now I'm back connected. And during the disconnection, I cannot move around. And when we do a migration over a longer distance, we will see that it hangs for a longer time. So a few details now. So I basically just did a YAM install into a directory of a row 7.3 system with I have the APL repository enabled, and then I can then install the Xonotic server inside of the repository. It's the first line, the YAM install. Then the second line, I'm configuring my container with OCI runtime tool generate. The first argument is dark places dedicated. It's the name of the binary I'm going to run, and then the second is the argument to it. I'm going to tell it run in the temp directory. The third argument is, or fourth, is network host. I'm using the host network of the host. Then I'm mounting two temp fs directories into the container. It's once temp and run, and I'm using it read-only. The reason for read-only is when I'm going to start it to migrate it all around the world, I don't have to migrate the file system. I'm just going to migrate the memory of the container, and what CreeU, fortunately, does for me, it migrates the temporary file systems I have mounted into the container. So, slash temp and slash run is also migrated for me around the world. Then I have to delete some seccomp configuration, which OCI runtime tool generates for me, which I haven't figured out why they do not work during migration. The third step is to start a container, and this is all running now. It's still rel7.3. This is the run C from rel from some optional repository, and it's the CreeU from rel7.3. So this is nice. I can migrate the container from one VM to another, but it's rather boring, and that's why I thought I'm going to migrate it to France, and I hope I still stay connected when I'm in France. One thing which can then be seen, so I can now here press, if I press the tap key, I'm seeing here there's a ping time of 33 microseconds, and so this is the time it takes to connect from my local system to the container on my VM. So now let's try to migrate it to France. So I'm now starting the migration, and it takes much longer. I cannot move around right now. This is kind of not what it should be, because it takes just much too long. I don't want to wait so long until my client can reconnect, and that's why I need optimization, and one optimization which can be done to decrease the downtime of my container, so my client now even disconnected. So I'm going to restart the container on my original VM. I can say, where is it? Here. I can say reconnect. So now the container is now again running on my local system, and because last time it took so long, I'm going to now use optimization. I'm going to use what we call pre-copy. So I'm going to start a migration now. The background, now the memory of the container has been dumped. I can still move around in the container. It's too fast. The network is too good. So what's now happening? I can see now I'm in France. I'm far away because the ping time didn't go up, but the container is running on my VM in France, so it definitely went there. What happened is now I dumped the memory one first time, then I transferred the whole memory while the container keeps on running, and then I'm just dumping the differences which have changed since the last time, and only during the time I'm transferring the changes, my containers actually down, and so I could decrease the downtime of the container a lot. So it took me about 30 seconds to migrate a container without the optimization, and I had a two-second downtime now that I did a pre-dump. So what I did to make this possible is I'm using a layer 2 VPN tunnel to have the same IP address all over the world. So the IP address is going with the container from my laptop to France. I'm using Keep AliveD to migrate the IP address. I'm using pre-copy migration to first dump the whole memory of the container, transfer the memory of the container while it's still running, and then I'm only transferring the differences having thus a much shorter downtime during the migration. So this is still REL 7.3 and CRIU from REL. RunC needs an additional code which is in a pull request available, so I've included this in my demonstration to show the pre-migration, optimization for the container migration, and this is a diagram which tries to visualize what's happening during the pre-migration or the pre-copy. So before we had a migration duration for this in the first try, and this was also the whole process downtime. So during the whole memory dump, memory copy, memory restore, the container was down. And now the process downtime is only a much shorter part because we can dump the memory, and especially what takes the most time is the transfer during the time which the container keeps on running. So we only need to, I called it process table entry, it's a few different things, but basically everything, which is not the memory of the process, and this is, so my migration needs to transfer around 200 megabytes, and with the pre-copy migration it goes down to two or three megabytes and I can transfer the rest before during my client keeps on running. So the last step I want to do, I want to migrate it to Canada, so the client is still running, this is good, it says pink time of 33. Now I'm going to use my VM in France and I say migrate sonotic to rel-Canada, and I say true, this enables the pre-migration functionality. Let's see, so I can still move around. Okay, now I'm disconnected, now the differences are transferred, and now I'm reconnected, and it now says, so now I have a pink time of 133 times, so now it's taking a much longer time to reach the cell, which is now running in Canada, and what I can do now, I can migrate it back to my notebook, also using pre-dumping, so we see here in the console now the pre-dump, it says it's dumped 182 megabytes, this is now transferred using our sync, then once this is done, the second checkpoint, the final checkpoint is taken, the final checkpoint is only 7.6 megabytes, which needs to be transferred, then the floating IP address is given over to my new VM, and so the final time, which I actually need to transfer the files is only 2.6 seconds, so this helps to reduce the downtime enormously. So the last step, I already migrated it back, and we have been working on further optimizations to reduce the downtime even more, we are using user fault FD to do what we call lazy migration or post copy, so we only migrate the main part of the process, so everything besides the memory is migrated to the second host, and then restart the container or the process on the destination host, and then once it requests the memory page, we get a page fault, a user fault, which is then forwarded to user space, and we can then transfer the missing memory page over the network into the process and then it continues running. So this demo, I can demo this also, but it's not very interesting because it looks just like the rest in the wrong term, sorry. So the connection to the host is disconnected, now in the background it's transferring the memory, I cannot visualize it, so it's happening in the background. The memory should now all be there and the client is reconnected. The last step, the combination of RunC and user fault FD is currently something which is only on my system and on my backups, so it's safe. Currently there's a limitation in the kernel that we cannot migrate processes which are forking the patches to support this is in the Linux next tree and we are hoping to get them soon into the mainline tree so that we can support user fault FD-based lazy migration also for containers with more than one process, so I was lucky that the server only needs one process. The last step which would be the goal would be to combine pre and post copies, so we do first a dump of the memory, we then do a second dump which is without the memory, we only transfer everything besides the memory, restore the container and the missing pages are fetched and we use a fault FD, this would be the ideal solution which should work soon probably and needs a few additional kernel patches. So I'm through with my demo and my presentation. Any questions? First of all, very nice. Thank you. I'm sure there are very many differences. I don't know much about virtual machine migration, I know they used to do pre copy, they are now also using post copy with the same user fault FD, so I think it's basically it was made for them to decrease downtime during virtual machine migration. I think the technical stuff behind it, it's the same, so they're doing the same thing. I always think it's a bit easier for them because they can control it from within their QEMU process what to transfer or how it works and for the container we have to work with the kernel more closely, I think, but I don't know what's the machine migration so much in the details. Yeah, we talk. We talk, yes. So we have the main developer of Crew here. So how is it, how do we do not lose the transaction in the kernel? So the question was how can we make sure there's no in-flight requests during block access or something like that and Pavel, who's the maintainer upstream, he said that a few things are working, a few things can be difficult. Any other questions? So there was the question that the dirty memory change rate can be higher than... So the memory can change faster than the pre-tum can transfer it and yes, there is a tool which is called P-Haul which has some logic to try it and you can set thresholds until which it should try to transfer it and if it doesn't reach then it just aborts. What I'm using here is just a script, it just one pre-dump and then that's the final dump so there's no logic in there which tries to handle this situation that the memory is changing faster. I have migrated processes where the total downtime is larger thanks to pre-dump so instead of improving, decreasing the downtime it increases the downtime so this can happen, yes. So this is all the same as in virtual machines but I don't think we are so far yet, we don't have the frameworks around container migration to handle all this but it sounds all like the same and it would be nice if we could collaborate there but I'm not sure if it's not too far away, I don't know. Yes, please. So the question was why I needed to patch everything I was using. So without, yes, basically yes, so it's still very much in development so the basic migration without any optimization that works and any optimization above needs patches which are currently in review or currently being developed so this is all very new, yes. Does it work for super privileged containers that have access to the host themselves? It basically works for any process so containers make it easier because we have separation and isolation, for example. We have the problem with if the PID on the destination system exists it will fail because we need to start as the same PID. If we're running in a PID namespace then the chances that there's anything running inside it, it's pretty low so namespaces make it in some cases easier. I don't know super privileged containers, probably yes, I don't know. Okay, one last question. Oh yeah, please. Does it work for username spaces? I don't know. I'm going to ask Pablo again. Does it work for username spaces? So the answer was yes, it does. Okay, thank you very much. I didn't understand the question so why didn't I? No, I wanted to ask why didn't you use a post copy? Ah, why? Okay. But you can answer it. You are working on that. Yes, yes, yes, right. Yeah, you can use it. That can be critical. Yeah, fine. No problem at all. I'm not sure how many you have. I have one here. Cool. Question. Can you go in such a way? Yes. Do you request any connection? No. It's already connected. No, it's already connected.