 Welcome to my talk about container life migration. My name is Arte van Riebel. I work at Red Hat. I am involved in process migration, which is the basis for container migration since for the last 10 years at least. This is all based on CRIU, which I will give an introduction about here. And I'm working on CRIU since 2012 at least. And focusing on container migration since 2015. Everything I'm talking about here has been already written down in an article. It can be found here. And I want to start with the definition of what I think or what I do when I say container life migration because it's something people often ask me about the details. So basically it's the idea of transferring a running container from one system to another. You could also say stateful migration. So the process just continues to run at the same point in time. You stopped it before the migration. And the basic concept is I serialize the process or the whole container on my source system somehow. Then I transfer it to the destination system and then I just restore it, restart it, and the container keeps on running at the same point in time. I started the migration of the whole thing. As already mentioned, this is all based on CRIU checkpoint restore in user space. And there are multiple integrations of checkpoint restore user space in different container runtimes. I will give an overview later which container runtime has CRIU support right now. And the main things I will demo will all be podman-based. So this is about integration of CRIU and podman and how to use it to live migrate containers. So I want to give you some details about CRIU works and how CRIU works. So the first step you have to do, you have to checkpoint your processes. So you have a container and you have multiple processes running inside it and you tell CRIU, I want to checkpoint this container. You point it or you give it the PID of the first process in the container and it will just stop and collect the information of all processes in the process pre-sold child processes are always checkpointed with the first process. And CRIU does this one possible way how CRIU does this is using P-trace to stop the process. There's also the way to using the C-group freezer to stop the processes. And so CRIU stops the processes, collects all the information and writes it to disk. And so the tool is named CRIU. It's named checkpoint restore in user space and there's a reason for the name because before checkpoint restore in user space was developed, there were multiple other checkpoint restore implementations for Linux there and they were not in user space. They were either completely in kernel space or they were even more in user space with syscall something, so whatever. So CRIU works a different way. CRIU tries to use existing kernel interfaces as much as possible so there's basically not one kernel interface added by CRIU which is only useful for checkpoint restore. The interfaces CRIU added to the kernel are most of the time to get more information about the running process from the kernel. So there are also, for a lot of the things CRIU added, there are other use cases which are using this new information which CRIU added since 2012. And once CRIU collected all the information from the proc file system, then there's the next step which is called the parasite code. This is probably my most favorite part of CRIU because it's also the craziest if you go into the details because you wouldn't expect something like this when you start looking at a project like, I don't know, I wouldn't expect it at all somebody doing something like this. So the parasite code is injected into the running process so the process has been stopped, paused using ptrace and now CRIU extracts some code out of the process using ptrace and replaces this code with a parasite code. Now that the parasite code is in there, CRIU restarts the process at the point of the parasite code. The parasite code is running inside of the address space of the processes it wants to dump and it's running kind of a demon so the parasite code connects to the main CRIU process and the CRIU process can send commands there to the process to do things from within the address space of the running process. And one of the main or one of the biggest things which are happening from inside of the parasite code is dumping the memory from the process to this so that it can be later be restored and the same memory information is there as before checkpointing. And although ptrace offers a way to extract memory from the process, out of the process, this used to be slow at the point where the parasite code was written and if you're looking at migration times, the dumping of the memory is really fast because most of the time for your migration will always be spent by the network transfer to transfer the checkpoint image from one system to another. So the parasite code is used to write all this information to disks and to disk and once the parasite code is done, it's removed from the process. CRIU causes now curing the process. The original code is restored or the parasite code is removed and the original code which was there will be copied back so that if you want to continue to run your process, it will just run without ever knowing that it was under the control of the parasite code. And at this point, the checkpointing is basically finished. All the information has been written to disk and in the case of for migration, you would probably kill your target process so that it stopped, but it also can continue to run. This is whatever you feel like is the best for your use case. What's also interesting about container life migration is if you're running with Portman, you're probably running on a system with SELinux and CRIU is especially interesting. I gave a talk at the Linux security summit about this because CRIU does things which the SELinux policy is not really happy about. So you have to invest some additional time to let CRIU do the right things if it's running under the SELinux control. But this is just too much for today here for my time. And so once the checkpointing is finished, you come to the second step, that's the restoring of the process. And so first what CRIU does, it reads all the checkpoint images to see what is there and then CRIU basically creates a process for each process which used to be in the process tree and for each thread which used to be there. And there was a talk at Linux Flamos conference I gave about CRIU and the PID dance because creating a process used to be complicated on Linux. So you had to, there was an interface and you had to write the PID you want to the interface and then be really fast with the fork and hope that no other process is created during the same time. But with the help of Christian we introduced clone three and now we can create a process with a certain PID. This is available since Monday, Linux 5.5 and CRIU also has all the code to use clone three if your kernel has it. So now CRIU can create new processes with less calls and without any races that some other process might have been created in between. And once all these processes have been created those processes are now morphed into the process which should be restored. And then I like the position and the example about file descriptors so CRIU would just so what CRIU does during checkpointing it tries to figure out all the file descriptors and to which file they point and which position they are. And this writes and CRIU writes that to the checkpoint images and once the process is restored the file is opened with the same file descriptor it's seek to the same position and once the process keeps on running the file descriptor is in exactly the same situation used to be before checkpointing and so that's basically what CRIU does with all the other resources the process is using. All the memory pages are mapped back to the place where they used to be before checkpointing and we are loading all the security settings up armor as a Linux and set comp we're doing this as late as possible as mentioned to do not have those policies interfere with CRIU's changing of the process or restoring of the process. And once the process has been set up in all the ways that it has to be we are jumping back into our original code and the code and the processes can continue to run at the same point in time where we checkpointed them before so that's where the process restore is finished basically and so now to container live migration to the actual inclusion of CRIU into different projects. I think the first one I have to mention here is OpenVZ because they invented CRIU for their container use case to be able to live migrate their containers from one system to another. I never used it myself but that's, the group who invented CRIU. Then one interesting user of CRIU is Google which we were informed like one and a half years ago and so Google actually uses in their container runtime Borg CRIU to live migrate processes in production a lot and as far as we upstream CRIU know it they are very happy with how it works and it works reliably for them so this is something which we're pretty happy about as upstream and Lexi Lexi has a long integration of CRIU for a very long time already then there's an integration of CRIU in Docker it's you have to enable the experimental mode to use it and at this point in time I would say it's basically unmaintained so I'm not sure how good it works right now and then the thing I've been working on the last two years is the integration of CRIU into Portman and we have seen a talk about Portman in the morning already it's a container engine runtime which is demon less and root less and I started to work on this beginning of 2018 and first code was merged in May was written in May and merged in October 2018 this was only the checkpoint restore implementation so you could checkpoint your container reboot your system, restore your container and it would continue to run at the same point you have checkpointed it and then I continued oh and this required changes in all the levels of Portman, RunC, Conman and also CRIU for how Portman handles network namespaces and then after that I continued to work on the container life migration for Portman this was merged in 2019 last year this already also required changes on all the levels of which are involved also the SELinux changes were part of this and with this I'm already at my demo I copied the commands from my demo here on the slides but let's run them here so what I'm doing here is I'm running I'm running a container with a wildfly container I have a stateful application there so that container migration is at least somehow useful so let's start a container here the wildfly container is a nice use case because it actually takes some time to start because all the Java things need to be loaded and actually restoring it from the checkpoint is much faster, like 50% faster than using it and starting the container fresh so now I can access my Java container so I have the simplest application which just returns an integer and every time I read it it's increased by one so I'm using curl to access the IP address from the container and my application it's called Hello World and the first result is zero and the second result is one so it's simple but it's stateful now I'm telling Portman to checkpoint the container I'm using the flag minus ours this tells Portman to keep the container running so I'm making a checkpoint of my container while it keeps on running so now Portman is telling pre-U to make the checkpoint the checkpoint has been written to this and now I'm accessing my container again and now it should say two and three and so the container keeps on running while I made the checkpoint now I'm transferring the checkpoint archive the archive includes all the files about the running processes all the memory pages which have been damped and all the changes which have been made to the file system of the container so this includes all file system changes and all process state which I'm now transferring to another VM on my system and now I'm telling Portman on the other system to restore the container and this takes about four seconds usually something like this now the container is restored and now I can access the container using curl again and now I'm getting back the two which I got back there on top which is the same value before checkpointing the container so a checkpoint in the container probably changed its state but I can continue the container from the same state and it used to be before checkpointing that's my demo and with that, I'm already done, thanks Thank you Hi, thank you for your talk it's cool technology you mentioned that a year and a half ago Google integrated this into Borg my understanding about Kubernetes is stuff supposed to flow from Borg into Kubernetes at least theoretically have you heard any noise about people being interested in checkpoint restore in Kubernetes? No, personally I haven't heard anything of that but that's... so basically this was integrating it into Portman is my first step into getting it somehow into Kubernetes so now I have to somehow get it into, I don't know, cryo or something like this and then maybe Kubernetes but that Google uses it internally might make the discussion about the usefulness of container life migration to Kubernetes maybe a bit easier because that's probably one of the problems that containers are stateless why do you have to life migrate them? But besides that it might make it easier to get it into Kubernetes Hi, you talked about file descriptors being copied over can you talk more about sockets being copied over? Like how it works behind the back? So this is probably the question about TCP sockets something like this so cryo can checkpoint and restore network sockets so if you have a working TCP connection it will still work on the destination host the only thing you have to do the restored process have to have access to the same IP address because without the same IP address you cannot restore a TCP connection and for UDP it doesn't matter it just works and for TCP you have to have the same IP address Other questions? Okay There's another one Databases, could you please tell us more about how it's good with databases? Of course we had this experience before and databases, the thing like they usually need to be stateful and that was a problem for us to handle migration of active databases actually so how is the progress right now with this? Thank you So databases, so I guess this basically depends on how your database is outlaid in your container if all your database files are mounted into the container then it's probably you migrate your container and you have to migrate your data directory and then restore it, it should work There are many years ago we tried to migrate Oracle databases this worked but the database shut down itself after the migration and we think that this is because the time is different on the different hosts so with the time namespace which was just accepted this week and once it makes its way to create you this could be as often a way that we can tell the process in the container that your time actually hasn't changed you're still running on the same clock monotonic as before or something like this so there's the work on the time namespace is probably the most important for the database I would guess but Thank you