 Welcome to my talk about container migration. My name is Adrian Reber. I work for Red Hat since 2015, and I'm working on process migration, which is the basis for container migration for the last 10 years. I'm involved somehow in CRIU, which is what we use to migrate processes here, at least like 2012. And since 2015, I'm focusing on migrating container. Similar content and what I have here, especially the demo at the end, is also available in a blog post. It's a bit different because it's based on rel 8.1 beta, but it's pretty close to what I have here. The first thing I want to do, I want to define container live migration, what it is. Because usually when I talk about container live migrations, there are different expectations what it is. And from my point of view, it's more or less the same than virtual machine migration. The first step is you somehow transfer a running container from one system to another, which could also be called stateful migration, maybe, or live migration. Multiple definitions would probably work. The first step is you somehow serialize the container on your source system. This is what we use CRIU for. We write everything to disk. And then we transfer it to the destination system and restore it. And then the container keeps running with the same state it had on the source machine. As already mentioned, this is based on CRIU, checkpoint restart in user space. CRIU is one of many checkpoint restore implementations over the last maybe 20 years. And the different thing about CRIU is that it tries to do most things in user space. That's why it has the name. CRIU is integrated in multiple container engines. And I will focus later on my talk on the integration of CRIU into Podman on what I worked on here. And before going to the container engine integration, I want to give some details about how CRIU works. The first step is the checkpointing of the process or of the container. CRIU uses ptrace to pause the container to pause the process to stop it. And then it starts collecting information about the process and writes it to disk. One of the main interfaces CRIU uses is PROC PID to collect the information. That's also one of the reasons why it's called checkpoint restore in user space because it queries the information about the process from user space. When CRIU was initially developed, many interfaces already existed, but the CRIU developers added new kernel interfaces to the kernel to get more information about the running processes out of the kernel. And at least for me, interesting thing about this is that those interfaces are not checkpoint restore only. They are already used for other things. So this was always important for CRIU to get things into the kernel that it's not checkpoint restore only. Once CRIU has collected all the information about the process using PROC PID, the next step is what CRIU calls its parasite code. This is one of my favorite parts of CRIU and maybe also because it's one of the many crazy, craziest parts of CRIU, how to retrieve information from a running process. The parasite code is injected into the running process using Ptrace, some existing code is replaced, then the parasite code is running inside a process, and then the parasite code is basically a daemon waiting for commands, and the parasite code then connects to the main CRIU process and waits for those commands to do things for CRIU from within the address space of the to be checkpointed process. One of the main tasks of the parasite code right now is to extract all the memory pages from the process which are later needed to restore the process, and there are ways doing this with Ptrace, but at the time when CRIU was first developed, Ptrace was really slow, and you want to get the memory pages as fast as possible out of the process and onto disk, and if you actually migrate now a process or a container from one system to another, the time to get the memory pages out of the process is much faster than the transfer time over network, so this is right now in a pretty good shape thanks to the parasite code and how it works with CRIU and the process where it's injected to. Once the parasite code has dumped the memory and the other things it does from within the address space of the process, the parasite code is removed again from the process. This is what CRIU calls curing the process, and the process can continue to run, and in most of the cases, the process will never know that it was under the control of CRIU and the parasite code. Maybe if it does check, I don't know, clock monotonic or something, it will see that it was paused for some time, but usually the process does not know that it was under the control of CRIU or the parasite code. At this point, the checkpointing is finished, all relevant information is written to disk and the target process can be killed or it can continue to run, however you want to use your process to do a migration checkpoint restore. Another interesting thing when talking about container life migration is SELinux, because if you do it from in a container, the container is running under the SELinux label of the container, and when you have the parasite code in the container and it tries to communicate with the outside of the container, this is something which SELinux is not really happy about, and also during restore, there are multiple steps where you have to restore the policies in the same way they were before checkpointing, and I'm also giving a complete talk about CRIU and SELinux at the Linux Security Summit this year in Lyon, and so once the things are all written to disk, the next last step in process container migration is restoring the process. The checkpoint images are all written from disk into CRIU's memory, and now what CRIU does it basically, so if you had a process CRIU operates always on process trees, so you point it to one certain PID and it will checkpoint restore that process and all child processes and will do of course the same during restore to recreate the process tree. There was a talk at Linux Plumbers I gave about CRIU and the PID dance, how CRIU tries to recreate the process tree with the same PIDs it used to have during checkpointing, and there might be a new interface using Clone 3 to improve this in CRIU, but the basic thing is CRIU morphs itself into the process to be restored. So first it forks all the processes, now all the processes are recreated into the state they were during checkpointing. One example I like to give always is the file descriptor, so CRIU records the file descriptor ID where it points to which file it points and the position and during restore it opens the file with the same file descriptor and position it's the file descriptor at the same location and once the process continues to run and if it writes or reads from that file, the file descriptor will point to the same location. That's how CRIU tries to restore all the resources it can control. Another thing is it maps the memory pages back to the right location and it loads the security settings as already mentioned. This is done as late as possible, CRIU can handle app armor as your Linux second. It does it as late as possible because if it would do it earlier, the security settings would partially be problematic for CRIU's restore process and that's why it's one of the really last steps before CRIU then lets the process jump into the original code and the code continues to run as it was before checkpointing. Now to container live migration after the short 30-slide introduction. Container live migration exists for multiple container engines. Right now, maybe the first which was using CRIU was OpenVZ. That's the company, the company behind OpenVZ was also the company which invented CRIU to make their containers being live migratable. Another interesting user of container live migration is Google and the last two years at Linux Plumbers, they talked about how they actually use, how they live migrate their containers in production. Everything which is not interactive, which is long-running batch shops, are live migrated using CRIU from one node to another if they are under resource pressure. CRIU is also integrated in Lexi, Lexi for some time already now. Then there's an integration into Docker, I would call it basically unmaintained from what I have seen in the last few months, year at what it's done with it. And then there's the Podman integration which I have been working on for the last one and a half years to get CRIU into make Podman and to enable Podman to live migrate containers. Some keywords about Podman, it makes containers run without a demon like maybe Lexi or Docker, you can run it without root, just as a user you can run your Docker containers. And the checkpoint restore implementation for Podman which I did, it started sometime in the beginning of 2018, there was some code in May 2018 and it was merged in October 2018. And this is not yet, at this time it was not live migration, it was only checkpoint restore. So you could checkpoint your container, reboot your system into a newer kernel and then restore the container with the same state so it keeps running with the same memory and settings it was during checkpointing. This required many changes to run C which is one of the container run times Podman can use and required CRIU changes and Podman changes of course. And then few months ago in June, the changes to implement container live migration were merged into Podman. This again required changes in all of the involved packages in the stack to from Podman down to CRIU. And now I want to give a short demo about how it works. So the first thing I want to do, I will start a wildfly container, this is a Java application server and it runs a really simple stateless application. It basically returns an integer and then it decreases it. So it's stateful but it's really simple. And now I can say I want to checkpoint this container and deflects the minus R is let a container running after the checkpointing and minus L tells Podman to work on the latest container and the export is write everything about this container checkpoint into this file. And the file contains metadata about a container. It contains the actual checkpoint image which is mainly memory pages. And it contains the file system changes to the layer, to the layer the container was started with. So all the files which were changed during runtime of the container are also included in this checkpoint archive. And now I can copy the checkpoint archive to another machine. So those are two virtual machines I'm using here. Those are both REL 8.1 beta something and I'm using Podman from Git. On the other machine I can now say restore, I say Podman container restore and I tell it, read it from the checkpoint and now Podman unpacks the thing and tells Runcy and Crewe to recreate the container. And if I see here the last access to the container was I got a two and now if I access it here, the migrated container I should get a three probably and I get a three. So the container was live migrated with the state and I can also restore the container a second time. I say, I give it a, with the minus N I give it another name, the container. And if I now access the container, I should again get the three and yeah, so that's the live migration of the container. Another interesting thing about this feature is you cannot only live migrate container like I did but in my case with the Java application server, the Java application server with this really simple application takes around eight seconds to start up to be able to answer requests and it takes around four seconds to restore it from the checkpoint. So in my really simple example, I can increase the startup time of the container around 50% just by not having Java do all its initialization but by restarting it from the checkpoint which has already set up all the libraries and memories loaded like Java wants it to have. On the slides there is the example written out, same I did just in the live migration right now. And with that I'm all already at the end of my presentation there are lots of links to recordings to blog posts and articles all concerning this presentation here and thanks for the attention. Any questions? Thank you for the talk. Yeah, I got a question regarding uninterrupted live migration, it was a very good example of suspend and resume that you demonstrated but if I have a live cluster and I start live migrating things, how well supported is that outside of the podman in this case, the service that runs the pod, like what happens to my open TCP connections or anything like that, is that managed yet or is this the work to do? So this is from Creeus point of view, if you can somehow migrate your IP address then the TCP open and established TCP connections will be migrated, so I think the most interesting thing is established TCP connections and if the IP stays the same established TCP connections will stay connected. But yeah, you have to migrate it within the TCP timeouts and so, but yeah. If the machine you migrated to didn't have the container images already down there would podman download them or? Yes, yes, I don't do this in my demos but podman sees from the metadata in the checkpoint image on which container this is based and podman would then go out to the registry download a container and then restore it and then do the restore based on this download hash or whatever it downloaded there. I'm not doing it in a demo because I don't know how long the download will take. Any further questions? Thank you, Adrian.