 Welcome to my presentation about container live migration. My name is Adrian Reber. I work at Red Hat since 2015. And for the last 10 years, I have been focusing on process migration. Process migration is the basis of the container live migration I'm showing here today. And I'm involved in Crewe Checkpoint Restore and User Space since 2012, which is the basis for the process migration we are using here. And I'm mainly focusing on container migration for the last five years. The information from today's presentation can also be found in this blog post I wrote with all the details I'm mentioning here and in my demos. Today's agenda is, I first want to go about a few use cases, why container migration may be interesting and how it can be used in different use cases. Then I want to go into the details, how Crewe enables you to migrate processes and containers. Then I want to give a few additional demos and then a short outlook where this process and container migration might be heading to. I like to start with a definition of container live migration because this is not always clear to everybody I'm talking to. And for me, it's basically the same what happens when you migrate a virtual machine, you transfer your container from one system to another. You could also call it stateful migration so that the container is running, continues to run in the same state on the destination system. And from a very high level view, container migration is just a few simple steps. You serialize your container somehow on your source system. In our case, we try to write it to disk and then we transfer the result from the source system of the migration to the destination system and there we then just restore the processes in its container and that's it, you have container live migration. This is all based on Crewe checkpoint restore in user space. Crewe is one possible way to do checkpoint restore in Linux. Today I would say it's one of the best alternatives there if you want to do checkpoint restore in Linux and especially the integration into container runtimes is working best in combination with Crewe. There are multiple integrations of Crewe into container runtimes, container engines and in my talk I will focus on the integration into Portman and how Portman can be used to checkpoint restore containers and how Portman can be used to migrate containers from one system to another. So I mentioned I want to give a few use cases why container migration or the technology behind container migration might be useful for your use case and one thing is you want to update the kernel on your system where the containers are running but you have stateful containers I don't know maybe a database or something which has data loaded into memory the caches are filled and you do not want to lose the state of your application in the container and so Portman in combination with Crewe can be used to checkpoint a container you can update your kernel and then you can reboot your system and restore the container from its previous state I have prepared also a short diagram showing this here so the different colors are basically the memory of the processes of the operating system so you have your host running with all its memory used, loaded by different programs by the operating system and one part of the memory is the container with its state and so what we want to do we want to take out the state of the container from the running operating system so we checkpoint the container onto disk so it's now saved the state of the container then we reboot the host so the host is gone and then we restore the container back into the new host which I try to visualize using green so it's now a different state because it has been rebooted with a new kernel for example but a container and all its state and memory is still the same as before the demo would look like this so I first want to look if there are any containers running now so there are no containers running currently so now I will start a container I have prepared a wildfly-based container wildfly is a Java application server and I created a really simple stateful application in there it's called hello in my case and what it does is the stateful application just if you query it we are HTTP it returns an integer and increases it and the next time you query the application you get the next integer so it's really probably the most simple form of a stateful application so I say curl to query the container and then I say podman inspect to get the IP address of the container and then it's running on port 8080 and then the path to the application it's called hello world so let's run this this should give me back a 0 and if I run it a second time it gives me a 1 and a 2 and so on if I would now reboot the system and start a container again it would start again at 0 so what I will do now I will now say podman container checkpoint and tell podman he should work on the last container I started so now podman will talk to RunC and RunC will talk to CreeU and they will write the checkpoint to disk if I now look at the output there's no more container running and if I try to connect to it it doesn't work because there's no more container running so now I will reboot my system wait a few seconds until the virtual machine comes back up the system is rel 8.2 with podman RunC and CreeU also from rel 8.2 installed on the system so it's all just what comes with the operating system out of the box no additional changes necessary so now let's look here podman ps no container running if I try to connect to my container it just doesn't work of course so now I try to restore the container I say podman container restore and again work on the last container so now I did something so let's see if I connect again to the container I should get back a 3 because this was the state the container was checkpointed previously when I did the checkpoint so now I get a 3, a 4, a 5 so I was able to do a checkpoint I rebooted my system and I was able to restore the container in the same state it was before checkpointing without losing any state or information to container had so this was one of the possibilities where checkpoint restore the technology behind container life migration can be helpful the other one is quick startup so I have a container running in my system and it takes some time to start up because it initializes memory it has to load libraries so what I can do I'll take a copy of my running container and write it to disk and now I can create multiple copies of the same container which has already been initialized and which does not need any further initialization and startup time so I'm going back to my demo system here so now I have a container it's running and let's see it returns probably a 6 now so now I want to make copies of this already initialized container so now I say portman container checkpoint and I again say work on the latest container and then I tell portman to keep the container running so it should not stop the container it should just create a checkpoint and the container should still be running and then I tell portman additionally it should export the checkpoint to a file which can then be later used to create a copy of that container so now portman and crew and runcy are writing the checkpoint of the container to disk and now I can still access my container so 7, 8 it's still running as it was before but now I can create a new or I can use my checkpointed container to create another instance of the same container with the same state it was previously in so when I say portman container restore and now I say import and now I tell portman about the checkpoint archive I just created previously and so I tell portman to create a copy or use create another container using the exported checkpoint information so now portman says error creating container storage because now portman tries to recreate a container with the same ID it had when I checkpointed it the ID is of course used by the original running container if the original container would have been stopped this would have worked now but now I have to tell portman to give the container a new name so I say name hello one and let's see now it I don't get an error now if I now say portman nps I say now I have two containers running and that they look to be running the same image and the same command from what I see here let's try to access the container I just restored and I can say now here name hello one and now I should get back I think seven because I got back a six last time I did a checkpoint so it doesn't work how does the command look like I don't have to tell name I just have to give the name so I now access the container here and now I get back a seven, eight, nine, ten, eleven, twelve and if I now access my original container which is called inspiring beaver here I should get back ten, eleven so now I have two containers running which used to have the same state at some point in time and now they have different states so I can restore additional copy of the container I say again restore and then I say import checkpoint archive and I'll give it a new name I call this one hello two and now portman restores the container and if I now try to access hello two I should again get a seven because that's the state the container was in when I checkpointed it yes and that's it and now we'll get portman ps I can see I have three containers running which are almost identical as they answered different queries they are now in different states but they used to be all the same container so I call this quick startup and the thing is this cannot be really be seen here in the demo because my small container starts up pretty fast but if you look closely at the startup times of my container so the initial startup time of the wildfire container until it can answer queries is about eight seconds and if I restore the container from the checkpoint it's about four seconds so I already get something like 50% increase in startup time because I started from an already initialized container so the last use case I want to present is container live migration and again I have a container running on my source system and I'll take it out of the running system by using a checkpoint writing it to disk and then I can restore the checkpointed container multiple times on the destination system and have container live migration, stateful container migration by moving the checkpoint archive to the destination system. I will show a demo of this later at the end of the talk so now some details about CreeU the first step in this whole operation is to checkpoint a container and checkpointing is one of the first things CreeU does is process using ptrace in most of the cases and so the process is paused all necessary information is collected from the process and then written to disk and the collection of information about the collection of information about the process is one of the interfaces CreeU uses there are a lot of files there which give details about the state of the process so CreeU goes over all processes in the container and collects the information from CreeU and this is one of the things why CreeU is also called checkpoint restoring user space and this is different than other checkpoint restore implementation which existed on Linux previously so CreeU tried from the beginning to use existing interfaces as much as possible to get information about the running process and CreeU also added additional interfaces to the Linux kernel but these interfaces are not only useful for checkpoint restore they are also used in other use cases to get information about running processes so CreeU tries to get a lot of information from ProgPit there and information about the process which is not available through ProgPit and CreeU tries to extract from the process with a concept called ParasiteCode ParasiteCode is probably CreeU and it's definitely also the craziest because it's just something you do not expect if you never dealt with CreeU before so the ParasiteCode is injected into the post process and then the process continues to run the ParasiteCode so the ParasiteCode is now kind of a daemon running inside the process to be checkpointed and a daemon waits for commands from the main CreeU process so now the main CreeU process can talk to the process it wants to checkpoint using the ParasiteCode daemon and tell the destination process I want to have all your memory pages write them to disk I want to have information which can be retrieved from within the address space of the process so all the information which cannot be accessed from the outside of the process CreeU can now access from within the address space of the process by using the ParasiteCode once all the information is extracted using the ParasiteCode from within the address space of the process the ParasiteCode is removed again CreeU calls this curing the process the process should usually never know that it was under the control of the ParasiteCode or CreeU and at this point the process can continue to run as it was before so the process will be running and will never know it was under the control of the ParasiteCode a simple diagram to show how this works so we have the original process running here in memory now the ParasiteCode comes along we take out a copy of the original code and store it outside of the process we put the ParasiteCode into the process use it to extract information from within the address space of the process and once we are done the process is cured again like mentioned just before and the original code is put back into the process and the process never knows the ParasiteCode at this point check pointing is finished all relevant information has been written to disk and the target process is killed or as seen in one of my previous demos it can continue to run so this is just basically how the user decides what it wants to do check point and restore because of reboot it will stop the process if you want to do maybe an incremental check point every few hours to not lose the state of your container you just do a check point and let the container running or the process running so now that the check pointing is finished I shortly want to talk about container life migration and SeLinux so when I initially looked into getting CreeU support into Portman it basically worked after a few months it was just working but the SeLinux labels were never restored correctly they were not even touched and one of the things is if you run under SeLinux CreeU does things which SeLinux does not expect so you have your process running in the container and then CreeU comes along and injects the ParasiteCode and now all of a sudden something from within the container is talking to the outside of the container because CreeU talks to the process or to the ParasiteCode in the process which SeLinux does not expect and which it does not allow so there were additional steps necessary to make CreeU work in combination with SeLinux so CreeU already had LSM support since 2015 and this was implemented with the main focus on up armor to support CreeU in Lexi, LexiU's cases to migrate processes there under up armor confinement and it's really not complicated to implement this or it's not implemented in a complicated way so during checkpointing CreeU basically reads out the current attribute of all the processes and stores them in the checkpoint image and during restore all those attributes are restored and just written back so this worked for up armor and SeLinux support was basically saying if you run under any kind of SeLinux context which does not start with unconfined then CreeU will just tell you no idea how to handle this so it worked if you were not running under some special confinement like Portman does but if you're using labels like or security context like Portman does it was just not working so it needed a few changes to work in combination with Portman and the necessary changes where like already mentioned we had to label the socket using for the parasite daemon communication correctly then like up armor we had to read and write the attributes of the current process and then it becomes a little became a little bit more complicated because so when CreeU restores the process it's running under the context of running outside of the container and then we have to change or we have to transition into a different context and this is not something easily possible with linux as it was working with so we needed additional policies to make the transition possible here at this point and then we had to make sure we changed the context as late as possible so that we are running under outside of the container during restore as long as possible then we are trying to influence the PID of the restore processes using proxess kernel last pit and this also needed additional policies to be able to write to this file during restore then all sockets, TCP sockets need to have the correct label as a linux label to work correctly and then we had to create the log files with the appropriate labels then as CreeU is running at some point under the container context and it can no longer write to it log files and at a few places we had file descriptor leaks into tools, CreeU is calling so we also had to make sure that those file descriptors do not leak out because as a linux was also not happy with those file descriptors hanging around so now we are through with check pointing so with as a linux and now we come to the second or last step of the container migration process this is restoring of the process so what happens here is CreeU just reads all the checkpoint images it has written previously and which we used to restore the container and the processes in it and then CreeU does a clone so we have each PID, a thread ID in the container and the thing is CreeU always operates on process trees so you give CreeU a PID and then it will checkpoint all processes below this ID so basically the process and all its child processes restores the processes with the same PID to not break parent-child relations of restored processes so another interesting thing about how CreeU does this is what I called the PID dance so to restore process with the same PID this is what CreeU used to do it opened process it wrote the PID it wants minus one to NS last PID then it closes NS last PID then it quickly does a clone and then it does a getPID and verifies that it actually is the same it does a getPID to verify that a PID is actually the PID it wanted to have if the PID is a different one because some other process was created during this time then CreeU will fail at this point so this can lead to race conditions because someone else can create a process during that time and it requires multiple system calls so it's slow and so we were looking for something else to create processes with a certain PID there was already a syscall implemented in 2010 this was also done for internal checkpoint restore which was never merged upstream it was called eClone so eClone had a parameter to tell the kernel I want to create a process with this PID but this was never merged and so we tried to do a we tried again to avoid the PID dense with the help of clone3 and clone3 was introduced because the parameters of clone were running out and clone3 was introduced from the beginning to be more extensible for the future so that you can add additional features to clone3 and that's also what we did in CreeU and clone3 so we extended clone3 to include the set TID parameter in the kernel and now we can with one single syscall we can create a process with a desired PID and so we do not have to do the PID dense it's not racy, it's not slow and it basically looks like this so we just do a clone and tell it we want to have a certain PID to the syscall and then we either get a new process or we do not get it, we do not even have to verify the PID of the created process because the syscall will fail if the PID we want is extensible so so we create a lot of processes we recreate the process tree and then CreeU morphs all the recreated processes back to its state they should be for the restore process and a nice way to describe this is a file descriptor so when CreeU does a check point it looks at all the resources the process is using and it also looks at the file descriptor so it looks at which file descriptor which number the file descriptor has which file it points to and at the position of the file descriptor and so when CreeU restores the process it opens the same file with the same file descriptor ID it positions the file descriptor at the same location so once the process keeps on running or continues to run the file descriptor will be exactly the same than it was before checkpointing so if the file is accessed this will just work as it was before checkpointing and in addition to restoring all the resources the process uses all the security pages are mapped back at the right location then we load back all the security settings at armor as Illinois sec comp we make this as late as possible to make restoring easier so that we are not confined by any of the security mechanisms here and once everything is restored as it was we jump into the restore process which is running as it was before checkpointing so now to to contain a live migration so contain a live migration exists for probably you could say a long time because OpenVZ which provides container for very long time they are actually the people who invented CRIU to provide migration capabilities for their containers so OpenVZ invented CRIU and they use it so if you are running OpenVZ containers you can live migrate them using CRIU then CRIU is also integrated into Borg this is Google's container engine we CRIU upstream had reports from Google during the last two Linux Plamos conferences how they use live migration in their container engine in production and it works pretty good for them as far as we are told so it's a thing that is used in production and works reliable for them then we have integration of CRIU in Lexi to migrate containers we have integration of CRIU into Docker to checkpoint restore and migrate containers and the thing I want to already show here is is how it's integrated into Portman and I think this is the, I've been working on this for the last three or two and a half years and just some words about Portman, Portman is container runtime which is demon less so there's no demon running you just start your container and then there's nothing which you need to talk to you always talk directly to your containers which is running and one thing is with Portman you also can run containers as non-route this is not really relevant for checkpoint restore because CRIU is not working as non-route yet we are on the way there but to get checkpoint restore into Portman I think the first discussion started around the beginning of 2018 I had the first code ready in May 2018 and it was merged in October 2018 so we had checkpoint restore capabilities at this point it required changes to RunC and CRIU and of course Portman but at this point it was only able to checkpoint and restore a container it was basically only to support for the first use case I checkpoint my container I reboot my system and I restore the container it was not possible to live migrate a container or make copies of it so my next steps were to get container live migration working and by June 2019 changes were merged into Portman and this again required changes in all of the involved layers so Portman, RunC, CRIU, SELinux and all these changes were merged and now you can actually checkpoint and restore in my great container like you've already seen in my demos the nice thing is the checkpoint I'm exporting into a file it includes all file system changes so you do not have to worry about if your container creates files in its local container file system all these file system changes will be applied upon restore of the container so Portman handles all the changes to the container necessary and you just have to use this simple export functionality so now to my last demo showing the migration of a container from one host to another host the slides here are just showing the commands which are necessary to do this in case the demo doesn't work but let's see here let's go back to the demo Portman yes so I still have my containers running and so let's make a checkpoint of the newest container here checkpoint minus L keep it running and export it again to a file to checkpoint archive now the container is written to this let's see which result I currently get from my last for Portman's last known container it says 9, 10, 11, 12 so now I transfer the checkpoint to another system so I say just the SAP to another host and now I connect to the other host and now I restore the container here I say portman container restore import temp checkpoint archive let's give it a name let's call it hello let's call it hello 5 and if I now access the container it should return 9 and it does so now I have migrated my container stateful migration from one system to another and the container just keeps on running here and on the other host because I didn't stop it there so that's the demo for the container migration a few more slides with the commands I'm using for my container migration here which can be viewed offline if necessary so about the future people often ask about Kubernetes and so probably future could be to migrate a container and I'm actively working on this so once I got it into Portman I was starting to look at CRIO which is a container runtime Kubernetes can use and I implemented checkpoint restore for CRIO and now because this requires API changes to the CR interface to the container runtime interface I'm currently waiting for reviews on my Kubernetes enhancement proposal to add checkpoint restore to the CRI API so this is I have an implementation to checkpoint restore containers using CRIO it's not yet migration it's just local checkpoint restore but I think this would be the correct step to get container migration up into Kubernetes and the other things we are working on is non-root checkpoint restore currently CRIO requires being root or Capsis admin you need the right capability to run checkpoint restore so there have been requests from the CRIO community every other month so how is it possible to checkpoint restore a container as non-root and we usually say it's almost possible but we never actually tried to implement unnecessary changes in the kernel and so now with Linux kernel 5.9 there will be a new capability called cap checkpoint restore and the relevant interfaces in the Linux kernel are now protected by the capability cap checkpoint restore so if you give CRIO the capability cap checkpoint restore it should be possible to run CRIO as non-root to checkpoint the processes of your own user the CRIO patches are not yet merged but they exist the kernel patches are merged so this could be ready in a few months to be actually be usable for you on your local system so summary CRIO can checkpoint and restore containers it's integrated in different container engines it's used in production and now the use cases I've shown as reboot into a new kernel without losing container stake you can start multiple copies of one running container to not wait for initialization or to keep all your data loaded into memory and of course migrate running container because this year is about container live migration so this is the main goal of this talk and with this I'm at the end of my presentation those are here links to things I mentioned here during my presentation and with this I'm at the end thank you for listening