 Welcome to my presentation, Container Live Migration. My name is Adrian Reber, I work for Red Hat since 2015. I'm involved in process migration, which is the basis for container migration, like I'm talking today about since at least 10 years, and I'm using CRIU, I'm involved in CRIU, which is the tool, which all of today is based on since 2012. And I focus on container migration since 2015 and less on single process migration. Everything I'm talking about today has been written down as a blog post, this is I guess over a year ago, so it's a bit older, but the steps I'm doing, especially in my demos, can be found in this article. And also the ideas behind the container migration steps I'm doing here. The agenda for today's presentation is I first want to present a few use cases, how and why to use container live migration and use cases, which are based on the techniques provided by Container Live Migration. Then I want to go into a few details about CRIU, Checkpoint Restoring User Space, which is the tool, which does all the checkpoint restoring of processes for us in the case of container migration, and I want to show demos using the tools and container engines and runtimes I'm talking about here. And then I want to have a look in the future of maybe container migration, but also what's maybe coming next for CRIU. I usually like to start with the definition, what is container live migration? Because there's often some discussion about what container live migration is. For me, container live migration is basically the same as VM migration is. So you take your container and transfer the container from one system to another. You could also call it stateful migration. And the steps to migrate a container are simple steps. You first serialize somehow your container on your source system and write it to disk or some storage backend and then you take this checkpoint and transfer it to destination system and then you use it to restore the container. And as already mentioned, this is all based on CRIU, Checkpoint Restore in User Space and there are multiple integrations of CRIU into different container engines, container runtimes. And I will later give an overview how and where CRIU is available for which runtimes engines. Most of the things or two-thirds of my demos are using Podman. One demo is using Cryo and Kubernetes to show how to migrate containers or checkpoint containers with the help of CRIU. Now let's talk a bit about possible use cases for the container migration technology. How can container live migration be useful and the first use case I want to talk about and this is also the first thing I implemented in Podman but also in I'm currently working on Kubernetes and Cryo. This is the reboot and save state. So I have a container which is running, the container is stateful, it's some database, it has data loaded into memory or it's an application which takes some time to start up and you do not want to wait again for your application to initialize again or to load data from disk or network so you want to save the state of your container before doing the reboot for example you want to do you want to install a new kernel for security updates something like this so let's go to the demo. So this demo is based on Kubernetes so let's have a look at my Kubernetes checkout directory. So this is all based on our local checkout things of which I'm showing now are available already in pull request upstream but not everything is yet made available upstream but first let's have a look at what is running on my Kubernetes cluster right now so I say cluster, queue, control and I say get pods so no pods are currently running here. I can also look at the Cryo level and I can say cry control and give me a listing of running containers currently and there are running three infrastructure containers started by Kubernetes. So the first thing I want to do I want to start a Redis container and fill it with a few entries so I say cluster, queue, control and I think it's create deployment and then I select an image. It's a Redis image and I call it my Redis so let's see what does get pods say get pods says there's no pod available with with the Redis container I guess and it's not ready yet if I look at cryo level I see there's also a Redis container now running and in addition to not just checkpoint one pod with one container I also am creating another pod with two containers and the containers here in this pod are two simple application one is running inside of wildfly it's a simple HTTP application which basically returns a number increases it the next time you query via HTTP the number is increased by one and the other one is the same but written in Python if I now have a look at cry control PS I see there's now one of my containers is called counter this is the Python container and then the wildfly hello this is based on the wildfly hello world application and then there's also the Redis container running. So now let's have a look at my Redis container first so let's again do get pods so that I know the pod name and now I will look which keys are loaded into Redis currently so I prepared all those long commands here just have to replace the pod name with the new pod name and now it's it's should say empty array so now let's put some keys into Redis here so I say echo and then set defcon 2019 was in Bruno can have to replace the Redis name of the pod and 2017 I was also in Bruno and 2018 and 2021 it was virtual so let's write this to Redis okay now let's again look at the keys which are in the Redis database and now it says there are four keys in there now let's have a look at the keys I'm using mget to retrieve all the keys and again I have to replace here my pod name and now it should say virtual and three times Bruno so this is one of the containers which is running and the other ones are simple HTTP containers so I just say curl and then I try to get the IP address of the pod and the hello world is the Java wildfly container so it returns just the number increases it so it's definitely stateful and the other one the python based it's the same it's running on a different pod at different IP address in the same pod it just returns one two three four five so now I'm in the situation I have pods running on my system here but now I want to reboot my system so I want to tell Kubernetes to stop all my pods here and so I say cluster keep control brain and the local node and then I say this is new here just I say check points so Kubernetes should now try to checkpoint the pods if it can and it will now shut down the the cubelet or it will put it in in a mode where no new pods are started and it will checkpoint the running containers checkpoint the running pods so let's tell Kubernetes to checkpoint the pods and now it should write the checkpoints to checkpoints because the two pods and some metadata to valid cubelet checkpoints and there it is so there are the checkpoints and there's the metadata now I can stop my Kubernetes installation and reboot my system to to restart Kubernetes and hope that it and hope that it restores the container state just as it was before checkpointing so Redis should still have the data and my stateful count containers should also remember the state they were previously so this is the VM on my local machine so let's hope it reboots quickly so it looks good it's already there let's start screen again and first I have to start cryo so let's cryo then I have to start Kubernetes and then we have to wait until Kubernetes starts up this takes a few seconds only let's hope it only takes a few seconds also change here to the Kubernetes directory and let's see so now the cube that seems to be running and now it needs to talk to cryo and as soon as it talks to cryo it will hopefully restart my pods and containers so let's see there are my pods and my containers are running if I now say cluster, cube, control and get pods I should see the same pods and they are running so now let's look at my Redis server if it still has the data using mget I should get the result of just as before so they are still there if I do keys I also get the same keys with the same values are still in the database if I connect to my Python based if I connect to my Python based container then I should get counter six so this is the state it was before and if I connect to my wildfly based I should I think it should be three here yeah it's a three so so the containers where where the containers and pods were restored from the checkpoint which was done previously using drain minus minus checkpoints so let's power off this VM so that my laptop so wrong direction so okay so this was the first use case reboot in safe case I have a small diagram here which tries to visualize what what has happened so we have the host the host and all its blue memory and a container is shown using red as its own memory and when we do a checkpoint the container in its memory and its state is written to disk and then the host is rebooted so the host is gone and then the host is starting up again now I colored it green to show that this time the application or the operating system is using different memories so but still we have the same red container running as before this was the first demo and another possibility to use container migration technology is for for quick startup and I will show I will also show a short demo about this so I have a VM which is using rel8 as the one the other one before was using Fedora 33 so let's see and this time I'm using podman so let's see podman ps here are podman ps okay there's no container running let's start a container this is again my wildfire based container because it's a nice container for showing stateful containers so the container is running I can again connect to it I say podman inspect minus alpha last as format I want to get out the IP address and then I'll give it the port and the path to my hello world application and it just returns 1 to 0 and now we want to for quick startup we want to create a copy of the running container so what we do right now is podman container checkpoint we say minus alpha last running or for last container podman knows about and we say minus r for keep the container running don't stop it after checkpointing and then we say export we want to export checkpoint to an archive so we say checkpoint tar and now the container is written to disk but still keeps on running so we still can talk to the container just like before and now we do podman container restore because the container is still running we need a new name for the copy of the container because restore would try to restore the container with the same name so we say counter 1 and then we say import and then the checkpoint archive so so after that we should have two containers running and let's see podman ps so and I can now connect to the container counter counter one and I get a three and if I connect to the original container of my test it will return a bigger result six and the counter will go on so they are both running independently but both are returning um same value so for this example I was talking about quick startup so it kind of true because initially the wild flag container takes about eight seconds to start up and the copy takes from the checkpoint takes only four seconds to start up so it's 50 faster I guess for a bigger container with more um java libraries the difference should should be even bigger but this way it's um you can easily create copies from an already running container without and the need to wait for um the container to initialize so um also have a small diagram here so we have again our host and the container is running on our host in our host we take a copy of the content running container of the of the host and just write it to disk and then we can create multiple copies of the container on the same host from the one checkpoint we created and they will all continue to run from the point where the checkpoint was taken so container live migration um I will show this demo at the end of the the presentation but um the the diagram is again here so I have again my source system with the container running I'll take a copy of the container write it to disk and then I can transfer the copy of the container to the destination system and start it once or multiple times so now I can migrate the container and make copies on the destination but also on the source system the container on the source system can continue running or I can stop it whatever is the best thing to do for my use case so let's talk a bit about um crue um the checkpoint restore and use space the first step um to use crue um to checkpoint a container um is this is the first step is checkpointing of course and um to checkpoint crue is using ptrace to pause the process and as long as the post process is in the container or paused um crue collects um information about the process and writes it to disk um it's using um details and information from propit about the process and this is also maybe the point why it's called checkpoint restoring user space because crue uses existing user space interfaces to get information about running process out of the system so once crue has collected all details it needs from propit and other external interfaces to get information about processes um the next step of the checkpointing happens and this is the parasite code the parasite code is maybe my most favorite part of crue and the craziest because it's not because it's it's kind of unexpected for for something to work the way um the parasite code works so the parasite code is injected into the running process and the process which was previously stopped by crue now continues to run but not original code is running but the parasite code is running and so um the the process is under the control of the parasite code and the parasite code is is a demon which is waiting for um commands from the main crue process and the reason why the parasite code exists is using the parasite code crue is possible to retrieve information about the process from within the processes address space so everything the process can access crue can now access and another reason why um the the parasite code um exists is when initially um crue was developed there was no other way to get um a lot of memory out of the process to disk fast so there was ptrace and ptrace enabled reading of process memory but it was slow and if you want to checkpoint a process with multiple gigabytes of memory this has to be fast and using the parasite code um checkpointing the the processes memory from within the processes um address space is is now pretty fast um if you if you do a container life migrate or if you do a process migration the writing of the memory from the process to disk will always be faster than um transferring the checkpoint um via network so it's it's it's it's fast and it's from within the process address space so um the parasite code um is really is essential to um to to crue and once the parasite code has done all the work it needs to do from within the processes address name address space um the parasite code is removed crue calls this curing the process so the process the original code of the process is copied back into the process and in the end the process never really knows it was under the control of crue and the parasite demon and it will just continue to run as it was before without um knowing it was it was checkpointed um again a diagram to maybe make this a bit easier so this should be the the original um process code which we want to to checkpoint so now the parasite or on our crue first removes some part of the code out of the process installs it somewhere so it can later during the curing uh reinsert the original code into process but um during the parasite code lifetime the parasite code is injected into the process it's then running from within the address space of the process and once it has finished um the parasite code is removed and the original code is restored just as it was um before and at this point so after the parasite code and after collecting information from prog pit um the the checkpointing has finished um and all relevant information about the process has been written to disk and at this point um crue either can kill the target process it it currently checkpointed but um the process can just as as well continue to run so this this really depends on what um the user wants my examples my demos before showed both ways so it's um for crue it's it's more or less the same so um another important thing for container life migration is especially um when trying to get it to work correctly with portman was the ac Linux integration of crue and uh one of one of the interesting things about um ac linux and crue is um the the is the is the parasite code because if we're talking about a container the the parasite code is injected into the process in the container and now all of a sudden the container is talking um using a socket to the main crue process and this is something which um ac linux doesn't allow because it doesn't accept it um it so um we we had to work around or we had to fix crue and portman and all the layers um to make it work correctly at linux security summit um 2019 i gave a presentation about this which was necessary what was necessary to make this work correctly and so so one of the big things was um how to talk to the parasite demon in the container um from the outside but then there are a lot of files which crue needs needs to access um so we had to fix um crue and and portman at different places but in the end we can now checkpoint and restore and migrate containers with full ac linux um confinement enabled so um this works without any problems right now so so once the checkpointing has finished and all the data has been written to disk the second and last step for migration of the process is the restoring of the process and crue um reads in all the checkpoint images which are on disk and tries to recreate the processes based on the information from those checkpoint images um crue does a clone for each pid and thread id and it restores the process three just as it was before checkpointing um crue always operates on process trees so you tell crue please checkpoint this pid this process and then crue will checkpoint the process you've told him and all child processes um of the target process so it will always work on a um on a process tree and during restore crue will always restore the processes with the same pid they used to have um before um checkpointing and until um lpc 2019 um crue was using clone for each um pid or thread id to create a new process and to create a process with a certain pid with what wasn't easily possible in linux or there was no sys call directly supporting this so um crue had to do something we called the pid then so you have to do multiple sys calls write pid minus one to ns last pit then do then close the file then clone and hope that no process between writing and closing and clone was created you crue you check this by by calling cat get pit on the on the created process and if the process has the correct pid um continue so this is slow because it takes multiple system calls it's it's it's a race condition because some other process can be created and basically the checkpoint it'll restore just um boards and it's it's it's over so and with clone three we have one single sys call and we can tell the chrono we want this certain pid please create a process with this pid and if if it works we get a new process and it has to correct pid and if it doesn't work we get an error message immediately that it cannot create a pid because it's in use or something else so um so once the pr the process tree is created using clone or clone three depending on your chrono and crue morphs all those processes um into the destination process in the to be restored processes and file descriptors are a good way to um to explain this um a bit so during checkpoint in crue records the the idea of the file descriptor the the file it points to and the precision the file descriptor has or is pointing to to that file and stores this all in the in the checkpoint image and during restore all this information is read back and now crue reopens the file with the same file descriptor id it had before and it also does a seek to the correct position so once the process continues to run it will find the file descriptor pointing to the correct file and to the correct location and this is also happening for other resources um crue restores um and crue can handle during checkpoint and restore so um file descriptors and the memory pages are mapped back to the correct location then security settings are loaded um we load security settings like ab armor as a linux secomp we do this as late as possible because some of those uh security settings will make it difficult or impossible to restore the process so we do as much restoring before restoring the security settings and the security setting as late as possible and once everything all the resources are in place um we tell the the restore process to jump into the jump to the original location it was checkpointed and the and the process just continues to run as it was before checkpointing on another system or on the same system so this was the introduction to container life migrations so um let's talk a bit about container life migration and container life migration is um is implemented using crue in in different container engines run times orchestrators maybe so the the first I have to mention is open vz um they invented crue to um provide a possible possibility to migrate um the containers they they provide so um this is the the one integration where crue comes from then there is another integration of crue in in borg this is google's container engine orchestrator and um to live migrate containers in in production from one host to another if resources are getting low um the container can be migrated to another system instead of stopping it and restarting and maybe losing hours of of of work there then there is an integration of crue in lexity lexity you can um migrate the stateful move in lexity from one host to another host then there's um in crue integration in in docker then there's crue integration in podman this is the thing I have worked on the last um three years and um and some of the demos are using um the the implementation I've done in podman um here the one I did before for fast startup and the the one I'm doing later for container migration and um this is really new to container live migration integration into into cryo so there's a pull request open from me um I'm working on this in september 2020 I'm trying to get um crue container migration checkpointing working in in kubernetes this is the first demo I've done is based on on on this pull request and in addition to the changes to cryo there um a couple of issues open and and pull requests for kubernetes the interesting thing is the the first issue the first link I am I have here is is from 2015 so people have been asking to implement container live migration in kubernetes since um 2015 so I'm hoping um now that it works locally for for drain and and and reboot um I can move the this this issue and my pull requests forward um for for the podman implementation the first thing I've done um was checkpoint restore so this is what we've seen with with kubernetes I'm checkpointing my container I can reboot and I can later restore the container this was the initial implementation I've done for for podman and the first discussion we're actually at defcon 2018 where where I talked to um some people about this if this is something which would be interesting for for podman and then the first code was available around May 2018 in october 2018 it was finally merged and after that I continued to work on container live migration which was made available in in 2019 in june then the the interesting thing or which is special about um podman's uh implementation of of um container migration is that the the checkpoint archive which podman creates does not only include the the checkpoint image written by crewe but it includes all file system changes so if your container has written to the file system the the podman archive the podman checkpoint archive will include these changes so you can easily move it to any other machine without um thinking about how can I make sure that the file system is in the same state as on my source machine um so um this makes container migration um really easy so now I want to show my my last demo um this on the slides I've just um put some the commands I'm running here this this is just for the case that my demo doesn't work so um we previously created already a checkpoint for the fast startup and now I can just use that file that archive which is temp cp.tar and just copy it to another machine this is again a rel machine running rel 8 and and podman so now let's go there and let's see are there any containers running here no that's good so now let's say podman container restore and I say import and then I say temp checkpoint.tar and now um I'm this is the last step in in the container migration so I've taken a container checkpoint on my source system it has been written to disk I transferred it to the destination system using scp and now I have a container running here with the same name and now I can also query it I say podman inspect minus l to get the IP address and it continues to run from the same point in time it was checkpointed I don't know 30 minutes before when I did the first checkpoint okay so this is um container migration um how it can be used with podman and now back to the slides so this is just what would be necessary to do this on your own and to the future about checkpoint restore in containers and outside of containers the the the thing I mentioned I'm working on currently on getting crue into into kubernetes the the problem is there are so many use cases in in kubernetes for for crue and so there's I want to live migrate a container which is currently running on one node to another node maybe I want to base it on a policy decision or a scheduling decision so it it has to touch a lot of things in addition to the actual checkpoint code so my my approach right now is to get the basics for checkpoint restore into kubernetes implemented using the the drain use case so because this is not a use case which is triggered by some policy or scheduling it's triggered by the user who wants to shut down the nodes so it's it's it's really straightforward and not complicated so the user says um drain checkpoint the checkpoint is written upon reboot the checkpoint is is read read from disk and and restored so um so my goal currently is to get um the drain implementation merged into kubernetes and once that is done I can start to think about other crue use cases in in kubernetes the other thing which might be interesting is um the the non-route checkpoint restore support for for crue we have been working last year on getting a new capability into the kernel cap checkpoint restore and if a process or if a binary has has this capability it can use or if crue has cap checkpoint restore it can checkpoint and restore a process without um having to run it as root this is there are a few use cases um where we heard upstream crue that people are interested in using this hpc is one of the things which comes back regularly at at at crue and so um we are working on it it's on the way and it should be available I don't know in some weeks months and with this I'm at the end of my talk at the summary I want to highlight that crue can checkpoint and restore containers like we've seen a couple of times it's the integrated in different container engines run times orchestrators it's used in production one of the use cases you can reboot into new kernel without losing container state you can start multiple copies of a container you can migrate running containers from one system to another and with this I'm at the end of my talk here are a couple of um links about the things I've talked today and with this I'm at the end and I want to thank you for the time