 Welcome everyone to my session, Kubernetes and Checkpoint Restore. My name is Adrian Reber. I work for Red Hat since 2015. And I'm involved in process migration, which is the basis, or which is the result of Checkpoint Restore for at least 10, 11 years now. I'm involved in CRIU in some ways since 2012. CRIU is the basis as the tool we use to Checkpoint and restore our processes and container here today. And I'm focusing on container migration around, I would say, 2015. And the agenda for today is I first want to give an introduction about a few definitions I'm using. Then I want to show a few use cases why Checkpoint Restore and container migration might be something useful. And then I want to give some details about basically how CRIU enables us to checkpoint and restore processes and containers. So first, the definition of container life migration, because the work to get Checkpoint Restore working with container is always also the possibility to migrate containers from one system to another one. And basically the idea behind container life migration is transfer a running container from one system to another system. You could also call it maybe stateful migration because the state of the application in the container is not lost, but continues to be there. From a very high level view, container migration is basically we serialize the container on the source system, write it to disk and transfer the image we written to disk to a destination system. On the destination system, we restore from the image the process the container and so we have container migration. As mentioned, this is all everything I'm talking about today is based on CRIU checkpoint in restore in user space. The reason for the name is at the time CRIU was first developed there were different approaches to checkpointing and restoring in Linux. There were kernel only approaches. There were mixed kernel and user space approaches. There were approaches where you intercepted system calls and CRIU took another approach and completely did it from user space using existing interfaces to collect as many information about a running process as possible. There are multiple integrations of CRIU and container engines and container run times and we want to present some of those here. The one I have to present first is the OpenVSET integration because they invented CRIU. They had some mechanism to migrate containers before CRIU but I think they were not able to upstream it to the Linux kernel so they worked on something which can be upstreamed and which everyone can use and so they invented CRIU and they are now using it to migrate containers using CRIU. Then there is Borg. Google's container engine it also uses CRIU to migrate container especially for long running jobs which take a couple of hours. If one of the container nodes is out of resources or is soon going to be out of resources then Borg can move the container from one system to another without stopping it and without losing a couple of hours of work which might have already went into the currently running job. There's a integration of CRIU in Lexi, LexD so LexD can migrate containers from one host to another host using CRIU. This also exists for some time already. Then there's an integration in Docker. There's a checkpoint and a restore command which you can use to checkpoint a container then transfer it to another system and restore it. And the same exists for Podman. That's what I've been working on the last three, four years now to integrate CRIU into Podman. And the work I have done for Podman I have also brought to CRIU and I'm working on this now for about a year The problem is the pull request cannot be merged in CRIU because it depends on changes to the CRI API and this requires changes to Kubernetes so it's a difficult situation because you cannot implement one without the other and so you have to start at some point and then you can implement the other thing but the one thing doesn't work with others. So it's complicated and we are discussing this how to best integrate it into Kubernetes. Interestingly enough there's an issue often about migration open in Kubernetes since 2015. There's also a lot of discussion going on there but until now nothing has been merged into Kubernetes to implement migration or to at least get a step closer to implementing migration in Kubernetes. So now about a couple of use cases. The first use case is reboot and save state. I try to visualize it here. So we have a container running on a host and the host is memories blue and the container memory is red so to see the differences here. So we want to reboot a host because we have to upgrade the kernel but we do not want to lose the state of the container because it takes forever to start. So what do we do? With checkpoint and restore we checkpoint a container write it to the local disk then we reboot a host so the original host with its memory is gone and then once the container once the host has rebooted into new kernel we have a different color but we have the same container running as before it's still red as before. I have prepared a demo to show this here so let's connect to a rel 8 system using portman so let me start a container portman run so this is a wildfly based container it's a Java application server and it has a real simple stateful application in there which returns just a number and once the number has been returned it increases it so the next time I get a number or increased by one so it's stateful and now I can talk to the container and ask for the value so I give it 0, 1 and 2 now I say I want to create a checkpoint of the container so I say portman, container, checkpoint checkpoint a last container and export it to the file so now portman will write the checkpoint file to the disk and then I can reboot the host and later I can restore the container from that checkpoint file the checkpoint file does not only include the checkpoint of the processes in the container but it also includes changes to the file system because when the wildfly starts it creates log files and a lot of files to manage its things so all these changes to the file system are also part of the checkpoint archive so now let's reconnect to the VM let's see there's no container running so let's restore the container portman, container, restore and I say import and I give it the tar file from before and now if we connect to the container it should give us backup 3 instead of the 0 as it would when starting from scratch so let's talk to the container and then we get a 3 and a 4 so we did a reboot and we were able to save the state of the container back to the slides and next use case is what I call a quick startup so again we have a container of host and a container again takes a long time to initialize and now we want to start up the container faster so we take a container, a copy of the container so we have the container we take our checkpoint of it but we leave the original container running and then we can create multiple copies of the container on the host so let's go back to our demo system so let's say I create another checkpoint I say partman, container, check checkpoint last container and now I say keep it running, don't stop the container after the checkpointing and I again export it to the file I'm overwriting the existing file so we don't need this anymore so now again the same thing, it will create a checkpoint and the file system differences or everything will put into an archive and now I can restore it so the original container is still running I can still talk to it and now I can restore a copy from the container so I say partman, container restore, import again and it will now fail because partman sees that the id and the name is already in use and so it tells me I cannot restore this because it already exists so I tell partman then use a different name for it so let's say counter1 now it will restore the container a copy of the container as counter1 and another copy as counter2 and then we should be able to talk all containers independently so now I can say partman, inspect, counter1 to get the IP address of this one so it says again 6, 7 this is the point where I checkpointed it from and now I can also say inspect counter2 and again 6, 7, 8 so we create multiple copies of an existing container and in this small demo case I can see already improvements in the startup time because usually my wildfly container takes like 8 seconds until it can answer the first request and it takes about 4 seconds to restore from a checkpoint so I already already have like 50% improvement in startup time restarting from a checkpoint so the next use case is container live migration this is basically a combination of the two things I've shown before so again I have container running on a source system I take a copy of the container and write it to disk and then I transfer it to a destination system and there I can create multiple copies if I want to so let's go to so I already have to checkpoint okay now I already have to checkpoint so let's transfer it to another system say temp, dump, tar and transfer it to another system where is it here and so now it's middle in the container migration and now I'm on the other host and there is no container hopefully running here now I say portman, container, restore, import again from the file I just transfer to the host it's restoring the container and now I should be able to talk to it and I should get probably again a 6, right? 6, 7, 8 so now we have migrated the container from one host to another host using portman's feature to export checkpoints so I have one more use case here this is the forensic container checkpointing and this is also the use case we are currently working on our Kubernetes enhancement proposal and corresponding pull requests to it so the idea is we want to introduce checkpoint restore into Kubernetes but it's complicated as I said with the CRI API so we try to implement a really simple use case of checkpoint restore which requires minimal changes to Kubernetes most of the changes do happen in the container engine so what we want to do is we want to checkpoint a container which we think something is not right with the container something seems to be strange but we do not want to analyze the container in case there is an attacker there and in case the attacker has put something in the container to detect analysis of the container so what we do is we take a checkpoint of the container out of Kubernetes and try and then we can transfer the container in a sandbox environment where we can start it without Kubernetes but without influencing the original container so I also have a demo for this here so this is now a Fedora system no longer the system from before I have prepared it here in a screen so there is a Kubernetes running with cryo in combination so let's see cryo control ps okay there is no container running this is just the the DNS from Kubernetes anyway so let's start a part with two containers so part with two containers created we can see the first container is already starting up the same container as before again the wildfly based container and there is now a second container it's called counter it does the same but it's just a python application this time we want to checkpoint the python application so let's see we first let's talk to the container and see what it returns to us returns 1, 0, 2 so it's the same as before just implemented in python and now we want to checkpoint the container without stopping it so we created a kubelet interface which lets us checkpoint the container so it's called checkpoint then we say default namespace pod name is counters and container name is counter so we now say let's checkpoint it now kubelet is talk to cryo cryo talk to runcy runcy talk to screeu and container has been checkpointed if we access the container again we see it's still running so the container has not been changed so now let's stop kubelet is here on this host and I will now my windows here and now I will remove from this host ok no more containers running here now let's create a new pod in cryo directly and in this pod we can then restore the container so I say cryo control create a pod and if I say cryo control pod is now I see there's one pod running so now let's ok now let's restore the container cryo control restore and the checkpoint has been written to disk here and I'm telling cryo to restore it in the pod I just created so now the container has been restored and now I can access it again under a new id and it continues to run here and you can see cryo control yes I see the container is running under disk but there's no more kubeletes running now kubeletes has been stopped so this is now running so I took a container out of kubeletes and it kept on running in kubeletes and then I started another cryo but I used the same cryo but the idea is to take it to another version of cryo and restore the container for example and this way we can analyze the container without changing the original container so now I want to talk quickly a bit about how cryo works and how it enables us to checkpoint and restore processes and migrate them in its way so the first step is always of course checkpointing and so what cryo does is it uses ptrace to pause all processes cryo always operates on a process tree so we point cryo to a process and cryo will checkpoint this process and all child processes it will collect all the information of the processes invited to disk this will be paused during that time and can continue to run after the checkpointing the first thing cryo does and where it got its name from is it collects the information from user space using the information available in procped so it goes through the information there and the initial idea was that cryo uses as many existing interfaces as possible over the time cryo added additional interfaces to the linux kernel but those interfaces are never checkpoint restore only they can be used for different things and so there's not really any specific checkpoint restore change in the linux kernel today so cryo collects all the information from procped once it has all the information it can collect from the outside of the process it goes into the process to collect information from within the processes address name space and to do this cryo uses something called parasite code the parasite code is injected into the process it replaces some of the original code and the parasite code is then starting again and it's basically running a demon inside of the process and now the main cryo binary can talk to the demon running inside of the process as the parasite code and collect information from within the name from within the address space of the process so we can see inside of the process using this and one important thing cryo does using the parasite code is for example to dump all the memory from the processes using from within the processes address name space and this way we can dump the memory really fast and if you look at migration times the most time is spent in transferring the data over the network from one host to another host so dumping the memory from the process to disk is fast also reading it back is fast compared to the time it takes to transfer the process from one system to another over the network so once the parasite code has collected all the information and written to disk the parasite code is removed after usage this is called or cryo calls this curing the process the process will never know that it was running under the control of cryo or the parasite code it will just continue to run I try to visualize the parasite code a bit here so we have the original process code to be check pointed then remove one part of the code and save it outside of the process then we replace that part of code the parasite code the parasite code starts is running in the process we collect all the information from within the processes address name space and once we have collected all the information we need we replace the parasite code with original code and check pointing is then finished and at this point after all relevant information has been written to disk and the target process can be killed or it can continue to run in the demos I've showed we have seen both ways either the process keeps on running or not so this really depends on the use case what you want to do so after check pointing the second last step is restoring the process this happens by reading all the check point images from disk and then we recreate the process tree for each PID, TID just as it was before check pointing we do a clone for each or clone 3 depending on your kernels for each process and recreate a process tree just as it was before check pointing and once we have the process tree create more all the processes into the original processes like they were during check pointing and one good example I think is always the file descriptors so during check pointing we look at the file descriptors which file descriptor points to which file and what position in the file do we have and we write this information to disk and during restore we reopen all opened file with the same file descriptor and then we position the file descriptor the correct position and once the process continues to running after create who has done all its work the file descriptor will be at exactly the same position and will be the same ID will be the same file just as it was before so the code which will access the file descriptor will just get the same file the same content same position create will map all the memory pages back to the original location it will load all the security settings like up armor as a linux or secom we do the security settings as late as possible to not to make restoring easier because if we load it early it might block something we do here then create jumps into the restored process and and the process continues to run from the point it was checkpointed and with this I am at the end of my crew part the next few slides are just details what I did in my demos just if you want to try it out for yourself the steps are listed here in the slides so first was the checkpointing then the restoring and restoring with a new name creating copies and with this I am at the summary of my talk so crew can be used to checkpoint and restore containers it's integrated in different container engines already it's used in production and use cases are presented where you can reboot into a new kernel without losing the container state you can create multiple copies of your container for fast startup if you have a container which is slow starting up or if you want to collect quickly on requests you can migrate running containers from one host to another host using checkpoint restore and crew and the thing we are working on currently for for Kubernetes is the Kubernetes enhancement proposal 2008 is the forensic container checkpointing with which we hope to enable first steps of checkpointing containers in Kubernetes with this I am at the end of my talk thank you for your time and happy to answer any questions you might have