 Good afternoon. Welcome to our session. Distributed and enabling, sorry. Enabling coordinated checkpointing for distributed HPC applications. My name is Adrian Reber. I work at Red Hat for nine years now. I'm involved in checkpoint restore, which is the basis for all of what we're doing here today. For 14 years now, we're doing everything we do with checkpoint restore. It's based on CRIU. I'm involved there, I guess now 12 years. And since I'm at Red Hat, I'm mainly focusing on container migration. Today here with me is Radistin. Radistin, I'm a PhD student at the University of Oxford, supervised by Professor Weiss-Armor. And my research is focusing on enabling checkpoint restore for HPC posters. Okay, and today's agenda will look something like this. I will go through the background of checkpoint restore, a bit about the history, why the names are the way they are. Then I'll talk about integration into existing container run times, engines, orchestration, then I'll talk about use cases, why you would want to use checkpoint restore in combination with containers. Then I will do a migration demo. This is not coordinated yet, it's just a simple part migrating from one host to another host. The distance between will be about 500 kilometers. And then Radistin will talk about coordinated checkpointing in an HPC environment, why it's necessary, and what the challenges are there. So, background. So the tool we are using here is called CRIU. It's called checkpoint restore in user space. And the reason for the name, especially in user space, is because it's a tool and there were many different tools before. So checkpoint restore is a technology in operating systems and Linux for a long time. In the early years, 20 years ago, it was mainly done in high performance computing. And the main use case was fault tolerance. If you have 1,000 nodes running, something stops working, you don't want to restart from the beginning. So you're looking into checkpoint restore your compute jobs regularly. And at that time, there were two tools developed around 20 years ago. One was an external kernel module, which was never entry and which you had to compile separately to be able to use it. And then you had to preload libraries. So the solution was never what you would call a transparent checkpoint because you had to prepare your application, either recompile it or prepare it in some other way. To solve this problem, there was an approach for checkpoint restore, which was in the kernel. It was a huge patch set around, I think 2006 or 2005 or eight, I don't know, some time ago. And it was a working solution to do checkpoint restore in Linux. But the problem was the Linux kernel community was not convinced of the approach. So the approach kind of died. And so people were looking at another way to do checkpoint restore in Linux. And the next one is now Crewe, checkpoint restore in user space, mainly because the one before was in kernel. So now we're in user space. In user space means there's no specific code to do checkpoint restore in the Linux kernel. In user space also means that Crewe was designed in such a way that it was using existing interfaces as much as possible. Over the last 10 years, Crewe introduced additional interfaces. None of the interfaces introduced in the Linux kernel are specific to checkpoint restore. They're usually kind of, how can I get more information about my running process? And so often they are often used for debugging and similar things. So this is about checkpoint restore, Crewe. And now to the integrations. So there are multiple integrations of Crewe in different container. Runtimes, engines, orchestrations. And the first one I usually want to mention is OpenVz. I never used OpenVz myself, but they were basically doing containers before the word container existed. And what they were doing, you would probably call today something like a system container. So not like an OCI container where you have just one application running in there. You had a whole operating system running in there. And they wanted to provide their customers, their users a way to live migrate those system containers from one host to another. So they came up with Crewe. And the good thing here about Crewe to remember is that Crewe was designed for containers. All the other checkpoint restore implementations which existed were for high performance computing. And they are still used in high performance computing today, some of them. But Crewe was designed for containers and that's why it works as good as it does today. So another interesting integration of Crewe is Borg. That's Google's container, orchestrator, engine. Also something I never saw myself, but Google came to us, the upstream Crewe developers. Maybe, I don't know, seven years ago, they talked at a couple of conferences how they use it internally. So before using Crewe in their container engine, what they did is if there was a low priority task running on a node and resources were getting low, they just killed it and restarted it somewhere else, which meant they had to redo all the work. Today from what we're told, they are using Crewe to live migrate those low priority containers from one node to another node without losing any of the work. Examples they like to give is video re-encoding in the background, so if they need additional resolutions of video, then they do it in containers and those are the containers which they actually migrate today in production all the time from what we're told. Then there's Lexi, Lexi, Incos ecosystem. They also have support for Checkpoint Restore for a long time. Then Docker introduced it many years ago. I've been working since 2018 on the Potman integration of Checkpoint Restore. And one of the reasons I started to look at Potman is because Cryo is similar to Potman from some of the code. So I could experiment with Potman getting a Checkpoint Restore working there and then I brought it to Cryo today. And the reason to bring it to Cryo was to bring it to Kubernetes. And in Kubernetes, Checkpointing currently is available under the label Forensic Container Checkpointing. We made it, it was, so this took us a couple of years to get it into Kubernetes. It was moved to Alpha with one 25 couple of years ago, two years I think, three maybe. And just a couple of weeks ago, we were able to move it to Beta with one 30. And this means from now on, Checkpoint support is enabled by default. The, I say only Checkpointing is available because the Forensic Container Checkpointing story is basically you have a container running something unexpected happens. You don't want to stop it immediately, but you want to take a copy. So Forensic Container Checkpointing gives you the opportunity to take a stateful copy of your running container. The container will never know that it's checkpointed and later you can analyze the container in somewhere else or you can start it again in a sandbox environment. The Restore part currently in Kubernetes is kind of a, it's a bit of a trick. We trick Kubernetes into restoring a container. So we hook in cryo in the create in the start call. We detect that it's a checkpoint. And if a checkpoint is then we do a restore of the container. So from Kubernetes point of view, it's a starting of a container, but it actually is a restore of the container. And this is something I will show later in my demo. And now a couple of use cases just quickly going over why you would want to use checkpoint restoring combination with Kubernetes or containers in general. So the first one is quick start up kind of. It's you have a container, it takes a long time to run to start. You need to update your host because you need a new kernel. So you can take a checkpoint, reboot the host and the container will be restored much faster than it takes to initialize all the libraries. The multiple copies use case is similar to that. This is also people talk to us which they use in production. So you have an application. Yeah, you want to offer to customers. It takes a long time to start, maybe 10 minutes because it loads thousands of libraries. And so what they do, they take a checkpoint after the initialization and then they, whenever a customer requests the application, then they restore from the checkpoint and the customer has to wait just a couple of seconds instead of 10 to 15 minutes. Then the container migration is one of the most obvious use cases from my point of view. You want a container, how you have it on one host. You want to move it to another host without losing the state. Then yesterday we had a talk about checkpoint restore and spot instances. This can go away anytime. So what you want to do, if you get a signal that it goes away, you take a checkpoint and continue to run the container on some other host. All the examples at this point have been for stateful containers. If you have a stateless container, all this is not really helpful because you don't have a state you want to save. The forensic analysis use case is useful for all containers because even if it's stateless, maybe you want to analyze it without stopping the container and something which people talk to us a lot in the last year is AI training. So you have a container doing an AI model training with a GPU, you want to checkpoint it and run it somewhere else without losing all the work you've done so far. So let's go to my migration demo. So I have a, you can already see it here. I have a really simple pod specification. It's running two pods and I'm gonna, and this is about 500 kilometers away, this VM. So I'm gonna, it's not running. Let's start it. So this is of course a stateful application. It's a really simple application. It just has a state, nothing more. And I can talk to it just using curl. So now I have a counter. It says one, zero, one, two, no, two. And now three. Now I can create a checkpoint. The checkpoint is currently available only as a KubeNet only API and this command does everything I need. But it's basically just calling curl on one of the KubeNet APIs. And now the checkpoint is created and I have the name of the file in the variable. There it is. And now I can, we had a Google Summer of Code project using two great students who were helping us to write a tool to do the forensic analysis easily. And we have a tool called checkpoint control. And one example of what checkpoint control can do for example is the inspect command. And what you see here is now you see, basically it's the image it's based on when the runtime dates, when it was started, when it was checkpointed, the engine, the IP address, the size and so on. And to migrate this container to another host I have now to convert this image to an OCI image. Let me do that on the top. I'm using builder for that. So the first step should be to add the tar file to the new container image. Then we do, no. Then we add some annotation to it so that it can easily detect it as a checkpoint. Then we do a commit. And then I delete the intermediate image. And now I have to push the image to container registry. Let's push what there is. And I'm gonna call it 51. And now on another VM, so this is now local on my machine here. So previously I had a pod with two containers and now I'm migrating one of the containers into a pod with a single container. And again, really simple. I'm just replacing here the 50 with the 51. And now I will say apply. And now what will happen is that cryo in the background will download the container image. See it's a checkpoint image. And then do a restore. Kubernetes will believe it's a new container but it's an old container like we will see soon, hopefully. And now I will access the container and now it should say, what was the last? No, I don't see it anymore. So I think it should say four or five. I don't know, let's see. Five, four. So the container has been migrated from one host to another host over a large distance, maybe. And so with this I'm at the end of the introduction part about cryo checkpoint restore. And please rise in. So HPC applications are driven by scale and performance. And in order to process large amounts of data they need to be distributed or to be running on multiple different servers. The HPC community has been increasingly adopting Kubernetes mainly because of the features and benefits that it provides. However, the current checkpointing implementation is not able to checkpoint multiple containers at the same time. And the fundamental reason is because our cryo was not designed to checkpoint distributed applications. It was designed to checkpoint a single process tree. So the question is how can we extend cryo to support distributed applications and how can we enable dysfunctionality in Kubernetes? And there is a large amount of research done on distributed checkpointing and the checkpointing and rollback recovery protocols. This was then the fundamental concepts were created in the 1980s and 1990s when the resilience of computer systems was fairly low. And the most recent work, for example, that enables distributed checkpointing is DMTCP. DMTCP is a tool similar to cryo. It provides system level checkpoint restore. The way it works, it works in user space but the way it works is using the LD preload mechanism which means if you, it's not fully transparent. So to be able to checkpoint and restore application you need to be able to, need to start this application with DMTCP launcher. DMTCP is also very similar to cryo in a sense that has been integrated, for example, with container platforms such as AppTainer. However, the way, the main point is that it provides a coordination mechanism that allows to essentially synchronize the checkpointing between different instances. And it's also then used during restore. Another example is Apache Flink which provides very advanced full tolerance mechanism to provide uninterrupted execution for data analytics and data stream processing, batch of jobs. However, Apache Flink is a framework. It cannot be used with existing HPC applications. So to be able to create a consistent global checkpoint we first need to understand what does it mean to be a consistent checkpoint. And essentially this diagram illustrates on the access to time of how different containers are running and the red circles are checkpoints. And the checkpoint wave is essentially when we create checkpoints for all containers and when a message has been sent, a checkpoint is consistent when all arrows are messages being sent between different messages. And for checkpoint to be consistent it means that all arrows should start on the left side and finish on the right side of the checkpoint wave. For example, messages that are starting on the right side and finishing on the left are called orphan messages or we have recorded in the checkpoint the received event of the message but not the sent event. So this is the definition of inconsistent checkpoint. So most distributed checkpointing implementations use a mechanism called barriers. And these are essentially points in the checkpointing process when we have to synchronize the checkpoint across different instances. Crew doesn't provide this type of barriers but it has functionality called action scripts or essentially it has hooks or places during the checkpointing process and during the restore process where it can execute an external utility and waits until this external utility completes. And in this case to essentially enable synchronization we developed a tool called Crew Coordinator which essentially is used with this action script functionality to enable synchronization between different crew instances. So the Crew Coordinator tool has a client site and server site. It's using the client site to essentially register different instances with the server and essentially specify dependencies for which crew instance depends on which others and then use this to essentially pause the checkpoint and continue when all the dependencies have been satisfied. The restore process works in a similar way. Essentially during restore we execute Crew, Crew starts the coordinator and then synchronizes and waits for all other dependencies to get to this point. And when the dependencies have been satisfied it continues to the next step and this essentially allows us to provide distributed checkpointing without essentially needing to modify Crew or modify Kubernetes or essentially we can reuse the existing projects with this functionality. So this is a simple demo showing how the coordination mechanism can be used. So we first start to counter applications that are simply just outputting a number every second. Then there is a configuration file for the coordinator tool defined in the images director where the checkpoint will be created. And then we start the server on the right side and on the left side. And when the checkpoint is created the application will, Crew will start checkpointing but it will stop until the dependencies for the checkpoint are satisfied. And when the second accrues and starts then it connects to the server and this essentially triggers the synchronization mechanism to continue the checkpoint. So essentially both checkpoints complete at the same time. And similar during Crew Store the Restore process starts with the first application then with the second and essentially it makes sure that both the checkpointing and the Restore process both happen at the same time. So now that essentially we can synchronize the checkpoint when we have application running on different nodes we also want to transfer these checkpoints to a single location or a single server where we can essentially form a global checkpoint that can be used during Crew Store. And for this we essentially have, we're using this streaming mechanism in Crew that allows us to transfer the images from all nodes during checkpoint and then collect them on the destination site or where the server is running. And some future work is to essentially integrate this tool with the cryo and container D engines to essentially automate the dependency detection in Kubernetes or these are essentially identifying the which container depends on which other containers. This is also very similar to the way you would, for example, start a pod. Kubernetes is using a list of dependencies to, for example, start the initialization container before application containers. And the future work is also to enable integration with Kubernetes objects such as stable sets and deployments. I think that's it, right? Yeah, that's it, we're at the end. Any questions? I think you need to go to the mic or on mic which is coming to you, I don't know. There is one coming, yeah. So I will answer the basic checkpoint restore questions about pre-U and Kubernetes and he will answer the... First of all, thanks very much for the talk. So question one is in your demo when you did the checkpoint in the restoring, if I understood right, Kubernetes took, knew that it's a restore by the notational image. Again, please? How did Kubernetes knew that this is a restore? Okay, Kubernetes doesn't know, it's just the container engine which doesn't know it. So that was a notation? That's the annotation because yeah, so what we do, we parse the image to, so basically Kubernetes parses the image during create to cryo, cryo downloads the image and then we check if it has an annotation, if it has an annotation, then we go an extra code path to do the restore. So there is a simple if condition in the container in cryo. Yeah, it's really just if string exists basically. And the second one is like, I assume it is, do you know anything what's related being done in this space for the networking to resolve, let's say give the same pod AP address, service put on the same like networking stack? Yeah, good question. So for the podman migration work I did before, I actually, I'm restoring the IP address of the container to be the same so that TCP connections can stay alive. For Kubernetes, I never really looked into it because from my understanding, the network stack can be really complicated, but there's actually a Google Summer of Code project again this year in combination with P4. Yes, so P4 is a programming language for programming protocol independent packet processors. It's essentially language used for programming switches such as the phenol into the phenol and the EP4 organization participates in Google Summer of Code this year and we have actually a project that aims to solve this problem. Essentially how do we migrate containers and enable changing the IP address? But the solution is essentially using a load balancer. So you have all clients sending packets to the load balancer and then the load balancer keeps track of essentially the updated IP address. But I basically ignored it so far because the whole networking thing is too complicated for me to understand. I'm focusing on a very small part right now. So yeah, but that was, yeah, please. Hi, this is Abhishek from IBM Research. Oh, hi. Hi. So I have a question related to AI workloads. Oh, you have to speak up. I cannot hear you. Sorry. I have a question related to AI workloads. Does it also checkpoint the status of the GPU memory and other things? Okay, so the question is if we can checkpoint, basically can we checkpoint containers using the GPU, right? Something like this. Okay, yeah, so it depends. So we can checkpoint processes which are using AMD GPUs because AMD came to the CRIU project and they provided a plugin to support the GPU. So basically whenever you have external hardware which is not directly connected by the kernel which you go around like InfiniBand GPU, then there is an additional state in the GPU, in the InfiniBand card, which we cannot extract from Linux easily. So we need the plugin I mentioned. And from my point of view, this means that basically the vendor has to step up and write a plugin. AMD did it. There are videos around that. There's also NVIDIA doing it. I think Microsoft had a video with JetGPT where they claimed they're using CRIU and NVIDIA GPUs. There was a video from Membridge in combination with NVIDIA GPUs. So people are working on it but people are not talking to the upstream CRIU developers. So we are not aware of it just that we see videos on YouTube on it. So what is necessary to enable GPU checkpointing is essentially the ability to save the state of GPU and restore it back. AMD, the driver for AMD GPUs is open source. So they already implemented this and you can checkpoint restore GPU applications with AMD GPUs. NVIDIA also have a team working on this and they were supposed to have a conference a few days ago where they presented something that hasn't been released yet but it's a proof of concept demonstrating that you can checkpoint restore GPU applications with CRIU. There is also a startup called SEDANA that are also focusing on essentially solving this problem and yeah, there are multiple people working on essentially this. Hello, thank you for the presentation. That was great and interesting. Indeed, you mentioned like a couple of use cases like running some jobs, basically long running jobs that I could imagine be using that kind of feature whenever there is a node that is reclaimed because actually Kubernetes nodes are reclaimed regularly or are updated with versions. So it's like a very ephemeral environment. So I believe that's very interesting a solution for that kind of use cases but the demonstration that you made was really manual. Is there any plan to automate all of this? Actually make just asking, requesting the plan. Yeah, yeah. So my goal kind of is to have migration working and in the best case it's done by the scheduler. If there's a low priority container part it will be automatically moved anywhere else. But for the first part which you saw now this took at least four years. And it feels like this is 10% of the way. So yes, but it's a long way. Thanks for the presentation. Question about the HPC part. How do you determine at what point or in terms of time how do you determine whether all the parts or the containers is replaced where you can take a checkpoint? I think this is for you. Can you repeat the question yourself? Sorry. You talked about flink. For example in flink they send a message to say hey that is the point all the processes needs to take a checkpoint. Similar to that do you have something in the case of HPC when you do a checkpoint? Yeah. So essentially the question is how do we synchronize the checkpoint or how? For example flink is sending messages between different essentially to initiate a checkpoint. And with Cree we can't really do that because essentially Cree is not something that is continuously running on the server. It's something that we need to start. So essentially when we want to create a checkpoint we start all Cree instances and then just go back. No, I can't. But essentially we have an action script hook called freedom which essentially allows us to pause, create until all other instances are running. And this allows us to start the checkpoint at the same time. Did that answer your question? Hi, first congratulations on your work. One question, under distributed setting when you're doing to achieve a consistent cut how are you capturing the messages in transit? So are you using sort of a chandel and port algorithm or something like that? Yeah, I mean we don't really capture the messages. In our case we just freeze the whole process stream. So essentially when all Cree instances are synchronized at the freedom hook then in the next stage Cree uses the freezer C-group to essentially pause all processes. So in this case all applications will be paused. And then when the checkpoint is complete again it will synchronize the checkpoints and then it will resume them when they're all synchronized. Yeah, I think currently we don't have, and in this part there's no messages could get lost definitely. If we talk about TCP we'll do the retransmission hopefully if we talk about InfiniBand then we're out of luck. If you're looking at what MPI does basically there used to be the MPI, I know about OpenMPI there used to be the framework to do checkpoint restore there and what it did before was basically it acquires all the processes in the MPI calculation to don't do any transmissions anymore. But if you're talking about InfiniBand then your message will probably currently be lost. If we're talking about TCP then we hope that retransmission saves us. And this is your research work so it's not something that can be used in production yet. Maybe we'll introduce the two-phase commit where we can capture this type of issues. Yeah, thank you for the presentation. I was thinking of the use case where we run basically a job like for an AI workload or something like that and maybe during the execution of the job which has a high value, we don't want to lose it. Something silly happens to the node, I don't know this is full or memory corruption and maybe to kind of devise a way to restore the job on another node, like in which case do you think a node corruption would still be able to, we would still be able to do the checkpoint and restore and in which case it would not be possible. So, hard question. So, I guess the thing is as long as you can read memory and registers, it might work but if your storage subsystem doesn't work anymore then I don't know then you maybe have to do it over network. So it really depends on the things and again going back to HPC there are lots of papers about fall tolerance using checkpoint restore and there's a lot of research, what is the optimal checkpoint interval to regularly take checkpoints but you don't want to create too much IO for the IO subsystem so that it's only writing all the time and not doing any calculations. So this heavily depends on your error, heavily depends on the application so basically we cannot answer it but yeah, it depends. Thank you. There's one more question in the second row. In the Kubler API path, so you have for coordination some pre-dump action scripts when we use the Kubler API path is there a way to tell CRIU to take some actions pre-dump state because some of the applications require some preparation like deleting some files or just something to prepare the application for a checkpoint. Yeah, so basically I guess your question is can you tell CRIU to do something differently because CRIU has many options and you want to select one of the options during your Kubernetes checkpointing creation. Yeah, hooking up an action. Yeah, so when I initially brought this up and we had to introduce the new CRI call I had a couple of parameters there like TCP established and so on but they were all removed during review because the story currently is the forensic container checkpointing and so it's not needed. I think it would be nice to have it in the future. The question is do you want to expose it one to one to? So there is also a pull request currently for me open to do it on KubeCTL checkpoint. This also exists, it's not merged. So the question is do we really want to expose all parameters from CRIU in KubeCTL and then pass them all the way through? There are also the configuration files but the configuration files are not really useful if you're talking about a Kubernetes cluster because the configuration file has to be on the node CRIU is running. So I'm undecided but I totally understand that we have to have a way to have additional parameters passed from basically from KubeCTL down to the Kube to CRIU the lowest level. And actually there is a run C supports custom configuration file. So technically you can define a configuration file for every container that your checkpoint may understand. This allows you to essentially specify additional pre-options that are being used for checkpoint. Oh, okay, it looks like we're done. Thanks a lot for your questions, for your time. You just turn it down, okay.