 First of all, I want to thank the DOK community for inviting me here to present to you guys today and my talk is going to be about how to enable hot restarting of stateful applications running both CPU and GPU pods to accelerate AI ML workloads and My name is Bernie Wu. I'm with a company called Memverge We're based in Silicon Valley and we are working on different types of memory virtualization and memory snapshotting Issues related to AI and ML So first of all what I'd like to do is just describe the the problem statement what we're trying to do is enable transparent CPU and GPU snapshotting of pods running AI ML workflows now a lot of people ask us well Don't these AI ML workflows already have built-in checkpoint restore capabilities. They do tensorflow has it by torch hazard, etc but nonetheless we see use cases for being able to do this transparently by the Kubernetes operators To increase productivity efficiency and sustainability at the same time lowering costs So use cases that we've uncovered include being able to run AI ML Workloads on Kubernetes, but to also take advantage of spot instances on public clouds People also want to be able to hot restart and rebalance GPU workloads across compute resources Third is to automatically for example save and restore users jupyter notebooks and machine data sets If an instance let's say out in the public cloud gets reclaimed or or it's they go home at night We can automatically save the state of their jupyter notebooks and then bring it back up the next day that kind of thing in addition to that during normal operations of Kubernetes people will experience evictions note evictions are pot evictions People need to drain nose people want to run autoscaling and so we wanted to build an operator that can run with those existing capabilities and then also the There are a lot of people trying to introduce batch jobs or long-running not so fault tolerant applications And so snapshotting will increase resilience So the way we did this is we started off by using Creo and a Creo is a an open-source project that I think came about around 2012 and It's actually also I believe in the alpha mode preview mode on 1.25 for forensic container analysis Well, we started with Creo and then we we built on top of that There's also another GPU out there from AMD That in case you're not aware already has a plug-in that device driver plug-in for for Creo And what we've been doing is collaborating with NVIDIA NVIDIA as a CUDA driver So we expect some time 12.4 CUDA driver just got released but in the near future either by 12.5 or Before you'll see a utility being released by NVIDIA. We're at the same time. We're doing this presentation We're also showing this at GTC and in the Bay Area And there'll be a utility out there that will allow the GPU to be snap-shotted Checkpointed and and then of course in this business in this community. We we built a Kubernetes operator to actually implement all this so Yeah, so let me describe this utility that We've been partnered with With NVIDIA to develop this utility That allows checkpointing and restoring of the GPU so the GPU basically Right now is opaque. So they they took a little bit of a different approach. They didn't build a device driver and NVIDIA They have a utility that basically looks for what what threads are running and implements basically its own freeze checkpoint and restore process within the GPU architecture and so any already submitted work Runs to a certain level of completion and then what happens is the this utility will dump the GPU memory to host memory and an allocated area and then and then basically Release the GPU So you have two choices you could either stop it completely or just checkpoint it and then continue running and then on the restore There's a reverse process the GPUs are reacquired by the process and then device memories copied from the CPU memory back into the GPU memory and mappings are restored and Everything the objects and streams and contexts are all restored and then the IPIs are unlocked. So that's the general flow and to implement this in conjunction with Creo we had to make some modifications to Creo which we will be contributing and One of the things that we have to do is do the checkpointing in two stages now So the first stage is that this is the checkpoint cycle We're freezing the GPU and CPU together and then we are starting a we're actually unfreezing one GPU process that will allow us to To do the checkpointing operation within the GPU and start copying the memory into the CPU And then copy everything the CPUs memory the GPUs memory and then any kinds of associated ephemeral files or objects and put that all on to a checkpoint image stored typically on somebody's persistent volume Out there and on the on the Kubernetes cluster and then we we can resume from there So that's the checkpointing process the restore is the reverse as a two-stage process again We have to restore the the GPU first and then let the Creo Utility restored the rest of the CPU state and so Just implementing this on Creo we found out is not sufficient because we find that The the window for checkpointing is the overhead checkpointing time is excessive so we've done some other enhancements like asynchronous checkpointing to reduce the quiescent period to allow the CPU and GPU to run this resume operation as quickly as possible We've also implemented an incremental checkpointing to minimize the amount of data transfer and then also compression technology also to Minimize the consumption of storage or memory as we're doing these checkpoints And then lastly we also have address ephemeral files some Stateful applications are still are using the local disk ephemeral files. We have to checkpoint all that and then We have to implement this all as an operator and then you can You can pick your favorite stateful app and then update the manifest so that this checkpoint restore is automatically invoked So what I'd like to do now is show a demonstration of this checkpoint restart hot restart We're going to drain a node and then migrate it to another node and here in this recorded demo We're using a a t4 NVIDIA t4 and we're using a tensor flow training workflow So if I can get this to go Yeah, sorry, this is kind of a night chart but down below we're just showing all the nodes There's a small cluster with two workers and then up above there we're launching the Operator and the the tensor flow job up in that corner you can monitor the The usage of the the containers being launched And then we turn on the logging in this upper left panel here So you can see what's going on basically the tensor flow applications starting to run it's compiling and then pretty soon I'll go into a training cycle. You'll see these epics ticking off Down at the bottom bottom upper right there And then on the lower panel there what we're doing is we're We're doing a no drain command And we're killing the job. It's right now. It's that epic seven and then on the upper right you can see the the the pod being terminated By the scheduler and then Then restarted On it on the other worker automatically And then in a little while you'll start seeing in the upper left side the The job resume From where it left off with epic seven and start going forward So, uh, so we're saving a lot of time by avoiding allowing hot restart now of these gpu workloads On kubernetes So very quickly there are other recorded demos. You can just click on this square We've just got a tensor flow and bare metal. We have a parabricks demo, which is a hpc workload for computational biology that nvidia has We have our own curated memory machine cloud batch Util app platform that's used for spot instance and wave what we call wave writing On this demo and then this same kubernetes operator you just saw And then last what's ahead We are working with nvidia to finish up these modifications to this utility Again, there'll be a preview release Sometime between now and the 12 out 5 release of kuda and will be contributing the changes to creo And then we hope to be collaborating with you folks, but developing production great operators and applications for this Thank you very much and uh, please contact me if you have any questions