 So, the talk is about resuming pods after support instance shut down with the cheeky title of the party must go on. So who am I? I'm Muafak. I'm an engineer at QA Wolf and emeritus maintainer at Crossplain which is a CNCF project and I've been involved in CNCF community since about like 2019. So I work at QA Wolf. What is QA Wolf? QA Wolf has a platform that runs all these QA tests, browser tests and a group of engineers in-house QA engineers that writes and maintains those tests. Together, we make up the QA Wolf team where we can provide you high coverage with zero effort. Effectively, a company comes to QA Wolf, hey, we want our web apps to be tested and we write the test, we maintain them, we schedule them, we just tell you about the bugs. And it's the high velocity and that gives you like, you know, confidence to ship much more frequently. So this is an example playground. On the left, you see that we use like playwright tests, usual browser test code and on the right, you're seeing blogs and the browser page. So in order to do that, at our scale, we are running over 2 million test runs every month. That means hundreds of nodes. So in order for this to be affordable, we're using spot instances in Google Cloud, which gives you like, you know, about 40 to 60 percent cost reduction. But the main thing with spot instances is that they fail randomly. They just restart and like, you know, Google, like GCP tells you, like, pick up and go in 30 seconds effectively. So because of that, up to 5 percent of our runs failed because of these shutdowns. And 5 percent is a noticeable, especially because it doesn't choose customers, so almost every customer gets the 5 percent. And especially the long running jobs are more exposed to these failures. So there are like, you know, there were a couple of options about how to overcome this. One is, well, do not use spot instances, which was not a fully an option, but which is something we partly did by retrying on standard nodes. So if a test fails for the first time, then in the second try, we are scheduling it to a standard node so that that customer sees less symptoms of that failure. And the second option was, well, let's run the test in VMs using like Firecracker, for example, so that we're still in Kubernetes, but use a micro-VM which has snapshot and like, you know, restore capabilities. However, that required nested virtualization to be enabled or the cluster nodes to be bare metals. And neither of them are available in GKE, and QABL is fully on GCP using GKE clusters. And in addition, like, that would introduce some friction with using the Cata containers, so like, you know, you would get away from like standard upstream Kubernetes stuff which may increase the bug surface. So the third option was using a tool called Checkpoint Restoring User Space, CRIU, to migrate the process tree. Not the whole VM, but the process tree that we care about. And CRIU is not like a new project, it's mature and like, you know, I think it's over like now 10 years, and it's got like a lot of Linus kernel changes in to make it work. So with CRIU, there's also some effort already going on in upstream Kubernetes that is focused on forensic analysis. If you look up, you're going to see like some blog posts where you can use a Kiblet Alpha API to take a checkpoint of a container and then inspect it for like, you know, forensic analysis, but it doesn't support restore yet. So one option was that, which is what we went for initially. And the second is in container execution of CRIU when the node shutdown is received. And that doesn't use any of the Kiblet API, it just does everything inside a container, which requires elevator privileges. So to explain that, let's look into how CRIU works. Yeah, in a minute. So the first option that we went with CRIU support in upstream Kubernetes. So we started on that path, and right now the support is RunC at the lowest layer on the right. You can see what it takes to start a container in Kubernetes. RunC can take a checkpoint and restore today using CRIU. And container. And CRIU also can call that RPC or RunC to take that checkpoint and restore as well. And Kiblet, even in Alpha, supports taking that checkpoint, but it doesn't have an endpoint to restore it yet. So the main reason that this has become bloke for us was that container. We did not have the support for Kiblet API wiring up. And GK uses container. We looked into setting up a CRIU cluster separately, but that just meant a lot of effort to maintain that on scale with two millions plus test runs. However, supporting container was merged two weeks ago. Of course, it's going to take time to be released in container D and then be part of GKE. But yeah, that's kind of like one layer handled as well. And the next thing was the storage medium support. So when you take a checkpoint for forensic analysis, it writes that to the host path of the node, which also doesn't work for us because we are losing the node, right, where we want another node. So we have to copy manually. Ideally, we would like it to be written into persistent volume or a bucket mounted as volume. An API server doesn't know anything about checkpoint RNA store. So there's some progress here being made, but it did not work for us without heavy lifting of diverging away from GKE. You can follow the progress in KEP 2008. So we went with effectively it started as POC and then turned into a real thing. The second option in container execution, which requires elevated privileges. So in order to understand how this works or why it needs elevated privileges, we need to understand how CRIU works. So what CRIU does, it first infects the process. It effectively uses ptrace and freezes the process. This is very similar to in IDs where you put a break point and it stops the process. So it does that by using ptrace and because it needs to injects parasite code, that's why it requires more capabilities, like close to root capabilities to do that. And then that parasite code dumps all the memory, file descriptor, open connections, there's a lot of things surrounding the process. Not everything, but quite a few things that are required to restore the process. And that may include sensitive data as well. For example, if you have a secret mounted in your process, reading it, that is going to be included in the memory. So it's quite a sensitive data. And then in restore, by the way, this is at a very high level, there are lots of details that I'm skipping over, but just to give you an idea of what it's doing. And during restore, it prepares, it loads all these files and prepares the environment and it creates namespaces, which we'll get to, to be able to recover the same processes with the same PIDs. And then it creates the processes one by one, the three, and maps them to those memory and then takes itself out from the process. So it's quite a heavy lifting that it does. And it requires some of Linux kernel changes, which are all in Linux 6. This is what we're using. But that's kind of the general idea of how Cree works. So this is a more complex example. We solve what happens in a single process and that's a process today, which is usually the case with most of the workloads. So the thing with Cree is that it takes care of a lot of things, but there needs to be a point where it stops doing things for you because they become so process specific. For example, PID 10 here, it has a log file open, right? So when you take checkpoint of this three and you restore it, and that log file doesn't exist, it fails. Because it has an open file descriptor and it's the process, because it's frozen exactly at the same place, it doesn't know whether a file existed or it doesn't have the knowledge to recreate it. Unless it has specific logic inside of it. But at the same time, Cree doesn't take copy of that log file because it just doesn't know about it or even if it did, it could be made available by the user. It could be like 10 gig of file that is part of the volume that is already in place. So it just doesn't know whether it's a generated file. It's a file that our system file, for example, a library file that is open by the process. Because it's such a low level, it just doesn't have those details. So when you restore a process, you need to make sure these surrounding resources are in place. For example, log file, you have to copy that or create another empty file so that it continues from there. And for example, in the second one, you see PID 12 is a graphical user interface that talks to X11, which is a socket, Linux socket that it sends the graphic. So with sockets, for example, every socket has like two end and both ends must be in place for restoration to work. Because remember, the process doesn't know that we're taking the checkpoint. So it expects exactly the same state. So and also the open TCP connections. So to make this all more concrete, let's do a simple, small demo of CRIU in a Linux environment. And it's completely Linux, so you can't really do it in MacOS at all. So this is the next VM that I have here. So here in this first demo, we have a counter. It just counts like one by one and then prints a PID as well. The checkpoint folder is empty. So let's start this process. Okay, so every second it prints like in a one, two, three, four. It has a PID of 51, three, nine, nine. So I'm going to another tab. I am going to dump this process. I need to know it's PID. I need to give this dash J, which I will get to in a second. And then the directory to save all those image files. Well, it needs pseudo permissions, root permissions. Okay, it's done. So count is at 35, and it kills the process. There's an option to leave it running, but by default, after dumping, it kills it. So if we look at here, checkpoint folder is populated with all these files about the process. CRIU also comes with a tool called Cric that lets you inspect the checkpoint. For example, let's actually clear the x.ps. So these are the processes included in the checkpoint. And let's look at, for example, file descriptors that are open. As you can see, there is zero, one, two, which are STD in, STD out, and STD out. They're all bound to TTY. So this is the checkpoint that we have on disk. So we are going to restore it. When restoring, I don't need to give the PID because it's in the checkpoint. I need to give the J. Well, yeah. So it continues from the exact same spot where it left here. We left it at 35, and here we're continuing from 36. So this is like, you know, very simple loop of how CRIU works and what it provides you, which is quite nice. So, but we're not like, you know, that was an example in a Linux VM in a terminal, but we're not running our jobs in a terminal. It needs to be inside a container, and that was the option that we went with. So our first goal is going to be running CRIU in container, and we're just going to go over the problems that we faced when we tried to do that. So the first problem is the PID collusion problem. When you start a process in a container, it takes up the PID one, right? And it may spawn like PID two and other processes, and when you trigger a CRIU dump, it takes the checkpoint of this process three, and PID has to stay the same. So when you restore, CRIU has to start with PID one, so PID one is taken, so it's not able to restore that. In fact, CRIU rejects taking PID one process, dump of PID ones, as far as I remember, if you don't override it. So there's this problem which you can overcome by, for example, spinging up some processes, empty processes, so that the first process takes like PID five, and then when you restore, it's available, which is like somewhat what we're going to do. So the next problem is that this is somewhat of a lesser known fact is that, so we had the file descriptors open for STD in, out, and error, they were all bound to TTY. That's what happens when you run a process in a terminal, but when you run it in a container using RunC, it actually binds them to STD into DevNull, and STD out and error to pipes. This pipe is effectively the vertical line that we use in terminal, very similar to that, but every pipe has a unique ID when it's assigned. So you take the checkpoint, process says, okay, I'm writing STD out and STD out to 287 and 458, but when you restore, those pipes either need to be in place, which like no IDs are randomly assigned, and there needs to be like other end of the pair that listens to that ID. But when you start another container, it starts with another pipe ID that is assigned by its own, on its own. This is like, I'm glossing over some of the details here, but these are like the two initially blocking problems. For example, in this case, if you restore a process that doesn't write anything to STD out or STD error, then that's fine. It runs well. So to orchestrate that, and to bump the PID, and make sure the pipes are in place, we wrote an open source command wrapper that orchestrates running create. It effectively privates the environment for taking the checkpoint and also doing restore. It gives us room to like, you know, do all these preparations to make sure it works inside a container. For example, the PID collision problem, what it does is click start as PID one, and it pushes the PID contour to a very high number so that the process, it starts with like in a very high PID. There are like, you know, other ways to do this, but like, you know, this has been like, you know, working for the use cases so far. But what happens is that the process that you give it starts with 9001, and you start create, this PID is not really relevant, and then when this stores, Craig scans the checkpoint folder, if it sees a checkpoint image, it starts create, and then when create restores, obviously those like, you know, high numbers are available. There might be some other processes along with Craig that are around because Craig doesn't require you to start with PID one, so that's like, now you need to have the room in between. For example, you can start like other logging processes or other stuff, but you shouldn't start like, you know, 9,000 processes, or you shouldn't start them after Craig. So that the PIDs of the restores process are not taken. And the TTY in container, so we said like, these pipes have an ID and like, you know, the other end must be listening. What Craig does is that when it takes the checkpoint, and this is the same mechanism that Run-C does for support upstream support for restoring processes, it takes note of those pipes in a separate file in the checkpoint directory. So that like, it knows that when it restores 287 and 458 should be bound to FD1 and 2 of CRIU. So CRIU has this inherit FD option, inherit FD flag that you give a file, open file descriptor to CRIU, and CRIU can tell the process, hey, this file you have open, you should write to this one. So process still thinks that it writes to that file, but it ends up in those file, open file descriptors. So Craig orchestrates that like, you know, by taking note of this pipe number, and then like in during restore, it tells CRIU to make it inherit those FDs instead of the pipes, and then it just forwards the STDR to Run-C. So it all happens in the container. Cool, so these are, these were like, you know, two like initial blockers to get it running on a container, the simple counter. So let's see that in action. So we're building a Docker file, CRIU needs to be available installing Craig and copying the CRIU config, which is, which has just like an image directory slash checkpoint, and then we're running the counter by giving it like an arguments, by giving it as argument to Craig Run. So I built this image because I, I'm afraid of the Wi-Fi here. So we're going to run this. So this privilege flag is needed to write to NS lastPID and we're mounting the checkpoint folder. So let's see Docker logs. Actually, we're in the wrong directory. Okay, empty checkpoint. So Docker run, the spot, Docker 0A. Okay, so it started the command with PID on 1,001. It counts similarly, and then it also set up a seek term handler to take the checkpoint so that you don't have to exec and like, you know, take the checkpoint. This is the same mechanism we use in the Kubernetes but we're going to see that Kubernetes sends the seek term signal and you have like 15, 15 seconds to shut down. So what we are going to do is effectively stop it just like we would expect from Kubernetes, Docker stop, 0A1, and receive the seek term checkpoint taken in 47 milliseconds. So if you look at checkpoint folder now, we see this from configuration animal. Well, yeah, we need to own that because it was taken from Byroot in the container. Okay, so you see this, like, you know, we have the Linux file, the scripted trio. First one is DevNol, STDN and the pipe numbers. Similarly, if we go to checkpoint, create XFDs, you see these are the pipes that are set by run C and we are going to override them with file descriptors of Crio. So the checkpoint is in place. So we're going to run the same Docker run command that we did and Cric is going to see that this directory has some checkpoint files so we don't have to run Crio restore. It's going to run it for us. And once you see that, it will try to restore it. Oh, we ran it in the wrong directory. It mounted the wrong directory so we need to run it in here so that the PWD is correct, PWD here. Okay, it found the checkpoint and it continues from exactly the same count that it left earlier. Cool, so we were like, you know, moving one by one up the layer. First we did it in the terminal and now we did it in the container. And the third one is like, you know, we were running in Kubernetes and we're running much more complex processes, QA tests that includes like browser and a whole thing. So when you look at the screen, in order to make this work, there's a pod running that includes a VNC server, WebSocify that converts that TCP traffic to WebSocket, Node.js, PlayWrite and the browser. So it's quite the complex process three and there's lots of external resources that need to be handled. So some of the problems that we ran into and we're going to, we're going, we will be going over is like, you know, runtime files, introduce sources and then like, you know, we're gonna get that point where we do the demo that it restores all of the, all of those processes. So the first one is runtime files. So as we said, the restored process doesn't have the knowledge that it's being restored. So it expects the files to be in place that it had to open and with Docker images, we're bringing all the files as part of the image, but there are also runtime files that are generated during like, you know, execution of the process. For example, WebKit browser that we're going to see creates root cache folder in runtime and because it doesn't know it's being restored, it effectively expects them and when it doesn't see them, it fails. So what Creek does is to copy those files to extra files. So this is the one of the problems that actually wouldn't happen in the ongoing upstream efforts because RunC knows about those files because when you start a container, there's like image layers and on top you have the writable layer that RunC is aware of. So it can just take copy of the whole writable layer and zip it, but in our case, we don't know because we're running in the container, we don't even know it, it's an overlay FS. So we just have, like user has to let us know where we have to inspect the checkpoint for the paths, but right now what we have is like, you know, a user actually tells us which paths that we want to be available when we restore. And there are like, you know, some entry resources like sockets that this is like a small annoyance that we like, you have to make sure all the folders till that socket is available, which also create like, you know, takes care of, but this is also another one that you like, that you try to restore your process, see the error and like you add these app specific work rounds as configurations. And the third one in the list was C Group V2 files. So in C Group V2, in Kubernetes at least in GKE, so some processes want to know like current memory usage and what's the max so that they can optimize like browsers. And that has dynamic path of like in that includes PODUID container ID, which is going to change because we're starting another container. So what Greek does is that it effectively inspects the checkpoints, seconds for this path, calculates the new path and then tell it to override those paths. Inherit the open file descriptor effectively so that it thinks that it's writing the same file, but it actually ends up in the right path. And the last one is, this is kind of like a funny one. So there's an event type called I Notify in Linux that processes can set up for a file so that they can get event. When that file changes like time zone, for example, for this specific browser, but overlay FS doesn't actually support it out of the band. So when you run these processes in container, you never get an event. But CREO when it restores it, it tries to make sure that those I Notify handles are in place and that fails because overlay FS doesn't support them at this by default unless you configure them. So this is effectively like a small hack that we had to put in place that we effectively delete that I Notify before taking the checkpoint. So that like, you know, of course that requires the application to be able to handle this. So there are like, you know, a couple of other ones. One was like a small bug in CREO which is resolved in that PR. And the last one, which we didn't really solve, but just work down is that pod IPs change. So CREO is able to lock the network connection and continue from where it was before without even dropping the connection. But that requires pod IP, both ends of the TCP connection to have the same IP when you restore as well, which is not possible with pods because when you create a new pod, it assigns a new pod IP. So our app had to be able to like, you know, handle the retries and reconnects when the connection drops. Cool, so let's see all that in action and like, you know, with a real pod playground. So this is an example test in our platform. So at the same time, let's keep CTL, get pods. So this is our playground system. Turn it to slow as usual. Okay. So when I click around workflow, it's going to create a new pod. Okay, let's keep CTL logs. This is in a real GKE cluster, not in a kind cluster. So I don't have to switch to Ubuntu or anything. So the process has started, a bunch of processes. And you see that we're seeing the webpage come up. So I'm going to just like, stop this execution and change the state here to visit Kubernetes website. So we have this pod. It has a volume mounted to it as persistent volume which will have the checkpoint dumped to it. So what I'm going to do is I am going to delete the pod. I'm going to delete the pod and let's see in the logs, receive sickterm, taking checkpoint, and it took the checkpoint in two and a half second. Here connection dropped, waiting. So our controller creates immediately a new pod with the same volume. So when Cric sees that, it's going to hopefully restore, run create restore. Let's see. Yeah. So this is like an enable for debugging the create restore command that we end up using, which has like a bunch of the work announced that we had. Well, the demo gods are not with us today. I think it's because the internet. Yeah. Yeah. We're running out of time, but it actually works. But yeah. That's kind of like, that's been our journey to get it up and running in our GKE setup. So there's another small tool included called node state server, which is what Cric asked, like, hey, is this not shutting down? Like, you know, should I take the checkpoint or not? Otherwise we would end up like taking checkpoint, even if user reports it. And the future plan is like our general role is we would like to converge on the upstream efforts as closely as possible. That will reduce the dedicated efforts. And also as you can see, not all of them, but some of the problems can be solved with, like, you know, having, doing all these things outside the container much better. And we want to have more automation to figure out a specific workaround. Chrome specifically is our next target, which is the browser that is used in our tests. And making Cric more comfortable for other use cases as well. For example, we have some workarounds for X11, that may be some like in a Veyland or GNOME things that we need to do. And the live migration without dropping connection. So that's actually possible with some CNIs that you can assign the same pod IP to the new pod. But we're using GKE and that relies on Cilium. So we're looking for, like, you know, for that sitting feature to land to see. In that case, it would be just like, you know, that VNC, we would freeze for 10 seconds and then continue from where it was before. And I would like to shout out to crew maintainers. They have been really, really helpful to get us unblocked. They were really good people. Let's give them a clap for the awesome tool that they did. That's all I had to say today. If you have any questions and you can leave feedback on the QR code. But thank you for attending.