 Hello everyone and thank you for tuning in. My name is Ritesh Naik and I'm here to talk about using Container Checkpoint Restore in Kubernetes to achieve fast port startup time. We at MathWorks have been using Container Checkpoint Restore at high scales since 2016. In this talk, I'll go over how we incorporated Checkpoint Restore in Kubernetes to achieve a scalable system. Hopefully, you can take something away from the session and apply to your own environment. On that note, before we get started, let me quickly introduce myself. My name is Ritesh Naik and I'm a senior software engineer at MathWorks. For those who are not familiar with MathWorks, MathWorks is a leading developer of mathematical computing software. Engineers, scientists, and researchers worldwide rely on its products to accelerate the piece of discovery, innovation, and development. I've been at MathWorks for over five years working on various areas of infrastructure and scalability. My team and I have been focusing on doing things in Kubernetes space since few years. If you'd like to reach out to me after this talk, feel free to ping me on Kubernetes Slack or via email. I'm happy to talk about all things containers, Kubernetes, Checkpoint Restore, Travel, or anything from that matter. So in this talk, I'll go with the motivation behind using Container Checkpoint Restore, quick introduction of Checkpoint Restore along with a demo. Our solution of using Container Checkpoint Restore in Kubernetes along with some of the lessons learned, best practices, and future enhancements. Finally, we'll wrap it up with Q&A at the end. As in any typical Kubernetes cluster, here we have a workload of containerized applications orchestrated by Kubernetes. This workload could include application serving concurrent requests like web service or each request could be handled by individual containerized application like in a function as a service or serverless setup. Applications could come up pretty fast or they might have initialization overhead. Now, keeping different workloads in mind, what are some of the things we need to consider to build a scalable system? First and foremost, system should be able to react quickly and scale fast to any unexpected traffic spikes. Secondly, there should not be any first use performance impact on the latency due to a cold start. This is basically the time spent on setting up the container environment along with the application initialization to come up ready. This could be avoided by keeping the containerized application warm and ready to go, and part of warming the containerized application involves initializing the application. We'll talk about this more in the upcoming slides. Third, to use the resources efficiently by maintaining low cost waste. So, how easy or difficult it is to achieve these goals? Let's start by addressing the elephant in the room, cold start. Yes, cold start is one of the main challenge that makes it difficult to achieve all three goals in the same time. This could also be seen in the function as a service or serverless use cases. Let me walk you through some of the options in the next couple of slides on why it becomes difficult to achieve all goals with container or application cold start-up time. Option one, scale out on demand with usage. Basically, wait for the usage demand and scale up the workload to meet the meet demand. Let's take this and stack it up against our goals. I'll give you a couple of seconds to guess. I'm sure most of you got it right. So, as you see here, this does well with the goal of using resources efficiently since new ports are spawned only when needed. However, this is prone to cold start pains and would have poor first use performance. Also, this wouldn't be able to scale quickly to meet the sudden burst in the workload. Can we do better? Let's see. How about we create a standby pool of pre-warmed pool? Basically, maintain a pool of pre-warmed containers and size of the pool could be calculated based on the historical traffic trend. Now, let's do the same exercise again and stack it against the goal. So, it does address the cold start pain. Setting up of a container environment and initializing the application is done ahead of time. But, it still suffers from handling the sudden burst in the workload. And since we are maintaining this pre-warmed ports, there would be a lot of waste in resources if these ports are sitting idle and traffic is slow. Now, can we achieve these seemingly contradictory goals? Can we get around the cold start pains to provide fast performance on first use with the ability to scale out fast to account for unexpected traffic spikes along with maintaining low utilization rate? It seems like a daunting task. But, yes, we can. And the answer is checkpoint restore. We used Linux utility called Cryo, which stands for checkpoint restore in user space to perform checkpoint restore of the containerized application. It's an open source project on GitHub with around 10,000 commits and hundreds of contributors. Now, idea here is to checkpoint the warmed container that is already initialized and restore as needed during scaling. With this approach, the startup time is reduced considerably. In some cases, we saw around 70 false improvement in the startup time by going from couple of minutes to couple of seconds. So, how checkpoint restore works? Checkpoint freezes a running application or any Linux processes for that matter and serializes the state to the disk as a collection of files. These files includes any files related in form, sockets, namespaces, memory, et cetera. Now, restore creates a process tree by reading these collection of files and restores the process state. Let's go over a quick demo on checkpoint restore to understand more about Cryo. In this demo, I have a containerized Go web application that I call it as GoPause. Basically, GoPause is simulating an app doing real startup and pre-warming that takes about 30 seconds before it starts listing over port 4000. And you can see how a checkpoint restore, this 30-second initialization has to happen just once during the checkpoint and every restore of the container would skip the initialization and be ready in couple of seconds. So, let's get started. So, we have a Go web application here that takes around 30 seconds to initialize. Waits for initialization to complete. It starts listening on port 4000 after initialization is complete and returns a string as CR demo Go app. For this, we are using Run-C as container runtime for creating and managing container lifecycle and Cryo to enable checkpoint restore functionality. Run-C itself needs a config.decent file through which container options are passed and root file system to be used for containers. With that, let's take a look at the config file to see what is included in there. It includes details like OCI compliant version, command to run at the start of the container, part to the root file system, as you can see here, along with some of the mounts required on Linux. Basically, any container options that is required for container goes here. Let's start a container running the Go web application. As you see, it's waiting for the application to finish initialization and it will take around 30 seconds to initialize while we are waiting. Let's start pulling port 4000. This should get the response back once the Go app finishes initialization and starts listening on port 4000. So container is ready. We get this thing back as the response since the container is ready, let's checkpoint using Run-C checkpoint. This should stop the container as you see and writes the checkpoint files to the disk. Let's see what the status of, so yeah, container is stopped. Now let's restore the container from the checkpoint and you should see that it comes up ready in a second unlike running the container that would have spent 30 seconds to initialize and come up ready. So you should see the advantage of having restoring the container over just running the container if you have checkpoint available. So if you look at container lifecycle through the straight transition lens with checkpoint restored, we were able to skip the initialization and jump to ready. So to summarize, checkpoint restore is the answer to fast container startup type. Wonderful, isn't it? Let's leverage this in our world of Kubernetes. Unfortunately, as of today, there is no need for support of checkpoint restore in Kubernetes. Let's see what options we have here to bring in the checkpoint restore functionality to Kubernetes. One option would be to add a support of checkpoint restore in Kubernetes natively. This would involve creating Kubernetes enhancement proposal design MPI review. Second option is to take a non-native approach without touching the Kubernetes code base. And in this talk, I'll walk you through our solution to non-native approach. So checkpoint restore basically has different use cases. It could be used to address cold start that we talked about or migrate a container from one host to the other. Maybe during maintenance, among many other use cases. Given the different use cases and complexity of checkpoint restore in Kubernetes, it touches some of the core workflow and components. So adding checkpoint restore natively would involve making some code design decisions and might extend the timeline to make it to general availability. Time to GA was one of the factors that was considered to offer non-native approach over native approach. Having said that, there might be a case that could be made where non-native and native could coexist given the specific problems it is trying to solve. So our approach of introducing checkpoint restore capabilities in Kubernetes involved adding checkpoint and restore microservices. Checkpoints basic functionality is to checkpoint a running container. Restores basic functionality is to restore a container from the checkpoint. Checkpoint accepts container image and configurations required to run a container, waits for the container to start and completed its initialization and checkpoints the container. Restore uses this checkpoint image and restores the container to the state at which the container was checkpointed. Now, Kubernetes doesn't expose checkpoint restore functionality through its API. And in order to checkpoint or restore the containers in Kubernetes, we had to think little outside the box. We needed a way to introduce container runtime in Kubernetes that supports checkpoint restore functionality and not managed by Kubernetes. Hence, we went with the design of container runtime in container runtime. As the name suggests, it's running a container runtime inside the container of a pod spawned by container runtime that is managed by Kubernetes. Parent container that runs the checkpoint restore service runs the containerized application using the child container runtime. Parent container runtime could be any container runtime supported by Kubernetes and child container runtime needs to support checkpoint restore. For this doc, I would be using Docker as the parent container runtime and RunC as the child container runtime. Hence RunC in Docker. Let's take another look of the state from pre-checkpoint to post restore. In a pre-checkpoint state, pod running checkpoint service spawned that runs the inner container to be checkpointed. In a checkpoint state, checkpoint image is made ability. In a post restore state, there is a pod running restore service that has a restored container running inside the outer container using the checkpoint image. Now let's look from the sequence of events anchor. In checkpoint sequence, checkpoint service is deployed. Checkpoint service starts the application container running as a child container and waits for application to become ready. Once it is ready, checkpoint the container and save the checkpoint in the location that could be accessed by the restore service. Now moving on to the restore sequence. Restore service is deployed. Restore services looks for the checkpoint and waits for the checkpoint to become available if none exist. Once it is ready, restores the container from the checkpoint. Tying in all things together, in the first phase of an incremental approach, we made a design decision to just do checkpoint and restore per node. Now checkpoint service is scheduled as demon pods that would run on every node to run the containerized application and create a checkpoint. Meanwhile, restore service waits for the checkpoint to become available on the node and restores the container from the checkpoint. Henceforth, any new pods that are coming up on that particular node would start from the checkpoint that is available on the node. This should reduce the pod startup time from seconds to milliseconds and from minutes to couple of seconds. Let's see checkpoint and restore service in action. In this demo, we'll scale the application with and without checkpoint restore next to each other to see the difference checkpoint restore makes while scaling the application. We'll use the same Go app that we used in our previous demo and deploy them in the cluster as a vanilla container and as a container with checkpoint restore. We'll scale both the pods and you'll notice that restored pods should scale around 30 times faster compared to vanilla pod. For this demo, we are using the same Go application that we used before that takes around 30 seconds to initialize and listens on the port 4000. Now the sample Go application is containerized that just calls the binary executable when run and the same Go application is containerized and provided to the checkpoint restore service which uses RunC as the base image which has ways to run the container checkpoint and restore. Let's look into the deployment object. So this is a deployment object for deploying Go application as a vanilla container as you see it's using the vanilla Go app image and liveness and readiness probe is set to probe at port 4000. Let's look into checkpoint object. Now checkpoint here is of a type demon set that uses the image that has RunC helpers to run checkpoint and restore. The operation here is checkpoint and we are mounting some properties required to save the checkpoint artifact and get the checkpoint restore properties. Let's look into restore object. It's a deployment object exactly like Go application deployment object but using the image that has RunC helpers again. So the operation to be performed here is restore from the image and location for checkpoint artifacts and properties. This is the conflict map where properties are passed through it basically has a probe to check the readiness of the container running inside the parent container. Now let's deploy all the resources. On the right, you should see there are three ports. First one is checkpoint port that is starting an application basically running the container with the Go web app and waiting for it to initialize. Similarly, vanilla container or vanilla port is running with Go web container and restore container is waiting for checkpoint to become available. While it is coming up, let's scale the vanilla container to two. As you see on the right, both ports have come up ready and now let's scale Go web app restore port. And as you see on the right, the restore port came up ready in couple of seconds and the vanilla port is still waiting to come up because it is initializing whereas restored ports skipped that initialization part and came up ready in couple of seconds. As you see, all the ports are up for the restored port but the vanilla port is still coming up and it is initializing. So you can see how much of a difference this would need. Make if you are scaling 2000 ports or 10,000 ports and every port is trying to initialize whereas with restore you can skip that part and ports could come up ready in couple of seconds or in milliseconds in fact. Tying all things together on a high level, what does it look like to introduce checkpoint restore service for existing application? Please note that this is just one of the ways you could achieve this and there might be multiple ways but the overall concept remains same. Say you have an existing image that runs a Go app. You could create a respective checkpoint and restore image as a wrapper around the existing image. This is one way to hand over the existing app to checkpoint restore service. Now let's take the deployment object for that application. Configurable properties like a way to check in if the application is ready or a way to pass in the container option could be passed as config map along with that respective representation for checkpoint and restore service. If the checkpoint is per node the Kubernetes object for checkpoint would be a demon set here and restore would be a deployment. If not checkpoint could be a deployment or a cron job that runs once per deployment if checkpoint is saved in the external storage or performed as part of CI CD pipeline. This is a respective image used by checkpoint and restore. Checkpoint restore service does require privileges but the actual application container running in set doesn't. So application container shouldn't have any impact. The next one is the location information of the checkpoint artifacts. Here it is a local storage location but it could be an external storage or an OCI registry. Finally, some way to pass in the properties again here it is done locally. With this we wrap up the non-native solution for checkpoint restore in Kubernetes and this zero to checkpoint restore in Kubernetes should give a complete picture of what it looks like. For native support I would encourage you all to attend Adrian Reber's talk tomorrow to learn more on the vision of adding checkpoint restore support in Kubernetes natively. There is also an existing KEP created and owned by Adrian Reber on adding checkpoint restore support natively in Kubernetes. Adrian's talk and KEP focuses on one of the use cases of checkpoint restore called pod migration. It doesn't take on other use cases yet in its initial phase. Signored meetings are also one of the other good resource to involve in conversation about checkpoint restore. I've been involved in reviewing Adrian's proposal and design and have been attending signored meeting discussions around the native support. I highly recommend to check out these resources if this is something of interest to you and I welcome to review and contribute. Let's go over some of the lessons learned using checkpoint restore. One thing to note is that checkpoint restore behavior is sensitive and could have unexpected results if there are additional processes that's added or removed from the container or if there are new network connections at the time of checkpoint. It could especially become challenging if it is an application not owned by the team that consumes it and has to control over what goes into and out of the containerized application. Checkpoint restore has its capabilities and functionalities sprinkled across Cryo, the tool itself, container runtime and even in the kernel. So in the past, we had to look at different layers of the technological stacks to understand the root cause of some of the issues that we ran into. It has been a learning experience throughout this course. Designing it for specific container runtime would lock you down to the specific container runtime and specific implementation, and any breaking change in the basic functionality in some of the other components could break the workflow. For example, there was a change in the way checkpoint location could be specified during restore in Docker. And though there was no change in the lower container runtime like RunC, it broke the Docker workflow. Since we were so tied to using Docker runtime, it took us some time to make a transition to using different runtime or different version of Docker. Also trading off between portability and optimization is a difficult task. The more optimization in place makes portability more challenging that affects local workflows. This might create a gap and allow issues to slip through the cracks. And this could be part of late discovery. So write balance prevents any gaps that would create with developers, local workflow, and the cloud environment. Some of these practices listed down here should hold good for any cloud native application and holds good with checkpoint restore tool. Highly recommend enabling logs and metrics around checkpoint restore. These logs and metrics are really helpful. And at the same time, maintainers are well accessible to discuss if you run into any issues. I'll make this logs and metrics as easily as accessible as possible. Having that repeatability factor and including checkpoint restore in the CI CD pipeline would help for quicker detection of any failures or change in behavior. Since there are different components involved in bringing the checkpoint restore together, always make sure to update to the latest in order to get all the required security patches, updates, and bug fixes. There are some things to keep in mind about the limitations and boundaries of Cryo in order to design checkpoint restore around that. For example, one thing that comes to my mind is FS notify system. We ran into issues with checkpointing a process consisting of I notify, and it makes it difficult for Cryo to resolve some of these handles with the current implementation of FS notify. This is a known limitations and documented, but limitations like this should be good to be aware of. So what are some of the enhancements that we envision in future with a non-native approach? As I mentioned, with an initial approach, we decided to go with making checkpoint available on the node. This need not be the case, and it could be made available to the external storage and save on checkpointing for every node. This is something we plan to take on moving forward. Also, we can go one step further and make it part of CI CD pipeline and get it closer to the integration and avoid late discoveries. Current non-native targets the fast startup use case, but would eventually could be extended to port migration as well. Also the other way, current native support in Kubernetes targets port migration, but it could be extended to support fast or port startup use cases as well. This brings us to the end of this talk, and I'm happy to share that MathWorks is hiring. If you are interested to be part of MathWorks, feel free to use this QR code that should take you to the careers at MathWorks page that list down the job opportunities at MathWorks or get in touch with me. I'd like to thank you all for attending and hope that you got something to take away from this talk. We learned a lot during this journey, and I think this should be super helpful for building a scalable system. I'll be available for Q&A now. Thank you.