 Welcome, everyone. Today, I'll be talking about how to keep calm and contain a deion. My name is Anusha Raghunathan, and I'm a principal software engineer at Intuit. Intuit is a fintech company that makes software around tax preparation, accounting, consumer credit reports, and stuff. So if you have ever used TurboTax, Mint, Credit Karma, or QuickBooks, then that's us. The agenda for today is basically starting off with, why are we even doing this talk? What is the background? What are CRIs? Then we'll do a section on how we went ahead planning the migration to this new CRI. What happened during our great migration? And some performance analysis that we did with the new CRI. And finally, finish it off with takeaways. Before we get started, I want to give a quick introduction about our Kubernetes-based infrastructure. We run about 220 plus clusters. And they average about 16,000 nodes. And this number of nodes actually goes up pretty high during our tax peak seasons. And a number of Kubernetes namespaces, roughly, about 15,000 and odd. We run about 2,000 production services on this Kubernetes-based infrastructure, serving about 5,000 developers. And we have about 17,000 assets that we manage with this. And each Kubernetes cluster runs about 25 add-ons on top of the vanilla Kubernetes cluster we get. These are primarily around security and compliance, cluster lifecycle management. I want to call out our KCoProj, which is an Intuit-started open-source project that manages our cluster add-ons, instance management, upgrade management, and stuff. Then we have add-ons around observability, metrics, tracing, and logging, networking, CNI and service mesh, storage, reliability testing around chaos experiments, and finally, container-native workflow engines, specifically Argo workflows. Now, what's a CRI? A CRI is a container runtime interface. And Kubernetes has interfaces for running your storage and running your network, container networking, using CNI and CSI. Similar to that, for running a container and doing some image management, there is a CRI interface that the Kublet calls out to. And a high-level container runtime, like DockerShimp or ContainerD, manages containers and images, lifecycle of them. And in turn, calls out to a low-level container runtime, such as RunC. And a CRI is a well-established interface that the Kublet calls over. And it's GRPC-based. And examples of container runtimes, like I mentioned, DockerShimp, CRIO, and ContainerD. Now, why is it that we're talking about ContainerD and CRI and stuff at this point in time? Typically, your average Kubernetes operator doesn't have to worry about these things. However, now we have to worry about it because of something that happened in Kubernetes 120, which is the primary container runtime that was used in Kubernetes, which was DockerShimp, got deprecated. And Docker, just a little bit of history, DockerDemon was the primary runtime for Kubernetes, but it was not CRI-compliant, mainly because DockerDemon was written way before Kubernetes originated. So a thin shim layer was written on top of DockerDemon to make it CRI-compliant. The problem was the shim struggled to find maintainership. And also, there were rising container runtimes that were CRI-compliant. So in Kubernetes 120, DockerShimp was deprecated as the default CRI. And in Kubernetes Upstream 124, it's going to be removed. That's why we all have to worry about the CRI runtimes now. So we decided to pick ContainerD as our CRI runtime. Why? Because DockerDemon eventually calls ContainerD in any way today, even when it's not the CRI. And it's been battle-tested really well. And it offers better performance in terms of CPU memory consumption as well as port startup times. And it's supported by a cloud provider, so we went with it. It was an easy choice. This is a quick comparison about the container runtime invocation between DockerShimp and ContainerD. You will notice that in DockerShimp, we have the extra hops between Kublet and DockerShimp, and eventually reading up to ContainerD. On the right side, you will notice that the Kublet directly calls ContainerD, and the extra hops are eliminated. So how did we go about planning our migration? First, you need to understand the CRI wiring in your cluster. Look at your worker nodes and see how many places you have DockerDemon, DockerShimp sockets being exposed to all of your cluster components. In our case, the blue Docker veil in the middle is actually the heart of a lot of things. And it was exposing DockerShimp and DockerShimp socket as well as DockerDemon socket. And if you look at the bottom of this diagram, there is a box with all our add-ons that we're actually relying on the CRI socket. Namely, the CNI, Chaos Engineering Toolkits, FALCO add-on, which we use for security scanning for our container runtime images, as well as Argo Workflow Controller. The other obvious clients of DockerDemon were the Kublet making the client connection through DRPC, as well as DockerCLI and API. The settings on the left side were mainly around DockerDemon configuration. And as you might see, there is a Selenux policies that we were writing based on DockerDemon GPU configurations for our ML workloads, as well as container log management. And another indirect dependency that we had was that we were expecting the container logs to be in JSON format, which was offered by DockerDemon. And we were using FluentD to ship them out. So we had to rewire our worker nodes to use containerD or other mechanisms when we had to migrate. Here is how we rewired containerD. So we mainly baked the containerD client, which was the crycuddle client, which worked out well for us. And the crycuddle client would talk to the containerD socket. And the containerD socket was also used to communicate with our add-ons in the bottom. So the CNI, Chaos, Falco, Plugins all ended up working really well with our new containerD situation. And notice that the containerD config file now actually still continues to take GPU config and AC Linux. But the container log management has now moved to the Kublet. So it's no longer maintained in containerD. You just have to configure your Kublet accordingly. Also notice that our Argo workflow controller is not dependent on containerD socket anymore. Argo workflow controller had a dependency on Docker, mainly because it needed a primitive to share the container namespaces. Two containers needed to have the same process namespace. And we were now able to use the Kublet because Kublet provides process namespace sharing. So our dependency on the container runtime was removed, and now we had a dependency on Kublet. And finally, the direct dependency I was talking about as far as the JSON log format was that containerD doesn't have a concept of logging plugins. And it was only logging container logs in text format. And we'll see what happens because of that in a bit. Word about container declient crycuddle, we baked this client basically into all of our worker nodes. And for the most part, we were able to actually find parity with the Docker API CLI, main handy commands that we would use for crycuddle PS, crycuddle images. Instead of Docker inspect, if you were used to using Docker inspect, you could do the same thing with crycuddle get info, which was used in our use case primarily to work around SE Linux policies. So you can actually explore more on crycuddle as your CRI containerD client. We couldn't find exact parity for some of the use cases. One thing I would like to call out is the Docker system prune command that we were used to for actually cleaning up leaky containers. Again, since Docker was more than just a CRI, it did a lot more than just container and image management. So system prune, we had to find some hacky batch scripting and cron jobs on top of crycuddle, and we were able to get it done. So the big takeaway is get familiar with some containerD client so that it's helpful during your migration. Another containerD client, the obvious one, is Kublet. And the one thing that was handy, again, for us during our migration is that Kublet actually can be configured to have a different CRI. In this case, the options are containerRuntime and containerRuntimeSocket, where you could actually spin up a test cluster today by setting these options and you will get a containerD cluster. So it's easy to just basically jump start on it and see which part of your bootstrap code you need to be changing in order to actually perform your migration. And like I mentioned earlier, now log management has moved from the CRI to Kublet. So you can get familiar with the log rotation options for max size as well as max files. All right. So we planned our migration. We made all the code changes that were required in our cluster bootstrapper. We had reconfigured things to use the containerD configuration rather than Docker. And all our add-ons, that code was also changed to accommodate for containerD. A lot of our end user teams and our platform teams had also migrated to their code path. So one thing I would like to mention here is that at Intuit, we go through monthly cluster upgrade cycles, mainly for security and compliance reasons, as well as a time to actually introduce new features to our Kubernetes platform. And this is the time when we actually migrated from DockerD to containerD. And one thing to note that is that we performed rolling upgrades of our clusters, which means that you started out with Docker as your CRI for your cluster, and then you ended up with containerD as CRI. But then there is a period of time during the cluster upgrade where you have a mix of both kinds of nodes in your cluster. And we guarantee to our end user developers that there will be zero downtime during the upgrade. So we planned and planned and we did the right things. Any guesses on whether it went smooth or did we have any gotchas? Well, of course, we had gotchas. And I'm going to be talking about two of them today. First was about the logging pipeline. Now before we get into the details of the problem and how we solved for it, a quick reflection on our logging pipeline. We used Fluendee as our node driver for logging. And Fluendee is responsible for log collection, log parsing if it's needed, as well as shipping it to our log aggregator, which is Splunk. And Fluendee takes care of shipping pretty much all the logs in our nodes. But the critical ones are container logs. And we cannot afford to miss even a small gap of logs. So log loss is a big no-no in our infrastructure. And Fluendee runs as a daemon set and it's a cluster add-on. So what is the first problem? Well, our entire logging pipeline had the assumption of JSON-formatted container logs. And containerD basically broke that assumption. Right from container logs to Fluendee to how it reaches our Splunk servers, everything was expected to be in JSON format. And this was because JSON offered better performance as opposed to other formats that we had considered before. And what happened was because of this change in this log format, there was log loss because there were portions of our pipeline that did not recognize formats. And we ended up with log losses. And like I said, this was a big no-no. And how did we solve it? Well, luckily, even though container logs were in text format, there is a predefined specification for the format, which is a timestamp, the stream, tag, and the actual log message. And they were all space delimited. We could pretty much use Fluendee configuration to actually parse it out using a regex. So that's precisely what we did. What we did was we changed our bootstrap code to make sure that each node came up with the CRI runtime that was running that node. And our Fluendee bootstrap container would read that file and load up the right Fluendee config map. For Docker, it remained the same JSON-based configurations. But for container D, it was a regex parser that would extract out the log message and eventually send it out to Splunk. The one thing to note that is with this approach, we did observe a 17% dip in performance of log throughput that was sent out. And this was because of the additional regex overhead that we were seeing with Fluendee. What was the second problem? Well, the way our cluster upgrades work is that the nodes are rotated out first, and then the add-ons are upgraded. And what was happening was there was a period of time when the nodes were rotated out and the new CRI was coming in place. But Fluendee still did not recognize it needed to do the bootstrapping changes we had made. As a result, there was, again, log loss during that period of time when the clusters were getting upgraded. So how did we solve it? It was pretty simple. We just got our code for the platform, the release before the actual migration to container D. So that was one big takeaway for us is you have to actually not just make all the wiring changes, you have to actually get it in a release, one release, at least, n minus one release before your actual CRI migration. The second gotcha I'm going to talk about today is regarding CNI. Again, a quick refresher. The CNI that we were using along with POD and host networking stack wiring for container networking also took care of IPAM daemon. So basically, IPAM daemon was like this local host daemon that was responsible for two things. One, maintaining a warm pool of IP addresses to hand out to pods and also taking care of allocating and deallocating IP addresses for the pods. And for IP allocation and deallocation, it would query the CRI socket to get a list of running pods. And CNI was running as a daemon set in all our cluster nodes. So what was the problem? Well, to prepare our CNI for our container de-migration, we had actually mounted container de-socket in the CNI POD spec as expected from our CNI vendor. And CNI is actually a bootstrap add-on for us. So in this case, the bootstrap add-ons get like upgraded before the nodes get rotated out. So you can see the problem, right? This meant that a node, a Docker node, would still be running. But the CNI would think that this was container de-node. And it would query the container de-socket, get an empty list of pods, and voila, it'd be like, OK, this node doesn't need any more IP. So let me start deallocating from the Docker pods, which were actually live and running. So this was, again, a big no-no for us. So how did we solve it? Well, we created a generic sim link for whether it's a container de-socket or a Docker socket in our bootstrap code. So if a node came up with Docker de-man, then this generic cri.sock would actually point to that. If it came up with container de, then it would point to container de.sock. And the cri.sock was actually mounted into our CNI pods spec. And we made sure, again, that these changes got into a release prior to the actual migration, because you had to prepare your CNI code to handle this before the actual migration. Let's talk about performance. We had already set expectations that pod startup times and CPU memory consumption for pods would be lesser in container de. But we wanted to actually verify that and see how much of a performance gain we get. So our setup was running Docker 19.3 and container de 146 on Kubernetes 121. And the test service is a Java Spring Boot application that very closely mimics our developer environments. And our test line generates about 6,000 transactions per second. And these transactions are mostly a combination of reads and writes into a memory heavy load, a GraphQL in-memory database calls. And the test was set up to actually ramp up for the first 60 minutes and then have a steady load of about 6,000 transactions per second for two hours. And we started with an HPA replica of Com3. How did we do in this? Well, container de definitely fared well. Especially in the startup times, we noticed that container de startup time maximum latency was only about 120 seconds in our synthetic workload, as opposed to Docker, which took about 200 seconds. And we're pretty excited about this performance gain because it will come in handy during our tax peak seasons when there is very high HPA kicks in. And we have a lot of replicas running at the same time. And startup times really matter. And during the steady state of about 120 minutes, we noticed that because of a slightly lower CPU consumption, there was lesser number of pod and node usage because it was just slightly more efficient. And HPA doesn't kick in for slightly longer than when compared to Docker. So I'm going to wrap it up with some takeaways. Understand your cluster CRI wiring. This was our use case where we thought we could just start off with a simple migration from Docker to container de. It turned out to be a several month project for us. So understand your CRI wiring. And plan to test and verify ahead of time because Kubernetes 124 is coming up and there will not be a Docker shim anymore. And the third big takeaway for us is you have to account for live cluster migration if you have a big platform like us. For us with the 220 plus clusters, we had, and with zero downtime, we had to make sure that all the live migration use cases were handled. Thank you very much. And we are hiring at IntroWidth. So if you're interested, please come and talk to me and I can take some questions now. Hello. You mentioned that your CNI needs access to the Docker socket, right? Yes. And do you know if the Docker socket, if the container of this socket is compatible, like fully compatible, or you needed to change your CNI codes to read it in a different format? So the CNI basically queries the CNI socket mainly to get a list of containers and parts so that it can actually do IP allocation. So as far as compatibility goes, the interface is pretty well established and to what it is expected. So we didn't see any compatibility issues there. It was more around the live migration where our CNI did not account for both set of CRIs to be available at a particular point in cluster upgrade situation. Just a question about the GPU. You say in the description you eat some issue with GPU. Can you say more about that, please? So the GPU issue was about, so it's more about a vendor issue, more than a container issue. So we get our GPU support from NVIDIA and our cloud provider is AWS. And the issue there was the NVIDIA GPU operator out there for container D doesn't work out of the box for AWS. It's mainly around OS support. So we had to work closely with AWS to actually get a separate recipe and bake it into our AMIs. And the other sort of glitch in there was the end user license agreement between NVIDIA and AWS had to be sorted out. So so far we've done some preliminary testing with GPU. There have been no issues with container D, although we haven't gone fully production with it. But I think if you have NVIDIA GPU operator running on CentOS and or Ubuntu, then you have out of the box support. All right, thank you, everyone.