 Hi, I'd like to thank everyone joining us here today. Welcome to today's CNCF webinar, 2000 Upgrades Later, Lessons from a Year of the Managed Kubernetes Service. My name's Ariel Jatib. I'm a business development manager for Cloud Native Technologies at NetApp and also a CNCF ambassador. I'll be moderating today's webinar. I'd like to welcome our presenter, Adam Wolf Gordon. He's a senior software engineer at Digital Ocean. A few housekeeping items before we get started. During the webinar, you're not going to be able to speak as an attendee. There is a Q&A box at the bottom of your screen. Please feel free to drop your questions in there and we'll get to as many of those as we can at the end. This is an official webinar of the CNCF and is such a subject to the CNCF's Code of Conduct. Please do not add anything to the chat or questions that would be in violation of that code. Basically, be respectful of all your fellow participants and presenters. Please note that a recording of this talk and the slides will be posted later today at the CNCF webinar page at cncf.com. And I'll hand it over with that to Adam to kick off today's presentation. Great, thanks, Ariel. Thanks, everyone, for coming today. As you've just heard, I'm Adam Wolfgarden and I'm an engineer at Digital Ocean. Currently, I'm the tech lead for our managed Kubernetes and Continuaries True Products. And I'm going to talk today about Kubernetes upgrades and some of our experience with them. I'm going to talk about how we do upgrades and the things that we got right and wrong in that process. But more importantly, I want to talk about some lessons that we've learned from doing upgrades for about a year. And these are lessons for both cluster operators, so people who are doing upgrades on Kubernetes clusters, but also for developers and others who are deploying workloads to Kubernetes. And these are things that will help your upgrades go better, make them easier, and keep your workloads running as expected as your upgrade cluster. And I want to start today with a little bit of background on how this talk came to be. So this talk really starts about a year ago in Barcelona. And there I am on the slide in Barcelona at KubeKani U 2019. And in Barcelona, we at Digital Ocean announced the general availability of our managed careers product. And if you stopped by our lovely booth in Barcelona, we probably told you about how it's now GA and you probably asked us what that actually meant. And we would have probably told you a bunch of things. I'm not going to go through all of the features of the product because this isn't really a marketing talk, but the important one that I would tell you and that I definitely would have told you about at the booth because it was something I worked on was that we had automated patch version upgrades. This is a very exciting new feature in our product. The thing that we probably didn't tell you at the booth was that you couldn't actually upgrade yet because we hadn't enabled any upgrade paths for our customers. We had tested our upgrade process a whole bunch. Had run hundreds and hundreds of upgrades on test clusters. But if you went to your cluster page on Digital Ocean, you would still see that your cluster was up to date regardless of whether it actually was. And the reason we hadn't enabled upgrades yet was because our upgrade process hadn't been exposed to the full richness of customer configurations and workloads that are possible in Kubernetes. And we were pretty sure that we were going to find some unexpected things when we turned them on for customers. And we didn't all want to be in Barcelona enjoying KubeCon and have to deal with those things. We wanted to wait until we were back at normal work. So we waited a little bit to turn them on. So as you can probably guess, once we turned on upgrades for customers, we learned a whole bunch of things. And that's why this talk is lessons from a year of managed Kubernetes upgrades. And that's really when I started thinking about giving this talk was when we turned upgrades on and started seeing what happened. When I wrote the proposal for this talk last year in the end of 2019, I ran some numbers and I estimated that by the time KubeCon Amsterdam rolled around where I was supposed to give this talk, we would have done about 20,000 upgrades. And that's a really nice big round number. So I put it in the title of the talk in preparing for today doing this webinar. I ran the numbers again and we actually accelerated our upgrades a little bit. We've done more like 35,000 upgrades now. And that's in about a year. So if you run the math, that's about a hundred upgrades a day across thousands and thousands of clusters. So we've done a lot of upgrades and we have a pretty good set of data to learn from. We've seen a lot of the possible things that can happen during an upgrade. So that leads to my favorite slide, which is disclaimers. I have two disclaimers for everything that I'm gonna say today. First of all, the lessons I'm gonna talk about today are lessons from our upgrade process at DO. And there are lots of different ways to upgrade Kubernetes. I'm gonna talk about some of the variations in how you can do upgrades. But depending on how you choose to do your upgrades, you might see different things than we do. And some of the things I'll talk about today are gonna be relevant. Some of them are not gonna be relevant. If what you take away from this talk is that you want to do upgrades a different way than we do upgrades because you don't wanna hit the same things that we hit, that's a totally valid takeaway. And I don't wanna say that what we are doing is the right process for everyone. The other disclaimer is that the lessons we've learned are from upgrading our customers' clusters. And their workloads are probably not the same as your workloads. Their workloads are not the same as our workloads internally, which we also have experience with upgrading. Depending on how your workloads work and how they're configured, you might see different things during upgrades, different problems, different advantages. So let's start by talking about what you have to do when you wanna upgrade a Kubernetes cluster. There are basically two parts of Kubernetes cluster. There's the control plane, which is sometimes called the master, and there are the worker nodes. So upgrading actually sounds like a very simple process, and it fits on one small slide. First, you upgrade the control plane, and then you upgrade the worker nodes, and then you're done. That's it, you've upgraded your Kubernetes cluster. It sounds really very easy. Of course, in reality, it's bigger than that. So here's an expanded view of what you have to do. This is probably still incomplete, and it's gonna vary depending on your exact environment, but when you upgrade the control plane, you're upgrading a bunch of things, and there's some ordering you have to be careful about, although some of these steps can be done in different ways. So the first thing you're gonna do is read the release notes for your new Kubernetes release, figure out whether you're using anything that's deprecated in your current version and not gonna be supported in the version you're upgrading to, and you're gonna update those things if you need to. So updating any resources that are in the water supporting your cluster. Then you're gonna upgrade EtsyD, if you need a new EtsyD, then you can upgrade the actual control plane components. You're gonna upgrade your API server, and then your Q controller manager, and then your Q scheduler. Then you can upgrade your CNI plugin for networking if you're using one, and then you can upgrade any provider-specific things. So assuming you're running in the cloud, you're probably gonna have a cloud controller manager and a CSI controller, maybe some other cloud-specific or provider-specific pieces. Finally, assuming that you're running things on the master as pods or static pods, you're going to upgrade your Qubelet on the master and your QPC tail out master for controlling those workloads. Once you've upgraded the control plane, you're gonna upgrade your worker nodes. And this is a little bit simpler because the worker nodes don't run as much stuff, but it also takes a lot of coordination because the worker nodes are where your workloads actually run. And your workloads are the things you care about in your Kubernetes cluster. Those are what you don't wanna go down. That's your business, your applications. So the first thing you're gonna do is cordon and drain a worker node. So get all of the workloads off of it so that they're running somewhere else. Then your node is empty. You can update the Qubelet configuration. And if there have been any changes that need to be made to the Qubelet configuration, once that's done, you can upgrade the Qubelet and you can uncordon the node, let workloads start being scheduled on it again. And you're gonna rinse and repeat for each of the nodes in your cluster, however big your cluster is. You might do a few nodes at a time. If you've got capacity to drain a few nodes at a time, that can up to be up the process. Assuming that you are running Kubernetes on VMs and not on bare metal, and we are running on VMs for our managed product, there's a bit of a shortcut you can take. Rather than upgrading each component individually in place on the nodes, you can just completely replace each of the nodes in the cluster. So before you start, you still need to do that initial step of making sure that everything you're using is supported in your target version. But once you've done that, to upgrade the control plane, you're gonna destroy your old control plane node and create a new one that has the new versions of everything in it. This does assume that your XED data is resilient to that. So either you have multiple XED nodes and they can be rebuilt when you destroy one and create a new one or your XED is outside your cluster or you're storing your XED on some kind of persistent storage. But assuming that your XED data is safe, you can blow away your control plane node, create a brand new one, has all the new versions of everything. So that's a much simpler process than trying to update each individual component of the control plane in place. Same thing for the worker nodes, you still need to do the draining, but once you've drained a node, you can destroy it, create a brand new node in its place, that new node is gonna have the new versions of everything, the new public configuration, et cetera. So if you've worked with Kubernetes Fairbed, if you've upgraded clusters before, you can probably already see some potential issues with doing an upgrade this way. And there definitely are some issues and that's what I'm gonna spend a lot of time talking about today. But there are also some advantages and this is how we chose to implement upgrades or in our managed product for our customers clusters. We do full node replacement of each of the nodes in the cluster rather than upgrading things in place. So the reasons we chose to do it that way are that there are a bunch of advantages. First off, if you upgrade by node replacement, then every node in the upgraded cluster is a clean slate. There's no chance that there was a customization made to that node that's gonna persist across an upgrade and cause a problem in the new version. You know exactly what to expect on a node when it's been upgraded. And this is particularly important if you're managing lots of clusters. You wanna know exactly what's gonna be there and you don't wanna have to deal with all of the various customizations you can make to a worker node. You want it to really be predictable. The other nice thing about doing upgrades by node replacement is that it's easier to automate. There aren't that many steps. There are basically four operations in this process, draining a node, deleting a node, creating a node and waiting for a node to become ready. If you've built automation for managing your clusters already, for example, automation to create a cluster, automation to do maintenance on a cluster, you've probably already automated these operations. So automating your upgrades is just combining those primitives that you already have in the right order. Finally, this process works regardless of what kind of upgrade you're doing. So you don't need to worry about whether a particular upgrade requires a CNI upgrade or not, whether it requires a new NCD or not, whether it's a minor version upgrade or a patch version upgrade. All of the upgrades to components that are gonna happen are encapsulated in the images that you're using. So there's less variation between different upgrades. You're doing less version-specific work to get ready for each upgrade. And I say it mostly works for all of these types because it's not always that tidy. There are some situations where you do have to do really specific stuff and I'll talk about some of those later. But in general, this process does work the same for any kind of upgrade that you're doing. So I said I would talk about things we got right and things we got wrong in our upgrade process. And I think this is the first thing that we really got right was choosing to do our upgrades by replacing the nodes in the cluster. It's a simpler process to understand than upgrading in place. It's easier to automate. And it was a really good choice for us since we are managing thousands of customer clusters where we don't control the workloads running on them. It gives us a nice predictable process. It's easy to understand for the developers working on it. If you're gonna manage a lot of clusters and you're building automation to manage a lot of clusters, I recommend this as an approach to at least consider depending on your needs. Of course, there are some problems with this process. And the basic problem is that it's a lot of change when you do an upgrade. You're totally replacing each node with a brand new, totally different node. So any custom configuration that you've done on a node like changing SysCTL values, for example, is going to be reset when you do your upgrade. And this bit some of our customers who were doing those kinds of customizations manually, when they did an upgrade, their worker nodes came back and they didn't have their customer configuration. Likewise, at least on our platform when you do an upgrade, every node is going to have a new name in Kubernetes. It's gonna have a new IP address and it's not going to have any labels or taints that are on the old node. This bit really bits out of our customers who are scheduling their workloads directly based on node names or directly based on labels or who were directly accessing their nodes by IP rather than using our managed load balancer. So this was a surprise for a lot of our customers and we've had to work to fix a lot of those issues and make their use cases work. So some lessons for customer operators here are if you're doing upgrade by node replacement, it is helpful to your users to reuse node names and IP addresses when you replace nodes if that's possible. Workloads probably shouldn't expect that that's going to happen. It's not great to expect that a node name is going to persist forever in Kubernetes. But scheduling by node name is a tool that you can use and someone's going to use it. So if you can make that work, it will reduce some problems. Regardless of whether you are going to do that or were able to do that, you definitely want to make sure that you're retaining labels and taints in some way. People who are deploying workloads to Kubernetes do want some level of control over how they're scheduled and which nodes they're scheduled on. And labels and taints are the right tools to use for that in Kubernetes. So providing some way to set persistent labels and taints that will survive an upgrade is an important thing when you're building an automated code. Finally, and kind of likewise, providing a good ingress or load balancing solution that works with your clusters is important. Getting traffic into a Kubernetes cluster is actually kind of tricky and that's worth the whole talk on its own that I'm not going to give today. But almost everyone needs to do it. You usually have some kind of traffic coming into the workloads you're running in Kubernetes. The easier you make it for people to do that, the less likely people are to build their own solution for it and end up relying on node IP addresses or node host names. So I like that. Those are the things that can cause problems during an upgrade or any other kind of maintenance when you start building those. Some lessons for developers here are these are things that will really help make your workloads more resilient upgrades and other kinds of node replacement. So first off, if you need to customize things about a node, like the sysctl values that I mentioned, you're best to use Kubernetes primitives to do that. Two good ways to do that are either using a privileged daemon set that's going to run on every node, make the customization that you need to make, or using an init container as part of a workload that requires a customization. Either way, what you're going to end up with is something that gets scheduled by the Kubernetes scheduler on each node that needs the customization that makes that customization so that you're not doing it manually. And that way, if your node goes away or new one gets created, it's going to get the customization. Secondly, don't use node names for scheduling. I mentioned you can do it, but it really is not a good idea. The Kubernetes philosophies that nodes are livestock, not pets, nodes are going to go away at some point. If you're doing upgrades the way we are, they'll go away during upgrade, but they might go away at other times for maintenance or because of hardware failure or whatever. If you want some control over scheduling, you're much better off using labels and taints and learning how to set those in a persistent way in your environment. On some providers, or if you're managing your own cluster, that's going to mean just setting the labels or setting the taint on the node directly through Kubernetes. On other platforms, like our managed platform, you have to create a node pool or some other abstraction or configure a label in the management layer so that it gets applied to your new nodes that you're creating. So read your provider's docs if you're using a managed Kubernetes service or ask your cluster operator if you're not managing your own Kubernetes cluster. Make sure you understand what happens when a node goes away and gets replaced and how to get labels set appropriately. Likewise, you are always best off to use supported Ingress or load balancing service that's provided by your cluster provider if you can. That's going to make sure that traffic keeps getting to your nodes when they are replaced. It's going to make sure that your traffic keeps going through during an upgrade and that's a good best practice to follow. There are always going to be use cases where it doesn't work and you need to build your own thing and there are totally valid reasons to do that, but I would say take that as a last resort. Try not to point things directly at Kubernetes nodes, trying these services, load balancers, etc. So there are some things that we got definitely wrong in our upgrade process and I want to talk about a couple of those now. The first big one is that we implemented our node replacement process in exactly the way that I described earlier, which is break before make. So we drain a node and delete that node and then we provision a replacement for it. And we did it this way for some reason specific to how our product works internally, but it really causes trouble and this is actually something we're working on fixing right now in our product to make things better for our customers. There are a few basic problems that this causes. They're all basically related to draining nodes. So first off, if a customer or a user is running right at the limits of their cluster, their cluster is basically full to capacity, then it might not always be possible to drain a node's worth of workloads to another node. There might not even be another node. We do have some users who have single node clusters. Hopefully they're not using them for production workloads, but they do exist. And so if we try and drain their single worker node, there's just nowhere for those workloads to go. They're going to go down. Either way, because of capacity or because you decided to have a single node cluster, you're going to end up with downtime for your workloads if they can't be drained to somewhere. And regardless of those issues, even if you don't have capacity issues at all, another issue with break before make is just extra turn that causes for workloads. When you drain the first node in a cluster in our scheme, the workloads that are running on it are guaranteed to end up on another node that's still running the old version. And that node is going to have to be drained and replaced right away as well. So those workloads are going to be drained or evicted twice instead of just once. And that's just a little bit of extra turn, extra chance for things to go wrong. Not great for the workloads. So listen for operators here. It's pretty simple. If you're going to do upgrades by node replacement, it's best to figure out a way to create the new nodes before you delete the workloads. This might be a little bit more complicated to automate, like it is for us for various reasons. But it's really a much better experience. If we could go back in time, this is how I would build our upgrade process. If you really can't do that, for example, if you were running a hardware cluster, then you can't really add nodes before you drain nodes. You might want to consider reserving some capacity for upgrades. So having a node that's not usually scheduleable that you enable during an upgrade that gives you just somewhere for workloads to drain to if you're near capacity. That might be kind of expensive, but it might save you headaches as well. For developers, the lesson here isn't really specific to upgrades. It's just that your Kubernetes platform is eventually going to lose a node. A node is going to have to be drained for an upgrade or for maintenance or for some other reason. So leave some capacity. Make sure that at least one node worth of workload can be drained to somewhere. There's somewhere for it to go when a node needs to go away. That's just a good practice to keep your workloads running smoothly through not only upgrades, but also node failures and other kinds of operations. A related thing that we got wrong was that we replaced nodes exactly one by one. So we destroy one node, create one node, destroy one node, create one node until they're all replaced. And this is just fine for a three node cluster or a five node cluster. It's not great for a 300 node cluster because it just takes a long time. And it gets really bad if you have a big cluster and the workloads don't evict quickly. They don't... So your node doesn't get drained quickly. We end up hitting a drain timeout. And we had initially sent our timeout for drains to an hour. And then we scaled it back to 15 minutes because we decided our workload should not need an hour to be evicted. But even on, say, a 20 node cluster, which is a very common size of Kubernetes cluster, if you take 15 minutes to drain each node, that's five hours just of draining. So your upgrade is going to take more than five hours. And that's a long time to be waiting for your cluster to do an upgrade. To restate that a little bit more concisely, replacing those one by one is just slow. And it can be even slower if you have workloads that get stuck. So upgrades can only be so fast they're going to take some time. We want to make them as expedient as possible. Most users of Kubernetes are going to want to sort of watch their upgrades or keep an eye on their cluster during an upgrade to make sure nothing was wrong because it is a very disruptive operation. And you don't want to leave them watching a cluster upgrade for five hours or 12 hours or something. You want to make it as fast as you really can. So for operators, the lesson is really simple. Replace multiple nodes at once if you can. That will just help you upgrade a big cluster quickly. This kind of requires that you do make before break, not what we did break before make. So users may have capacity in their cluster to absorb one node worth of workload when you need to drain a node. They probably haven't set aside 10 nodes of capacity if you're going to drain 10 nodes at a time or replace them. So you kind of have to do the make before break if you're doing multiple nodes. The other thing is set reasonable drain timeouts. Don't wait forever for a node to drain because sometimes a node is just not going to drain. So workloads don't drain instantly. It takes some time for a process to respond to a signal and be evicted. But it also shouldn't need an hour. If you set a good timeout, and like I said, I think our current timeout is 15 minutes, that's going to help make sure that you at least have an upper bound on how long it takes to replace a node. And that upper bound is somewhat reasonable. For developers, there's not a lot you can do about how your cluster operator or provider replaces your nodes or how they do upgrades. But you can help with the draining aspect. There are two aspects to this. You want to make sure that your workloads can be evicted safely. So use pod disruption budgets and other mechanisms and Kubernetes to make sure that enough pods of your workloads stay up all the time. The other piece is trying to make sure that you can be evicted quickly. So respond to signals appropriately, try and make sure that your application shuts down quickly and safely so that when a node does get drained for whatever reason, including an upgrade, it's a happy process and fast process. And you should test this. It's really great to try draining a node with your workloads running on it and make sure that it doesn't cause any problems for your application. Make sure that it drains nicely. Eventually, a node is going to get drained if you are doing an upgrade. No matter how you do your upgrade, you are going to have to drain nodes. So it's really an eventuality. You want to make sure that your workload and your application can handle it. Back to the sort of positive side of things. This is something that we got wrong, but we are very happy we got it wrong. I mentioned earlier that when we did our GA, we only offered automated patch version upgrades. So for example, that would be 114.1 to 114.2, but not 114.2 to 115.0, which is a minor version upgrade. We started out with patch version upgrades because they're a bit simpler. Resources and Kubernetes aren't supposed to change between patch versions, so everything should basically continue working in your cluster when you do a patch version upgrade without any changes to the stuff you have to play. It was a good idea for us to start that way. They are simpler and we learned some things that made our lives easier when we got to doing minor version upgrades. On the other hand, when we started testing minor version upgrades, we found that they mostly just worked, and we had probably worried a bit too much about them and we did longer than we needed to implement them. The same basic process that we use for patch version upgrades works for minor version upgrades as well. They are just easier than we expected in general. There was very little that we had to fix or do for specific versions to make it work. I will talk about some of the things that we found we had to do specifically for certain minor versions. The lesson here for our versions is really simple. Just don't worry so much about minor version upgrades. It turns out that all upgrades are disruptive, and minor versions are that much more disruptive than patch version upgrades. They are probably less scary than you are expecting. One decision we did make that really helped the minor version story go well was that we leave most Kubernetes alpha features disabled, and they are disabled by default, and we don't change that configuration. Alpha features are the things that are most likely to change or be deprecated between releases. So if you leave them disabled, there is just a whole class of problems that you're not going to have to worry about. Specifically, you're not going to have to worry about changing things that are using alpha features to make them work in the new version. The lesson for operators here is, again, pretty simple. I would recommend leaving alpha features off by default. They are much more likely to change and rate between releases. Like I said, you're going to just eliminate a class of problems by leaving them disabled. If you or your users do have a reason to use an alpha feature, just consider it as a trade-off. There's value Ds in the feature. It's also potentially going to cause pain at upgrade time. It's something you're going to have to think about. There is one alpha feature that we did enable, which is CSI snapshots. We felt that offered a lot of value for our users. It's something they requested. So we did enable it in our clusters, and we're actually doing the work right now to migrate away from those alpha snapshots. And it is one extra piece we're going to have to take care of in a future minor version upgrade to make sure that we migrate from the alpha version of that to the beta version of that. For developers, I think the lesson is similar, and this is something you have a lot of control over. Regardless of whether alpha features are enabled, you can decide whether to use them or not. I would be sort of reluctant to use them if you can avoid it. Use them as kind of a last resort. If you do need to use one, be extra vigilant around upgrades. Look at the release notes. Make sure that you know when your alpha feature is becoming beta or has a breaking change. And try and make sure that your usage is compatible with the next release before you do an upgrade, just so that you don't have any surprises and you're not counting on a process that you will control, which is maybe the upgrade process to take care of it for you. I want to spend probably basically the rest of this talk today talking about two common classes of problems that we've seen with upgrades. And the first one is issues with container storage interface, or CSI component in Kubernetes. So for those of you who maybe aren't familiar with it, CSI is a pluggable way to provide storage to containers. It's an abstraction layer between Kubernetes or other orchestrators. It's a orchestrator agnostic framework. But it's an abstraction layer that allows you to present storage to containers for them to use in a sort of abstracted way so that you're not building directly to Kubernetes. Kubernetes clusters on DigitalOcean, whether they're using our managed offering or managed by a customer, are able to use our open source CSI plug-in to attach our persistent block storage to their workloads. And this is the mechanism we recommend for any user that needs persistence in their Kubernetes on DigitalOcean because your CSI volumes are completely outside of your cluster. They're going to survive an upgrade. They can survive your cluster being deleted, et cetera. So we've seen a few different problems with CSI, and I'll talk about a couple of them specifically as they relate to upgrades. The first issue that we've seen in CSI is just that it was generally immature when we started using it. The first release where we supported upgrades in Kubernetes was Kubernetes 1.10, and that was the same release where CSI was promoted from health to data. So in that 1.10 time frame, the Kubernetes components that support CSI were relatively new, and most of the CSI drivers, including ours, were relatively new. So unsurprisingly, there were some bugs in both of those things. Upgrading a cluster, like we've talked about a lot, requires draining the nodes in the cluster, and when you drain a node, if any workload on that node is using persistent volumes, those volumes are going to have to be detached and then reattached to another node so that the workload can run there. So there's a lot of CSI interaction going on in that process, and we hit a number of issues in both the upstream CSI components in Kubernetes and also in our own CSI driver that essentially resulted in the state of volumes being out of sync between Kubernetes and the real world. The symptoms that we would hit were that a node wouldn't be able to be fully drained, it would hit the drain timeout because CSI was trying to detach a volume that actually wasn't attached to the node, or we would drain a node and try and reschedule the workloads on another node and not be able to attach the volumes because CSI thought the volume was still attached to a different node or thought it was already attached. Those kinds of issues. The nice thing is CSI has matured a lot in the last few Kubernetes releases, so in I would say in 114 plus we see very, very few CSI problems. Upgrades in 114 plus have been very smooth with regard to CSI, and it's really taken a lot of strides. So if you're on a newer release, I wouldn't really worry about it. The other problem we hit related to CSI is also kind of related to the fact that it was not that mature when we started. Every CSI driver has a name, and the convention that's defined in the CSI specification is to name them on a domain name basis. So kind of like a Java class, familiar with those. In the early versions of the CSI spec, that convention was reverse FQDN naming. So if you were the example corporation, you would call your driver.com.example.csi. In later versions, it changed to be forward FQDN, so now if you're an example corporation, you would call it CSI.example.com. And we changed our driver when the spec changed to be com.digilotion to digilotion.com. And this name ends up being used in a bunch of places in Kubernetes. It gets used when your driver is registered with the Kubernetes subsystem that manages drivers. It also gets used in the storage class and set as a field on all the volumes that are created by the driver. This is essentially how Kubernetes correlates a particular persistent volume to the storage driver that's supposed to manage it. And so for good reason, it's immutable. Once you register a storage driver in Kubernetes, a CSI driver, you can't really change its name because that name's been propagated to all the volumes it created. So when we went to upgrade from a CSI driver release, where we used the old name, to when we hit a new name, we started hitting a problem. And the problem was basically that Kubernetes no longer knew that those persistent volumes we created with the old driver should be managed by the new driver. And those volumes became unmanageable. You could no longer attach them to workloads. And if you tried to drain a node, they wouldn't get detached and reattached. So our solution was to make the name configurable in our driver. It defaults to the new spec right thing to do, which is the forward FQD in naming. But it is overrattable by an environment variable. And what we do when we upgrade is we detect whether a cluster was using the old name. If it was, then we configure the new version to also use the old name so that that name doesn't change. And unfortunately, this will probably be part of our upgrade automation forever since we have to keep supporting clusters that have been upgraded through various meta versions of Kubernetes. And I guess that's one of the few version-specific things that we've had to build into our upgrade process is detecting that change and persisting it. So a couple quick lessons which are applicable for both operators and developers. If you're using CSI, I would just recommend carefully testing your upgrades and seeing what can go wrong. There is a lot that can go wrong in coordinating volume moves between nodes. And the data on your volumes is probably important to you. That's why you put it on a person volume in the first place. So you want to make sure that your data is safe and your workloads are going to work as expected. Watch out for any workloads that get stuck, for nodes that get stuck draining, et cetera, like a bunch of common issues. And be especially vigilant if you're using an older Kubernetes release, I would say before 1.14. Using your release, if you can, the catch-22 is that if you upgrade, you'll have fewer CSI problems in the future, but also you're more likely to hit a problem. But like I said, in 1.14+, you're much less likely to hit these problems. We really have seen very few issues in your releases. I've seen the big one for last. This is probably the most common problem we see in Kubernetes upgrades to this day. It's a problem with admission control webhooks. And these problems are possible in any environment with any upgrade process. So I'm going to spend a bit of time on them because I think this is a problem that's been a big pain for us and a lot of people are likely to hit. So for anyone who hasn't seen admission control webhooks before, I'll give a quick overview. An admission control webhook is a configuration you can make in Kubernetes to have an external service determine whether a resource can be created or not. And there are two kinds of admission control webhooks. There are validating ones and mutating ones. The mutating ones can modify a resource before it's created. A validating one just determines whether it can be created or not. And for our purposes, in the rest of this, they're identical. There's no difference between them. So my example is going to be a validating webhook, but the same problem applies to mutating ones. The sequence diagram on the slide here shows what happens when you try and create something in Kubernetes with an admission control webhook in play. So you make your call to the API server to create your resource. And the API server is going to make a call out to your webhook service that you've configured. And it's very common to run these webhook services inside your cluster as a Kubernetes workload. You can run them outside and I'll talk about reasons you might want to do that. But it's very, very common to run them inside your cluster. The webhook is going to return a response that says allowed true or allowed false. And that's how the API server determines whether or not it's allowed to create the object. Assuming that it is allowed to, it's then going to go ahead, do its normal thing, create the object, everything is good. I want to be really clear that there are lots of good use cases for admission control webhooks. Authorization is a common one. Validation and enforcement of best practices is a good one. Injecting sidecars for things like service meshes is also common. There's nothing wrong with using admission control webhooks and you should definitely use them. They're a great tool. I'm going to talk about how to make them safe for upgrades and the problems they can cause during upgrades. The problem's all really to what happens if the webhook service is not running and it can't respond to the API server. So looking at our sequence diagram again, what happens if you go to create a resource with your API server? It calls out to the webhook service and it just doesn't get a response. Well, what happens depends a little bit on how you've configured your webhook. First of all, it depends on the failure policy. The failure policy field can be either fail or ignore. If it's fail, then the webhook service, if the webhook service isn't available and the API server doesn't get a response, it's going to act as if the webhook disallowed the creation. So your resource creation is going to fail. If you haven't set to ignore, then it's going to act as if the webhook just doesn't exist. It's going to go ahead and create the resource. We'll come back to that for a minute in a minute, but I want to talk about how this affects upgrades. So the problem for upgrades is that during a Kubernetes upgrade, we're going to update a bunch of system components that run in a Kubernetes cluster as workloads. These are mostly in the Kube system namespace, but they might be in other namespaces too, depending on how you configure your cluster. Some examples would be like Core DNS or Kube proxy. These are things that run on your nodes as Kubernetes workloads and are scheduled by the API server and various controllers. Webhooks can prevent these updates from happening. They can prevent the definitions of your system components from being updated. They can also prevent new pods from being created for your system components. And webhooks can prevent the services that back them from being scheduled. So if you're running your webhook service in your cluster, which I mentioned is a very common configuration, it can potentially prevent itself from being started, and then your webhook service is never going to work again. And that's a bigger problem. So coming back to this webhook configuration for a minute, this one applies to pod creation and it applies to pods in any namespace. So when you try and create any pod in your cluster, this webhook is going to be activated and it has a failure policy of fail. So let's look at what happens during an upgrade. Say our webhook service is deployed in the cluster as a deployment and we're going to start doing our upgrade. We have a node that's running the webhook service and also the other normal cluster stuff. When we start our upgrade, we're going to drain this node and the webhook service pod is going to be killed. The deployment controller is not going to be able to create a new pod for the webhook service because when it tries, the API server is going to try and reach out to the webhook service to ask whether you can create the pod and that call is going to fail, the failure policy is fail, so it's not going to create it. So when we bring up a new node, the webhook service is not running because the deployment controller was not able to create a new pod for it and the daemon set controller now is going to try and create system components like Qproxy and our Cili MCNI driver. It's going to try and create those on the new node and it's not going to be able to again because it's going to try and create the pod. It's going to go to the webhook service, the webhook service isn't running, it's going to fail. So at that point, your cluster has nodes that are just completely unusable. You can see a simple solution to this which is set the failure policy to ignore and that actually causes another problem you might not expect because of the timeout. It turns out that almost all of the default timeouts in Kubernetes are 30 seconds. That includes the timeout for webhooks. So even if you don't specify 30 seconds as the timeout for your webhook, it's going to get 30 seconds by default. It also includes the API server timeout. When you make a request to the API server, the default timeout for that request is 30 seconds. So if you set your failure policy to ignore, but you leave the timeout at 30 seconds, you'll end up with actually the same effect as having the failure policy set to fail because the API server is going to wait 30 seconds to try and get a response from the webhook server before it ignores the failure. But by the time it hits that 30 second timeout, the request is also timed out and so the ignore doesn't even matter that the request has already failed. So I recommend keeping your timeouts much lower than 30 seconds regardless of what failure policy you're setting. This is actually what's recommended in the official Kubernetes docs. So this isn't just me saying it, this is in the official documentation. The configuration I'm showing on the slide here will work just fine where you have a timeout seconds of five and a failure policy to ignore. That's never going to cause any problems during an upgrade. Let's say you really do need your failure policy to be failed because your webhook is very important. You can still avoid upgrade problems by having your webhook not apply to the Kube system namespace or any other system critical namespaces. A good way to do this is to set a label on your Kube system namespace and have your webhook ignore namespaces without label using a namespace selector. One strategy that some teams at DigitalOcean have used is actually to have a mutating webhook that mutates webhook configurations so that they are forced to ignore a Kube system. That way you can never set up a webhook that's going to cause problems. We're considering this for our managed product as well. Your webhooks should also make sure to ignore whatever namespace their own services run in if they're running in cluster and also any other namespaces that run system critical components. So lessons for operators out of this. First of all, check that your webhook configurations are good before you start upgrading a cluster. We have an open source tool called ClusterLint that includes a check for this. So that's what we use on our customers' clusters before we upgrade them. And you can also use that tool as well, like I said, it's open source. The other things, like I mentioned, you might want to configure a mutating webhook that mutates webhook configurations to make them harmless. That's a great way to avoid the problem ever coming up in the first place. But if you're going to do that, you might want to consider running that service for that webhook outside of your cluster just so that it's not susceptible to these same kinds of problems. For developers, the lesson is basically when I show them the example, be careful about your failure policy and your timeouts. And be careful with webhooks in general. They can cause big problems for the important components in your cluster. Like I said, they're a great tool to use, just be really mindful of your configuration outcome so that they're not going to cause problems for the cube system namespace or any other system-critical namespaces. That's all my content for today. I have this slide with sort of everything we talked about. I'll run through quickly just as a recap. My first lesson today was you might want to consider upgrading your Kubernetes cluster by node replacement instead of upgrading your nodes in place. It is a simpler process. It helps with automation. There are some problems that can come up with that. And so make sure to be aware of those. Consider retaining node names and IP addresses if that's possible in your environment. Have your workloads assume that nodes are going to go away and not refer to specific node names or specific node IP addresses. And create new nodes before you destroy old ones if that's at all possible. That's really going to help with the draining problem and having your workloads continue running through an upgrade. Secondly, make sure that your workloads can be evicted. They are definitely going to be evicted during an upgrade regardless of how you do it. And the more prepared you are for that, the more you test that, the better your workloads are going to appear. The next lesson was upgrade more than one node at once if you can. That's really going to help when you have a big cluster. It'll make the upgrade process faster and smoother and that's a good thing. My lesson after that was minor version upgrades are probably easier than you think, especially if you avoided using or enabling alpha features. Don't worry so much about minor version upgrades. There are many advantages to upgrading to the next minor version of Kubernetes. And why you need to be careful about any upgrade, minor upgrades are not that much harder than patch version upgrades in general. The last two lessons here were around specific problems that we've seen. One is that CSI is now becoming mature. I would say now it is quite mature. But on older versions of Kubernetes it was not. So take special care when you're upgrading if you use CSI. The final one was around admission control web hooks. They can cause all kinds of trouble to hang an upgrade. Like I said, this is the most common problem that we see with upgrades for our customers. So if you're using admission control web hooks, check your targets, check your failure policies, check your timeouts, make sure that those are all configured according to the Kubernetes docs and what I've told you today. You can use our cluster link tool to check those if you want. There's I'm sure also other tools that can check them. And that's what I had for today. We'll move on to Q&A, which I think Erin was going to moderate. Yeah, thanks for a great presentation and a talk, Adam. We have a couple of questions in chat. I'll start with the first one, which takes us to an earlier spot in your talk, which is can we use init containers for node customization? Yeah. So the way that works is when you define a pod, you can have the normal containers for the pod and you can also have init containers. The init containers run before the normal containers start and they're going to run on the same node. So if you need to set a specific Ctl value, for example, on a node because your workload wants a really big TCP buffer or something, you can have an init container that goes and sets that value before your workload starts or before your application starts. So that's the basic mechanism. You just, yeah, set it up to do that. Cool. Next up, David Suarez asks, regarding CSI, FQDN order change, was it an option to update at CD data to upgrade it? Or, and if so, why was that option not ideal or selected? Yeah, that's a great question. That is something we considered and it's something I experimented with. The trouble with it is we would have had to do it directly in EtsyD. So we would have had to go around the API server to do it. The API server disallows you from changing some of the fields where that driver name gets persisted. And we really didn't want to reach directly into EtsyD and hope that we got all the right places. We much preferred to leave that in place and just handle the fact that it had changed. But yeah, that's definitely a strategy we considered and would have probably been the preferred strategy if it was possible to make the change via Kubernetes mechanism rather than going directly to Etsy. Christian Roman asks, what kind of unit test do you run prior, during or after the cluster upgrade to validate things went well? Maybe even at individual stages, such as after a CD upgrade, control plane upgrade, et cetera. There's a variety of things that we do. The biggest thing we do is we make sure that after we replace a worker node, we make sure it becomes ready before we start up the next one. That ensures that we're not going to take down all the worker nodes in a cluster and have nowhere to schedule workloads. It's not 100% guaranteed that the workloads are okay. There's only so much we can do about that since the workloads we don't own and we really try not to look at since they're taken on to our customers. But we do make sure that the nodes become ready. And same between the control plane upgrade and the worker node upgrade, we make sure that the control plane components are all up and healthy, that the CNI is healthy, our Cloud Controller Manager is healthy, our Kubernetes scheduler is healthy, all those things. That's the biggest thing that we do is just rely on the Kubernetes health status to make sure that things are happy. If you control both your cluster and your workloads, then doing health checks on your workloads would make a lot of sense. That's not really something we can do since we don't, like I said, we don't control the workloads. And if someone wants to configure their workload really poorly, we don't want to sort of wind up with a stuck upgrade for that. It's a bit of a trade-off being a managed provider that we want our customers' workloads to be as safe as possible if we don't have full control over them. So trade-off there. So I guess the question follows up a little bit with that. If there is an issue with the, and I think this speaks to the managed service aspect of it, which is if something goes wrong with the upgrade, is there a process for raising this as or flagging it for human to come in and kind of troubleshoot? Yeah, there definitely is. We lean heavily on Prometheus metrics for this. So the process we have that runs that reconciles clusters and does the upgrade, exposes a bunch of metrics internally to us. For example, what's the cluster that's sort of been reconciling for the longest? So what's the slowest upgrade that's currently in progress? And we have alerts on those things that go to our ops team and then eventually get escalated to us if there's a problem. So that's our most basic mechanism. The most common thing we see is that upgrades just get stuck because, for example, we upgrade the control point and it ever gets healthy. So that's how we catch those kinds of things. How do you, is there a good way, a good practice that you guys employ to determine at stack point, we used to struggle with this a little bit, what's, you know, that the customer didn't deploy application to the cluster, leveraging best practices and then upgrades can potentially become problematic because everything, is there, is there some practice that you all employ to evaluate whether it's, you know, on the customer or whether it's something in the platform? That, I would say that determination is mostly manual at this point when we do have something get stuck or across problems, it's really kind of human intervention. We'll go and look. And there's some problems we'll just fix for customers like the webhook ones, for example, will temporarily disable the webhook but the upgrade proceed and then set it back. That's not something we like to do because it is touching the customer configuration but if necessary we'll do it. We'll also, you know, get back to a customer and say, hey, if you change this thing in your workload, the customer, the upgrade's going to proceed. We do have a mechanism to pause the upgrade which, I mean, leaves the cluster in a kind of broken state but at least doesn't block our visibility to other upgrades that are going on. So that's something we also sometimes get back to the customer and say, hey, change this and then it can proceed. Cool, yeah, very consistent how we used to do that back then. Tan Prokowski asked, did you have to roll back the upgrade during mid-upgrade? If so, why and how did you get visibility and control the process? Yes, we can roll back upgrades in mid-upgrade especially like if you, if we upgrade the control plan and it never comes up, that would be the, probably kind of the last point at which we do a rollback. We don't have any automation for that. That's a manual thing. We've had to do it, I can probably count on one hand how many times we've actually done it because it's not a great thing to do. The in particular, if some of the control plan has come up and it started converting resources to new formats and etcd or things like that, there's a lot of chance for things to go wrong if we roll back. So we try not to do it. But our basic mechanisms for that is we do take etcd snapshots and also VM snapshots before we start the upgrade and make sure that those snapshots are sort of in place. So that if we need them and we need to roll back to them, we can. Cool. We have a couple more minutes and a couple more questions. Do you, Dejan Vederik asked, do you change control panel, control plan IPs to when replacing? In my experience, this is a pretty complicated task. Yeah. So we do change the public IP for the control plan, like the IP that the customer connects to to use it. The way we handle that is is that we never actually expose that IP directly to customers. They connect through a host name that we manage and we just set the TTL on that host name super low. It's like 30 seconds. So that's the basic way around it. The in terms of like internal IPs, that's all handled by the CNI. So as long as someone's not like hard coding a control plane IP internally, they shouldn't have a problem. Cool. Should Ding Zhao ask, how can you configure the control plane components, QBAPI server or Qubelet during the upgrade? Sure. So the way that we do it is we bake that configuration into our VM images that we're going to use for provisioning or upgrading clusters. So it's actually exactly the same regardless of whether you're creating a new cluster on a version or upgrading to a version. We do some templating in there, just using like go templating, go template streams. Nothing very fancy. But we bake those configurations right in. Like I said, that's kind of one of the advantages I see of our process is that we're not mutating those configurations. They're sort of right once objects, which it is helpful and just keeps the process a little bit simpler. Cool. Let's roll through a couple more as they keep rolling in. Yerli Romanov asked, what percentage of nodes do you update at a time then? How do you check the user settings like anti-affinity and such to allow, will allow to reschedule pods? Yeah. So at the moment, we are still doing nodes one by one. We're working on right now, moving toward a system where A, we create new nodes before we delete old nodes and then that also allows us to increase the number that we upgrade at once. I think we're still up in the air on exactly how we're going to set those numbers. I think it's going to take a little bit of experimentation on our side. It also depends a little bit on, this is a kind of a detail of our product. It's not going to be applicable elsewhere, but in our product, the worker nodes are owned by the user. The user has like full access to them. And that means that they count against the limit of the number of VMs we allow a particular user to have on our platform. So we have to be kind of mindful of that and allow for the case where they hit their limit and we can't create any more for them. So there's some of the complexities that you might have to manage if you have quotas or something to manage. Yeah. And I think the same holds true at other cloud providers, believe it or not, since we had a lot of experience with the multi-cloud, where limits on what could be used impacted operations, if you will. An anonymous attendee asked, do you upgrade a certain percentage of nodes at a time? For instance, upgrade a third of the nodes each day over a three-day period. So we, like I said, just now at the moment, we're just doing one by one. But in terms of the time scale, we start an upgrade and we don't pause until it's finished. So we don't take multiple days or anything like that. Our goal is to do an upgrade in less than an hour. And right now, if you have 100 node cluster, it's definitely going to take more than an hour, but hopefully in the future, that'll come down. Cool. Great. And then I think we have one final one that we can ask here. What problems do you think the other approach has, I guess, a little bit of context? The approach of upgrading each component separately in each node instead of replacing the complete node. I'm asking, as my previous project, we used to see a lot of downtime for our workloads as unfortunately, we didn't have a breathing room in our cluster for upgrade the make before break. Yeah. So I think the big challenges of doing an in-place upgrade are, A, it's just a lot of components to coordinate. And it's going to be somewhat different between different upgrades. Sometimes you're upgrading your CNI. Sometimes you're not. Sometimes you're upgrading your Cloud Controller Manager. Sometimes you're not. So you have to build the automation differently depending on each individual upgrade you're doing. The other thing is that you're going to have to upgrade configuration in place and you're going to have to deal with any customizations on the nodes, anything that's changed sort of beyond your control. So depending on your environment, if you have a tightly managed environment where you really control the workloads, control the cluster, an in-place upgrade could be much less disruptive. And like you said, you don't have to deal with that capacity concern quite as much. But for our environment where we're managing thousands of clusters and we don't control the workloads on them, the node replacement strategy seemed safer to us and I think that's played out pretty well. Okay. Great. I see another question came in, but we are unfortunately at actually a little bit past time on the screen. You'll notice you can reach out to Adam via email he's also on Twitter or lower left hand corner. So feel free to reach out there, right Adam? And I want to thank all of you for joining today. The webinar recording and the slides will be online later today and we're looking forward to seeing you at the next CNCF webinar. Have a great day. Thanks everyone.