 Welcome to KubeCon 2022 in Valencia. This session is about the current state of the Kubernetes cloud provider. I'm Steve Wong of VMware, and I'm joined by Nick Turner of Amazon. We'll quickly cover what the cloud provider does, just in case we have some audience members who are new to Kubernetes. Then move on to general project status, followed by lightning talks from individual cloud provider implementations. Finally, we'll wrap up with futures and coverage of how to join the cloud provider SIGs community. The cloud provider is the mechanism that allows Kubernetes applications to be portable across various public and on-prem clouds. The goal is an experience where any well-written app can run anywhere, and for the most part, can't even tell where it's running. SIG cloud provider works closely with a few other SIGs to make this happen. For example, most of the storage-related abstraction is managed by SIG storage with a little integration activity across the SIGs being monitored by SIG cloud provider. In the early days, Kubernetes was a monolith when it came to cloud providers. The Kubernetes binary included a bunch of code for cloud-specific cloud providers, and it was bigger than it needed to be. This had a number of suboptimal aspects listed here. For example, a user investigating startup logs on a Google cloud might see entries saying could not find AWS ECB. The old model slowed the time to feature and patch delivery, and we've been moving to out-of-tree code for cloud providers for about two years now. New deployments should be using out-of-tree versions of the cloud providers now, and if you're running a legacy version, you need to be planning for a migration, and Nick is gonna tell you more about that migration status. So Nick, take it away. Yeah, so I'm gonna talk a little bit about just kind of general status of cloud provider migration and what components that involves. It's actually kind of funny, I just ran into Lucas Caldstrom, who was a very early member, or even the genesis of SIG cloud provider a couple hours ago, and we were talking, and he actually started doing this work six years ago, and he was thinking like, how long is this gonna take? And he, in his pessimistic estimate, he was thinking about a year, year and a half, and it ended up not being quite that short. And there's a lot of reasons for that, but it was more complex than people realized, and I think one problem was that there wasn't a lot of motivation for cloud providers to move out of tree because everything works in tree. Why should we go out of tree? But yeah, so the first component that I wanna talk about is the queue controller manager. This is kind of what I think of when I think of the cloud provider in Kubernetes. You have the control loops that act on all of the objects, and they also do things, making API calls to cloud APIs. For example, managing instances, managing load balancers, routes, volumes, et cetera. So when we're talking about migration out of the queue controller manager, the node lifecycle controller, the service controller and the route controllers are the primary controllers that we're referring to, but there's also the volume controllers, which are gonna be replaced by, for example, the CSI drivers. And when you pass cloud provider equals external to the queue controller manager, those all get disabled. And so if you're running a cluster that is HA, for example, and you don't want to have to experience downtime when you do this migration, then you're gonna wanna take advantage of the leader migration that we've built, which basically allows the queue controller manager and the component that replaces those cloud loops, the cloud controller manager, to use an additional leader migration lock so that they can coordinate and you won't have a situation where you have two leaders, two service controllers working at the same time, for example. So that's something you might wanna take advantage of if you have an HA cluster. If you're using a vendor, then the vendor's probably gonna deal with that for you and you're not gonna have to worry about it. And as I said, the cloud controller manager is this vendor-specific component that replaces those control loops in the queue controller manager. But there's other components that are involved with this whole migration effort, the first being the API server. So there's some lesser-known features that are considered part of the cloud provider migration effort, one being SSH tunnels, which was something that really was only used by Google. And it doesn't involve cloud SDK, cloud provider-specific code that's actually used. There's no calls to the SDK there, but because it was specific to one cloud, we're kind of combining it with this effort and we've already extracted that from the API server and it's being replaced by the network proxy. So if you are interested in that effort, take a look at the network proxy. And the second piece in the API server is the persistent volume labeling admission controller. And this is not super common use case, but it's when you need to create persistent volumes and you don't have a PVC for them, but you still want the topology labels to be attached to those volume objects. So this is an effort that's actually still underway. We are replacing that admission controller with a webhook that will be built into the CCM or if your distribution wants, you could run it separately. But idea being that you have a webhook that replaces the admission controller there. And then the next is KubeLit. So KubeLit also has some cloud provider code that needs to get extracted. There is node addressing functionality. So when KubeLit starts up, it needs to figure out what addresses it needs to attach to the node object so that communication can happen between the node and the API server. And so that is being replaced by a brand new controller that's being added to the cloud controller manager, which is the cloud node controller. Not to be confused with the cloud node lifecycle controller or the cloud node IPAM controller, which I'm not even gonna talk about. So node addressing is one piece in KubeLit. That's important and needs to move out. The next is there is actually some volume plug-in specific code in there. So for, I'm only familiar with the AWS stuff, but AWS has some EBS mounting logic in there that needs to get out. And so that volume plug-in code is being replaced by CSI migration. The CSI driver is gonna do all of that work. And finally, the KubeLit when it starts a pod needs to pull the Docker image for that pod. So there is a couple of plugins in there. For example, ECR, GCR, and maybe a few others that has the SDK logic that actually gets the credential so that it can pull the image. And that's being replaced by something that we're calling the KubeLit image credential provider, which is just gonna be a binary that sits next to KubeLit. You have a configuration file. You give to KubeLit and then KubeLit just execs that binary to get the credential that it needs to pull the image. And some random things to know. So for example, if you're doing this migration and you have, and you don't want to use CSI migration, but you do want to, in general, disable the cloud provider. There's this little workaround. You can pass cloud provider equals external and you also need to pass external cloud volume plugin to prevent those volume loops in the KubeController Manager from being disabled. So again, that's only when you want to do this without CSI drivers enabled. I wouldn't really recommend it. I would say just do everything at once and enable CSI at the same time that you do your cloud provider migration or even before. But that's an option if you need it. And another sort of interesting piece is you may or may not be familiar with a feature in KubeLit which is the node IP flag. This is a flag that you can pass to KubeLit to give it a little bit more information about what IP address or address that it needs to use. And what happens is when you pass that flag, KubeLit will use it to filter down the addresses that it gets from the cloud provider when it calls and asks for the node addresses that it should be adding to the node object. And so this introduced a little wrinkle when the CCM is doing the same thing when you have that cloud node controller that's trying to figure out node addresses by calling whatever cloud API it needs to do. And AWS would be like EC2 or something to get those addresses. It didn't have that node IP to do that filtering. So there's actually an annotation that you can use which is the equivalent when you're running CCM versus KubeLit which will allow you to have that same filtering logic. And there is a period of time during upgrade. So there was a bug where KubeLit and the cloud node controller both are trying to reconcile at the same time. And so we have a fix that will basically allow us to, we will always provide this annotation that allows this filtering to happen so that they agree. And there's a slightly better long-term fix that we're considering which is to basically prevent them from ever try to do that reconciliation at the same time. So that would be ideal future fix but currently as long as they agree then reconciliation is not gonna, you're not gonna have a flapping situation where addresses are going back and forth. So here are some slides from cloud providers that kind of go over the status of their migration process, where they are, et cetera. And did you wanna do IBM? I'll give it a try. We've got these slides submitted by people who couldn't be here. You know, obviously Kubernetes runs on a lot of clouds and we invited the people behind the cloud provider implementations for these clouds to give the community updates. You can see the density here. I'm just gonna show these rather than read the slides to you. You could read them a lot quicker than I could pronounce them. And we are publishing this deck so you can download it later at the very end of this deck. You'll get a URL and a QR code where you can download the deck. So in this case, this is so dense. I'm gonna assume you read this. They're covering interfaces, recent work. And this one I can talk about and I'll do it very quickly. This is the vSphere cloud provider. The summary is that the recent releases bumped up the level of support to match recent versions of Kubernetes. And the other big news is that support for dual stack, which means running IPv4 and IPv6 simultaneously, moved into alpha status and we anticipate that it will move to GA pretty soon. Azure, Penfee and the key couldn't be here. They're based in Shanghai. So travel difficulties obviously. But the stuff here, the summary is that it seems to be related to storage and networking, including once again dual stack support, which I think is something that a lot of the cloud providers are either out with or coming out with soon. AWS, I guess I'll give this back to you, Nick. Yeah, yeah, so just really quick update for AWS. We have stable releases of the cloud controller manager going back to 120 up through 124 now. We have a component called the AWS load balancer controller. This is where we're focusing our load balancing support. So we have NLB support and ALB support in the load balancer controller and we recommend moving to use it over the built-in service controller. And that's just because we have a lot of people maintaining it and a lot of effort there. So you get better support and it supports the newer load balancers, ALB and NLB versus the classic load balancer, that's entry. And so update about usage of CCM. EKS has, we've migrated to use the CCM and 122, currently just the service and node life cycle controllers and cops also, I'm not sure if it's by default, but they provide, you can enable the CCM in I believe 122 plus. And then CSI migration is offered as add-ons in EKS and cops to use the EBS CSI driver, Google. I guess I can try to do Google. So focus on Google, well Walter here is bragging about how many features are only tested on Google GCP and entry. So a lot of testing coverage there and that's actually something that makes this whole process difficult. If we disable the cloud provider entry, then all of these tests start failing. So that's something that we need to work on. We need to get these tests out of entry so that we can actually disable cloud providers and finish this process. But yeah, so more GCP specific code in KKA than any other cloud provider. It mentions that the SSH channels have been removed and more stuff to do. Link to the reference implementation for GCP cloud provider status and a little bit about what's left. Okay, so just a couple notes on what is coming for a SIG cloud provider. We have these feature gates that will eventually flip to beta and when they do flip to beta, cloud provider stuff is just gonna not work. It's gonna be gone, it's gonna be disabled. So you can always flip that back for a certain amount of time and re-enable them but it's kind of a very strong message that this will be removed in the future. So that's not gonna be flipped in any of the immediately upcoming releases because we have to finish some of these things like getting the stuff out of the API server as one piece but look for that in the future. So that's eventually gonna happen. And as I mentioned before, testing is something that we're thinking about. So figuring out a plan, this is a proposal, the last known good, which was, I think we sort of, in the last year we discussed this but it's essentially around how do we get those tests out of KK and we want to bring those tests, we don't wanna just get rid of them, we still want that coverage. So we want to move those tests to each cloud provider's CCM repo but we still want those tests to matter. So the question is, are they release blocking or how do we get that signal to the community so that we know that some change that was merged in KK breaks all of the cloud providers? So that's a proposal that is probably worth taking a look at and giving feedback on. And some other random things, there's a cluster directory which Walter has been trying to get rid of for a very long time. So we'll see if he makes any progress on that in the next year. And just in general, making all of the cloud providers work better, right? We all need to increase testing, we all need to put a lot of effort in to make sure this whole migration process goes smoothly. And just some information about our SIG, when we meet, we have a couple bi-weekly meetings, there's the regular SIG meeting, which is Wednesdays at 9 a.m. Pacific time. And we also have an extraction focused meeting, which is bi-weekly at 9.30, and that's on Thursday, and they alternate. So you have, every other week, you have extraction and then regular SIG meeting. There's the Slack channel and video recordings. And I think that is... Yeah, I think we've got some time for QA, so please raise your hand so we can get it on the audio for the remote audience. And then if you wanna download this deck to get all those links, use the QR code or this URL. Anybody with a question? Oh, let me turn it on, just a minute. There you go. Can we get the hand mic turned on? I think now you can hear me, right? Okay. Yeah, so thank you for the talk. And this is, I think, an underestimated topic that many people don't look at, and it's gonna cause some pain for some organizations. And the question that I have is mostly around the cloud providers themselves, and how does this migration is gonna look like for customers of these cloud providers? Because I know that some of them try to do the lift and shift on their own, and some others try to push this responsibility of running this component to the customer. Which, to me, as a customer, it looks like there was something that was working and it was managed before, and now I need to manage it. And specifically, this is a WSL balance controller. Before it was integrated into the manage control plane, now you need to run it as a component, and you have to maintain it, basically. Yeah. Yeah, ideally, you're right. It should be as easy as possible for the users. So, I'll start with just kind of general. Most things users are not going to see. Distribution is gonna handle a lot of these things for you. Like, the leader migration is gonna be, hopefully, invisible unless something goes wrong, I guess. You, depending on how much control you have over your nodes, you would see, if you are providing, if you build your own node images, machine images, then you might have to do the work to put the Kubelet image credential provider on the image, right? So, there's some things that you might notice, but in general, it should be pretty opaque, I guess. Let's see, so, and then the load balancer controller, yeah, I'd like to have a better story with that, where the, you know, someone who's setting up their own cluster doesn't have to worry quite so much about installing it as a separate component. So, we talked about different things, like combining it with the CCM, that's still an option, kind of turning it into a library or something, adding it to the CCM. I like that idea, but we're not settled on it yet. Or just making, you know, having a really easy installation method that covers that, like, however, whatever it is, you know, like if there's a Helm chart or something, I don't know if Helm's the best option for that, but something like that. So, yeah, I guess I can just say I recognize the problem there. Does that answer your question? Yeah, I want to add on this. And I think the problem is not with the new clusters, because new clusters are gonna have the load balancer managed by a new AWS load balancer controller, it's mostly about the old load balancers, and how do you get them all outside of the entry management. Because for, as far as I know, the EBS CSI driver will provide a smooth migration path with these two feature flags on the control API server, but for the AWS load balancers controller, like, basically you need to migrate to a new load balancer managed by AWS load balancers controller, which might be a pain if you don't have the correct mechanism in the org to be able to do that shift transparently. Right, yeah. I think the idea is that you, at this point, you run both of them, and they both, you know, they shouldn't conflict. But yeah, if and when we do deprecate, I don't know that it would ever really be deprecated because it's kind of tied to classic load balancers, but if and when we did, there would be a migration path for sure. All right, thank you. And another question, sorry. So do you plan, I saw these two flags that you have to completely disable the cloud provider. Do you plan, like, are the cloud providers planning to expose these into their API? So like, as a customer again, we have, say, we have migrated to the new system, we would like to completely get rid of the old one so that we can make sure that our customers do not create load balancers using the old system, for example. I'm not sure if it was clear. You got it. No, they really shouldn't be exposed. And in terms of, I mean, so for load balancers, like, we're really just talking about the entry, you know, service controller that, again, speaking specifically about AWS, but we're just talking about the entry service controller for the migration. And it shouldn't change, right? So like, you're still creating the same load balancers before and after the migration. If you choose to install the AWS specific controller, that's separate, and you can do that whenever you want. Right, thank you. We've got time for one or two more questions if somebody's got one. Okay, I'll take that as no, but you can track down Nick and I, our names are in this deck, so if anything comes up, and you can find us on the Kubernetes Slack in the Cloud for Adder SIG channel. So thank you for coming.