 Hi, welcome to KubeCon 2021 in Los Angeles. You're about to get updates from a bunch of people who work on the Kubernetes Cloud Provider. I'm Steve Wong with VMware. And I'm joined by Walter Fender of Google and Nick Turner of Amazon. We also have people representing many other popular cloud providers who couldn't be with us here physically but who graciously prepared lightning talk recordings or slides, which we'll be sharing in the middle of this talk. If you're wondering about the Cloud Suite, of course it's SIG Cloud Provider, but it's just a demonstration that in spite of Walter's rule with an iron fist as co-chair, we do have some fun and levity in the group. So come to meetings and enjoy yourself. The way we'll start out is I'll quickly explain what a cloud provider is about. Just in case anybody here wandered in and isn't familiar with the Kubernetes architecture. And then we'll move on to a background on our strategic mission to move the cloud provider specific code out of the main Kubernetes source tree. Then Nick's going to give a general status followed by the cloud provider specific lightning talks. And then finally, Walter will close covering roadmap, interesting topics, and how you can get involved with the group. So the cloud provider, what is it? Well, this is part of the architecture that allows Kubernetes apps to be portable across various public and on-prem clouds. The goal is to allow a well-written app to run anywhere and for the most part not even able to tell where it's running. SIG Cloud Provider works closely with a few other SIGs to make this happen. For example, most of the storage-related abstraction is actually done by SIG storage, but there are cross-SIG integration efforts and design reviews that go into this. The effort in play now is moving out of tree. When in the early days, Kubernetes was a monolith when it came to cloud providers. The Kubernetes binary included a bunch of cloud providers, and it was bigger than it needed to be. This had a number of suboptimal aspects listed here. For example, a user investigating the startup logs running on Google Cloud might see a startup log entry saying, could not find AWS-ECB. The old model also slowed time to feature and patch delivery, and we've been moving to out-of-tree cloud providers for a while. New deployments are probably already using the out-of-tree now if they're picking up distros, but anybody still on legacy deployments needs to be planning for migration, and Nick is going to tell you more about that. Testing. Testing. Hey, so I'm going to just give a general status update on cloud provider migration by going through the various components that are affected by the migration, starting with the Kube Controller Manager. So as many of you probably know, the Kube Controller Manager is basically a collection of control loops, and the control loops might be acting on Kubernetes API resources, and in some cases, they're also going to be creating or updating or deleting cloud resources as well. So the whole effort to migrate the cloud-related code out of tree affects the Kube Controller Manager, and the way that we're going to do that migration is by moving those cloud loops, any of the loops that actually touch cloud resources, to a vendor-specific binary called the Cloud Controller Manager. So this effort is pretty far along. Many vendors already have Cloud Controller Managers at various, some are GA, some are Alpha, some are Beta, but the vendors that you're interested in, you can find those repositories and look at them and see how far along they are. So specifically for the Kube Controller Manager, you would set the cloud provider flag to be external, and when you do that, those control loops are disabled, and so you would do that when you would start the Cloud Controller Manager along with it. An interesting little note is that the volume controllers are also going to be acting on cloud resources. Those are not actually disabled by setting cloud provider equals external, it's just not going to have access to some of the cloud volume plugins. So you'd want to run CSI along with your Cloud Controller Manager, or there's a sort of kind of backdoor that you can do, which is pass a flag called external cloud volume plugin and set that equal to your provider AWS, GCP, whatever it is. So another component that is relevant here is the API server, and the API server has two parts of the code that are cloud related. The first is the SSH Tunnel functionality, and this, as of 122, has already been dropped and is replaced by the Network Proxy Project, also known as the Connectivity Agent. If you have any questions about that, Walter is your guy. And the second area is the persistent volume labeling admission plugin. So this, unfortunately, is taking a little bit longer. It's probably going to be the long pull with extracting cloud provider related code, but luckily we did get it kept merged in 123, which will make this migration a little bit easier. So it's going to build some framework that allows cloud providers to build webhooks as replacement for this admission plugin. And that's related to labeling manually created persistent volumes with these topology labels. And then another component that is important is Kubelet. So Kubelet has a number of areas of cloud provider related code, first being node addresses. So for quite a while now, Kubelet node addresses can also, or node addresses can also be set by and some other process, which in most cases is the cloud node controller. So a Kubelet can initialize a node without setting the addresses and the cloud node controller can asynchronously populate those addresses. And then another aspect of Kubelet related code or cloud provider related code in Kubelet would be the external image credential providers. And this is I think beta going beta in 123. And we're adding a feature gate to sort of push people along, which will start out being disabled, but it will eventually become enabled, which will then go and disable the credential plugins. So when that reaches beta, you'll have to actually go and flip that feature gate to false if you want to continue using it. So take a look at that cap. And the entry volume plugins are the final piece of Kubelet. So these are being replaced by CSI. In terms of cube controller manager migration, there is a feature that's important for HA clusters. So if your cluster cannot tolerate downtime, you should check out the leader migration cap. This is beta in 122, and this will help you migrate a live cluster without taking down the cube controller manager. So now we're gonna go into the lightning talks and you will hear a recorded talk by me. Oops, clicked on the picture. I think it should play. Hello, my name is Nick Turner. I'm on the Amazon EKS team and I'm gonna be talking about the AWS cloud provider. The AWS cloud provider is comprised of or related to a number of components, including the AWS cloud controller manager, the Kubelet image credential provider, CSI drivers, EBS EFS and FSX, and the AWS load balancer controller. So what's new? Well, the AWS cloud controller manager has a new release, 122.0 alpha zero, which is comprised of mostly bug fixes and gerithics. There's also a cop setup example on the repo now, which is a great way to get started, get a feel for what is required in order to run the cloud controller manager. In a cluster, the AWS load balancer controller recently released version 220. It added a number of awesome features. So take a look at that release on GitHub if you're interested. And the EBS CSI driver recently released version 131 with multi-arch OS image manifests in ECR. So take a look at that as well if you're interested. And what's coming soon? So the cloud controller manager, we're working on an improved upstream test framework. We'll be taking advantage of the last known good testing proposal. And you can look forward to the cloud controller manager being enabled on EKS in a version soon. The AWS load balancer controller has a release coming out pretty soon. Also a bunch of features loaded into that release. So give it a look. And then CSI will be enabling CSI migration on EKS in one of the next couple of releases and Kubelit image credential provider. We're working on the first release of that with binaries and documentation. If you wanna build it yourself, it's there in the GitHub repository. Otherwise you can go ahead and keep an eye out for the first release. And just a table to kind of get an understanding of when things are graduating to beta and GA. You can look at 122. So HA migration framework went beta in 122 and we're working on the cloud controller manager getting a beta release after that. And then for the credential provider framework, that is going beta in 123. And so we'll be following with the ECR credential provider beta release in around 123. And that's it. That's kind of a very condensed version of our roadmap. All right. My turn to talk, assuming this is on. Yeah, okay. My turn to talk, we'll get to find out quickly whether or not as a Googler and the chair, I'm capable of being somewhat neutral on these slides. So the Azure folks actually provided these slides. Andy and Pengfei actually work out of China, so it was a little hard for them to show up. But they've actually been doing a lot of work trying to make sure that the abstraction layer under Azure works and that the Microsoft code is being pulled out. So we see a lot of stuff from them for the CSI drivers, a lot of that, getting the cloud provider abstraction, the reference implementation for Microsoft so that any of you can go and look how to bring up a Kubernetes cluster on Microsoft and get it all working. They've had multiple patch releases and they actually have just recently gone GA with a 121 release. So, and then one of the things that Nick mentioned is this idea about the credential provider. And so their own external credential provider needed to go to ACR should be coming out very soon. GCP, that's me. And so... Let's start with looking where we've come from on GCP. GCP was one of the very first Kubernetes implementations. And as a result, most of Kubernetes and GCP have been developed together. This means that there's a lot of legacy code that is very tied to GCP and also a large portion of our testing code base is tied to GCP. So where are we right now? We have a cloud provider GCP implementation and it implements most, although not all of the pieces needed. CSI is coming soon. In fact, I'm hoping it has been checked in at this point. Otherwise we need to do things like make certain portions of this the default. So if I look at the cloud provider GCP repo currently it builds with Basil. We are hoping to switch that to make very soon to come to terms with the rest of Kubernetes. But right now it is Basil. There's a PR to fix that. As far as bringing the cluster up right now we use the cube up script that has been the standard for Kubernetes for quite some time. There are newer technologies out there and we are looking at using one of the alternates but that's an investigation that is still ongoing. Right now I'm bringing up a control plane and three nodes. And so we will see that come up soon. Once it comes up, we'll go ahead and check to see that all the nodes that I'm expecting are there and then we'll take a look at the pods and make sure that the key pods that are needed for extraction are there. So great, we're up and we're running. Now we'll go ahead and take a look at the nodes. Okay, all four of my nodes are there and look to be ready. Take a look, we see the CCM is there, we see the connectivity agent and connectivity servers are there. We'll try to do a log to check that connectivity works correctly kind of meta but it works. Now we'll go ahead and install a pause pod and that works. Great. So now we'll take a look at the last. So the question now is what's left to do? So LKG testing, please go ahead and pull the slides and take a look at this document. This is a document that explains how we plan on doing development across repos and how we plan to get the testing working. GCP is gonna have a prototype and then we're hoping to roll this out to all the other cloud providers. We'd also like to more immediately start dealing with the issue of the cloud provider and credential provider feature gates which are currently blocking the extraction effort and after that we should be good. Yeah, the space. All right, and we also had some folks from IBM give us a slide. Sadev is actually at this conference but unfortunately I believe he's actually at a different talk right now. So IBM has also been doing a lot of work to get out of tree and get that whole system working for them. They have a CCM implementation and interestingly their implementation has a couple of control loops that actually come from their managed offering that have since been open sourced so that anyone who is interested in knowing how IBM does their managed node control or load balancing can now see how they do that. Their CSI stuff is also coming along quite well and they've got some interesting things coming up with the API provider. So definitely if you are at all interested in that or in OpenShift, I would definitely take a look at what they've got coming. Space. All right, and now do you wanna do the vSphere quickly? This is a recording from Nicole Han. So let's just kick it off. Hello, my name is Nicole. I worked on 2K junior working team at VMware. So today I will share a few updates on cloud provider this year. Since v1.18, there are a lot of new features implemented in this protocol. And the first one is we implement zone and region support so that we can initialize a node with cloud specific zone and region labels. This is significant because vSphere has its own concept of zones and false domains. The cloud controller manager maps those vSphere zone concepts to Kubernetes zone concepts. vSphere tags are used to identify zones and regions in vSphere data center objects. The same tags are mapped to labels in Kubernetes, allowing placement of nodes and thus calls and persistent volumes in the appropriate zone or region. And we also add new instance type labels which can use some nodes. This is used for if you want to target certain workloads to certain instance types like small and medium sensor. Instance type offers different compute memory storage and network capabilities. We also add initial support for NSST route paths so that paths IPs are routeable and accessible externally from the cluster. We introduce initial support for vSphere power virtual cloud provider. So currently we have two nodes in cloud provider vSphere projects which are called vSphere and vSphere power virtual. It has different implementations for cloud provider interfaces like instances and routes. We add initial support for NSST route paths in the param virtual mode as well. Another support we added is the Helm chart installation. So now you can install the cloud provider vSphere by Helm chart, this is currently supported. So currently it's moving away from maintaining the cloud provider in trace and I suggest that everyone can use the out-of-train cloud provider vSphere. So we have a few docs in the ETOC repo of cloud provider vSphere that can guide you to install the out-of-train cloud provider vSphere. I'd like to share the roadmap of vSphere cloud provider as well. So first there is a plan for the CPI migration. The cloud provider vSphere runs a node route and load balancer control rules in two modes which are called vSphere and param virtual. We want to merge those two modes into a single one so that we remove one mode then freshly install the other one. To do this we might also need a new manifest for different resources in the future. We would also support do stock in future. So IPv4 and IPv6 do stock allows your resources to use both IPv4 and IPv6 addresses for the network communications. And so yeah, those are the updates for cloud provider vSphere since 1.18, yeah, thanks, good morning. Wow, that was fun. So we've had the lightning talks for the cloud providers. We've sort of seen what each of the individual cloud providers is interested in doing in the effort to get extracted. But I think it's also interesting to look at what Kubernetes would like to get done for cloud provider in upcoming releases. So we'd like to get the code out. Why, I think quite a bit of that was gone over by Steve, but I mean there are other concrete things to think about. We ran an experiment on how large is the Kube API server if we don't put any cloud provider code in it and we save about 30 megabytes in the binary size. That's kind of a big deal and you can get similar savings in both the Kube Lit and the Kube Controller Manager. Also, if anyone's tried to merge anything from vendor recently and looked at just what a headache that is, the amount of vendor stuff that is needed goes down dramatically if all the cloud provider code is gone. So this is important, we'd like to get this done. So what are the immediate things that we as the cloud provider and we as the community need to do? Well, we've added two new feature gates and the purpose of these feature gates, the disable cloud provider and the disable Kube Lit cloud credential provider is to see what happens if you delete that code. We do it by removing, by turning off everything within the binaries and if you don't turn it off then we kill the binary. So we can force any cloud provider can just turn those feature gates on and see how it affects them. But we, the community can do this as well and have and all of our tests break. So one big effort we have is we need to start turning those feature gates to beta and get all the tests to pass. So that's one big effort. Another one is we know that when we say we're gonna get those tests fixed, some of those tests are gonna have to go out of tree. I discussed under the GCP, a lot of the tests were built for Google. They test core Kubernetes functionality but they do it in a way that is Google specific. So we either need to rewrite those tests to run in a non-Google specific way or we need to just realize that some of our testing has to happen on individual cloud providers. And if that's true, then what that means is I gotta build Kubernetes as a kernel and then I've gotta find a way to take that kernel, put it into a cloud provider, bring that cloud provider up and then run my tests there. We have a proposal for this. I mentioned it earlier. I'm mentioning it again. It is the last known good proposal. I strongly recommend anyone who's interested in this read the proposal. Joe Betts and Kermit put a lot of effort into this proposal. It's gonna affect a lot of us in the community and it's really important we get this right. So please go take a look at that. We've got a sorted cleanup. We really do do a lot of things that cover a lot of the SIGs, as was mentioned earlier. One of these is the cluster. Anyone who's dealt with the cluster directory, anyone who's had to support it, I'm one of them but there are about four SIGs that deal with that directory. We all wanna delete it. We want it gone. But we all have dependencies there and so all those dependencies need to be fixed. We have all of these reference implementations. So I mean, we've gone through a bunch of the reference implementations that gave us slides. But I mean, Huawei and Rancher and a lot of other cloud providers have their own reference implementations. And so we wanna work out how to help them integrate into Proud. We want to standardize on how branching works. We want to work out how to get good docs so any user who wants to bring up Kubernetes on a cloud can go to KK, find that page that says here's your cloud provider, here's how you build it, here's how you bring it up. We also know that, whether it's IPv6, new ways to think about load balancers, whatever it had new ways to do, firewalls, et cetera, there are going to be things that we wanna have as a standard part of Kubernetes that require deep integration with the cloud provider and working out how to do that in a neutral way is what our SIG is for. So even when this was extraction effort that has gone for a few years now, finally gets done, we're gonna have this need to keep making sure that Kubernetes remains neutral and functional across all the cloud providers. Now, if any of that sounds interesting to you, I would highly recommend that you join the SIG. So, the standard, how do you join the SIG slide? The first thing I'm gonna say is join the Google group, right, the Kubernetes SIG cloud provider. As soon as you join that group, you'll be on the emailing list, you'll get calendar invites, most of the information you need to become a contributing member is just gonna show up. The other thing I'll mention is join our Slack channel. If you wanna know how to get started, you wanna mentor anything like that, go to the Slack channel and just say hi. I tend to monitor it pretty well as does my co-chair, as do people like Steven and Nick. So there's a lot of support that you can find on the Slack channel. The meetings, we keep playing with like Europe times and Asia times, most of ours are America times, but go to the Slack channel and say, hey, I'm in Asia or I'm in Eastern Europe and I really would like to join, we'll set up meetings for you, we have been doing it. So, and take a look at the recordings, we have various recordings of previous systems and even dives into various bits of the technology like the CCM and the API server network proxy or connectivity system. So with that, does anyone have any questions? There is a link there to download this deck. Yeah. And I don't buy, it's safe to ask questions, I promise. The iron fist was a joke. And also for all the people online, if you wanna ask questions, there's a live Q and A button down there. So feel free to just click that and drop your question and we'll monitor those as well. So do you wanna speculate in what release in which we will actually have all of the cloud provider stuff out? Yeah, great question. So the question, I'm gonna repeat it, excuse me. When will this actually be deleted? So I'll give you a little note. I deliberately lie about the answer to this question. Why do I lie about the answer to this question? I'm giving away secrets. I lie about this question because if I give a realistic estimate, then all the cloud providers relax and they stop doing work. So I keep telling them a date that's deliberately earlier than I think it's going to be just so that there's some pressure. Having said that, so if I just walk out that release, if we can get, and we can't, but if we could somehow in the next three weeks get the features, both the webhooks and CCM cap implemented and could get the feature release to beta in this upcoming release, it would still be a couple of releases to go. And that's obviously gonna get pushed out by, so my hope at this point is that by 125, everything works without cloud provider code. And in fact, for those who actually do that is a little bit of a carrot. Ben the elder from SIG testing and also I think some degree SIG release, a couple of releases ago, gave me a special build flag. And what that build flag will do is compile all of the binaries without any of the cloud provider code in it. And so for those of you who are good and get your work done early, you can actually start realizing the benefit early. But my general thought is if we can get in 125, get everything working, we're still gonna need a release probably too of being in the everything's working, wait for the signal, because as we know, most of the signal is gonna come from the managed folks and the managed folks tend to be about a release or two behind to know that we've actually gotten it all right and that it is then safe to delete. So I'll let you do the math, the answer is later than I like. Cool, any other questions? Anyone interested in, there are a bunch of technologies here, we can talk about the webhooks and the CCM, about the CCM itself, about the HA migrator and how it works and what it does, the connectivity system, what credential provider is doing under the covers and how that works. There's an online question asking for a little more information around the replacement for SSH tunnels. Sure, connectivity system. So SSH tunnels, the bane of my existence for quite a while, was a very specific Google solution to a Google problem having to do with where the control plane lived and where the customer's cluster lived. And it was a lot of difficult code that no one liked having to support, not even the Googlers. And everyone wanted it gone. Took a couple of revs, we put out a cap, we built the connectivity system. So the way the connectivity system works is that instead of the API server creates an SSH tunnel down to an SSHD demon running on the cluster and then funnels all of its egress through it, what we have are these agents that depends on the cloud provider, how they get installed, but they run them next to the Kubelet. They connect to the connectivity server which is running next to the API server. And then the API server connects to it and that is how the traffic is tunneled. It can be either done via GRPC or HTTP connects. There are a couple of options on how the authentication works for that. And then one of the interesting things here is, so this was originally designed as an SSH tunnel replacement, what you'll find in the API server network proxy repo is a reference implementation. But that reference implementation is actually used by Google and it's actually used by several companies other than Google as well, which is different than the SSH tunnel which was a Google only solution. And in fact, because it's a little more flexible, there are several things that have been done such as the ability to have it intelligently route your traffic, which SSH tunnel could never do. And even understand if you have failure zones that cannot communicate with each other, it understands that and will do the routing correctly with that set of provisions. There's even one of the companies in the deck, but I won't say which one, who is looking at what if I wanna run my clusters very remotely and the connectivity has to be over the internet, not my private cloud net, but over the actual internet, I'd like to have the same level of security going in the other direction. Can I make the traffic from the Kubelet or from the pods go back to the API server over that same set of tunnels? The KEP is written, it's approved for alpha and in fact, the PR is still being reviewed but that is one interesting set of work that is happening on that front. Excellent, thank you. Well, I think we're just about at time. So, Walter, Nick, Steven, thank you very much. Fantastic talk. Thank you for coming. Thank you all for coming.