 Hello, and welcome. We're here to talk at KubeCon EU 2024 about how Kubernetes is finally removing include, yeah, saying words is hard. Why don't you say the words? Removing the entry cloud providers. Say that three times fast. Or just try to do it, and you'll be on this journey with us. OK. I'm Chris Privatier. I'm a software engineer over at Equinix. I work in the developer relations section doing integrations engineering, hooking up Kubernetes and other things to our APIs. I'm an ex-Kate's user. I've been there, done sysadmin, done storage, been in the Kate's ecosystem for, I don't know, five years? Can find me on GitHub? There's my name, C. Privatier. Excellent. I'm Bridget Kramhout, and I'm in product at Microsoft. I focus on the upstream open source ecosystem, Kubernetes WebAssembly, and all of the fun things. And I've been in the Kubernetes ecosystem about seven years now, and I'm Bridget Kramhout on GitHub. All right. So we have three things we're going to talk about today. We're going to do a quick overview of what exactly do we mean when we're saying cloud provider, because I think that's one of those semantically overburden terms that it's like everything is a cloud provider. So we're going to help you figure out what is. Then we're going to talk about this migration of entry to out-of-tree. And then we're going to give you a few actionable takeaways that you can take and use, even if you're not actually making a cloud provider or doing this transition. All right. So let's see. Should we start with, what is a cloud provider? Well, as you mentioned, it's an overloaded term. So we're not referring to here the cloud providers like Azure and Google and Equinix Metal. But here we're talking about two things, the specific cloud provider infrastructure manager component used by Azure or Equinix Metal, like cloud provider Equinix Metal, cloud provider Azure, cloud provider Google. And then the second part is the cloud controller manager, which is kind of the shared component that all those individual vendor ones use. So for this talk, we're mostly focusing on that shared component. So when we say cloud provider, we're referencing the cloud controller manager in its related code. And you will see the term cloud provider used elsewhere in Kubernetes in, like, say, Ccluster Lifecycle and the cluster API world, et cetera. So we're not exactly the same as those usages. So don't worry too much about that. Just be aware that you may see that term and it could mean a lot of different things. But this is what we're talking about here. All right. So what is a cloud controller manager, Bridget? OK, so this CCM, and then you're immediately thinking, wait a minute, there's also KCM. And yeah, there are a lot of things. It's fine. So many things. But this specifically is what's going to translate and link these events or actions between your Kubernetes cluster and the infrastructure it's running on. So specifically, we've got controller loops. We'll go into a little more detail about those. And the key difference, and this is where you start to see why we want to be out of tree, because, for example, cloud credential providers can be different between your clouds. And you'll usually see your CCM run as a pod in Kubernetes. So yeah, should we go into a little bit more detail? Yeah. So there's three main focus areas. And we have a controller loop for each of these that a cloud controller manager tries to tackle. The first is the node controller, which will update nodes based on infrastructure changes. So this is how your cloud can let Kubernetes know that a node is gone and it's not coming back anymore. Also, this is where it sets the node ID somewhere that you can find it inside Kubernetes. The second one is the one most people are coming to the Cloud Control Manager for, which is the service one, which is how you make load balancers. So real big, real important. Otherwise, traffic doesn't get to your apps. And then finally, we have the route. And that one will configure networking routes so that you can actually get your traffic where it needs to go. All right. OK. So there's more that it can do, though, right? There's so much more. And if you're thinking at this point, wait a minute. I'm familiar with at least one of the CCMs. And it doesn't work the way you just said. You're right. And it's because not all of those controllers actually need to be implemented. For example, service and route controllers, it depends on what the cloud provider offers in terms of the network. And it also gets a little more nuanced and interesting when we start diving into, for example, what Equinix worked on, because you also can add controllers that are specific to your infrastructure. And for example, we have some examples on the slide here that the registry credential controller was more necessary for GCE or Azure needed a node manager. So that's why if you're starting to think, oh, so you're saying the in-tree code base is absolutely enormous. Yes. There's a reason we're moving out of tree. We also, at Kube County last year, went into a lot more detail about this specific topic. The slides have already been uploaded. Thank you, Chris. And so if you're thinking, wait, I need a lot more information about this, go follow the link already available to you. OK. OK. So at this point, if you want to build a new CCM, how's that going to go? Well, it's real easy. You go to kubernetes.sig slash cloud provider, and you download that stuff, and you make a Go package. You satisfy the cloud provider.interface. You take the example main.go from our upstream repo, and you just copy it, link in the package that you just created in step one, and then you run the init command for your cloud provider. And that's it. You're done. You run the init command, and that's it. I feel like I've been told this before, and it was always a lie. It's 100% a lie. Yeah. No, it's totally a lie. You need to mostly what you need to do is study. You need to go check the documentation before you're going to be able to figure out what you need to put into all those functions and inits. And I feel like every time somebody tells me to read the friendly manual, read the fine manual, I feel like they're telling me to go do a ton of pre-work, and I'm not 100% sure which parts are important and where I should start, and can this please be shorter? And so we thought we'd make this shorter for you. So we went and picked the exact places in the documentation that you'll want to start and end why, so that you know which stuff you're looking at and why you're looking at it. All right. So first off, the basics of how the Cloud Controller works, what it's doing, how it's designed, what its functionality is. You go right there to Kubernetes docs. We got the link on the slides. Go check that. So that's your foundational knowledge, right? Then we have how to administer a Cloud provider or a Cloud Controller Manager. Now, you want to know this because this is how your users are going to interact with it, right? So you kind of need to know what they're going to be doing. This is where you realize, I want ours to run as a daemon set versus a deployment. You kind of get the sense of how this thing is going to run in the cluster. So yeah, again, here's a look at that. And I think that the troubleshooting steps here will probably be the most useful if you're starting to dive into this and quickly running into weird errors and not knowing what's going on. Walking through this part of the documentation will probably be the most useful. Then we have the framework itself. This is the thing I talked about earlier that you download and just satisfy the interface. So these are go-to modules containing the core Cloud Controller Manager. This is what the SIG maintains. This is the code you've got to study if you're going to implement a CCM. So this is the code part. You're going to have to read some of it, unfortunately. Or fortunately. Or fortunately. But important here, you're looking at this. You're looking at that URL, and you're thinking, oh, OK. So this is rating the core Kubernetes org. That means this is the official Cloud provider. I don't have to look at any of these other SIGs and things you're talking about. Oh, yes, you do. This is just an instructive sample. So this is not the one that you can run. Because as mentioned before, we're moving everything out of tree. So there's not going to be a singular one, which is good. This is a good thing. We'll explain. And then finally, the best way to figure out how to do it is just look at what somebody else did. And they took care of it for you. Get inspired. So we've got a link for a search to cloud-provider-inside the GitHub org. And when you say dash, that dash in there actually matters. That is a load bearing dash. So by having that dash in there, it's finding all the cloud-provider-something else that has already been created inside Kubernetes 6. So you'll get the Cloud Provider Equinix Metal. You'll get Cloud Provider Azure, Cloud Provider AWS. You'll get all those. And then you can look through, how did they handle starting up their load balancers? How did they handle initializing the Cloud Provider? And it's extremely helpful. I'm inside your code base all the time. We're all looking at each other, and that's the joy of open source. And one other thing you're now thinking, well, wait a minute. I saw a very important Cloud Provider code base, and it wasn't in Kubernetes 6. Sure, absolutely. Every cloud provider could put their code base wherever they want. A lot of them are in Kubernetes 6. So that's where the central location we're sending you to is. But you certainly can go look at any of the ones that aren't GitHub anywhere. And there are links. So how does a Cloud Provider find its way inside that centralized location? Well, over at Equinix, we had ours outside of the Kubernetes 6, and we said, gosh, we'd sure like to play better with the other kids in the pool. How do we get with them? How do we do all that? And so here's how you can bring your Cloud Controller Manager, your Cloud Provider, to the Kubernetes 6. First thing, show up to the SIG meeting and say, hi, I'd like to bring our CCM in. That's what I did. Honestly, I think that right there, that show up to the SIG meeting. I've had discussions with people throughout the conference who say, well, it sounds really intimidating. I'm not a fill-in-the-blank of whatever fancy thing that they think we are. And it's like, I'm not going to go to the SIG meeting and bring my code, which might not be good enough. I mean, I can't do that. And actually, the SIG meetings are not scary. They're literally us and a couple of other people. And we would love to talk to you. We're lonely. We're friendly. We're friendly, yeah. So once you declare, I guess, your intention, we'll point you to the link that I've got. We're here on the slide. Follow the rules. Kubernetes already has a big old process, like how to contribute code. It's a useful checklist to get your repo just moved into the Kubernetes SIGs repo. And it's not difficult. No, I said it was big. But really, it's these three things. Like make sure you got the contributing license agreements, because you need that. Make sure your license is compatible. Apache 2 is great. That's easy, but they've got other ones that you could use. And then some documentation things that they want to make sure all the code copyright stuff is all correct and whatnot. And then, oh, yeah, after that, it's really just submitting an issue on the Kubernetes GitHub org, then come back to the meeting and tell us you made the issue, and we'll push it through. Yeah, and the one that Chris himself did is right there. You can go look at it. 4259, go check it out. You can see what we had to do, which was not much. And then you could do it too, and join us. And then you get the benefit. If we ever do any kind of global searches to see, is anyone else doing this in a cloud provider before we make any changes? Your code will be right there and easy for us to check. And people will be able to find your code is a real nice one, because it's in the centralized location. Awesome. So why did I contribute it? Well, partially I contributed it because I felt a little over my head. So I came to Equinix Metal to work under some wonderful people who were leaders in the cluster API and cloud provider space and just doing this thing. And I was going to kind of bootstrap my career and get to the next level and start contributing back to this Kubernetes ecosystem I love so much. Then those people left within six months. It happens to the best of us. Yeah. So I looked around and realized, oh, no. I'm the Kubernetes admin here now. I'm the maintainer of the code. Yay. So I had to figure out, what does that look like in terms of the cloud provider? What is my job now? Because I didn't know. So I've spent some time figuring out what I do so that I can tell you, in case you find yourself in that situation, all of a sudden, you're maintaining a cloud provider, what does that look like? What are the things I should think about? So here's what I think the four big areas are. First of all, issues. If someone comes in and says, hey, Metal LB is no longer getting IPs for my ingresses, that's an Equinix specific problem. But whatever your users have, they submit a ticket on their GitHub. That's the first thing you got to solve, right? They're having a problem. The next one would be adding new features to your cloud provider. So over in Equinix Metal, we've got a brand new load balancer as a service option that people have been asking for for years. And we finally launched it. So we expanded the cloud provider to be able to allocate those load balancers so you can use it with your Kubernetes clusters. And another useful way to learn from each other in the community is you go to the community call. You're finding out that other people are getting feedback or requests or whatever from their users. You can start putting it in for yours. Yeah, you can see it coming before it's a deadline. The third thing you can do is make things easier for your end users. Over on Equinix Metal, our instances would not to default to having BGP turned on. But we found that with Kubernetes clusters, our users pretty much always wanted BGP turned on. So we started having the cloud provider turn it on for them. And it just made their lives a little bit easier. And then, of course, your standard code maintenance chores, upgrading Go modules, getting ahead of CVEs. So many Go modules. So many Go modules. Depend about your friend. Improving maintaining your CI CD, making sure your images are publishing to Quay. I found out it was Quay yesterday. I thought it was key. I know. Docker, wherever you host your images. Compatibility issues with Kates, of course, if there's a new version, you need to do something to update a function call somewhere in the libraries. Got to stay on top of that. Great. So hopefully that gives you the foundational material to get to where, perhaps, the excitement that you came for, which is we're going to change everything on you. Happily, you didn't have to go through that change, because you came in. We're already out of tree. So we're good. I feel like we started out of tree. We're good. Too bad for all you legacy ones. So what does that look like if you're in tree? And for these, like when Kubernetes started, obviously a lot of things were in tree. And those infrastructure specific components we talked about, those cloud controllers, in tree. The cap around, it was around release 1.11, which I think was possibly in the, is it the Mesozoic era, the Paleolithic era? I'm not really sure. But it's like the exciting days of 2017, 2018 when I came into this ecosystem is when we started talking about removing the entry cloud provider code. Did you hope it would just take a couple releases back then? Everything is going to take two releases, and it's been going on for five years, obviously, right? But so then we started that process of figuring out, figuring out starting with the cap. We shout out to Tim and Dims yesterday for their great talk about the things people complain about, things taking too long. Well, we start a cup and we work through it together. Because it turns out you don't want to break everyone's ability to use clouds. That's a terrible idea. So OK. And for this, because it's in beta right now, we're hoping it went to beta for 1.29. We're hoping to get it to stable soonish. And right now is the time of tests and seeing if it still works for you and seeing which tests are broken. And it is a pretty significant code base. We're talking many hundreds of thousands of lines of generated code are definitely going to be coming out of like core Kubernetes. So but there's a few things to look at. If you're going to plan your migration, we have a pretty detailed blog post that you can go check out there that says, again, if you were using an entry provider, you have to change some command line flags. And you are going to have to run the CCMs, which is a variety of ways to do it. So we're not going to try to enumerate them all here. I'm just kind of pointing you to the thing you need to look at if you think you're going to be doing this migration yourself. So how do I know if I'm affected? Yeah, that's the zillion dollar question. And the answer is, if you are operating Kubernetes on AWS, Azure, GCE, OpenStack, or vSphere, you could be affected. Because those are the ones that had entry. Everyone else, like Equinix and others, are definitely not affected because they never had entry. And as of the next release of Kubernetes, only GCE would still be affected, because everyone else has yanked their code out, and we will have an iChart in a bit that will show you exactly when everything happened. And if you're thinking to yourself, surely every infrastructure provider can coordinate and make this change on their individual entry to out-of-tree migration at the exact same moment. Oh dear, that's not going to happen. But we are all communicating with each other. And yeah, we don't have to worry about it much after this coming moment. And of course, everyone is only running the version of Kubernetes that just came out, right? Yeah, no, absolutely not. So iChart. Yeah, so here's what we've got for you to try and give you a sense of, am I OK today? Will I be OK tomorrow? So if you are on one of these five providers, you are eligible to be running on an entry provider. That doesn't mean you are. You'll have to do a little work to see if you are. But just in case, if you're on one of those entries and you're on 1.25, for example, you know you're fine, because all the code's still in there. If you're on OpenStack, though, and you're about to upgrade to 1.26, you might want to make sure, oh dear, I better make sure I'm on the out-of-tree version. And the same story for Amazon on 1.27, and then Azure going into 1.30. But there's a little ripple. And that's the cap went to beta for 1.29, which means the default changed, which is totally fine. And that means that if you still were staying on the entry provider and you're about to do the 1.28 to 1.29 transition, then you need to go see if you need to set these feature gates, because the default changed. And you'll need to say, oh, I'm not quite ready to change yet. I still need to stay on entry a little longer, starting in 1.29, you might have to explicitly request that. Yeah. Of course, everybody should be moving to the out-of-tree version of their providers as soon as is reasonable for you. But we want to make sure you knew how long you could stay, in case it works better for your production schedules. Absolutely. Oh, and I have a note here. You should read the Kubernetes release notes. It's so important. And I feel like this is the always-read-release notes. Do people always read release notes? Nobody reads the release notes. You should. But the people, they're well written. Sometimes they have jokes. They do. The people who write the release notes want you to have a good time with your upgrades and not have a bad time. Did you see the oonetties? Oh, that's so cute. So great, yeah. OK, so. All right, so if you are making all this change, what do you get for it? I guess I'm just getting my cluster still works. Is that the main thing? I mean, your cluster still works, yes. But also, if you think about all of these out-of-tree providers and the entry providers that we're still cherry picking patches to, I cannot wait until we no longer have to cherry pick patches back into Tree. But if you think about all of those code bases for the provider-specific features, the provider-specific teams are maintaining them, and they're smaller and lighter weight. And if they need to do a bug fix or get something out that works for their Clouds new features, they don't have to wait until the next Kubernetes release to get the change out. It's real nice for us. We could just put a new version, and people who are ready to upgrade can take it, and they get the new feature. Whereas in Tree, it's like, well, it's in Tree. You release with the Kubernetes cadence. And then the other major benefit, as we alluded to before, is the Kubernetes code bases you may have noticed large and complex, and has many bugs lurking within it for interactions that nobody anticipated a million years ago or a decade ago, which feels like a million years ago. I'm not going to lie. But that makes the code base, in general, more maintainable. So for your interests, if you have a feature that you're rolling up to SIG network and you would really like them to implement and they look and they say the interactions between this and all of the Intree Cloud provider code are going to be a hot mess and very difficult to troubleshoot, you may have to wait a while on that cap. Like, it'll be much easier to get your features in at this point, because it'll be more maintainable. Absolutely. All right, so finally, what did we learn from all this? And if you're thinking at this point like, oh, lessons. Yes, lessons, but also, for those of you who came to this because it was close to the coffee and it's the last day and you don't want to walk very far and you're wondering what is going to be useful for you to take back to your actual workplace, I think we have a few lessons here. All right. Starting with, I would say this is going OK, but there are things that didn't go super well. Like, for example, as you may have noticed from that iChart, changing the default on a feature gate and having it be confusingly labeled as disable, enable, true, false, there's a giant feature. Yeah, we didn't even mention it. It's a double negative feature gate. It's, oh my gosh. So it's a little confusing. And also, as you also noticed from that chart, we didn't all do it at the same moment, because some teams were very wanting to move forward and not ready to keep backporting bug fixes forever. And other teams weren't ready. And we also found in the extensive testing that we talked about earlier that there were some issues in the CI CD workflow that we have now found and fixed. But that took time. And there are some absolutely delightful bugs that mostly Antonio, who we're going to shout out and say thank you to people later, but Antonio gets a double shout out because he has found and identified and come up with solutions for a great deal of race conditions. Remember we talked before about the complexity, inherent, and really large, really old code bases. So there's some race conditions that once the out-of-tree CCMs, like the out-of-tree CCM behaviors, some of the behaviors that we just kind of ported out of tree, other things that were in tree were influencing it. And when you pulled it out, oh, suddenly you find the places where the code does not do what you thought. And then, yeah, this, again, we talked a little bit about the coordination. And then probably one big area is because interactions between CCMs and test grid are a lot, not everyone on the KK has the context to fix the significant problems that we found. And so it's just bottleneck of people and time. And if you're sitting here thinking, I'm good at testing and Kubernetes, we would very much like to talk to you. All right. So what went well? Well, we did it. We got to beta? Yeah, we got to beta. We got to the stated goal. We did it, got to beta status, with the feature gates, now on the path to stable. So all that process is going the way it's supposed to. We used the Kubernetes processes and the workflow. And it worked out. There were those road bumps we mentioned. But overall, it had good guardrails, kept us on track. And even though it was a very complex task, the processes helped keep confusion to, I guess, a minimum. And also, it was delightful to see all these people who helped make it happen. You want to give some shout-outs, President? Absolutely. And so there are so many people, but I had to put a few names on the slide because of bug fixes, tracking down interactions between and differences between the entry and out of tree for various providers. Antonio, Andrew Sykim, Dimms, Joel Speed, Elmiko, just to name a few. There are plenty more. But these are people who have worked really hard to make sure that this is a good experience for everyone. And what still confuses us, I think the interaction between SIGs, I think we definitely need more people in SIG Cloud Provider, which is, of course, one of the more difficult SIGs to join and find good first issues. Because, well, first, get really familiar with the code base at one of the big providers. Good luck. But we are still working on a bunch of those open issues. And we need more contributors, like every other part of Kubernetes. And we're not 100% sure how we grow the way we need to, but we do need to. So if you have ideas, come and share them with us. Because we are totally looking for help changing things up and moving forward. All right, so here's some action items we got. So we've added some temporary reviewers. That's one of the things we're trying. And we're reaching out to build consensus on areas of common maintenance that we might share with some of the other SIGs to help avoid some of these surprises in the future. And then we're finding a time to get some of our EMEA and APEC contributors more plugged in, rather than just, I guess, the US-centric meeting time that we have now. And then finally, we're trying to make some detailed info about the activities and keep that shareable so you guys know what's going on and can easily feel like you can join in. And oh, go ahead. Oh, and shout out to Michael McKeon, Elmiko, for go take a look at the SIG Cloud Provider spotlight that he was interviewed for. He was supposed to be here to be able to give this talk as well. So he prepared and helped us prepare a lot of this material and did a great interview. So go take a look at that spotlight and see if there's anything in there that kind of speaks to you. And then the other couple of things that were, we talked a little about testing before. And if that's of interest to you, we basically need to remove this provider. We need to do common end-to-end tests for all cloud providers. But then also clean up the reliance, the hard code of reliance on individual providers. So that's kind of the TLDR is there's a bunch of tests clean up that's in process right now. If that's of interest to you, go take a look at the open issues. Yeah, and I think we're also working on an end-to-end framework that I, as a third-party cloud provider, could just pick up and make sure my code works the way it's supposed to. Because we really don't have that now. And if that's your jam, we need your help. So how can you contribute? Come to the meetings. That's really the big thing. Just show up to the meeting. It's Wednesdays at 9 Pacific, noon Eastern. You can convert it to your time zone. You can also join the community Slack. SIG Cloud Provider is our channel. Come say hi. And let's talk cloud providers. And of course, you can get involved in the code base itself. That's the main code base right there. Click the link. Yeah, I mean, basically, SIG Cloud Provider is what we make it together. And that's what we want to do with you. So that's it. I think that's our time. Yeah. We thank you, appreciate you. And we will hopefully see you all in the community. Take care. Thanks.