 Welcome everyone. This is the Kubernetes SIGKATES infrastructure updates and intro talk. Or as we're calling it, Kubernetes infrastructure, the next frontier. Better? Maybe? Now? It's been a long conference. Let's start over. Hello everyone. Welcome to the Kubernetes SIGKATES infrastructure talk. I'm Benjamin Elder. I work at Google. This is our new micum. It works at VMware. Maybe Broadcom. We'll see. So what is SIGKATES infra? SIGKATES infra manages the Kubernetes project's own infrastructure, the cloud-native infrastructure that we need to run Kubernetes, the project. That includes things like registry.kates.io, where you're fetching our container images, or gild.kates.io, where you might be fetching a cube cuddle binary. And all the cloud assets and CI resources that we need, in conjunction with some of the other SIGs, like release and SIG testing, that manage those specific assets using the resources we provide. Importantly, we've been working on migrating the project to exclusively run on open, community-controlled infrastructure. A couple of companies, Red Hat and Google, were involved very early on and bootstrapped the project. And in the course of that, a lot of the infrastructure was set up by Googlers on GCP, or even leveraging Google's own non-product offerings, like the package host for G Cloud, for Debian RPM packages for Kubernetes. So we want to get away from that. It's a liability. It means that you depend on people inside of a particular company. It's hard for the project to maintain continuity on knowing how these things work, make sure that they're affordable and sustainable. So there's been a very long process that the SIG has been operating to make sure that all those things, from eight years ago, we got the project up and running, get to a place where anybody participating can run them, fully open and transparent, and through committed funding from the different vendors. And we made a lot of progress there. So that's our theme for what we've been up to this past year, getting things more sustainable. We went into this year in January, and the bill projection for the Kubernetes-owned GCP account was going to be over $4 million, but we have a 3 million commitment of credits that's provided at the beginning of the year. That's not gonna work. Last year, at the end of the year, we ran over and we got some extra credits, but that was sort of like an emergency measure. And we need to get things back to an actual sustainable place. So this is fine. So what did we do? We worked again with the GCR team to try again for putting code in Google Container Registry to forward traffic to the community's own modern multi-cloud host. We have a lot of traffic that comes from users on other clouds, mostly Amazon, that are using the community-provided images as opposed to some external distro-hosted images, and it's really, really, really expensive to host something in one cloud and serve almost all of your egress to another cloud. It's just not a cost-effective answer. You at least wanna be using a CDN offering or something like that. GCR wasn't intended for that purpose, it isn't primarily meant for you're gonna host container images that you're gonna run on GCP, and it served great for us, but it just was not a cost-effective answer, and it's been replaced with artifact registry. So the community we built, registry.cates.io, registry.cates.io has a small application in front that allows us to load shed the traffic for the layers of images securely to alternate hosts, so most of the time for people that aren't on GCP for the layers, we're gonna forward you to the nearest AWS region now that we have credits from Amazon. So, but users weren't switching, everyone was still pulling from case GCRIO, and we were seeing spend go up on both of these, and registry.cates.io runs much more cost-effectively, but we really need to get over there. So we looked at the highest traffic images and came up with some plan to have some safety measures, but get traffic forwarded and move people over so we can get to a more reasonable spend, and it has stuck this time, and that was a huge part of it. We were spending north of $2 million a year just hosting the container images that everyone uses. We cut some spending in some other areas temporarily. The scale tests, the Kubernetes project runs 5,000 node scale tests to make sure that Kubernetes is performing and scaling, and some of those take 14 hours of 5,000 VMs plus the control plane. That's really expensive. We cut back on the frequency of running some of these tests temporarily while we're working on some other options for running those. And we started adopting a bunch of new resources, which we talked a little bit about with the registry. So at the end of last year, Amazon announced that they would also provide $3 million in credits. A little bit more complicated than that, but essentially, yes, Amazon's providing us a large amount of credits to run things on AWS now and on board. So suddenly we have a ton more resources to work with, but we can't just snap our fingers and get things over there. So starting in January, we had a huge increase in the amount of resource we had available for them. So we've been pointing image downloads that way with the registry, and we've been working on a large community effort to get the CI working more on Amazon. All the CI is run on GCP, on GKE, and spinning up tests on there. So we've been working on making sure that we can run CI workloads if they're a Kubernetes pod that doesn't need to use external assets because it's running something like unit tests or build, we should be able to schedule those on any provider with a Kubernetes cluster and right now that's Amazon. So we've been moving these, not just between the clouds, but also out of the things that are still running on the Google internal projects that the community hasn't migrated because we halted while the container registry costs blew up on us. And we've been running an effort to get more of it running on the different clouds, and we've been particularly been working on getting scale tests to work on different clouds. They used to use some bash scripts from SIG testing, my other SIG, that only run on GCP and no one wants to maintain anymore. So we've been working on getting with the scale team in Kubernetes community and with the testing team to get the scale tests running on GCP and AWS using Kops in a nearly identical way so that we can interchangeably choose where we need to run those resources depending on where the rest of our costs are coming from because it's one of the larger costs. And like the container image bandwidth, we should be able to shift it wherever it makes sense. And shout out to Dems and TZNeal at Amazon for helping us with this effort. We love Fastly. Fastly came to agreement with us for their Fast Forward program this year. So they're giving us a very large amount of CDM bandwidth for our other bandwidth problem, which is the binary downloads, kubectl, kubatom, kubelet. So deal.kates.io, our download host is now powered by Fastly. We have Fastly CDM in front and you can see it's soaking up a huge amount of traffic for us. This all used to be served out of a GCS bucket in one region because that's what someone set up a long time ago and it turns out it's really hard to move these things once everyone's using them. So, Hanna, thank you. We're on Fastly. And we have other sustainability issues, not just the costs, but the people. We've had the same people running this for a long time and they've had a lot of other things to do. So this might seem a little bit familiar, but I know myself and Mohamed are the new tech leads for Sid Kate Zinfra to help guide these efforts. And we wanna thank the emeritus tech leads, Tim Hawken and Aaron Krickenberger at Google who ran this effort for many years. We have a lot more to do. Yes, so basically the title of the slide is the next frontier because since Kubernetes v1.0 to now, we went through a journey where we have to basically go from bootstrapping Kubernetes in the Google infrastructure to the point where the community owned infrastructure. So now we reached the point where we say, okay, we have all those things we did over the past year. We own a lot of workloads maintained by mostly maintenance of the community. Now, what's need to be done and what's the next frontier for us as a SIG? So, I think the first thing is, I would say, fully own the entire SIG infrastructure because historically, like Ben said, a lot of things was built inside Google. In the case of the Kubernetes project, we have a built orchestrator called Prow still running basically the control plans to run inside Google, but all the tests, basically majority of tests, thanks to Ricky here, run inside the community infrastructure. We still need to migrate that. The SIG itself can't take out that because it's too much, so we reach out to the community trying to advocate for this and push people to do the migration. That's something we still need to do for 2024. And hopefully by the end of 2024, we have everything migrate. And that's a prerequisite from a technical perspective to move the control plan to the community infrastructure. So we don't need to rely on Google employees to run that. So having basically everything owned, having an on-call team only from people interested to be on-call as a bestie for basically want people to carry the pressure of 24-7, but will be interesting to have a group of trusted people be on-call for the control plan. I won't want to say the second thing, but basically the other part of the CI is also test-grade. Basically test-grade does case.io and I will talk under Ben, and Profiles Bosley is many, I would say, a visualization tool to basically see results from the test. Basically when you run that site, you can see all the results from all the tests running for the Kubernetes project. And when I say that, I say all the Kubernetes project itself and all the sub-project, but also I would say tests coming from third parties or companies using Kubernetes as the baseline for the product. We have read out what's the company from the third manager, VNAPI, JSTAC, AWS, those kind of things. So basically, but the thing is, currently the backend has been rebuilt by Google in order to basically transfer ownership to the community. But the UI is still under progress. There's a new effort to rebuild the UI in open-source so we can migrate away from Google infrastructure again. And that's like a huge effort because ultimately we rely on knowledge from people and basically some of us are not the best front-end developer. So that's why basically we need help on this, like be able to have people come and help. We basically build an amazing front-end based on modern web technology like Flutter and Dart and those things. We're not based on Google internal technologies. Exactly, that'd be normal technology based on Google internal stuff. So another aspect of the problem is basically maintenance or data operation where you have this infrastructure but you need to operate everything. The GK cluster for the build, the fastest services we use, AWS, we also use GK, all those kind of things. We are in interaction brief, differences, contracts, testing releases. All of this is allowed to maintain. So we want to basically, we are not enough staff for all of this. I think we have like seven people actively working on that to basically be able to support and then infrastructure serving 3,000, modern 3,000 people. So I think it's a lot. So I think the effort for us in the incoming, I would say years is like, revisit the way we do data operation. How we operate infrastructure, decouple things so we don't have a blast radius when there's like a misconfiguration somewhere. We had that in the past where basically we trust a maintainer to ship something and something is broken at a larger scale. It's happening. It's like, what's happening when basically you might build something, they're always an issue somewhere. But for us, it's a lot of the time and we basically at some point, it's a risk of burnout. It's a liability for the committee itself because you don't have enough people to basically operate infrastructure. So for us is like, be able to really basically decouple the different public and critical workload we have. So that help ramp up basically empower more people to be expert and leaders on those different workloads. So we delegate operation and ownership to trusted member of the community. We become a larger group helping the A4 to basically render infrastructure. We are not, the one thing about infrastructure is like, we don't ship features in the sense like in software engineering, we don't build a feature. We maintain and improve infrastructure so the community can strive on building features. That's the thing we do. We have some time requests coming at us about specific requirement like basically be able to have, basically be able to host OCI artifacts because WebAssembly is now, there's a work in progress about basically define WebAssembly extension as an OCI artifact. How we make that happen? So the community can host this and basically build on top of that. Those kinds of subject we have like other different problem we have like and so for us is maintain the infrastructure and keep that up to date. Also prevent any kind of attack. I think we use so the keynote today like basically there are attacks everywhere, security. So we need to reduce the threat model for the community. We are a successful project. We now become an ecosystem. Over time, we are a target for any bad actor out there trying to basically do something like basically enjoy something in order to take over a government or community or whatever. So all of these basically take time so we need to make sure when we basically think about operation and maintenance, we assure that any attack has a minimum, I would say impact on the infrastructure. We're also working with, like I said, we work on different SIG inside community currently, we're working on an effort to improve the way we run E2E test. So there's a cab, you can check the link basically and is driven by Mohammed which is now, congrats again, TL for Gates and Front. So in the context is basically at the beginning of project we, all the E2E tests rely on GCP. So basically you cannot, until now, cannot fully run E2E tests outside of GCP which is some kind of vendor looking for us because at some point we want to be able to also I would say welcome other providers, building Kubernetes managed service to run E2E tests in the upstream and some on doing that downstream. So in order to do that we, conversation between different SIGs happening in order to migrate away from the existing tooling. So we use something that basically make us agnostic of the cloud provider. Be able to, I think currently there's like, GCP is good right now. We're making progress on AWS. We hope to see more cloud providers coming and talk to us about this. Like it will be interesting to have like, EKS running upstream tests on Azure or Worker or UiWay, those kinds of things because they provide managed service. So be able to run an upstream, we get data about how tests running different on those different environment and also it's like a good feedback loop for the different SIGs, like how Kubernetes act on the, I would say a, what's your name? Of the OS Microsoft like, I forgot, flat car for example. We can run flat down on Azure, how could we basically behave in that environment of those kind of things. I think that's like one of the SIGs. Now there's like transparency. Transparency in the sense that we are an open source project in some point one of the principle for us was to basically be able to provide how we render infrastructure, what we consume or the most interesting things is basically the costs, the cost usage of those infrastructure. We have three millions. Basically, we are trusted, we now six millions of credits between two cloud providers and also vastly trusting us with basically unlimited bandwidth but we need to be able to report that. So once he keep us in check, we don't burn money for no reason. Also we track basically those costs because we want to also be able to not abuse a credit. There's no intent for us to run, to basically use three millions by end of year. You need to keep some margin if an incident happened. So for us, be able to be transparent about basically how we use resources help us first, like I say, be in check and also track A4 and also report to those cloud provider doing donation because yeah, they basically give, they support us but they also need to trust us with the donation they make. So in order to do that, we need to be transparent. Currently, we work over the few years to basically provide that. Well, if GCP is to the link, I'm showing here, up here we'll basically see the cost over any kind of timeframe from one year to one week. I think the minimum you have is like one day. You can see the cost of the GCP infrastructure over one day and see, okay, this is what we use over one day. Is it a bad thing? Is it a good thing? Like we have like weekly report but we basically track costs over time. We have an average of 6,000 per week now. So we're trying to basically track that. If any progress happening, basically there's a spike. It's alert us because it's a sign that there's something broken in the CI. So like basically we also, by having a cost report, we are also able to basically identify any CI issue because there's an abuse of resources. That's not the one thing. We're working on make AWS cost accessible. Currently you need to be part of the SIG leadership to be able to all come to the kitchen for meetings because we do also building reports every bi-weekly. So we work on that to make that accessible. So in order to do that we end up with a situation where provide access to the community is tricky because as an open source project, as a community we don't have basically, we can ask people to provide personal information in order to create an identity in a cloud provider. We can force people to do that because people want for, I would say, I was a legal reason like GDPR, PII, California Data Act. We can basically say force people basically to people to tell them use your email to connect to a cloud provider. We can't say that. So for us there's a struggle because as I would say as a real employee, my company provide me an identity solution to authenticate to any internal services. As a community member I don't have that because the community don't have, I would say a central system where we can authenticate because for us in order to remember you need to provide a GitHub user. We don't create basically an identity in K-Clog or whatever system we have to basically access to, I would say, GitHub, order service we use mainly GitHub. So one idea is to define GitHub as an identity provider for the community and go through something like Octa or OZero to basically provide an SSO. We have conversations with Octa currently, they provide access to the platform and we're trying to basically ramp up that, put that Octa as basically the central identity between GitHub as the identity provider and Octa as the authorization and authentication platform to access to all those service we use because we now have a minimum of five services for as a community, as a maintainer you can use five services. Eduber, GCP, Slack, GitHub, all this kind of thing. So I think one of the problem we, I would say a problem is also kind of a bloke of us is like provide an identity management solution to the community. So like I say, we're working on the interesting, I would say interesting problem and we hope for next year to continue on that, push the frontier for us to the point we basically are completely independent and have a full ownership of the infrastructure. So please train us and try to provide any help you want on your time. We don't force people to basically do like specific amount of hours, those kind of things. So thank you. We left some time in for questions if anyone has one. Is there like any questions? I don't think we have a mic so. No, we have. Okay, so wow, someone need to present. Yeah, so, any questions? Going once, going twice. Thank you everyone for coming. Thank you all for coming.