 Hi, everyone, and welcome to the communities and for CIG. This is mostly an intro and an update of the CIG. So this is the first time we're doing this, because historically we were a walking group, so we didn't have the possibility to make an intro. And we became a CIG October of last year, which means in terms of community, we have authority to basically provide information about what we're doing. So welcome to everyone in this call. What was that? OK. Let's go. OK, so my name is Anu Murkem. I'm a senior software engineer. I work for a specialized company in community distribution. I will be with Benjamin Elder, a senior engineer from Google. DIMMS, I think anyone know DIMMS, is actually in the different meeting in my billet. So we're going to be the two people doing the presentation. We actually just came from the governing board meeting, talking to them about the Kate's Infra situation, and DIMMS is still there representing us. It's not open invite. Sorry. Fortunately, most companies have people over there to talk about the different Kate's Infra subject. So what is CIG and CIFRA? We are a working group specialized in basically have a full ownership of the community infrastructure. And I want to put things in context. Kubernetes is a seven-year project. And they initially start inside Google, using Google infrastructure all the time and over the different step of the project, there were conversations about basically make the community aware of the infrastructure they're running and give the community full ownership. That's why we start as a working group, have conversations about how we transfer all the resources and all the assets of the project from the Google infrastructure to a community-owned infrastructure. And that's where we start as a working group. So the main responsibility of the CIG is to be responsible for first, and that's like the first step of this group, be responsible of the successful migration from Google infrastructure to the community-owned. And community-owned in the sense that this is fully controlled by the community and support by CNCF and basically no overall with people from Google. We also basically have authority to basically establish policy about where we want to run the infrastructure based on different aspects of the environment. It might be ethical. It might be a social issue. So we're trying to basically get consensus about the decision-related infrastructure. So we have different policies around running the community infrastructure. We're also trying to provide a report about spends. Like, basically, we use cloud providers, so we basically have to pay for everything we use. So we're trying to basically provide a different report annually, monthly to the community to make sure we are transparent about how we use the budget allowed to the community. We have actually, we didn't put that here, but maybe at the end of the talk, we have a data studio dashboard that basically give you different output about what we consume per year, per week, per day, depending on what request you want to put in the dashboard. So what exactly are we doing? Like I said, we first start with the migration. So currently, we migrate half of the infrastructure running inside Google to the community infrastructure, but we still have a lot of jobs inside Google. So that's a constant effort. It's not even up to the seek. It's also up to the community, because by definition, seek gets and for all seek testing are not responsible for the jobs and different tests running inside the infrastructure. Whatever is Google or us, because basically, the seek have full context of the test and running, why they're running this, what is required to do that. So we give the possibility. So this is a combination for, we try to talk to different seek about this. Last year, we migrated the jobs from C Scalability, the famous job running one cluster with five candidates. That was an interesting challenge to do. We also talked to different cluster lifecycle signals to try to migrate. Other than that, we also have to migrate the different contributors related to the C infrastructure. Pro, test grid, those things you use day to day to get information about the test running. Those still run inside Google, so we also walk with Google to try to do the migration. It's not an easy exercise, because there's sometimes lack of time, lack of bandwidth, or maybe not enough information to migrate. So like I said, this is a combination for. So if you are in a specific seek or anyone, if please check your jobs run inside the community infrastructure, we might talk later about how to check that out to verify. I try to provide information about this. But if you have a job, please reach out to us. I will give you information about how to contact us. We also have the system packages. Mostly all the artifacts produced by the community also do provide to communities, users, slash companies like system packages, binaries, controller images. Currently, the system packages are still inside Google. So it takes time. We have to discuss with Google about this, because first, there's an issue with signature, how we handle that. We also have the binary, when you're trying to basically install, keep got all using dl.case.io. This is Google. We also have to transfer that. Like I said, basically most of the subject now are about migration. How we do the migration to the community infrastructure and over multiple cloud providers. It's a difficult subject, because basically you need to have expertise with a specific cloud provider. Try to maximize users of that specific cloud provider. Also do cost optimization, because we use a lot. We are even bigger than the Apache Software Foundation. So it's a continuous exercise. So we're trying to talk to a different cloud provider. And also people with specific expertise about this. I'm going to throw something in there, Arno. So one thing that I'm realizing now is probably confusing for anyone that's not familiar with the project. When we talk about migrating these things out of Google, what we're talking about is migrating them either from some special internal infrastructure or from google.com organization, GCP projects, to kubernetes.io GCP projects. We are still almost entirely on GCP, but we are moving things to where it's inside an organization that the community controls with public billing funded through the CNCF, as opposed to something some Googlers stood up at the beginning of the project that maybe no one knows about and limited people have access to and no one knows how much it costs. Another very important thing that we couldn't have had in these slides until this morning and didn't have time. Amazon has announced that they're going to be joining in and providing credits. But up until this point, the official CNCF credits program has been $3 million a year from Google and some unofficial resources from other providers that is far we know. OK. Like I said, we do some collaboration between specific SIG. For example, we work with SIG release about basically moving the content images from Google to the community for sure. For anyone who went to the keynote this morning, I think, we made an announcement that Registry.case.io is now GA. And that's basically one of the effort we had with SIG release because SIG is responsible under the release process. So it's responsible to push the different content image to the Registry endpoint. So we had those type conversations with SIG release, for example, about basically how we handle artifacts of a different type provider to distribute that, to simplify distribution, and improve latency for anyone trying to consume that. So we have those close conversations with SIG release. We also talked with SIG testing about moving the CI infrastructure to us, like I see Chauer in the room. We have interesting conversations about how we can migrate Proud, which is the CI system of the project from Google currently inside that to the community infrastructure. Those are those conversations we have with currently those SIG at the moment. So like I said, in the keynote, we basically announced a new Registry domain, which is basically say, oh, we have case.gc.io. Now we have this run fully owned by the community. No need to talk to Google about this. No need to escalate to Google about the situation. We handle whatever issue happening. And I think in the sense we roll out, we have two incidents fully handled by us. And that was quickly easy to identify. So that's one of the things. I will let Benjamin present. So this has been the current state. Everyone's clusters that aren't using like EKS or GKE or someone's distro where they provide their own images, they're all running on this thing. And it's something that the GCR team provided for us. And the community side of it's maintained by a few volunteers. And it's serving a ton of traffic. So now we're deprecating that. We'll leave it up. This isn't the typical Google deprecation. I'll probably take shit for that later. We're going to keep it up. But we've been discussing. We have to keep it funded, too. So one of the things that we're trying to do here is make it possible for other vendors and the community to participate in controlling how this works cost-wise, how we distribute things. And see, whatever, we don't need a single vendor solution. Kubernetes is all about hybrid cloud. How can we safely offload these things? So it's GA. We've had a couple of container runtimes move their defaults for the pause image that's baked in, that every pod needs, that it pulls from us. And we now have, so before Amazon did the credits program, they had kind of made an official agreement with us to help host traffic from AWS users. And we've been working on that. So talk a little bit more about that. So this is what it looked like before. The GCR team provided a special alias for us that helped us with some of the traffic, the latency, and geolocation by, they give us an endpoint and it looks at the incoming traffic and it goes, OK, an Asia multi-regional mirror, an EU multi-regional mirror, a US multi-regional mirror, and those backing mirrors are owned by the Kubernetes project and fully controlled by them. But they provided this infrastructure for us because we didn't have anything at the time. Previously, we were on a single container registry in the US. So everybody using Kubernetes is coming from the US and we have traffic going cross-continent. With registry.case.io, we've moved to artifact registry, redesign Google Container Registry, and we are using regional instances ourselves in 20 regions around the globe. And the community controls all of this. It runs also with us three. So we have GCP infrastructure in place because that's what we had. They runs the front end. But we are able to offload most of the traffic when one of these additional clouds can provide us credits. So we have a very lightweight application that just inspects the incoming traffic and directs it to the correct patient. So we're using Google Cloud Load Balancer to do geolocation to our back ends, which are running in 20 regions as well. And then from there, our application inspects the incoming traffic and determines where to send the requests. So this is what it looks like. So to understand how this works, you have to understand a little bit about how is pulling a container image work. So when you pull a container image, the first thing the client does is make a request, and it says, I'm hitting the current V2 Docker API just at V2. And what response do I get? Depending on the response, we'll either get an auth token through some negotiation protocol OAuth or something, or if it gets a 200 OK, which is what we do, it goes, OK, I don't need an auth token. This is a publicly readable registry. And that enables us to do some interesting things. So because everything's publicly read, there's no auth tokens to worry about, we can hybridize where these things are served. You no longer have a credential for a scoped token to pull an image on GCP. You have no token. You're just reading. And we may send you to GCP. We may send you to Amazon. And it depends on each request. From a trust perspective, the Kubernetes project has a bunch of infrastructure built around, making sure no one can mess with our production registry. No humans have direct push access. There's a manifest in GitHub. You commit, I want this digest from this staging registry to be promoted to this tag. And it has to be peer reviewed and merged. And there are never mutations permitted. In an emergency case, possibly we might let someone delete, but that will set off our infrastructure and alert. So we have this really locked down community controlled registry. And we're not prepared to fully hand off to a bunch of mirrors and figure out securing all of them. But we need to offload this bandwidth. Well, the bandwidth isn't in most of the registry API. You're just asking if you need off. You're asking, what is this container image? You get a small JSON response that describes the image. The bulk of the bandwidth is when you download the layers in the container image that contain the content. And those are content addressed. They are addressed by the hash of the layer. And clients verify that hash. All major container clients, containerity, Cryo, Docker, Podman, all of these tools, the digests are in the initial API requests. And then they request the layers by those digests. And they validate them. So now we can have a very quickly stood up untrusted mirror, AKA, we are abusing that GCR is backed by a GCS bucket that contains the layers. And we're R cloning them to S3. And ta-da, we have the layer mirror. So we have this very small application on Cloud Run for the moment, because we don't have much of an ops team here. It's just a couple of volunteers. But in the future, if we have more staff, we could pivot to Knative and run on Kubernetes clusters run by the community. That tiny Cloud Run app running in all those regions looks at the traffic and says, oh, this is an Amazon IP address. And you're requesting a layer. Let's see if the layer has been synced to the bucket and send you there instead. For Amazon users, this means a lower latency download for the bulk of the content. It also means for the project the actual cost is lower, because we're serving from all of our popular regions directly. So there's not even cross region traffic for the bulk of our traffic. And we can split the bill between the providers now that we have credits from more than one provider. So this is the future we're headed to. We need to get all of you to move your traffic here. So we are also working with Sick Release on all this tooling, because the promotion tools have to support this. And we're working with the Node community, which runs KubeLit. And with the cluster lifecycle folks on things like Cubatom, Kops, KubeSpray, to shift towards the default being this registry. One remaining problem we have is that we would like to just shift all this traffic. But we found that users are very sensitive towards filtering exactly what they contact. And they come to us and say, oh, you broke us. So we tried working with the GCR team to redirect the existing endpoint and just immediately backboard it. But we found people don't understand that this is Kubernetes, and it's breaking them because they're limiting their traffic. So another important thing is, as we roll this out, we're working through as a project clarifying the expectations that the registry API will be stable, and these images will be available when we will do everything we can to keep that up. But if you need tight compliance over where you're pulling things, please mirror these images into your own control where you know what these back ends are based on. And so that at any time, if Azure jumps in and decides to give us some credits, we can start using them as quickly as possible. Because we now have a very repeatable pattern to set up mirrors for the bulk of our bandwidth. So I think that's all for the major issue of the year. I want to call it an issue because it's a cost issue. We had to talk about this Monday during the Contributor Summit. We spent over a million per year in terms of distribution. So anyone trying to basically say, oh, time create cluster, you pull from us and we pay for that. So we're close to 2 million this year. So by the end of December, we will be close to 2 million. So that's why we insist for people to try to use the new endpoints. So we have a better control in terms of traffic. So if you are interested to help about basically set up convince your managers slash leaders to help us, we have a bi-weekly meeting on Wednesday, 10 p.m. Central Europe time. I'm from France, so I put that in French language. And 4 p.m. Eastern time. We have also basically checked the charter. The charterways will basically go over detail about what we say in the beginning, about responsibility, role, and different things in scope of outer scope. We also have a Slack channel. You can come up, have a question about anything you see in everything related to infrastructure for the project. We centralized everything in a GitHub repo, issue, configuration management, and stuff like that in one repo, except for WGC.Kit style, which is a separate repo. I might put that later in the information. We have an email list. You can send an email if you have a question. You can reach the GitHub repo for the registry by visiting the registry in your browser. Yeah. When you redirect you there. Yeah, exactly. So I think we have time for one thing. We have one more thing I neglected to mention that is an update from today, from the meeting we were just in. So the Kubernetes project has been about one to two weeks away from reaching the exhaustion of our credits for running all this GCP infrastructure, as we've been a victim of our own success ever in pulling these images. Google announced in the governing board meeting, I'm not sure when it'll go out wider, that they'll be providing an additional 600,000 this year to cover the overage for the next 45 days while we figure out further. And there is a recurring $3 million a year now from Google and now Amazon going into next year, as well as engineering support. So we've come very close to Kubernetes shutting down here, but we should be good. But to help us going forward, we really need to get people to move towards the community domain where folks like us working in the project, no matter what company we're at, we can work together to make this more sustainable for the project. Any question? How many times we have to shift? Hey, folks. So I was just wondering if moving to S3 bucket can generate some egress costs later and if this can become a problem for us in the future as well? Well, at the moment, we're using very little on Amazon. So we have quite a bit of overhead for that. Also, though, these S3 buckets, we've analyzed the traffic and we have them in all the regions where we have significant traffic. And probably now that we have a larger donation, well, I think it makes more sense to probably put them in most regions or all regions. We'll have to revisit that. But for the moment, by far the bulk of our traffic, we've actually made the infrastructure a bit more complex to try to avoid that because we haven't known if we'll have more resources. So we're actually using published Amazon data to within the code map to a specific bucket based on the region that the AWS traffic appears to be coming from, which may not match the Cloud Run. Going forward, we'd like, since we have more resources and maybe we could afford a little bit of slush, we'd actually like to move that to the same model we use for the artifact registries where each Cloud Run region, which we have many of, we have 20 currently, maps directly to one back in from each provider because also that allows us to make this a reusable project within the rest of the community while keeping the complexity low. And it should be a pretty reasonable approximation for if you're hitting this particular GCP region, we know which Amazon region is close to that region. Also, I just want to add this diagram basically explaining the logic behind the end point is because we decide where to redirect the traffic. So based on that, we succeed to establish a map of S3 bucket with IP address, which means eager cost on S3 are gonna be low because by definition, if you're easy to instance it's pulling from the S3 bucket in the same region, it's free for us. So we're trying to minimize the cost by optimizing IP voting. That's why we might not have the problem if we basically use different backend because we want to show we minimize costs in the specific cloud provider. The one constraint is we are open about this because we rely on object storage services. If we start to basically say, oh, maybe someone want to run on bar metal instance, we might have a different conversation about how we do IP voting. Yeah. Is the bulk of our cost as a project in the image registry or do other SIGs that own tests, should they start looking into reducing like load balancer cost, computation cost, things like that. I know SIG network, for example, does a lot of load balancer tests and I have no idea how much that actually costs us. That's a good question. So you can look at the dashboard if you want to know, but I can tell you at a very high level. So for the things that are built to the community today that we have migrated to full community control, container image hosting is like north of two thirds of our costs. We're over $2 million a year because it's all in GCP, serving every Kubernetes user. Over half of our requests were from Amazon. So it's a lot of egress and thankfully they're stepping up on that. There's an additional complication which is that we actually spend even more on internally built things that are still running inside Google's dev projects. And again, around $3 million of that, actually a bit north is downloads again. And there's only like another million that's CI and we believe the things that are running internally and CI have room to optimize a bit, but budgeting is different and we've been more focused on the sustainability of the downloads at the moment. But when we go to pivot those resources, when we have some overhead again to start migrating more things, there will be some room to improve our CI costs. I believe we also have some CI running that people aren't monitoring, but it's just not a dominant factor in our costs. It is by far serving the traffic to end users. Okay, thank you. Okay, one thing I want to add to that question in, I think the soundtrack for next year is gonna be two things, use the new Regency Endpoint and if you're a Kubernetes contributor, make sure your job run inside the community infrastructure because that's gonna at some point help us establish a basically have a better capacity planning over the years about how we consume resources because basically in Google is mostly a limited so you don't have to set up resources. You basically put your job there. It's running magically because Google covered a bit, but when it comes to conventional infrastructure, it's gonna be a different conversation because now we have budget so we need to optimize resource consumption. Even if we minimize the cost of one job to $10 per month, that's a win for us. Another new thing now is that now that we have these Amazon resources, one thing that the community has to understand that we'll need to get communicated more widely is that we will need to get people to help us figure out, including support from Amazon, they said to some people, we have to figure out how to start using those resources for something because if you work at a cloud provider and you're providing these kind of credits, it's much easier for whoever's managing these programs to continue to approve them if you're actually using it. So we're gonna have a very interesting time over the next year figuring out how can we very quickly start to take advantage of more of these resources. So for example, the binary downloads that are hosted entirely in Google today, we might consider just moving that to Amazon because those are primarily pulled, we point people to deal.case.io, which is actually a small IngenX instance that redirects you already. So in general, we also need to look around the project and say, where are we directly pointing people at like a cloud storage bucket or some resource that we can't just move and start getting in front of those. We've been focused on the obvious very large ones, but there's a talent of more things. We also should be looking at, for some of these smaller, smaller sub-projects that may not need this degree of infrastructure, we should be looking at can we leverage things like other, like you could host your binaries on GitHub, for example, the kind project we've actually been experimenting with. Okay, since we don't have these expectations up front, and now Docker Hub has like two-factor authentication and things, we're hosting images on Docker Hub, we have our binaries on GitHub, it's not the same scale as Kubernetes, but whenever a user uses that tool, it doesn't cost the project anything. Because in fact, the node images from kind contain a complete Kubernetes release that works offline, so we're not paying for any of that traffic. There are probably other places in the project where something like this makes sense. One question here. A container demaintainer? Do you want to mic? We gotta get on mic, for a couple of hours. Michael Ward, thanks. One of the container demaintainers, I'm also a maintainer for some of this infrastructure. I can't figure the heck out. Is there any documentation or a map of who owns what? And we can go to Slack, sure, try, but it usually breaks on Friday afternoon. Yeah, we're trying to come up with a document for next year basically having a way to explain how to do this for the community or maybe some tooling to inspect your job and try to establish some felt in the spec of the pro job to basically do the migration. One document for how the pro jobs work. Yeah, we're working on that. And another one for who owns the actual images like cost images and stuff like that. We run the actual CI CD stuff, like Node-ED, right? Most of these things are centralized to the Cates.io repo, so Kubernetes slash Cates.io. That's where we have for everything that's run in the community, we have like Bash or Terraform or something that is in the repo that specifies how we bring these things up and what other resources, wherever possible, the entire Kubernetes project operates on GitHub. So there's somewhere in Git that has, like, this is the spec for these things. There are owners for these things and there's some in the temp that are reading them. And if you want to add what's in Node, we're getting trying to be... Let me help in the root of the root. I mean, if we're sitting in for that, there was a pointer to these places for the proud jobs, right, indeed. Kubernetes is obsessive about like DevOps and community infrastructure and GitOps to a fault. If you want a tweet from the contributor account that the community actually runs, not Kubernetes.io, but at Cates contributors, which you should all follow, you write a YAML file in a repo that contains your tweet. Someone reviews it, it merges, and then there's a tweet. Everything is like this, wherever we can. There's a few places where someone has to manually run a script or something like that because we just haven't gotten it running in automation yet. This infrastructure, we'd terraform apply. We've had some issues where if we had just put in an automation, it didn't quite reconcile correctly and we need to manually fix things, but we're always pushing things towards a state where it's in Git, there's a history for everything. There are owners documented for these things through at least the owner's file system that we have. And most of that is going to be in Cates.io. Sometimes implementations are somewhere else, but anywhere that we're spending up cloud infrastructure, there's a hierarchy of directories in the- That's the part I usually get confused on, right? I understand, I can scrape the YAMLs and figure it out myself, but then who owns these things is just difficult. Yeah, we're also working on it. Where's Lantel? We're done, right? It gets hard. Yeah. Okay. Any other question? Thanks guys. Where do you need the most help? All is not a valid answer. I think some subject matter expert in specific cloud provider like AWS, I think we need this kind of help. Because like you said, we get an announcement this morning, we get 3 million, we need to exploit that next year. How people can help us maximize usage of the cloud provider will be really helpful because we can have a multi-cloud provider approach about C infrastructure and also move some artifact to AWS to offload the cost. I think like basically subject matter expert is the first thing we want. Also trying to help with basically maximum automation about basically infrastructure, because we do a lot of things manually. Like basically say, we do telephone apply, we deploy something manually. We want to stop that basically. I know there's tier from cloud, something like that. Maybe that's like the two subject automation and cloud provider, military vendor approach. This is where we need the help. We even talked to Fastly about basically how we can leverage Fastly to minimize the cost. But we have some money flow fastly but not enough to basically trying to build something that's going to optimize the cost to a point we pay only maybe 2,000 K per month or two. I think we can take a last question because we, yeah, the last question. We do expect, I believe that it's still being clarified that Amazon will send us some subject matter and like Google has provided some people but we can always use more. I think the biggest thing from my personal point of view is we just need people that are going to consistently show up and become trusted enough that we can go, oh yeah, we'll let you run the thing that everyone pulls container images through and that's fine. We operate quite a bit on people volunteering even if they work full time and we just need people that can show up and pick anything and just become trusted part of this process. Okay, last question. Yeah, thank you very much for the talk. A question about this particular solution for container image registry and maybe I just missed it. Any plans in future to release it also as a separate project for other communities to use same approach? It's in its own repo. It is expressly not reusable at the moment just because we need to just ship very quickly and like I said, since we didn't have as much credits we're doing like direct code mapping to which backend in the code. But going forward, like we're moving towards a reusable thing. We've had some conversations with some like sort of sister projects. There are a couple of projects that already reuse a lot of our infrastructure Istio and Knative and we started talking to them about like, okay, like their interests and these sorts of things as well. Nice, thank you. We'll be happy to help us. If you visit registry.kates.io you can see the backend code. If you visit the kates.io repo under the Kubernetes project you can see the cloud configuration. Thank you. Okay, I think that's it. Thank you for everyone. Have a good day.