 Hello, everyone. Thanks for coming at the end of KubeCon. It's been a long KubeCon for me. I'm sure it's been a long KubeCon for you all. Bear with me a little bit. I'm losing my voice here, but I should be able to get through this. Welcome to the SIG K-SIM for updates. We'll be talking about what's been going on in the K-SIM for SIG and where we're going. My name is Benjamin Elder. I'm a software engineer at Google. That's Muhammad Ali, SRE at Cisco. And we are leads in the SIG. So Muhammad is a SIG K-SIM for tech lead and a native maintainer. Is it K-native now? It's K-native. And I'm on the Kubernetes Steering Committee. I'm a SIG K-SIM for an SIG testing tech lead and a maintainer of kind. So what is SIG K-SIM for? SIG K-SIM for manages the Kubernetes project's own infrastructure. That includes the registry reserve, the container images, Prowl, which is our CI system. We'll talk more about that later. deal.case.io that serves binaries and all of the expansive CI-CD infrastructure that we use for running E2E tests and all of the things to make sure that the Kubernetes project is reliable. So what are our priorities this year? We're migrating everything left that is running in some company's account to accounts in the Kubernetes project. We've been working on this for a long time since the beginning of the SIG and we had one bump after another. We're confident this year the last lingering things by the end of the year, everything will be an account that is owned by the CNCF and SIG K-SIM for a together with continuity through the project and where anybody can show up and participate. We're going to adopt Okta SSO to make it easier to onboard people into participating. We've had some infrastructure for GCP because that's the first vendor we had for managing access. We're going to do multi-vendor access and make sure that contributors want to contribute. We can get them quickly onboarded. We're going to improve our observability stack so people can see what's going on with the infra. We're going to do some cost optimization and we're going to migrate from GCR to AR the rest of the way ahead of that deprecation. I want to take a moment to thank all the vendors who've been sponsoring us. Google Cloud and Amazon who are providing $3 million a year for each in Cloud Credits. Fastly, who's providing a huge amount of bandwidth for us. Equinex and Digital Ocean who are providing resources that we use for testing as well and more soon than you'll hear about later. So I said we complete the migration in the community. That includes the CI. There's been a huge effort to move the CI but it's not quite all migrated yet. So for real this time, we're going to finish this year and we're going to move the test triage pipeline that goes with CI because there's a couple and the release bucket and I'm going to talk some more about these. So this is our CI. Happened on accident. We had a Jenkins and then we added some extensions and then we thought, why don't we run things on Kubernetes because we're just running containers anyhow and next thing you know, SIG testing has a CI tool. KSIMver provides the resources used to run the CI tool that is at proud.case.io or the Kubernetes slash test infra repo. As you can see we run a lot. Up at the top here that is a flame graph of the time to complete and the successor failure state of the jobs over the past day. We run tens of thousands of CI jobs daily. So we put out an announcement earlier this year. Action required. The default cluster that we schedule workloads to is a cluster inside a google.com GCP project as in Google's GCP accounts. We still have things tangled up in that. We have been migrating them to run on Kubernetes clusters run by the Kubernetes project. We are going to finish that by August. All jobs must be migrated or they will be removed. We've made a huge effort and many of them are moved already. And when that's done, we'll be able to look at migrating the control plane and we're starting to plan that out. A shout out to Ricky. Are you here in the audience? No, Ricky couldn't make it today. Ricky has been doing a massive amount of effort to coordinate migrating CI jobs to the community infrastructure and we're getting really close. So we also have this test triage pipeline. Basically we collect all the logs from our tests. We get all the errors, we put them into BigQuery. And then from that BigQuery pipeline that's kettle the first part, we have this tool brilliantly named triage. It actually does KNN clustering after doing some normalization on the logs to find these are common failure modes that are happening across tests. So we can track that something has become flaky in Kubernetes. Something has started to break. If we just looked at one job, we might not notice but we'll see trends where some failure mode has started to increase. This tool is used by some of our most prolific contributors to find where Kubernetes has regressed and solved these things. This is still running inside Google. It's a pipeline that was pieced together over the years and meant to be temporary. So of course we've been using it for like a decade now. And we'll need to finish migrating it. It's running inside Google. So I'm gonna be taking a look at this and the only known instance as opposed to our CI which other projects also use. We'll be looking at how do we move this out as part of migrating the CI? The release bucket. So we previously announced that deal.case.io was powered by Fastly. That's true and we're very thankful for that. But it's still backed by the Kubernetes Release GCS bucket which is in the Google Containers Google.com GCP project. Little bit of Kubernetes trivia. Google Containers is actually the placeholder name for Kubernetes before it was Kubernetes. That was the IRC channel on FreeNode and it's the GCP project which literally all of the infrastructure was. We're still depending on that. And there's a few other miscellaneous things in there. So we've been working to flip that over. The other thing that has happened here is because we've been using this particular GCS bucket for years, there are a whole lot of places out there that are just using the bucket directly because it's public read behind the redirect. So this time we're gonna be publishing to a bucket shielded by the CDN. This is not public read so we can guarantee that we're not getting ridiculous amounts of non-efficient traffic. The bucket in a single region and you get people fetching it from the other side of the planet, it's really expensive, we don't want that. So we've been working on making sure we can smoothly flip this over without busting the cache and to a bucket that's controlled by the community. And we've been partnering with Fastly to plan that out. It's not too complicated but it's one of those things where people need enough time to finish the work. So now I'm gonna hand it off to Mohamed to talk about some of the other things that we're doing this year. Hi there, I'm Mohamed. So as Ben said, I work as the KC for lead. I also work at Cisco as an SRE. So cost optimizations. So last year, as you might have heard and towards the end of the previous year, we kind of run out of money and had to scramble some credits. So we did some cost optimization last year and we're still planning on doing more this year. So for example, we use Google Cloud, it offers you flexible cuts if you commit to use X amount of dollars a year. So we have a base amount of CI that we run every day. So we've bought a commit for that and so we're saving somebody there. We wanna explore the same thing for AWS. So on AWS, we also have a fixed amount of base CI every day that we run. So we're seeing what we can do with reserved instances and et cetera, et cetera. We also look at our cost every month and see what we can optimize if we're doing something inefficient that needs to change. EtCity. So last year, at the end of the year, before etCity joined Kubernetes at CIG etCity. EtCity is very critical to Kubernetes but they also run some infrastructure that we're trying to take over and manage on their behalf. One of the things that etCity is looking for that we can help with as a Kubernetes project, for example, is better visualization of the tests. So they are leveraging test grid, they've got some more CI that they're planning and running so we're gonna help them out with that. Let's see. So Octa SSO. As you've seen in the diagram in the picture earlier, we've got access to a lot of services from different vendors, including some that are not on our screen. These are enterprise products that use enterprise identity that we're very familiar with at work. So Octa was great, it was full enough to give us access to their IDP so we could onboard contributors so we can allow them to onboard and off-board that these services very easily with minimal work. So we've deployed this, there are a couple of services accessible on there. I need to add additional services on there and then start rolling this out to the wider Kubernetes community. There's a couple of systems on there that I'm planning on working with the stakeholders to enable that feature. Our monitoring stack. So we've got a couple of graphite instances that are monitoring different things. So we have a really old one that's tracking Prowl, still lives inside Google, so we need to fix that. Recently though, if you're at the contributors summit, we showcased new graphite instances that are visualizing how our jobs are performing. I've got some screenshots. Those were kind of important for us as we're looking to right size jobs so they're very efficient. You're not requesting 10 cores when you only need two. So there's a couple of pictures here. So right there you can see, at that URL, if you go there, you can see jobs that are scheduled to run for a particular repository, how often they run, how many concurrently. And then over here, you can see their CPU and usage of these jobs, right? So if the job ain't quite sized right, we can go and amend that. This information is also very helpful as part of the CI migration that we've been working on is because people are moving jobs. Many of these jobs, initially, didn't even set resource limits and requests. So it was a fun challenge to work through. Let me see. One, two, three. Oh, and here's the legacy one. If you look at the Grafana UI, it's a bit old. But it tracks the usage of Boscos. Boscos is a software that we wrote. It allows us to lease projects for Google Cloud projects for CI. So you can lease a Google Cloud project, you run some CI in there, and then you return it, and then we clean it up, and then another job can borrow it. But we have some improvements planned. So those are three different Grafana instances. We want to unify them and make sure that it's up to date and it's patched all the time. We also need to gather metrics from some clusters that we are currently not monitoring today. And the other thing is we need to make sure those metrics are persistent for a few months so they can survive cluster restarts and stuff. And we can see job usage and cluster health over a period of three to six months. GCR to AR migration. So if you are a Google Cloud customer, you might have heard that GCR is deprecated. That happened a few years ago. But the deprecation is for real, as in they're gonna get rid of it in a year's time. So we need to move to auto registry. In 2023, we adopted the artifact registry for our production registry. So if you're pulling images from registry.co.io, that's what happens today. However, there is a part of the puzzle. So when we build container images, we stage them somewhere first and then we copy them over to production once we're ready to release a specific build. That staging happens in GCR and that needs to be fixed within 12 months. Ideally, we wanna do this for the end of the year, but it's some engineering work, so we gotta work out. So we need to work out how we can do that. So as you had this morning, Oracle's given us some credits at Lost KubeCon in Chicago, so we're trying to set that 10 and up. Primarily, we wanna do some ARM 64 testing. We have a lot of jobs that don't really care about architecture. For example, we check out repository, we run unit tests, right? They could run on any architecture. And ARM 64, as you might have heard, might be cheaper for some workloads compared to AMD 64 for specific workloads. So we're gonna try that. We also have some jobs that actually wanna run E2E tests on ARM hardware. So Etsy is one of them. So we need to enable that functionality. So yeah, we have a lot of work to do. There are many things on fire and we need some editors to turn up. I'm very interested in hearing from SREs like myself who work elsewhere and have worked on fun challenges. Let me see. So that's where you can find us. So if you're not in a Kubernetes Slack, you can go to slack.kates.io to join. We are in the second infer channel. We also have a GitHub repository where you can see our charter and the projects and things and systems that we are responsible for. We also have a repository. I forgot to add it onto the slide, but we have a repository called kates.io in the Kubernetes org that you can come and take a look at. There's some hundred odd issues that we need some help with. And we also meet every other week at these times on Wednesday, 9 p.m. London time, 1 p.m. Pacific time, 10 in Europe. And yeah, thank you. Do you have any questions? Someone's coming to the mic. No question, I said keep up the good work. Yeah, thank you so much. Hello. Thank you for the excellent presentation. You mentioned that you have a backlog of some issues that you would love to have people work on. Can you elaborate a little more on what you think are the areas just generally that it would be great to have more attention towards or people from which skill sets or which interest areas? I do think that's not one. Okay, so we have things where we need people who are gonna have some expertise in actually running cloud infrastructure. And we have things where we have engineering challenges where we have some custom software and we maybe need to make some patches to make it easier to move it around. We also just have a bunch of places where our CI system, we have job configurations, maybe no one's on hand that's an expert, but take a look at it. And we have people that will help you figure out the patterns to migrate it, but we have some very mechanical work to do. And then just following up and see like, did it break? Okay, it broke. We need to roll that back and reach out to the owners and sort some more, but we're taking a first pass that can we just mechanically move it over? And so those don't even need any particular special skills, just time. There's a lot of things like that. I'll give one simple example. So, Kubernetes came out almost 10 years ago, some six years ago that Cates.io repository was around. And we have some infrastructure that was deployed using bash scripts. It's really awkward to use that today. But at the same time though, we have infrastructure that we deployed in the last few years using Terraform, but I need someone to come along and rewrite that bash in Terraform and merge it with the rest of the modern Terraform code that we have. That's just one simple example. And there's a bunch of others that you'll see. I think we have an umbrella issue somewhere tracking all our infrastructure technical debt. So yeah, if you're like me and you do SRE and DevOps at work, I'm very interested in hearing from you. As mentioned earlier, the Cates.io repo has a long backlog and there's a pretty wide variety. Well then, thanks for coming. Safe travels. Thank you so much. Thank you.