 Hello, hello, CubeCon and welcome to the great kates.gcera.io vanity domain flip. My name is Steven Augustus. I am a senior open source engineer at VMware. I'm also one of the SIG release co-chairs and a Kubernetes release manager. So we're generally responsible for maintaining all of the infrastructure as well as pushing all of the Kubernetes that you consume day to day. And hello, my name is Linus Arver. I am a software engineer at Google. I am the main author of the container image promoter, which we'll be getting into more later today. And I also contribute to the working group for Kates Infra. We do lots of different things. The image promoter was probably one of the biggest projects to happen recently in this group. So a quick overview of what we're gonna be covering today. We're gonna be talking about the historical context and the rationale for image promotion and the vanity domain flip overall. The infrastructure changes that are required. So walking you through how the container image promoter works as well as how it's tested. And then finally some lessons that we learned along the way. So what is kates.gcera.io? kates.gcera.io is a vanity domain or a human friendly name that essentially points to a folder within a Google container registry. So in our case and in the before times, we're looking at gcr.io slash Google containers. And today we're pointing to kates-artifacts-prod. So why a vanity domain name? This allows us to make infrastructural changes behind the scenes with minimal and hopefully no downtime to operations that depend on these images. So that flip is something that hopefully you didn't notice. It happened in July. And yeah, that's what we're gonna talk all about today. So here is a highly scientific diagram of how we've done the container image flip. If you imagine changing a DNS record, you kind of stand up the old thing, stand up the new thing and slowly move traffic away from the old thing. And that culminated in essentially a flip of the backing registries for that vanity domain name. But is that it? There were some infrastructure improvements that we had to do that were absolutely not trivial. But fortunately that gave us an opportunity to improve the way we do production images, the way we test these images, improve the overall security posture of the project, including things like essentially business continuity, right? Knowing that we can create backups, we can recover from those states and we're able to audit all of the changes that are happening in that system. So Linus is going to jump in some more details on the historical context from the Google side. Thanks, Stephen. So yeah, the entire process probably took around two years. So the initial idea of promoting images based on a configuration internally at Google, while we were still in the old, pleasing the old Google containers folder, that happened in 2018. And then between 2018 and 2020, we basically did like a rewrite of the internal promoter for the open source community. It's basically the same idea, although it's implemented entirely differently. There's also auditing and back-up involved, which were separate infrastructural pieces. And these changes basically led up to the change for the flip, which happened in July. So why did Google need an internal promoter to begin with? So Stephen's kind of hinted at some of the reasons and it was basically to make it more secure for Google to promote these things. Basically, to give you a little backstory, at the beginning, before the internal promoter existed, we had roughly 60, 70 Googlers who had production access to modify, have right access to production. And that was considered a, not a good security practice. So there was basically a mandate saying, we need to lock this down to very few people. Meanwhile, we don't want to make these 60, 70 people like unable to push images on their own. Let's do something about it. So we created a internal bot basically that does the pushing of images from staging to production for the Googlers. So this also made it better in that there was less human error involved and also made the changes, who pushed which image that auditable in the history, in the source code. So this is just basically an illustration of what was happening, circa 2018 before the internal promoter, so people were manually copying images from staging to production, at school containers at this time. And obviously, this was not a great idea. I won't spend too much time on this slide, but basically we need a change. And this is what we did internally. So we basically had a production, like the bot that has production access, it held the keys to production. So humans weren't allowed to individually change, make changes to production. And because of that, it made things more secure. The history of the changes to production was auditable because all the changes to production were done in source code. This is kind of like configuration as code, that concept. And we also had pre-submit checks for any new changes going in into the promoter manifest in which is like the green box you see that in the illustration. Just to guard against human error and to check that the images are okay and all this other good stuff. So fast forward like a year and like late 2019 and basically we decided to do an open source version of the promoter, but we couldn't just copy paste the code from Google's code base to GitHub to donate it. I mean, we could have technically, but it wouldn't have worked anyway, because we needed performance guarantees that really weren't there because yeah, the scale is completely different. If you recall the slide from earlier, basically it's an image copy, but for the internal like Google case, we were talking about a handful, maybe a couple hundred images, but for the open source case, we're talking about like everything. And also the open source version tracks a lot more images, roughly 30,000 unique ones to be exact. So the manifest is like basically that much bigger. There's a lot more stakes or intent that is tracked. So to do that, we have to be ready to go. It takes roughly 30 seconds or a little bit less than that to read all of these images from GCR, which is their production repository. And to kind of reference that against the manifest to do the promotion. So that's pretty cool. Golang makes that a bit easier because we can use concurrency. It's very easy to do there in that language. I'll be also talking a little bit about the edge data structure, which we use for basically encapsulating the idea of promotions to make it a little bit more easier to reason about because when you're talking about 90,000 like edges or promotion, basically you need to simplify the problem a little bit to basically make debugging easier. So as for the performance optimization, these are the four steps basically. So the promoter, when it first starts up, it reads in images in the program manifest. That's the stuff checked into GitHub by humans currently. It's just described the intent of what we need to promote. That's the glue circle there. It then reads all the stuff already in production. That's the stuff that takes 30 seconds roughly. And then it does a delta. So it removes all the stuff that's already been promoted. That's what's been purple. And then we only promote what's left. That's the blue circle. For the eagle-eyed people in the audience, the red stuff here, you might ask, what's that? What do we do there? Those are images that are not tracked yet basically in the manifest. That used to be a pretty big chunk, but we actually added what's called a legacy or a backfill manifest to track all those as well. So in actuality, this red circle or this red part is pretty much non-existent today. So the edges. This is like a pretty simple concept. Basically, we treat each edge as like a timeless encapsulation of an image copy. So the edge has three parts. We have the staging name, the production name, which are vertices. And then we have like a connection between the vertices. And that connection is populated by the data. That's, it sounds pretty simple, but it does make it easier to reason about. I'll give you an example. So let's say somebody wants to promote from staging to production for this particular image called foo. You might notice that the staging name, staging slash foo is a little bit different than the production name, which is the production path to foo. So the destination endpoint is slightly different. You might also notice that the tag 1.0 in production is not the same in staging. This is kind of by design. Like we don't care about the staging tag because we only care about images by their content. And every image with unique content are, well, every image has the digest and the digest is unique because it's a secure hash. So it really doesn't matter what it's named in staging, as long as you can find the hash, the blob. So these are the cases that the edge helps promote or the edge helps detect. So we check against overwrites where overwrites means putting a different image into the same production name or endpoint. That's the first example on the bottom left there. We don't want to promote two different things to the same endpoint. That's basically a disaster, right? I mean, you don't want one totally different image to somehow magically replace another one. That would be really, really bad. So that's a definite no. However, we don't mind having multiple copies of the same thing promoting to the same endpoint. So that's the bottom right. If you have two different staging projects, let's say R and S, they both have the exact same image and they both want to promote to the same exact endpoint in production, that's fine. In reality, we actually do pull these out as well to reduce the performance cost, although that is negligible, but it's strictly speaking, it's not a bad thing. And the middle picture there is like the normal case where you have different images from different staging areas going into different promotion or production endpoints. So this is how it really works as an overview. So it gets all the promoter manifests, then creates these edges, that's what actually happens. Well, after it reads all the data from production. So once we have these edges, then we check for illegality, et cetera. That's like the simplification set where we go from 90,000 edges possible and then we just reduce it down to like it's just a handful, maybe the 10 or the 20 that we could promote in one step or in one, sorry, pull request from GitHub. And then we just actuate each promotion. That's also done in parallel as well because why not? There are also two other pieces that I alluded to earlier, which is auditing and backup. So I'll just cover those briefly. So for auditing, actually this is an interesting slide. So there's a lot of different pieces involved here. We use a lot of different cloud components. So the auditor's job, before I get into how it works, the point of the auditor is detect any changes in production that happen outside of the normal promotion process. So we can have a team of people using the promoter like day to day, that's fine, but what if somebody has access to production and does something on their own, could be, and it's a mistake, could be a hacker, anything, any change that happens in production, we wanna know about, right? So in this example here, the auditor is designed to basically detect any change. So the production registry where all our members are stored, it's stored in GCP or GCR. GCR has a feature where you can have any change that happens in GCR, get notified via PubSub. So anytime an image gets pushed there, anytime it gets deleted or anytime a tag is created, all of those things are individual events that PubSub basically generates if you listen to it. So that's what's happening in step one. So in this example, there's like a bad image that's the red Docker image there. And then because the auditor is listening to PubSub, it spins up and as soon as it spins up, this is in Cloud Run, it pulls GitHub. So it says, okay, I got these promoter manifests here, that's the official, everything in the master branch. Does this Docker image that I see, does it match stuff in this manifest? If not, then it alerts error reporting, which is another cloud component. If it does match, then nothing happens. The error reporting then gets fed into Slack. There's a Slack channel for this, but I kind of ran out of room. I don't wanna add in too many things in this slide. So that's how that works. And for backups, so we also have backups of production. This is pretty simple. We run it every 12 hours. There's a full copy of production in a backup registry. I guess there are a couple of things here to note. Well, the main thing is that due to quota constraints, so the initial implementation of the backup was actually but naive. So we tried to do a full snapshot of all 30,000 images every time. And that basically ate up all the quota for GCR reads. Like the GCR API only allows you to read a certain number of, or only allows you to make a certain number of API calls per hour and per day, there's different limits. So when we first did our version of this, the backup job was eating up all of the quota. So we actually brought down the proud jobs and stuff that was running, so sorry about that crowd team. But we've since fixed it. We just do incremental backups today. I guess the other thing to note here is that production only grows, we never delete images because deleting images in production would be a very bad thing because Kubernetes, as you know, is used worldwide, is used everywhere. So you never know who's using which image at which time. So by that reason, we always add, we never subtract or change or modify. So that kind of makes the backup easy because, you know, you only get more stuff in production and only grow. So every time you do a snapshot, you really only need to snapshot new images about it. You don't need to think about deletions or changes to existing data. Like all of that is kind of unnecessary. So this is pretty simple. So that's the overview of the three, I guess major pieces. And Steve is gonna talk about some of the tests that we do here. Yeah, so, you know, I think the natural next step is to ask how do we test all of this, right? And it's pretty simple, right? We tested as you would test any Go program, standard unit tests, there's no extra sauce, just kind of using the standard Go testing frameworks. Now, what's a little special about what we do and maybe it's less special if you like testing in prod, is there is a custom end-to-end test framework which is built around being able to replicate the actions that we would do moving from staging to prod. What becomes a little trickier about these systems is that when you're operating with a system whose idea is to handle promotion from a essentially non-prod bucket to a prod bucket, what you have to consider is we have to test against end points that would look very similarly to prod, have the same sort of restrictions that we would place on a prod registry. So our testing happens in kind of a near-prod environment where there are GCR endpoint setup, PubSub, Cloud Run, error reporting, everything that Linus mentioned before, as well as a fully replicated backup stack. And here we're gonna see a pretty diagram of what that looks like. So again, very similar to what we saw in the previous slides around end-to-end testing instead. So that promoter is in play, there's a staging and a near-prod bucket as well as verifications against GitHub Cloud Run and error reporting to handle the auditing pieces as well as that production backup component or that near-prod backup component. So I'll talk actually about the actual flip. So we just talked about the infrastructure, all the different pieces that were necessary to make this happen, to make it more robust, more secure, all this good stuff. But what I want the actual flip to change for that, the DNS change basically wasn't rocket science, but that has kind of its own history. So I kind of go briefly over what actually happened. So the very first attempt happened on April 1st, no joke. But unfortunately, this was rolled back, it was on a Friday. So it was a typical, it was nice Friday afternoon. We started on Monday and then April 1st, and then Friday, it's like, oh, God, you know, what is this weird signal that I see? And basically we had to roll it back. Long story short, there was a hard coded configuration in our code base. It basically resulted in an incident. That's where I'll leave it. But the good news is that because of that incident, we had a lot more eyes on this whole project. So we had a lot more people come in just from the Google side. And also maybe, yeah, I think we had more people interested in it after this attempt happened that we had to undo it. So more people in the community were like, hey, what's wrong? Anything can do to help. So that was a cool little, cool, cool, cool event. As you all call it. And then the second event happened or attempt, sorry, happened in June, unfortunately, it's like we're kind of, you know, cursed or something, right, because reasons. So this was a purely non-technical issue. So basically, if you recall, the new production of Backing Store is a different name. Well, it's called K-Star Facts Prod. It's in a different project. Different project means it's built differently. Basically a different credit card. So the issue here really is we kind of forgot, you know, the lightness of the change, the domain flip is a very simple, like, you know, pointer flip, like that's very easy. But what we didn't, we kind of forgot was that that little pointer points millions and millions of API requests like every day. It's something on the order of like hundreds of millions like per week or something like that. It's huge, the huge amount of traffic. So what we realized at the last moment there was, you know, hey, if we flip this, can the new project, can that, you know, new, you know, owner or whatever, you know, community basically, can they pay for it? And if they don't have the, you know, credits or something set up already, beforehand to take all this traffic, you know, what if we automatically shut it down or something because that's how, you know, GCD works or something, you know, that's how most cloud providers, I imagine it will work, right? If you use Amazon and you've just spent up, you know, 30,000 VMs or some EC2, you know, and you don't have the money to pay for it, I'm sure they will shut you down. But so in order to avoid that scenario, we just kind of, I think we rolled it back immediately. I think we spent a few hours on day one. It's actually a four-day rollout. So anyway, yeah, we kind of realized this a few hours in and just, you know, preemptively undid it. There was no issue there in terms of like technical, like we didn't hear about any issues. And the third attempt finally has been on July 24th. Or I think it wrapped up in the 24th because I think that was a Friday. But yeah, it took four days, it's a four-day rollout, so we had to wait. But I think, yeah, true to our intentions at the beginning, no systems noticed this. So it was a nice like, you know, under the radar chain, that's how it's supposed to work. That's what the vanity domain name is for. So that was a nice moment when we finally realized that it did work, you know, on the third attempt. So yeah, go ahead. Yeah, so again, with a summary of a incredibly simple process, wouldn't you say Linus from one registry to another? But it again begs the question, is that it? And no, it's not, it's not. So for us to be successful in this endeavor, lots of work had to happen in the community, kind of adjacent to the build out of the container image promoter. So some of that work included developing tooling for staging projects, right? Staging projects are fundamentally a newer concept in the community, right? Given the background that Linus has told you about, the pushing artifacts to Google containers kind of involved finding the right person who had access to do it at any point in time. The release process is a little different because the release process has the keys that it needs to write into the Google containers registry. But for everyone else, for every repo across the multiple Kubernetes, Kubernetes SIGs, Kubernetes client, so on and so forth, all of these orgs need a way of being able to produce images and have those promoted within GitHub and within GCR, right? So to our trusty friend, Bash, I think a lot of Clever Bash was written to enable us to generate new staging projects as well as delegate IAM to various component leads to the various owners of code. SIG chairs and technical leads as well as sub-project owners to grant them access to write to these staging projects or to wire automation to write to these staging projects. So the second component of that is, well, now that I have access to push to one of these projects, how can I do it in a safe way? How can I ensure that what I'm doing is not looking backwards, is not doing something similar to what would have happened prior to the image promoter existing, right? And the way we solve that problem is a combination of Prow, a combination of Prow, which is the Kubernetes and several other projects in the ecosystems, CICD solution, as well as Google Cloud Build, right? So Google Cloud Build and a cloudbuild.yaml file in your repo is a common way to maybe that's hooked into a make target or a bash script that you've written that essentially tells us or tells GCB how to build your image and how to push it to your staging repository. What we get out of this is an opportunity to, again, be able to audit some of these changes that are happening, right? These changes are no longer happening on a developer's laptop. They're happening as a result of a PR being approved and merged within a component area and subsequently a Prow job kicking off a post submit after that PR has merged, which kicks off a GCB, our Google Cloud Build job that eventually pushes this image, right? Finally, you bring the human into the process and you have them generate a manifest for promotion or an update to a manifest for promotion. Lots of bash script cleanup, bash is pretty popular and our release process runs on about 5,000 lines of bash which is kind of changing every day. So in addition to that, we had to make sure that the components that we had already built out were able to support pushing into a new registry. Again, lots of hard coded references to consider. So maybe a few quarters of work and testing across the release engineering sub-project too to wire some of that stuff up. So what's next, right? Lots of exciting discussion around tooling and how we get all of this done. It's always good to look into the future and get an idea of what we want to accomplish next with this tool, right? And I think ultimately we want consolidation, right? We have a promo bot tool that was also written for file promotion, right? So very similar concepts, you'll see some of the same concepts happening within file promotion that happen on the image promotion side and they're similar enough that they should be consolidated, right? So we have started work on building one tool to rule them all for promotion, for file and image promotion. We're working on deduplicating the release engineering libraries. So lots of great work has happened on the image promotion side, on the file promotion side, and as well as the kind of elimination of fast scripts to run the release engineering process. So bringing all of that knowledge together in common libraries for all of us to reuse. Now, Google has recently announced the artifact registry which is going to be the next generation of the Google Container Registry. So we need to make sure that the tools that we use support any new APIs and that we're present around any potential deprecation cycles that we need to consider, right? We wanna make sure that when, if we need to do this again, we do it in a safe way, we do it in a way that minimizes downtime for the community and for the consumers. And then image vulnerability scanning coming soon, we have active work on that. And finally, finding people who are interested in this kind of work and welcoming them into the community to do release engineering work, to do work around these promotion tools that we've built out over the last few years. So I'll talk a little bit about the lessons learned here. So it's a big project, what did we learn? So basically infrastructural changes for legacy code. I mean, I don't call, I don't think the existing, you know, release engineering, and whatever that's legacy code, well, it works. But tweaking just a small bit of that, that's even said just little references and stuff, that takes time because you have to coordinate all the stuff together, right? So that was really hard, but the rewards I think are worth it because it's so much simpler now. Anybody who has an interest in contributing new image and stuff that can just come in and make a PR, get a staging sub-project and get their images in production. It's pretty straightforward. I will also repeat a quote from Tim Hawken, our principal engineer for Kubernetes at Google. He said to me, as I was writing some of this stuff, if it is not tested, it is broken. You can actually find that in a GitHub like discussion comment somewhere. So, yeah, that was really eye-opening when he said that. And really, all of the tests that we have today, they really help bring some sanity into this kind of chaos of all these images flying around there everywhere, basically every day. And also it takes a village because none of this happened from just one person's efforts. It was the community, it was other Googlers, it was behind the scenes, Google security helping out. All of these people, like SNRE, GKE developers, there are so many people that I should actually name here, but I didn't have the time to add them to the slides. But yeah, it really does take a village. Sorry for all the people I didn't name. I forget your name, so I'm gonna try it. We'll mention them in Q&A, right? We can talk about that on Slack with you all later. So to wrap, we wanted to just give you an idea of how to get involved, right? So again, the container image promoter and the other artifact promotion tools are tools that are maintained both by SIG releases, release engineering projects, release engineering sub-projects, as well as the working group, Kate's Infra, right? So the promotion tooling can be found, Kubernetes SIGs container image promoter, some of that tooling has already started to be migrated into affectionately called K-release. So that's kates.io slash release. And then there are some of the links in how to contact us, right? So the SIG release, the working group, Kate's Infra, the SIG release repo as well as where you can find all of the promoter manifests that we've been talking about today. So thank you again for taking the time to hang out with us at KubeCon. It has been, it's always a thrilling journey and it's been really exciting to work on this project with all of the community.