 Hello, welcome to the SigAuth deep dive. My name is Tim Alclair. I'm a SigAuth sub-project owner and emeritus chair. I'm Rita Jane, I'm also a SigAuth co-chair. I'm O'Connor, I'm also one of the co-chairs. Welcome. So for our topics today, we are just going to go through a bunch of the work that SigAuth has in the pipeline. A few things that are basically finished. So we've got a couple of things that recently graduated the stable that we're going to talk about. A bunch of things that are kind of in the progress of being implemented, and then a few provisional, a few caps that we're still working through the design of. So show of hands, who here is familiar with pod security admission? All right, pretty good. Yeah, so pod security admission graduated the stable in 125, which is the same release that pod security policy was removed from trade-in. So pod security admission, we designed to be like super simple security out of the box for Kubernetes enabled by default. It's not exactly, well, it's not at all a feature complete replacement for pod security policy. And so this was kind of covering the like 80 or 90% use case, and then for everything else, we're referring people to the whole ecosystem of policy engines and other options around that. Yeah, oh wait. You wanna talk about the tool, the migration tool? Yeah, so I did a talk on the topic of migrating from pod security policy to pod security admission yesterday, so if you didn't manage to catch that and you're interested in that, I recommend watching the replay. Oh, sorry. Yeah, we also have some guides published on how to do that migration, and we have a, there's a Kubernetes SIGs project called Pods PSP Migrator, which automates a bunch of the tasks in the migration workflow, like identifying mutating PSPs, which when you disable that can potentially cause problems with things that no longer work. So if you're familiar with certificates and the CSR API within Kubernetes, traditionally speaking, when you asked for a certificate, you were completely on the whim of the signer to decide the length of that certificate, which for all intents and purposes meant somewhere between one and five years as a cert that came out with no revocation attached, so amazing security, but we fixed that not too long ago, so now if you want, you can request the duration of your certificate to be shortened as low as 10 minutes and all the built-in signers to the controller manager honor this, the GKE signer honors this, cert manager signers honors this, and I think I found all the ones that were open source and I fixed them, so it's optional, but well supported I think now. All right, so now we're gonna talk about the implementable CEPs that the SIG is currently working on. How many of y'all are using KMS encryption or encryption at rest at all? I just wanted to show a hand. Cool, thank you, great. Thank you for doing that for your companies. So yeah, so for both V1 and V2, there's been some changes that's coming along. For example, default crypto algorithm for KMS data encryption is now default to GCM, so for those who didn't wanna use CBC, now that is available by default. In 124, you can still use CBC and both CBC and GCM are in read-only format. And starting in 125, right, we'll start using GCM by default, but read can still support GCM and CBC for backward compatibility. And then starting from KMS V2, the default will only be GCM. And for those who may be following the SIG auth PRs, we just merged CRD encryption yesterday, so really excited about this. What this means for you is, starting in 126, you can now encrypt your custom resources and just add your custom resource in the list of resources in your encryption configuration. And your pandas are happy. And also, as we kinda look at encryption at rest, we're also wondering does the community actually want to encrypt all the resources, right? So we currently have an issue open out there. If you think this is important to you, a wildcard that actually encrypts everything as we think should be the default behavior for Kubernetes, secure by default, please plus one or let us know if there's an interest there. And if so, we would like to also push for that and perhaps make that a default behavior going forward. So please look for that issue. And we also have hot reload for encryption configuration, especially for folks who have done this in the past, you all know how hard it is to make updates there, which will also require a API server restart, which is very problematic and could put your cluster at risk. And so with hot reload, changes will actually be watched. And then starting in 126, we'll have this. And then the debt cash for now is only reset on reload. Again, for those of you who've used KMS encryption V1, I'm sure you've known this gaps for a while. And this is, I think we introduced KMS V1, what, 110 or something like that. So this is years ago, right? And it's still in V1 beta one. And we would like to change that, right? So some of the gaps that we have learned along the way are performance, right? This is why we couldn't push this to stable, right? With currently with V1 performance is pretty bad if you have large number of secret objects that you want encrypt, which can actually slow down your cluster startup time, right? And the reason for that is all the requests actually are making remote requests against the remote KMS. And your request can also be re-limited, right? Again, this is not great if you're trying to create a cluster, right? And then key rotation is another problem, right? Again, if you're an operator, you've been managing your Kubernetes cluster, you know how problematic this is. Today it's a very manual process, very error prone and hard to determine which key is actually in use, right? So again, we would like to change that. And we also thought about what kind of health check and status that operators would love to have. Today, whenever we wanna check the health of a KMS plugin, yet the API server will have to make an actual encrypt or decrypt request, which is not ideal, and the health data is actually not very useful. And last but not least, we want observability, right? And today, because there's no tracing ID, it's very hard to correlate events in the logs. So you actually don't know which request went from the API server to the plugin and then to the actual remote KMS. So we would like to also change that. And here we're introducing KMS v2 alpha one. Again, this went into alpha in v125. So if you wanna check it out, the cap is there as well as some usage guidance in the Kubernetes docs website. So to use this feature, you would enable the API server feature gate to use KMS v2. And again, this comes with GCM by default and it's alpha in 125 and 126. And we're hoping to target beta in 127. Again, if you wanna get involved, please join the Segoth KMS dev channel on Kubernetes Slack. Again, some of the benefits, we talked about the gaps. Now what do you get with the v2? Again, so for performance, we actually have a reference implementation to help all the KMS plugins out there to see how you can leverage the recommended approach, which has a key hierarchy design such that you will have a local key encryption key so that you reduce the number of remote requests that you have to make to the remote KMS. And again, we are targeting 126 to have this reference implementation. And for key rotation, we now added metadata to track the actual KMS key that is being used for each request. And again, this allows us to actually have automated rotation without having to restart the API server. And we've introduced a new status API. So again, the API server can just call this status API to check the health of the plugins, separate it out from actual encryption and decryption requests. Last but not least, we also added a new UID which is generated for each envelope operation. And again, this helps us identify which requests actually went through the entire workflow. And we also have a new proto format for the data that is stored in SED. And again, this is to give operators the ability to use a proto buff viewer to actually see the data that is stored in SED. All right, this is gonna be really hard to, so just bear with me. I kinda wanna walk you through a day in the life of an encryption and decryption request. What does that actually look like? So as you can see, imagine you have an encryption request coming from the API server. What we're introducing is this hierarchy, key hierarchy concept, right? So today with existing KMS plugins, the request comes in, the plugin sends it to the KMS, the remote KMS encrypts the data encryption key and returns the encrypted key back to the API server. That's how it works today. With this new key hierarchy design, what this means is you, the plugin developer can actually add a local keck cache that basically says, hey, for all the encryption and decryption, I'm going to use this local keck, which is going to be encrypted by the remote key, right? That's stored in the remote KMS. And then that is actually the key that is going to be encrypted. And that will also be the key that is cached in the plugin. And that data gets returned to the encryption response. And then the API server says, hey, looks like here's my encrypted data encryption key and here's the key that I'm using for the remote key that I'm using. And here's the local key encryption key that is cached by the plugin. And similarly for decryption, imagine you have this key hierarchy design, right? The API server says, hey, looks like this request actually is using a local keck. So I'm going to ask the cache, the key cache to look for the local key encryption key. And if that's found, I'm going to use that instead of making a remote request, right? And that will also be the key that is used for decryption. And then that's the key that is stored in KMS, in SCD as well. What this means is you're reducing thousands and thousands of requests to the remote KMS down to n number of requests, just depending on how you set up the configuration. So if you have three API servers, potentially, you only make three requests to your remote KMS, which is beautiful. You save money and you reduce the performance impact. So what does that proto buff format look like? As I mentioned earlier here, as you can see, the data that is actually stored on SCD, there's a prefix, looks very similar to V1. It will say KMS V2. And then the actual object is actually stored in this beautiful structured proto buff format. And as you can see, there's the encrypted data, and then there's the encrypted data encryption key, which are encrypted. However, the key ID and the annotations are not stored in an encrypted format, so that as an operator, you can actually go and look at this data and understand which key was actually used for encryption and decryption. And then there's also a little room where if you want to add things, you could add to the annotation. A very, very tiny space there. All right. Thank you, Rita. So I guess as a member of CIGAuth, at some point I got exhausted by not knowing who I was on a Kubernetes cluster. So we added an API so you can do that. So it's alpha and 126. And I expect basically no changes to this API because there's basically no inputs and I don't really know what else it would do. So hope to see that one soon. So anyone here familiar with client go credential plugins? Anyone seeing all the fun warnings coming out of like the GKE CLI thing? It's gonna get deleted soon. Same is true for A-cast, don't worry. But if you're familiar with the interface that that mechanism has, it requires that the credential that is used for authentication be directly passed to client go or CUBE CTL as an example. That means it cannot be based off of like a hardware based credential because hardware doesn't give you the credential. So this is an extension to that existing API that will allow us to support something like MTLS with keys that are stored in the KMS as well as if you wanna run like a front proxy in front of your API server and you wanna have signed requests, it'll be able to do that. Or maybe you just wanna add some custom headers for logging or audit purposes. All those things will be possible. So instead of just purely being able to pass a credential, you'll be able to basically fully control the request before it's sent to the API server. Continuing on with the theme for certificates. We wanna make it easier for consumption of like certificates and signers within the Kubernetes ecosystem. As a first small step in that direction, we are working on adding a dedicated API called the cluster trust bundle to allow you to specify CA bundles that can be reused in various contexts within the Kubernetes API. So the initial implementation will just be available as something the Kubelet can mount into your pods for you. Going on in the future, there's still some uncertainty exactly all the places will do this. One sort of obvious place is the Kubernetes API server. So for example, if you've ever had to configure CA bundles for admission web hooks, it'd be really great if you could configure that in one place and updated in one place and not have to remember every other place that you had to update that. One of the open questions there that's still undecided is the API server, unlike most clients is high memory, high CPU. We can kind of hold it to a higher bar and traditionally we haven't had revocation support in certificates in Kubernetes but this is one place we're exploring being able to add like CRL support so that way if you wanted to revoke a cert, at least the API server would honor it for you. And I suspect that's a client that many people care about. Yeah, so legacy service count tokens are the secret based service count tokens that Kubernetes originally had. In, I can't remember when did we introduce bound service count tokens? I think it was 20. 20? Something like that. I think it was GA, I think it was 22. Okay, so in the last two years. Yeah, it's been a long time. We had bound service count tokens which are generated on the fly for a pod running with those service counts, with that service count or anything that wants to create a token review request. Those also expire, they're bound to, they have a expiration time, you need to be refreshed and they're also bound to the pod lifecycle. But we still, in Kubernetes, have these legacy service count tokens that are still generated for each, well they were still generated for each service count. And so you still have these secrets that if someone managed to, if those got leaked, they would still be used. Could still be used, they weren't, they never expired. And so we're excited to be able to get rid of those for cases that don't need it. So as of 124, we're no longer auto generating those. And then in 126, we're adding an alpha to make it easier to migrate clusters off of legacy service count tokens. And so there's, if this feature is enabled, there's a couple tracking features that get added. So if you do use the legacy service count tokens, it gets a warning, there's actually labels that get added to the secret itself to represent the, to show that those have been used. And yeah, make it easier to identify the use and migrate off of those. So as an explanation of what provisional means here, the things we're about to talk about are completely up in the air at this exact moment. There isn't necessarily agreement among the SIG or the community that this is where we're going, but there are kept open for this. And also if you wanted to get involved, this is a great place because we haven't written any code yet. So the folks are familiar with the existing OIDC integration that the Kubernetes API server supports. It's very limited, you can only have one and sort of the control you get over the identity that's asserted in the Kubernetes API is pretty limited. And it's really hard coded to support ID tokens from the OIDC spec, even though the OIDC folks disagree with our use of that. So some of the things that this new API fixes is you can have as many providers as you want. And following along the rest of the Kubernetes community, we are hoping to use CEL to give you much more flexibility around how you actually assert the identity to the Kubernetes API server, as well as how you validate constraints on those identities. So you can kind of see some examples on the bottom for how you would validate that an identity is something that is allowed to be on this cluster as well as how you would actually do the final extraction of claims. Our hope is, like one of the things that I know we don't have agreement on right now is can this be used outside of Kubernetes or outside of OIDC ID tokens? So I see Spiffy shirts, I see Spire shirts, hi. I see Spiffy people. Yeah, I mean, I'm looking. So maybe perhaps we could use this for Spiffy dots. That'd be cool. I haven't convinced Jordan yet. I know, but another thing I wanna be able to do is today, tokens have just a hard-coded 10 second cache just to completely prevent a DOS against the API server. But otherwise, if you have a webhook, it's a pure hard-coded cache of some number of minutes, whatever you configure. In the case of Jots, we can have a claim that tells us the expiration time. So you can imagine a much more intelligent cache that caches it to that duration or until a key rotation occurs, whichever one happens first. So just all around better semantics. Sort of following in that same trend, today you can only have one authorization webhook configured on the API server. So we wanna be able to support as many as people would want in whatever order they wanna run them in. If folks don't know today, if your authorization webhook is down, we fail open. So if it was a webhook that could have denied a request, we won't know. We'll just keep going, which is not exactly what people expect. Another thing that we wanna make it really easy to do is unlike admission that only runs on create, update, and a few other places, authorization always runs on every request. So if you wanna be able to have many webhooks, but not have any single one of them being down, completely catastrophic to your environment, we wanna let you scope when a webhook would be invoked. So if you know a webhook only interacts with users from a particular IDP that all have a prefix, we wanna be able to support that kind of use case. Or if you wanna run the webhook on the API server, well, you're gonna have a bit of a circle in your dependency graph unless you can filter out certain identities from being targeted by that webhook. So we're considering cell-based filters there, similar to KMS and various other features, and this is true for the OIDC thing earlier as well. We don't want any API server restarts on this, like absolutely necessary. So even though these things are CLI flags, we wanna reload the config so that way, if you can change the file on disk, you don't worry about anything else, right? Yeah, and last but not least, we also wanna talk about some of the cross-sigs works that's been going on. And I think all three actually are part of API machinery. And as Mo mentioned earlier, multiple times. Cell, this is the cell for mission control cap that's, I think, recently merged for 126. And this is going to touch on policies for emission, validation, mutation, as well as authorization. So there's a lot of overlap between the two sticks. And then also we have storage version API cap and the API server identity cap. Both are very crucial to ensure we have HA on the API server, which is the foundation to ensure we can actually rotate the KMS keys. So it's actually a part of the KMS v2 future as well. Anything you want. And of course, we can do this without all of the new contributors that made this possible. Just wanna give a quick shout out to all the folks who really, really helped us drive a lot of these changes. Anish, Christof, Damian, Nealik, Andrew, Tahari, if you're listening, thank you for all your contributions. Thank you. And thank you for joining this talk and please use this QR code and give us your feedback. Thank you. Any questions? Ground with a mic. You described some changes to the client go plugins to the interface and you mentioned that you'll have more control over having headers and such. Is the, and you said there was some concern about passing the data back through the client go library. So it's a little confusing. How does the request make it out containing the contributions from the plugin? So we went through a few designs on this. Like we, I think the initial version of this cap was open like almost a year and a half ago now. But what we settled on was it would be really hard to make an API that served everyone's use case. And also we were tired of extending this part. So we wanted to be done. So effectively what we do or what we're planning to do is basically the plugin can give us a proxy and we will just connect to that proxy and we would expect you to basically fully terminate the request. And then you can do whatever you want with it. So please actually send it to the API server. But if you don't, that's also fine, I guess. I mean, it's, yeah, it's, yeah. You get to do whatever you want because you make the request, not us. So it's meant to be an incredibly powerful extension point. Hopefully the most time people don't need it. But Nick from AWS and they wanted to make it so their cubits could do request signing back to AWS Infra, for example. It's a great use case. Totally not supported by the existing extension point. But we're hoping that others can use it too. Are you guys excited about the new features that are coming? Oh, one quick question. Thank you, great talk. And I'm so excited about all of those new features that are coming. Can you, I would love to hear from any or all of you, which ones you think would be a great choice for new contributors to jump in on? We were just talking about that actually. Say what? We were talking about how hard it is to start and say God. I think the stuff that's provisional is certainly easier because we haven't written much or any code for a lot of it. The bar is a little bit higher than maybe we want it to be, but it is kind of hard. For the KMS V2 stuff, we do have a dedicated GitHub board where there's quite a few issues that are not even started. So if you're familiar with encryption at rest and want to suffer through API server wiring, then yeah, that's for you. Also to get involved, coding is one part, but the other part is also give us your feedback, right? And as we write the designs, the cap we will love for more feedback and reviews, because y'all are the ones that's gonna use it. Like for example, we have to work with all the KMS plugins to make sure the migration journey is not gonna be too terrible. That's why we're writing the reference implementation. So please get involved, give us feedback, or just tell us, yeah, this makes sense to me. I would prefer to use it this way or maybe this doesn't make sense and here's how I will like it. Definitely welcome those type of contributions. If I could just add, I think those are good answers for people who are experienced in this space, maybe not Kubernetes, but understand a lot of the nuances of SIGAuth, have a lot to add on the design aspects for more kind of junior contributors or people who aren't as experienced in this space. I think actually the stable features are a really good place to jump in. I have a bunch of audit cleanups that I'm working on in the next release. It would be great to get some help on improving some of the testing around auditing and some of the other graduated features. Have a lot of, there's always room for cleanup and testing improvements as well. That's a great point. That's what that reminds me. On the OIDC provisional cap, one of the prereqs for that cap is an open issue right now. I don't know, like 20 things on a list of like, this is the missing test coverage for OIDC, so like, and I have blocked new changes to all that code effectively because I don't trust it to not break if you touch it. Okay, quick question. Thank you for the presentation. Great design on the key caching. I like the key wrapping part. I think that's really, really cool thing and will solve a lot of latency issues. Maybe just a general question. What is the scope of a CQ, what are you guys focused on? What is in scope, what is out? Because I see authentication authorization at the same time in there. Okay. So CIGAWT owns basically all authentication authorization policy across Kubernetes, so like, as well as like service accounts. So basically if there's like a security aspect to some feature, either we're a participating CIG or we're the owning CIG. Also policy? Yeah, policy as well. So like to contrast us with CIG security, CIG security exists to like help people that like for example, say you work in IT security or something and you're a consumer of Kubernetes and you're building tools to run on Kubernetes. They do a lot of that work and outreach and help the community grow that way. We focus more on like the core code within Kubernetes like POT security admission and all those things and how those mechanisms work. But certainly there's a lot of overlap between our CIGs and... There's also a lot of overlap between CIG off and API machinery, a lot. Yeah, none of our stuff works without it. There was a question over there. Yeah, first, thank you for the talk. It's very nice. I have a quick question about upgrading. It looks like some features are like breaking feature without backward compatibility. So is there any plan we put some automation in the code so that when user do the upgrading, they don't break their cluster. They don't cause outage in the production environment. Was there a specific feature that looked breaking? Sorry, it looks like the algorithm you use, GCM something. Oh, okay. So we were very careful on that. So yes, that if we had done it poorly, it would have indeed been a breaking change. But the way we wrote it, so we do require that you follow the Kubernetes release guidelines of no skip releases. So you upgrade from one version to the next minor version or major version, whatever you want to call it in Kubernetes land. But we always did it in a way so that if you did an upgrade and used the cluster, realized something was wrong and did an immediate downgrade, that neither would have any data corruption. So we staged it out over many releases and we have retained basically till the end of time the capability of reading the old formatted data. So KMS V1 actually has two legacy formats that are still supported in our code base, even though we never write them out and we haven't written them out in 10 releases, we just assume that there might be some old data that's never been rewritten. So we are very careful. I think the closest thing we have to a breaking change is not auto-generating secrets anymore for service accounts. But even then, what? Yeah, post-security policy, yeah. Oh, we have a migration book. But that one was, I mean, you can run it as a web book if you really want to. That's fair, we did remove a beta feature without graduating, that's fair. I don't think we'll end up doing something similar for KMS V1, beta one, mostly because I don't know if you're happy with it, I guess it doesn't really hurt us, I guess you keep using it, whereas post-security policy did cause all sorts of confusion and just mistakes. I just want to call out that you see no user action required, that's very, very important. And this is exactly why we want to make GCM as the default for write, but for to read, you can still do both, right? But the hope is that slowly, everybody would just migrate to GCM. We're at time, but happy to stick around and take more questions in the hallway. Thank you for coming. Thank you.