 Hey everyone, we're super grateful for y'all making the time to listen to us talk. My name's Tyler Lasowski, this is Cody Glosser, and we're extremely excited to talk to y'all today about implementing an access control system based off of CA rotation. I'm going to turn it over to Cody to kick us off, so go ahead. Awesome, thanks Tyler for the introduction. So let's take a look at a typical Kubernetes cluster deployment. There's one CA across the cluster. The CA issues all the certificate for the cluster components. The certificate grants the components the ability to do specific actions, like a Kubernetes ability to create no data. But let's imagine we have a Kubernetes cluster and someone has access to your CA. Whether it's an administrator who has left the organization that had admin access or a malicious attacker that gained access to the cluster and the CA. Or there's security controls in your company or organization that requires CA certificates to be rotated. How do you prove that their access was revoked without the ability to rotate the CA and the client's service does not expire? What is your strategy and what do you do? Without CA rotation, they basically own your cluster. One strategy is to delete the cluster and recreate it to get a new CA which would require you to migrate your applications, not an ideal solution. We need a way to revoke the certificates with the zero downtime fashion of the cluster and our applications. We're going to talk today about this CA rotation process that allow you to meet all these objectives. We're going to solve in a way to not take the workload offline and rotate the zero downtime of the cluster for the applications. I'm going to hand it over to Tyler to talk about this process. Thank you Cody. So now let's picture us collectively in this room as an organization that operates Kubernetes clusters for our business. As part of this organization, Cody and I are going to be the administrator team. As part of the administrator team, we will be responsible for all operations associated with our Kubernetes clusters. In order to perform those operations, Cody and I are going to be granted admin certificates to the cluster that will allow us to do operations like coordinate drain nodes, view utilization across cluster components, debug software defined overlays, etc. Everything is going to be going great with our operations for a year. Business applications continue to get onboarded that drive business value and there'll be zero downtime associated with any of our Kubernetes cluster fleet. Then at that year mark, Cody is going to hit the lottery and in this fictitious world, Cody is going to believe that living in Hawaii is better than working with me and he's going to retire, go by property out there and move away. Now we're going to fast forward three more months and we're going to enter into an audit as an organization. The auditors are going to ask for our administrator group and the history of that group. They're going to notice that Cody has left our organization and they're going to proceed to ask us the following questions. What all credentials and secrets did your administrator group have access to you? Show me evidence that you've rotated them and show me evidence that you have re-approved your existing administrators as part of your organization to have access to these newly rotated credentials. That's actually going to bring us to the start of the CA rotation process. Picking up where Cody left off, we started off with a Kubernetes cluster that had one CA across all the components and issued certificates for all the components of the cluster. What we are going to do in the first step of the CA rotation process is introduce a new cluster CA that is going to then issue all certificates of all cluster components going forward. We're going to introduce that CA in a well-defined multi-step process against a subset of components in each step of the process. The first step and the first set of components that we're going to target are the server side components. Those are going to be things like Kube API server, etc. When updating these components, we're going to actually be updating two configuration pieces of each. One is going to be the CA bundle associated with each component. We are going to move from having initially the one CA to having two CA's in that CA bundle. Additionally, we're going to produce two extra certs that are actually cross-signed CA certs where each CA will sign the other CA cert. You will have the new CA signed by the old CA and the old CA signed by the new CA. That's actually going to be updated and rolled out across all your server side components. Then the next piece is all the server certificates, client certificates, and peer certificates are going to be updated to be issued from the new CA. Additionally, in that chain, we are going to include the cross-signed new CA certificate by the old CA. The reason why that's critical is that is what's going to allow us to keep all our existing MTLS connections validated across this phase, even with clients that have had no existing updates yet. Let's look into that a little bit more by analyzing the connection flow from the administrator in the bottom right to the Kube API server. When that administrator reaches out to the Kube API server, it's going to pull down the Kube API server certs. We're going to see that the server certificate is signed by the new CA, which is noted by that blue key in the bottom right. Now, if just that certificate was served, we're going to look at that administrator and see that he only has the old CA in his certificate bundle. That actually would not validate, but when we include that cross-signed CA cert of the new CA by the old CA, it will jump up and use that as part of the chain and then link back to the CA bundle locally and validate that side of the TLS connection. Now, let's go to the reverse side where the client presents the certificates to the Kubernetes API server. It's going to see that the client certificate is signed by the old CA, noted by the green key in the bottom right of the administrator bundle. Then we will go to the Kubernetes API server CA bundle, see that that old CA is still present, and that side of the TLS connection also is validated. This can be done across all components and effectively, we are able to introduce a new CA that will now allow us to progress and issue out all our cluster components, new certificates without causing any regressions with our existing clients. Now that we've actually updated the server side components, we're going to actually move to updating the configuration of all our pod workload. Again, there's going to be two main configuration items. We're going to ensure that the pods service account CA bundles are updated to include the four unique CA certificates. Additionally, we're going to update any server certificates for back-end pods that are used to process API services or web hooks, et cetera. How that's typically performed is once the Kube controller manager is updated with the new CA bundle, it actually will go and resync the CA certs across all service accounts. Then if your application actually reloads that automatically, that will happen seamlessly. If not, sometimes you actually have to go in and restart some pod workload, and then on the first boot, it will reload that new CA certificates and then have that active. Then for the server side certificates, those will just be issued out from the CA, and the application will load those as well. Again, just to note, you're going to see a very similar pattern where the same way that the CA bundle was updated with the server side components to have the four certs, and the server certificates are issued by the new CA and include the cross-sign CA cert. It's going to be a very similar pattern that we're going to see across every component that we update. Now that we've updated the pod workload, we move to actually updating the configuration on the nodes themselves, which are mainly going to be your KubeLit client and server certificates, and the associated CA bundles on each of those nodes that the KubeLit uses to validate the API server connections. The certs are updated again in a similar fashion to have the server and client certificates issued by the new CA, and then in that chain, also including the cross-signed new CA cert by the old CA cert, which again just allows us to ensure that our existing clients can still validate that MTLS connection. How this is typically performed is we will go across a subset of nodes at a given moment of time, execute the workflow to actually update those certificates, typically through a CSR-based workflow where you put a CSR up to the Kubernetes API server, get that approved, and then redownload that configuration back down, and then additionally updating the associated CA bundles in the appropriate configuration as well. Then once we've done a subset of nodes, we'll bring those back online and do that in a well-defined fashion until we've actually updated every node associated with our cluster. Lastly, in the addition process, we're going to move to reissuing the certificates associated with our administrator clients and external automation. View external automation as any sort of Jenkins automation that might be doing debug operations associated with your cluster, or potentially external disaster recovery automation that you might have that needs access into the cluster to be able to perform its functions. Again, in this phase, certificates are issued that are all signed by the new CA, and that chain is also updated to include the cross-sign cert, and the CA bundle is updated to include the four unique certificates. At this point, doing those operations, we are now prepared to start the revocation process, but what we're going to do is we're going to take a second and pause and ensure we do an inventory of all our active systems and make sure that they are updated because as soon as we start the revocation phase, that is when if a component is not updated, you will start to see failures on the MTLS connections. So we've taken that pause, we've ensured all our components are updated, and now we're going to proceed to start the revocation process. You're going to see that the revocation process is going to follow a very similar workflow to our CA addition process. To start, again, we're going to target the server-side components first. We will be updating the two same configuration pieces that we did initially with the CA addition as well. Specifically, the CA bundle will move down to just containing the new CA cert. Then our client and server certificates, we're just going to remove that cross-sign CA from the chain just to remove any references from the old CA at all. Following the similar pattern that we did, we will move into now updating our pod workload to have the CA bundle associated with the pods just contain the new CA. Again, with the server certificates, remove that cross-sign CA from the certificate chain as well. Next, we move into the cooblet side of the components. We're again in the same controlled fashion going across a subset of nodes. We're going to update the CA bundle to again just have the new CA and remove the cross-sign CA from the chains of the client and the server certificate. Then lastly, we'll move to our administrator and automation components. Again, in the similar fashion, update that certificate chain to no longer have the cross-sign CA and just have the new CA in that CA bundle. At this point, we've actually rotated out all references to the old CA in place. With the well-defined steps that we did in this process, we actually were able to seamlessly with zero downtime, update all components as a part of that process to pick up the new CA and then be able to successfully handle the revocation of the old CA. At this point, I'm going to hand it back to Cody to talk about some considerations and challenges when actually executing this flow in production. Awesome. Thanks, Tyler. Some things to consider here when you're doing the CA rotation. During that pause period, Tyler talked about, you're going to have applications like go applications that do not automatically use these certificates when they're running through the process. You just need to make sure that you restart these applications, pick up that CA during that period and make sure that your application data can pick it up. Now, during the revocation process, if you revoke that CA cert, your applications will not continue to work. It's important you make sure you take those steps. Another big one as well is web hooks. Web hooks that have a hard failure policy that are part of critical components of your cluster like intercepting secret process data or config map data, those web hooks have to be running and they should not be required when going through the API server because that can cause a deadlock during your CA reload process. Also, the ad hoc automation is dependent on stored configs. Also, during that pause period, you want to make sure that you look at all of your Jenkins jobs, everything that goes in reaches into this cluster because those old configs will no longer continue to work once you revoke that old CA. Also, if a security event, someone broke into your cluster, stole that CA, you should do an evaluation of your cluster to make sure they don't do that again because you would just have to revoke that CA and it's not an ideal situation. Make sure you evaluate that as well. Then the surface account token keys can also follow a similar pattern. Thank you to our sponsor upticks for making this event a reality. Now we can open up to questions. Thank you. First of all, thank you very much. For me, the first session that goes into such a level and that's pretty nice. Simple question. Isn't this a bit an overkill? If I have something like an Hashicob vault in place, I could use this for the same approach and say the certificates have only a short time to live, like a day, an hour or whatever based on what I need. Then I only trust simply the certificate authority that is from Hashicob vault, for example, one product for sure only. I also get the auditability, but it's still for sure does not solve that at some point of time I also have to change my certificate authority. So for sure it's still needed and that's why I really like this, but I simply question if it is needed for every user that leaves the company that we have to rotate the whole authority. No, and you bring up an excellent point. Exactly. Oh, you bring up an excellent point on like the tradeoff that you had in that approach, right? Where you were talking about the validity of the time that you're executing those certs is really if you look at that delta, because once that cert expires, the authentication is going away, especially as an administrator. So there are sort of mitigating controls that you can do. Also typically, especially with some of your managed providers, you're going to see a lot of integration directly with their identity service to where you're not going to be using certificates as much. However, we typically always see that there's some sort of break glass mechanism that you want to have, right? You want to be able to have, in case your identity service has an outage, having that certificate, that long live certificate to be able to pull back on and, you know, hit some of those pieces. But you're exactly right when you're talking about some of those tradeoffs, you have everything right in your head on, okay, if this cert's only signed for a day and someone were to leave that had access to it, I really just have that day period of time where there's this sort of, okay, what do I do to reconcile this? But what we saw is actually enabling a process to have zero downtime is like then people start to the fear of doing that is like once that goes down, this can start to be like a workflow that's implemented, you know, annually or yearly for people. So, no, but it's excellent points that you brought up and thank you for the excellent question. Typically, what we see is probably yearly. So annually is what people will go through and then do that sort of revocation. That's typically what we see. No problem. Hi, my name is Andre from London. The question would be when you're talking about the revocation, what's the process? What do you mean by the revocation? Do you add the old CA into certificate revocation list or are you physically removing it out of the bundle? And the second question, your conjunction, when you were saying the enumerating all of our reports or making an audit of what we are having, are there any clues how to do that? Do you have any thoughts? How can we implement that? Because it could be a really challenging task to identify all the workloads that are not yet changed their certificates. Thank you. Yeah, excellent question. Excellent question. So let me talk about that first piece. So specifically, it is not actually a certificate revocation list, but there's config flags in the components that will actually say dash dash CA bundle. And that is actually what they're saying is like those CA's that I trust. So what you'll do is you'll move to a period of time where you have both CA certificates in that bundle. And then once everything gets updated to be issued by the new CA, you'll start to update those flags to remove any references to the old and go through. Now, certificate revocation is absolutely another valid policy. Kubernetes does not have that by stock in those like API components. And then sometimes that's more focused on less from a specific CA perspective and more from a the certs that have been issued out to your components. That can start to get a lot when you start to have, you know, 500 nodes across your cluster and a bunch of different clients. So hopefully that answers like question one. And then I think question two is another excellent question that you had, which is, look, okay, this is great. And I see the well-defined steps, but what how can I actually look and almost keep track of what I am issuing, right? Like what that CA is issuing and what's that full set of components and being able to monitor that? Typically twofold. So when you're actually doing operations within the cluster, like let's say the kube API server, or sorry, the kubelet certs, the client in the server, you'll actually, if you're executing like a CSR based workflow, get audit events associated with that, like, hey, node A is joining and created a CSR request. And that audit event like forwards, and you can sort of log that in your, I don't know what I'll call it, you know, your whatever your backend is. Now, when you start to get into some of the administrator clients that you issue, that can get a little tricky, because sometimes that's actually not issued through the kube API server process. It's just a sort of, hey, I have a CA, I'm going to take that CA key, sign that certificate and issue it out. Typically with those, whatever automation you have, what we'll see is like maybe as a final step before it uploads that to its component, sort of again, emitting a log to like, you know, an audible history where, hey, we've issued this with common name, let's say, you know, administrator one, and like just to like note that you'll have the date and everything associated with that. What I think is nice, and that's absolutely important that you brought that up, but what I think is nice, and sometimes we see is there's typically always some creep, right, there's always some creep that sort of gets around, you know, someone's in a rush and does a, you know, a CA or issues a cert. And you can still have confidence when you go through this revocation process that you're starting from ground zero, right? Like even if one slipped behind the cracks that you didn't like document, once you revoke, you're back to ground zero again. And then, you know, some of those process kick into gear. So hopefully that answers your excellent question, by the way. Just a small point of clarification. When you're talking about zero downtime, you're mainly talking about zero downtime of the platform, like your applications may need to be able to handle like restarting their pods. You're exactly right, right? So yeah, exactly. So we're just talking about the platform itself, and you're exactly right, where ultimately, as some things are going to be restarted, right? So you do like if there is an assumption that, you know, in that application, while you're doing that, you know, whether through it's a rolling deployment or whatever that might be, that you're able to handle that, you know, CUBE obviously is going to do some of that heavy lifting for you as far as taking things out of rotation. But if they're sort of like, you know, stateful protocols going on, it does get a little tricky. So that's an excellent point. Yeah, so that's an excellent question. We sort of left that out, because we know a ton of different organizations use different tools, whether it's like CUBE ADM update, or, you know, pieces, whatever like the tooling is, sometimes it's Ansible Automation across a suite of VMs. Specifically, it's simply broken down when I would say into two categories. If you look at a lot of the managed providers, so let's say you're a part of like an AKS or an EKS or, you know, an IBM cloud, you know, Kubernetes service, those are going to be automated for you through APIs. But if you're managing your own fleet, it's typically going to be what we see either some Ansible based automation, if you're running your components on like a VM based system, well, you'll go on a per VM basis, update that CA certificate, wherever that is on your file system, update those client certs, and then restart a component. Or what we sometimes see, and this is more of like a gardener based model, where like everything is actually Kubernetes centric, even like the API servers that you're running, run as Kubernetes deployments. And you'll see those sometimes just be a config map update and a rolling restart of the CUBE API server. Does that answer your question? Awesome. Excellent question, by the way. Well, thank you so much, y'all, for your time. We're really appreciative of it. Hopefully, this is extremely helpful for y'all. And I just hope y'all have a fantastic trip in a wonderful safe flight home. And just thanks for your time. Yeah, thank you, everyone.