 My name is Mark, I'm an open source tech lead at Cisco. I'm also a CNCF ambassador as of this March. For the last couple of years, my primary job was to help engineering teams run their business applications on top of Kubernetes without worrying about things like configuration management, secret management, or deployment pipelines. Before we jump into representation, I want to ask you a couple of questions. Who uses Kubernetes here right now? Are used or plan to use? OK, so who uses Kubernetes secrets? OK, so who rotates those secrets periodically in an automated way? How can you sleep at night? All right, so I'd like to start by telling a story, which I believe will be familiar to a lot of you. A couple of years ago, I was in the debug session in the middle of the night. I probably had a couple beers in me at that point. And I figured out the problem. It was a release that went out that day. Fixed it, pushed it to Git, and I was about to get home when all the bells went off. Turned out I managed to commit and push AWS credentials to a public repository, which is obviously not good. And I was so lazy and reckless that I actually took those credentials from an active instance in a development environment. All right, so I rotated the credentials. It was a dev environment that was fine. Rotated the credentials, deployed a new one. I was about to go home when even more bells went off. It turned out that those exact same credentials were used in three different other environments, including a production one. So what started as a simple debug session ended basically in a production incident. Does it sound familiar to anyone? Oh, okay. So obviously there are lots of ways humans can leak secrets and cause production incidents, but coming from a Kubernetes directive or direction, there are other ways to compromise secrets. For example, there is a common misconception about Kubernetes secrets that they are insecure because they use base 64 encoding. Well, why they can be really insecure is because there is no encryption at rest configured by default in Kubernetes. So if you just spin up any Kubernetes instance on any cloud provider, those secrets will be stored in plain text in HCD or in any other database that the cloud provider uses for their Kubernetes offering. So that is a problem. If you can't trust your provider, it doesn't necessarily mean it's a public cloud provider, but if you store secrets in Kubernetes, basically they are available in plain text in HCD. Another way to compromise secrets is, and this is actually my favorite, not setting R back up properly. Like a lot of people often think about disabling access to secrets to developers so they can't access Kubernetes secrets, but they still give execution privileges into the pods for debugging reasons, which means that basically any developer who has those permissions can just go into the pod and get the secrets the same way that they would just get from Kubernetes. So it's really not about when your secrets will be compromised, it's really about what you are going to do about it. Obviously there will always be a human factor, but there are a bunch of other ways to get secrets compromised. And well, what can you do about it? Obviously you can do a lot of things, but if you read the title of this talk, today's topic will be rotating secrets. And when I talk to other people about rotating secrets, they often get scared that do we really have to rotate all the secrets we have? Like in a lot of applications we have a bunch of different secrets. And the answer is you only need to rotate secrets that are not rotated automatically anyway. Like in a bunch of cases, if you use workload identities for example, those already use short-lived tokens and you don't need to think about rotating those, you need to think about rotating the long-lived credential type secrets. So why is it important to rotate secrets? And we already talked about one of the reasons why it's important because if you leak something, for example, you push it to a public repository, then you have to rotate the secret. But you may have to rotate secrets because you're a STO or so. Like if there is a rule in your company that you have to rotate all your secrets every 90 days, you just have to, you don't ask. But the absolute worst possible scenario is when you don't even know that your secret is leaked and you just continue using it while a malicious actor maybe running Bitcoin mining on your EC2 instances, or worse, steal your data. Now when you think about the challenges of rotating secrets, obviously, depending on your environment, you may have multiple Kubernetes clusters. You may run different applications in different scenarios. So it's generally a complex process and it doesn't work well with humans. So we are not really good at complex tasks. And we tend to screw those up as well. So it's, which obviously, in the worst case, ends up disrupting your service availability. So all this really points to the fact that secret rotation goes without saying should be possible. But possible in this case means it should be possible within a reasonable timeframe. If you can't rotate your secrets quickly enough in case a secret actually gets leaked, then you have a problem. You either expose your system to a potential attack or you actually end up disrupting your service availability. So it should be possible. It should be automated because it's not something that humans should do. And if you think about the scenario I just mentioned that you don't even know your secret is leaked, it should be done periodically. So how does this look like in practice? Not in the context of Kubernetes, but in general. So generally you have some sort of secret store. This is where you store your secrets. This is where you deploy your configuration or secrets from to some environmental, multiple environments. You have some sort of secret provider, which may be AWS, GitHub, whatever. And you generate these secrets from those secret providers and store those secrets in your store. Now on the deployment side, you need something that watches your secrets and deploys the new secret to your production or dev or whatever environments that you have when something changes. So how does this look like in Kubernetes? So first of all, you have to decide if you really want to use Kubernetes secrets or not. As I mentioned before, if you don't plug the holes, then using Kubernetes secrets may not be for you. So make sure to turn encryption at rest. It's called, I believe it's called encryption config on a lot of cloud providers. You will see it as an end-of-encryption. Make sure you configure our back properly. And generally just go to the Kubernetes past practices page and just go through the recommended steps and just apply those. If you want to use Kubernetes secrets, make sure that you do the most you can to secure the use of Kubernetes secrets. All right, so how do we deploy secrets to Kubernetes? Obviously, you can just keep CTI, apply a secret object to the Kubernetes API, but if you have multiple Kubernetes clusters, if you have multiple different environments, that's probably not something you want to do. Especially if you have like a GitHub's workflow, where you want to automate the whole secret deployment process. So this is where the external secrets operator or just external secrets comes into play. It's an operator. You can configure it through custom resources and basically synchronize the secrets from an external secret store like HashiCorp's vault, all the different cloud provider-based secret stores to Kubernetes secrets, which is great because you can actually put these custom resources to your GitHub's workflow because they don't actually contain the secret, they just contain the instructions, how the secret should be deployed to the Kubernetes cluster. And once ESO deploys the secret into the cluster, you can use that secret as you usually would. You can mount it as your environment variables, you can mount it as a file. So very quickly, how it works behind the scenes. As I mentioned, you configure ESO through custom resources. Similarly, how you would configure third manager, you can configure a cluster level secret store or an in-space code secret store. The naming is a little bit unfortunate because you scold the secret store, your external secret store, where you actually store the secrets and the custom resources also called secret store, which is basically the configuration for your external secret store, how external secrets can go to the secret store and synchronize secrets from to your Kubernetes cluster. And then you have the external secret object, which tells the external secrets operator to configure a particular secret from your secret store to a Kubernetes secret in a certain format. So we have a lot of options here if you want some sort of templating or you want to generate a file or something. Yeah, so the cluster secret store is in a cluster scoped resource, while everything on the right here is namespace scoped. It's similar to how it works in third manager with the cluster issuer and the issuer. All right. So there are alternatives that you could use if you wanted to. Sealed secrets is a pretty popular one. The problem with sealed secrets and SOPS in general is that they don't really work well in a multi-cluster or multi-tenant environment. Yes, of course, you can share the key between the clusters and you can absolutely do that, but you still have to expose the key to the developers or whoever manages the secrets for encryption and you can't essentially revoke them without re-encrypting everything. So general speaking, I find the SO to be a better solution at this point, but if someone needs a quick start solution sealed secrets may work as well. So I deployed the secret in the cluster. I have it, the application is running and something changes. ESO can actually synchronize changes, so you can specify an interval if you want the operator to check the secrets and it can actually synchronize changes as well. So now what? Well, if you mount the secret as a file into your pod, then obviously your application needs to take care of reloading that file and reloading the secret, but if you inject the secret as environment variables, you can't really do that on the application level. So what you need to do there, is actually trigger a new rollout. And so far, I mean, reloader is a pretty new component on the market. I mean, it's one or two years old, but so far we didn't really have a good solution. Reloader can manage triggering standard workload rollouts when any of the Kubernetes secrets referenced in those workloads change. So it's pretty cool. And going back to our earlier process, now we have Kubernetes in the middle, which is our environment. We have external secrets that deploys the secrets to the Kubernetes cluster, and we have reloaded that watches for those changes and triggers workload rollouts. So basically with this pipeline, whenever something changes in the secret store, you never have to touch anything. It automatically deploys the secret to the Kubernetes cluster, and it automatically reloads the workloads. So, sounds good so far. It actually works pretty great as well. We've had this in place for almost two years now. But the next question is obviously, how can all this go wrong? But the first answer is who knows? So you have to monitor your entire secret management pipeline, and you have to have some tools to notice when something goes wrong. And fortunately ESO exposes a bunch of different metrics. They also have like SLI recommendations for you. I do recommend to revise those and find in those to your needs. Depending on how you deploy secrets to Kubernetes, you may want to observe different metrics. They also have a very nice Grafana dashboard that's good for an overall overview of how the operator works currently, but it's not that great for debugging. Like when something is going wrong, it's not that great. So I've opened an issue, and I'm working with the ESO team to come up with a better dashboard that can help debugging issues. And one additional note here is that some of the metrics include resource names as labels, which in very high environments can cause problem because of the high cardinality. So you may either want to drop those labels if you don't use them, or just don't drop the metrics entirely. But this is something that we actually had to deal with. Now, another common problem with ESO is that when you change something, you change a secret store configuration, or you change some authorization details on in HashiCorp's wall, for example, that change doesn't actually take effect until the next synchronization period of an XML secret. So the problem is when you change something, you don't actually know if you're broke anything or not. So what we did here to solve this problem is we actually have test secrets for every single secret store that we have, and every time we change something, we make sure that this secret gets deployed again. So we know if something goes wrong, we have alerts. If any of those test-secret synchronization fails, the alert goes off and we can fix the problem. But this is not something that's immediately obvious, and this is something that we've learned the hard way. Another other issue we had recently, which I'm told may not be an issue anymore in newer versions of ESO, but I wanted to talk about it anyway because it's such an edge case and it's still happened. So in the secret store, there is this option called store validation, which means you can tell external secrets not to synchronize anything until it can reliably communicate with the store. So if the store goes down, it doesn't start synchronizing all those external secrets, still hammering the not working external secret store, which is great. And we had that turned on, and naturally our internal went down for like five hours. And the store validation has a retry on it with a nice back off, which after five hours were somewhere at the day interval, which meant our complete synchronization pipeline stopped working, which we've learned the hard way because some of the secrets should have been rotated and the secret was not synchronized because it wasn't working at all. The whole pipeline stopped working, which means it caused the production incident for us. So I'm told this may not be a problem anymore, but a seemingly simple problem like the store going down actually caused the entire synchronization pipeline to stop working. Our solution for this problem, we use GitOps, we use RGroCity to deploy all of the configuration to all of our clusters. So what we did, we basically add in an annotation to all of the store configurations, which caused the store all addition to start again, and that's restored everything. But doing that across dozens of clusters manually, would have been hard. So to sum up ESO, it's a great solution together with ReLoader. You really don't have to touch anything for it to work, but you still need to understand how and when changes take effect, and you still have to monitor and alert for everything to make sure that, well, it doesn't stop working. Now that's, so you may decide that you don't want to use RGroCity secrets at all because you can trust your provider because you can't change any configuration to plug all those holes I talked about, or maybe you just don't want to use RGroCity secrets for whatever reason. So you can do that if you want to. In that case, naturally, you will have to talk to the secrets store directly somehow. Now, option A is to integrate that communication into the application directly, but that's not really what we want to do, like keeping configuration in secret management apart from the application is usually a great idea because you may have to deploy it into different environments. You may not want to tie your configuration management to your application. So we need something else. And the alternative is that you inject those secrets into the applications somehow. For example, using environment variables. Now, in case of Kubernetes, there are some solutions available. They generally work the same way. Like usually they have a mutating admission webhook that mutates pods being created in Kubernetes. They inject custom init into the container through a mounted volume. They change the entry point as well so that the custom init goes first and then they inject environment variables into the application somehow. Now, different solutions use different strategies for determining what secrets need to be injected. One of those solutions is called bank quotes. Anyone heard about bank quotes before? Okay. So bank quotes started at Monsign Cloud as a project and it's now being developed at Cisco. I'm one of the original engineers and we like to call bank quotes Vault Swiss Army Knife, HG Corpse Vault Swiss Army Knife because it's not just a secret injection solution. It can run Vault on top of Kubernetes for you. It can configure Vault for you and it can do that secret injection I talked about. So how does that work in case of bank quotes? Where did bank quotes? We use the so-called secret reference and we set them as environment variables for the application. The mutating webhook, we use a mutating webhook as well. The mutating webhook goes through the pods, the secrets and all the custom resources that take place in creating the pod and scan for the secret references. If there is such secret reference it's going to mutate the pod, it's going to inject the custom init and that custom init is going to replace those secret references with the actual secrets from Vault. As you can see, the secret reference contains Vault colon, the path to the secret and then optionally a key if there is a specific secret you want. And that custom init replaces the secret references with the actual values. Now the only downside of the solution at the moment is that it doesn't actually detect changes. So once you started the pod with a secret and if the secret changes in Vault that's the solution doesn't detect it, it's not going to trigger a workload reload yet. This is something we are working on right now but currently it's not part of Bankholds. What we did in the past with Bankholds is that we basically triggered a workload reload every day or basically just shorter than the interval for secret changes. Now obviously as in case of ESO this solution comes with its own risks and trade-offs. For example, if the secret store goes down then you can start applications because you don't have access to the secrets. Whereas with ESO you have all the secrets there in the Kubernetes cluster so even if the store goes down you can still schedule applications you can still scale your applications because the secrets are locally available in the cluster. Now similar to that solution you can actually start a Vault instance with Bankholds within the cluster and synchronize the secrets from your central secret store to that local cluster instance and talk to that local cluster instance instead of your central external secret store and that actually comes with a bunch of other advantages. For example, you're not going to hammer your central secret store with requests, you can just talk to the cluster local instance which is nice and obviously if the cluster local instance goes down you'll have to fix that but it still reduces the risk in case something happens to the central secret store. The other problem is the webhook itself. If the mutating webhook goes down for any reason and it can very easily go down if you are not careful with its configuration then you won't be able to schedule pods at all. Depending on the error policy that you define in your mutating webhook but if you ignore the error then your application will still not be able to launch because you will still not have any of the secrets injected into the application. So the webhook goes down so does your ability to schedule applications. Now there's a list of best practices that you can follow. I actually wrote a blog post about that. Make sure your webhook is highly available. Make sure it's spread across your cluster so if one node group goes down for example your webhook is still available. And of course there are alternatives for this kind of operation as well. Commus is being one which actually stores the secrets encrypted within the cluster and then with injected custom init binary within the container it talks to its central service running in the cluster and decrypts the secrets on the fly which sounds somewhat better because you can still schedule applications. You can still schedule pods if something happens with your central secret store but its development kind of died down in the last couple of years. So it's not really actively used in the community. And the other solution that's coming up is the secret store CSI driver. But the problem is you can mostly use that for mounting files in the containers. You can't really use it for mounting environment variables which is fine. But if you want to use environment variables you always can't use that. As far as I know it can't detect changes yet either. So a couple things about bank codes. We are trying to revive the community around bank codes. We are moving it to new GitHub organization. So if anyone wants to contribute it's going to be way easier. As I mentioned we are going to work on the workload reload feature. So we are going to detect secret changes and we are going to trigger workload rollouts on secret changes. We also plan to support more providers. Currently we only support HashiCorp's vault but the secret injection can actually work with any kind of secret store out there. So we're planning to add more providers. And we are also planning to add the new feature to the bank code suite which is secret synchronization. I mentioned earlier that you can run a cluster local vault instance if you want to and synchronize secrets. Now we don't really have tools for that at the moment and we plan to give you some tools so you can do that more easily. And if you have any feature requests or if you're interested in bank codes please just talk to us because we definitely want to hear from the community and we want to hear how you use bank codes. Now I actually prepared a little demo. I'm not sure if you have, oh we have time. So I just want to quickly show you how external secret works. Let's see. So I have a local kind cluster here. It's running vault within using the vault operator. ESO is already installed and the reloader component is already installed. And I can show you that hopefully I have still have the forward running, yeah. So I can talk to vault, I can grab the secret from it and I already have the application deployed, I believe. I just need to set up the, so I have the application running and basically the world here is coming from a secret. I can easily show that to you. Deploy, demo. So as you can see, the environment variable hello is injected from a secret which is synchronized by ESO from vault. Now if I go ahead and change the secret to something else, let's say hello everyone and I go back to the service, nothing changed. If I take a look at the secret, the secret changed. Previously it said world, now it says everyone. So what I can do here is I can manually roll out, figure it all out for the application. Obviously I have to restart the port forward. I wish there was something that could do that automatically. And if I take a look at, by the way, can you see the console? Okay. So if I take a look at the output now, obviously it uses the new secret. Now let's change the, actually let's enable reloader now. So I can do that by annotating the deployment. I'm gonna tell reloader, okay, start watching for any kind of secret changes and please trigger it all out when something changes. So this is gonna change the deployment and I'm going to change the secret as well. If I take a look at the secret, the secret is changed and probably I can't talk to the application now because the port forward is down. So I'm going to restart that and if I go back to the application, now it says hello points or something which is the secret I've changed the last time. Which is what I changed the secret to the last time. And I didn't do a manual out here for the last time. It was all done by the loader. If we take a look at the instance itself, we should see revision increased. It basically uses the same way when you use kubectl preloader start. It increases the revision count on the deployment. All right, where is my, so if you would like to see a more detailed demo, I have one for Bankwells as well but I don't think we have time for that. But I'm happy to talk about this. You can actually try this out. It's a Maggie tub. It's very simple. You just need some Kubernetes cluster which can be kind. And I have all the instructions detailed in the read me so you can easily try this. But I can show you in the hallway if you want to after the presentation. So to kind of summarize all this, it may not be the best life advice or the key to happiness, but certainly in secret management, it's always better to assume the worst, prepare yourself for the worst possible scenario and then you can actually sleep at night. Thank you very much. And I'm happy to answer any questions. There's a microphone over here I believe, so. Hi, I had a question about Kubernetes secrets itself. I want to get your viewpoint this. I was told that, sure, so Kubernetes secrets is not secure because I can just base 64 encoded or decode it on to see the reals text. But I'm told that apparently now we can actually like encode it with some salt, some hash. Okay, so the base 64 encoding is not actually the problem because it's for encoding. And if you think about the types of secrets you may store sometime, for example, certificates, you absolutely need the ability to encode those. So that's why base 64 encoding was added to secrets. The problem is that by default there is no encryption at rest in Kubernetes. So the default setting is plain text. And the reason for that is obviously, depending on the provider you are running on, you have to configure different KMS providers, for example, if you want to encrypt those with KMS. But you can configure your own encryption at rest. I think it's called encryption configuration, but you can tell Kubernetes to encrypt the secrets before it stores them in that CD. So basically that's how you can protect your secrets from being stored anywhere in plain text. When you are running on a cloud provider, it usually is called envelope encryption. And you can use KMS, for example, if you are running on AWS and EKS, you can use AWS KMS to encrypt those secrets. Obviously you are still trusting AWS not to abuse your KMS solution and decrypt your secrets. But that's how you can make sure that it's not stored in plain text in that CD. Thank you. Any other questions? Oops. All right, if you have any more questions, feel free to reach out to me. You can find me in the hallways here or you can send me an email, find me on Twitter. And yeah, thank you very much for your attention.