 So, we're the only ones between you and lunch, so we're trying to make this quick. So, right, so the talk is six turn, how we love to stop trusting registries and trusted us in love signatures. The title was loosely based on Doctor Strange Love, when the title slide is a bit off. We also try to use Dali for generating all the images just to make it a bit more intriguing, interesting. We hope you like it. Right, so let's start by introducing ourselves. Hi, I'm Tyson Kamp. I'm a security architect at Influx Data. There's actually two Tyson camps on LinkedIn, and so mine has a one on the end. The other easier way to discriminate is the other guy has hair. So, if you get the hair guy, you get the wrong guy. Right, and I'm Wojciech Kotzian. I'm a platform engineer at Influx Data. So, I was working with Tyson more on the technical aspect of this. This is my GitHub and LinkedIn profile as well. There's multiple Wojciech Kotzians. I guess, I don't know if we all have hair, but the link is there, so you should be able to find us. Me, right, so let's quickly introduce Influx Data, Influx DB. So, we come from a company that's been mostly known for Influx DB, and this is a platform for building time series applications. And at its heart, it's an open source database that's built for keeping timestamp data and a metric that's captured in a specific point in time. And we also provide Influx DB Cloud, which is a multi-tenant software-as-a-service solution built on Kubernetes, multiple clouds, multiple regions. And this is where we were adding the signing solution to. So, why are we adding signatures? I'm gonna really bite my tongue and just go at the slide, because I tend to ramble up, and that went too quickly. Anyways, we don't trust the registries, that's the answer. I guess this thing is dictating our thing. Oh, it's still me? Thanks. Okay, so some other stuff. So, high-level, what were we trying to do? So, number one, the first bullet point is the one I just said. Number two, one thing, I'm a dev in my bones, and so when I hear key rotation, it just kind of makes me cringe. So, I want it to be really quick and really trivial and as easy as possible, and you'll see how we did that later. We have many, many clusters also. So, this had to scale and be really simple. And the last one's gonna repeat to the second bullet point. Let's skip that one. Okay, so, non-goals. So, I did my best to solicit as many different groups around influx as I could, because I've only been there for a year for one, but number two, I want as many opinions as I can. I certainly can't think of everything. And so, when you start talking to people about security, you probably all know this. The what-if questions start happening really quick and what-if, what-if, what-if, so while getting the information, I was also reminding people, we don't trust the registry. That's the only problem that we're trying to solve here right now. We'll deal with the rest later, so that was number one. And number two, we were getting pretty jazzed about all the stuff that Sixdoor has to offer, so we had to sort of contain ourself and do what I just said, just keep it to registry trust. Still me. Okay, so, why don't we, I should put my name on it. So, when are we done? When every single container that we have is being signed and being verified in the clusters, that's when we're done. And from my experience from other domains outside of cyber and outside of engineering, I'm a big believer in training and being able to perform under stress. And so, our SRE team and people that might have to handle the breach when it happens, not if it happens, I need to know that it's easy for them. It's really, really easy and it's quick and that they're trained in doing it. So, that's another definition of done. This can't be something that Boycheck and I think up and it looks neat and we have some diagrams and sock it away somewhere and then we move on and 18 months later we're gone. Somebody has to be able to handle this really quickly and really easily when there's a breach. So, that was a big criteria. And number three, our dev teams need to know how to onboard really easily. Okay, so we can stop switching the mics now. Thanks so much. Right, so, a bit more details about InfluxDB Cloud. Just enough so I understand what the problem was that we were trying to solve. So, as I mentioned, it's Kubernetes based. It's partially stateful, well, I guess any database has to be stateful at some places. So, most of the workloads are stateless microservices. So, for example, when we get an API called, there's a lot of microservices that parse it out, call it out and then talk to the storage here and this is where we get to the interesting bit of Influx data. So, we have a storage here that's basically using PVCs or volumes at the Kubernetes level with the data being also persisted at the cloud native objects or so like. S3, Google Store, depending on the cloud it's running on. We're also using managed databases for the metadata. So, anything like the user's organizations, we try to put it in SQL. Some of the things are in different places as well but most of it is in SQL. In Kafka, ZooKeeper for the write-ahead log. So, when someone starts writing data, it starts there first and then gets written to the proper storage here. And the thing that we're really happy with is that because this has been built as cloud native from day one, it has fully automated pipeline. We're using infrastructure as code, GitOps, proper CI, CD in place. I think this is also similar to previous talks. We have the concept of phases. So, anything that we want to introduce first gets run in the staging or testing environments that don't really have any external customers using it. Once all the post deployment tests pass and everything looks properly, then that gets deployed into production. So, there's that integrated in our pipelines as well. And again, I think we're on all three major clouds and in multiple regions. There's a lot of deployments happening, a lot of things changing all the time. It's actively developed, so we have multiple, well, I guess a large number of deployments but they have no idea what the numbers are. It's happening all the time basically. So, we have multiple sources of images and they're treated differently. So, we have CI that's building the code that we've written specifically for Inflex TV Cloud and that's a lot of code. And I guess a lot of the images we run is that. So, with those images, it's relatively simple because they get built, they get signed, or we'll get to the details on how they're signed later. But that's just the subset of images we use. Inflex data also provides some open source code and open source-based containers. For example, we have an agent called Telegraph that we try to use to get our own metrics from our own product as well, just because it would be nice to use our own solutions ourselves. And we use a lot of third-party search container images like, for example, we use Argo workflows for some of the tests, so we want to use those images and be able to validate them. So, as I mentioned, for each of those types of images, the signatures are a bit different. Like, CI, we can just sign it when we build a new image and when it's on the main branch and now we're going to use it in our clusters. Like, people's PRs will probably not get signed. I mean, they're not signed and they will never probably get signed unless there's a good reason to. For all the other images, we're kind of managing the updates ourselves, meaning if we're updating Argo workflows, we also need to update that. What's called the Argo exec image. That's actually running some of the logic in cluster. Same with, for example, telegraph. We try to follow the major releases of telegraph, but it doesn't mean we update them the same day. So, we basically sign them periodically, meaning that we have a list of images and those are specific charts that we decided we're going to trust and they get signed periodically. And maybe I'll get to the details of this. So, this is a diagram of how it looks like. So, we have a cluster, Kubernetes cluster that has all of the logic in it. So, we have vaults, same as, even from mentioned in data doc, we're also using vault and key versions in vault with the transit key. And we use that to generate the signatures. Then we build a service on top of Cosign's code. So, we had to extract some of it away from the CLI into our library. We could probably contribute that back now that it's more or less stable. But we have code that can sign those images and we build a small API service in front of it that uses authentication to validate what the signature is and requires some metadata. And that's being called from the periodic, from the jobs that are periodically signing images that we mentioned, but also from CI. And the reason why we build it as a service is interesting. In most CI systems, it's possible to change the CI logic on a branch. And we wanted to avoid a common pitfall that happens in CI that if you have the secrets for signing the images and if someone really wants to do something nasty and they have access to Git repository, they would be able to change the CI logic. And in our case, our service gets the metadata about the commit that is on and the branch it is on. It can cross check it with the Git repository whether what we're signing is actually on the main branch and it's actually that should digest by putting it against the OCI image registries. So for both of these cases, the signatures land in the image registries. One thing I didn't mention previously, for our images, we just start signatures alongside the images in the same registry. For everything else, so like telegraph, ArgoExec and all the other images that we don't control specifically for this project, we have a dedicated external signatures registry which is something cosine and policy control support. So we just basically sign it and put just the signatures in another location. But regardless if we use tags, shadiages or anything else depending on the tool and whether the tool provides ability to force the shares to be in use. We sign the very specific shares of those images and we keep those signatures in our own registry so we manage that. And every single cluster that runs our product, what happens is we deploy policy controller in there and policy controller is a six-star project that allows for refining a lot of things on the images side. And the thing that we use is today is just the image signatures. So what happens is we also have a, it's more or less a cron job today, it may be more complex in the future, but it basically queries back the main, the cluster that's managing the keys to get a list of public keys that should be used and that's obviously unauthenticated API because there is no reason to make it available to all the clusters. And we have this something called the cluster image policies which is a policy controller CRD that defines rules for which images could be used. So we keep on rotating the keys as I stated mentioned and make it simple and transparent for everybody that's using the system. We have a fin client on the CI site that could be integrated. That would be calling our image signing service and then in each of the clusters we have systems that make sure that all the clusters have up to date set of public keys. Because of that, we didn't have to set up any CA certificates and because most of the images we use are just internal we didn't want to specifically use all of them like all of the solutions that we include transparent log and keyless signing we wanted to just choose a frequent key rotation and just using key based signatures with vault providing that. With all the nice upside is as well that the private key never leaves vault it's just just for signing. So it's relatively secure. As for adding signatures, so when Dyson and I were tasked with this we didn't really know a lot about six at this point in time so we started by looking at the available tooling. The obvious choice was cosine because that seemed like something that was well-established and standard we looked at ways of validating things like for example, connoisseur but we decided to go ahead with policy controller because it seems it has much more powerful features that we may want to use and it's also part of six so that seems like something we would want to be using. We decided not to go with full TN recurges because we wanted to have a simple solution based on frequent key rotation and we also didn't want to make some of the metadata about the images public as they're just private and we don't provide those images to customers we run our own product in our own cluster so it just seemed easier. We started getting involved in six at one we hope to get more involved. My goal is to at least contribute something back as soon as possible maybe even in the next weeks. So we started small by just deploying this alongside experimenting manually then we switched to both just in dev mode but it was enough to play with key rotation play with all the aspects of the system that we wanted to build and then we moved into scaling out so we used vault that we already used for a lot of other things in our infrastructure and just switched to the proper production set up of vaults. We spent some time thinking about the transponderings as I mentioned we decided to build a dedicated image signing service one thing is it's using call science logic internally it's just exposed as an endpoint that we can call from the CI. So one reason is again because we can do additional validation of whether we're actually signing the image that we want to be running in production. The other thing was that because of that CI systems do not have access to the private keys they just have access to keys to talk to the image sign service. So this makes it better that key rotation doesn't mean that we have to rotate it in all the CI's like that we don't have to we just rotate those keys used to talk into the image sign service but I guess I also mentioned more about the plans for how we would handle that in case of a security incident. And again we also have some additional logic for checking whether the image should be signed inside the image sign service so the CI for one type of images calls and says I want to sign different types of images image sign service may just say I'm not going to sign it and this provides additional layer of security. In terms of policy controller we had to do some tweaks and to do some settings changes and we also opened some issues and like I said we're hoping to contribute back to policy controller at some point. It was mostly about making sure that if that we trimmed down what the policy controller was doing and that we don't end up in a situation where we were having policy controller fighting will make it more difficult to keep on managing the cluster so that we have a plan to wind it down in case of any issues. We haven't run into any but just it's always good to have a plan for this. People that carry pages want to know that information I got it. Can you hear me? I put this in professional mode. Okay. This is kind of automatic. Okay. So getting that like I said key rotation is really important. So important and so easy I wanted it that nobody does it. Cron does it. It's handled by config variables and it doesn't even have to happen and it's happening all the time. We basically have dials to say how quickly do you want to deprecate the keys? How often do you want to create new ones? So you can create them twice a day. You can create them once every two weeks. You can say throw them away after a month, throw them away after a week. It's just in get ops. It's that easy. That's how easy I want my life to be. Integration with the automation and processes. That was some heavy lifting to me. Vojcik mostly handled that but that just I'll just say make sure you set some time aside for that. That was difficult. And again, just to highlight I tried to work as cross-functionally as possible so everybody's cool with this, especially the people that carry the pagers. You know, they have to, I don't want to see any concern on their face when I'm rolling this out because I'm really usable security is really important to me. Like people find workarounds really fast if it seems like another hoop they have to jump through something that you did. So working with other teams. It's the dev teams but really the SRE team and the people downstream for me. I don't want them to feel like, you know, I'm just giving them another thing to do because that's often how security is viewed, right? So handling incidents. Since we're so close to lunch and I'll keep it brief. I looked at this slide this morning and then I reviewed my own documentation because I can't remember anything from two weeks ago and I looked at the docs this morning and when there's an incident you check out a file, change a flag from false to true, check it in. Watch Argo CD. And when everything's been redeployed. By the way, that issue is a new key, a new signing key, deprecates all the old ones, deprecates all the old public keys to validate the signatures. You watch Argo CD for everything to deploy. Check out the file, change the flag back, check it in. That's it. So that's what I wanted. I wanted this super easy like solution for the SRE team and that's it. I can explain it to our CFO and I can tell them why we're doing it and they love to hear that kind of stuff. So that was it. That was a big concern of mine. So that's about it. There's a, is it the last slide? It is, yep. So I think we're done ahead of time. We'll take questions if anybody has any. That's it. Thank you. Does anybody know? Using what for key rotation? No. It's just more than we needed to do for our scenario. That's it. It's just more stuff, you know what I mean? So any introduction of a complexity has to have a really huge upside. And so we could, we may, you know, we may do that. But right now this involves setting up almost nothing else. It's really, it's gotta be really easy and it can't leave a big footprint because I've seen a lot of people come and go and build systems and they're gone and then we're trying to figure out how those systems work and there's a lot of smart people and a lot of stuff being done at Influx. So we're really sensitive to like exactly why do you wanna do this? So we have a lot of those tug of wars, right? Everything's gotta be really justifiable, any addition in complexity. And I just finished a book about the collapse of complex societies. So complexity breeds fragility. So I'm not saying it doesn't have a reason, I'm just saying we don't have one right now. And just to share some other interesting issue we're running into sometimes. So we're rotating the case but the validity of the case is relatively long because sometimes we see engineers having to roll back specific service of previous version because for example of incidents. So it's happening I wouldn't say on a daily basis but we've seen issues where for example someone introduced a refactoring of a piece of code and then for a specific customer, for a specific cluster that's used in a non-standard way that generates some issues. So we're seeing people, there is a relatively easy mechanism to what we call pin and image which means instead of just rolling to the latest one in a specific location for a specific microservice people do that. And I suspect with Davids also possible but that was one of the things we didn't want to make more difficult. So there are procedures in place to do that if you need to make an exception use code that's on a branch because you have a fix for an incident as well. There are mechanisms for that to do that, that is not as simple but rolling back to an image from two days ago is something you want to do. Rolling back to an image from six months ago is something you want to prevent because that one may have vulnerabilities in there. Yeah, that's kind of book ended with the replay attacks scenario where we want to keep the window of public keys that are accepted as small as possible. That's the replay attack says, well, why don't you do it for an hour? Like a two hour old key is deprecated and we haven't really figured out where the middle ground is right now. Cool, that's it. All right, let's eat.