 OK, so yeah, hello everyone. So welcome to Cloud Native Live, where we dive into the code behind Cloud Native. I am Mohamad Shari, almost like Amitul, a sense of ambassador. And I will be your host tonight. So every week, we bring a new set of presenters to showcase how to work with Cloud Native technologies. They will build things, they will break things, and they will answer your questions. So in today's session, I'm stoked to introduce Andy and at the same time, Stevie to our session. So they will basically talk about the whole through the session. So this is an official livestream of the CNCF. And as such, it's subject to the CNCF Code of Conduct. So please do not add anything to the chat or question that would be in violation of the Code of Conduct. Basically, please be respectful of all of your fellow participants and presenters. With that, I will hand it over to Stevie and Andy and let's kick off the session. OK, so yeah. Hey, Andy and Stevie. So how are you? Doing well, how are you? Yeah, I'm fine. So I think you guys can introduce, starting with Stevie. And I think then, Andy, and we can start the session, right? Yeah, Stevie, would you like to introduce? Yeah, so I'm Stevie. I am an SRE tech lead here at Fairwinds. And I have been working in tech for longer than I will admit, so that SOS to hide my age. But I've been working in this industry for a bit, starting off with desktop support, going through system admin stuff, network admin stuff, all the way up to now, where I deal primarily with Kubernetes containers, microservices, and the things that love them. So that's me. How about you, Andy? Well, I'm Andy. I think we've covered that. I'm the CTO at Fairwinds. I've been working in Kubernetes and cloud-native technologies for about eight years now, reformed SysAdmin, moving into the cloud-native space. I don't have to react servers anymore. I'm not really complaining about that. And I'm also author and maintainer of several open-source projects that Fairwinds makes. So Pluto, Goldilocks, Nova, Polaris, and just like running Kubernetes clusters and talking about the different ways that we solve these problems. So yeah, you guys can start the session right now. All right. Steve? Yeah, so today, we're going to be talking about managing infrastructure as code or IAC, as scale, and tackling some of the challenges that or discussing ways that we tackle here at Fairwinds, some of the challenges that come along with that. So talking about how you can run a terraform at scale, we're going to be talking about using Argoscd for a GitOps flow, and talking about using Vault for a secret store, and a lot of other stuff. So Andy is going to kick off the demo now. Cool. Yeah. So we mentioned that we both work at Fairwinds, and what Fairwinds does and has been doing for about eight years now is run Kubernetes clusters for other people. And so but we've done this in sort of maybe not a unique way. Well, it was unique eight years ago, but it's not unique anymore. But we run customers infrastructure inside of their infrastructure. And so we and we try to build it in a way that is portable to them so that they can use it after they move on from us. And so that produces some interesting challenges in how we manage Kubernetes infrastructure, because we're managing across three different cloud providers and dozens of different cloud accounts and that aren't owned by the same people. So there's no central governance over those accounts. And so over the years, we've had to solve some of our problems in interesting ways because of that structure. And so when we looked at how we were going to continue to manage this infrastructure and improve the way that we manage it and build more automation around this, we have a sort of unique set of requirements that we put on our infrastructure management. And so we wanted to share a little bit of that today because what it does is it brings together a whole bunch of different cloud native open source technologies into our sort of management workflow. And so I think we'll go ahead and share my screen. I've got up here just a list of requirements that we made for ourselves when building out this infrastructure. So first of all, I already mentioned this, we have to authenticate. Sorry, this feature is not open. Yes, I haven't added that yet. You can see that. All right, cool. This is the only slide I will show today, I promise. It's just easier to put bullet points into a slide. So we have to authenticate into multiple cloud accounts across multiple cloud providers. And actually that also includes, and this is a sort of an interesting requirement, there have been times where we've worked inside of AWS GovCloud as well. And so we really have to have a super flexible authentication mechanism that works everywhere. We wanna be able to automate Terraform using a Git-based workflow, whatever that looks like. We can dive into what that looks like in a minute, but we had the requirement to automate our Terraform based on pull requests so that we can do code review and we can see the plans in CI. We don't have to be running Terraform on individual engineer machines. The third requirement, we're big fans of open source. So wherever possible and reasonable, we'd like to use open source tools over commercial variants. This obviously got a little bit interesting when we were talking about automating Terraform and changing how we automate Terraform right around the time that Hashicorp did their licensing change. So we had to deal with that from a business perspective. And so in general, we just try to keep it to the open source wherever possible. And also that helps us to remain as Kubernetes, sorry, as cloud provider agnostic as we can by using open source tools that are Kubernetes native across the different cloud providers, we avoid having too many differences between our different clusters and the different cloud providers. Obviously you have to have some, but we try to avoid that. Next requirement, we, and this one introduces more problems than probably any other, but it is an interesting one. So instead of doing a centralized infrastructures code that pushes out to all of our customers, we need to be able to manage individual infrastructures code repositories that can be shared and handed over to individual clients. So we have customers that view their infrastructures code repositories, we have customers that have left and taken their infrastructures code repositories with them. We have customers that audit the infrastructures code repository that come back to us with audit results and things like that. And so we needed to be able to make this sort of multi repo, which is kind of interesting. If you were to say inside of a single company, build a platform that managed a fleet of clusters, you probably wouldn't spit out a new infrastructures code repository for every single cluster that you build or every single subunit of your infrastructures code because that would get really messy really fast. So this was an interesting requirement. We have to have secure secrets management. That's obviously a requirement. And then we wanted to use a GitOps model for managing all of the things above the Kubernetes cluster level. So that's our base set of requirements. Did I miss anything, Stevie? No. Yeah, I think the interesting thing to point out, and I think you started off with this, is that our requirements, some of them are a little unique to us because of what we do. And some of these things are challenges that you might not necessarily come across if you are a contributor necessary or something in a single organization. You'll probably have multiple clusters and things like that and you might even be spread across different clouds. But some of these things like the IAC repo finiserm won't necessarily be a challenge for you, but a lot of these other things would be useful in just like streamlining your own workflow as your infrastructure expands. Yeah, definitely, definitely. Cool, well, let's talk now a little bit about how we solve these problems and all the different tools that we put together. So I love a good visualization. I can't think without one most of the times. If you can attest half the time in order to communicate, I just have to open a diagram and start drawing. Oh, nice, and it's great, it's super useful. This is sort of a high-level overview of all the different tools that we use to manage the different clusters under our purview. So on the left here, we have our own internal infrastructure. So if I maybe highlight that, I don't know, orange, why not? Orange is good. So that's our own internal infrastructure. And this is an internal Kubernetes cluster that we manage for running our own internal tools. And then here we have infrastructure as code repository. This would be one client. So we might have 12 and 15, 20 of these, whatever, however many clients we have at the time. And within that is an opinionated file structure that dictates where everything lives. So we organize things into inventories, which roughly maps to like cloud accounts or individual clusters. And then within that, we have Terraform and we have Git infrastructures code and we have all of that good stuff. And then in order to run the Terraform, we've settled on Atlantis, but we're not running one single Atlantis. And that's why it says Atlanti here. That's my fun made up plural for Atlantises. In order to segment the cloud credentials that are used to run the Terraform, as well as segment the execution environments and the permissions of Atlantis, we decided on actually running one Atlantis per customer or per, well yeah, one Atlantis per customer. And so that Atlantis has access to the credentials only for that customer and has access to only the code for that customer. And so we've got half a dozen Atlantis running that manage each different customer, however many. I should stop saying numbers. I don't know how many there are, but those run the Terraform that create the customer environments. So they manage the VPC, the IAM roles and all of that good stuff that you need underneath your Kubernetes cluster. So then once we have a Kubernetes cluster and EKS, GKD or AKS cluster that's created by Terraform, we need to install stuff on it. So we need like add-ons, so like external DNS and metric server and all of those different things. And so to do that, we use Argo CD. And so in each client cluster, we install Argo CD and it's able to pull manifests from that infrastructure as code repository. And the huge problem with all of this is secrets management. How do we get secrets into Argo CD? How do we get secrets into Atlantis so that it can authenticate? How do we authenticate with each of the different cloud providers in an automated fashion? And so really the best tool for that job is Vault. And so Vault contains a ton of different stuff for us. It's got the credentials or a user that's able to assume roles and hand out assumed roles to both Atlantis and our SREs for each of the cloud environments. It contains any secrets that need to be installed into client clusters like API keys and things like that. And so that's sort of the heart of all the secrets management and really enables us to automate all of these things. And so we use the external secrets operator here which is a really cool project that pulls secrets out of Vault so that we don't have to keep secrets in our infrastructure as code repository. And we pull those manifests again via Argo CD. And yeah, that's kind of a high level overview. Yeah, where the SRE sits and all that is that most of like where the, when changes need to happen, the first place is in the infrastructure as code repo. Like it's in the Terraform. And then we have a whole system of checks and balances that goes through for like PR reviews and stuff like that. But that gives the SRE like a single place to pretty much make all the changes that they need to for client infrastructure, right? Exactly, exactly. So most of what the SRE does is via pull requests. I'm trying to make this not look horrible, but what most of what the SRE is going to do is via pull requests using our internal tooling that manages the templates that go into this infrastructure as code repository rather than using cloud credentials and the QCTL or the CLI. And they're able to have very limited permissions by default in the CLI so that they can do investigation and troubleshooting and things like that. And then they're able to escalate roles and because we're using Vault to do that and using just sort of a wrapper around Vault to provide that access, we're able to control the roles that they have and we're also able to audit log all the access that we have into our customer environments through Vault. So Vault's kind of the interesting bleeding heart of this whole thing really, if we're honest, it's a huge, huge part of it. So now that I've shown the overview, let's do a little bit more hands-on and show sort of what this looks like. What does Atlantis look like running in your pull request? How do we make changes to Terraform? How does the external secrets operator work in order to get the secrets in and then how are we syncing things with our CD? So I'll hop over here into my terminal and I've got one of our clusters open here and you'll notice sort of this opinionated folder structure that I was talking about where we keep stuff. And so I have this sandbox cluster. This would be like any client cluster, but this is one of our own internal ones that we use for demos. And first I'm in the Terraform directory. So if we look at really any of these Terraform files, what we're gonna find is that these are actually automatically generated and I shouldn't be editing them. We're not gonna talk too much about our templating, but we do have a centralized templating engine that spits out all of this Terraform or most of it, right? But there may be things that I need to add on top of it for a particular client or something like that. And so I can go into this inference code repository and I can make a pull request on it. Why does it say I have? And just to answer a quick question from the chat, we are using the open source vault, not the enterprise vault. In AWS with an AWS backed for it, we're not keeping the data for vault in the cluster. So I have gone and I have this module here that adds IAM permissions for developer team. It really doesn't matter what I'm changing. We just wanna see that something is changing. So I've made this change and I'm gonna add it and dev team two for demo. I'm gonna make a pull request on this repo. Indeed, it is HashiCorp Vault to the latest question asker. All right, we added that. Let's hope it doesn't yell at me. All right, so I've made a pull request on our infrastructure repo and we go here to our Fairwinds Ops infrastructure. We see our pull request and hopefully what we see Atlantis is gonna do some stuff. So we are using the Vault Kates plugin to populate the secrets in the Atlantis pod. And so that's actually keeping those credentials up to date. And so it then needs to copy the current vault token on AWS credentials into the working directory in order to do that. And we should see an Atlantis plan. So Atlantis has commented here and said, hey, I'm gonna do some stuff. Do we wanna tackle that last question? I think it's fine, yeah. Okay, yeah, so we do have an internally developed templating tool that creates those files for us. It does a template generation. Yeah, so I can elaborate a little bit more on that because there's not really too much secret sauce here, right? It's called Terrafish and there's a lot of weird historical reasons for that. We are engineers at heart and engineers like to name things in funny ways. And those history, that history goes way, way back to when I started at this company six years ago. But it is called Terrafish but it is called Terrafish and it's called Terrafish because it is based on Terraform. So we're actually using Terraform templating to generate more Terraform. And as we'll see in a few minutes, we're using it to generate manifests for the GitOps repo that we manage every client infrastructure with. So Terrafish is the internal name for the tool but it is really just a templating engine at heart. So that's how probably as far as I'll go explaining that. So anyway, we see Atlantis has planned here. Stevie, I'm gonna need your approval on this because we're using a Git-based workflow. I made this pull request and I could tell Atlantis to apply it but it's gonna yell at me and say, you can't apply it because it's not approved. And so luckily, Stevie here is one of the approvers on this repository and she's gonna go approve that PR for me. And so without ever leaving my Git workflow here, I can make changes to apply an infrastructure that get automatically planned, applied and approved and all of that good stuff. So I can make it a little bit bigger to try and help with the fuzziness for folks. Streaming video is a hard problem, something I did in my last job and something. Still that issue is there, I guess. I think a little bit of zoom in it open much more. All right. Let's just talk, can we dive in just a little bit more into this particular workflow, right? Because what we're showing is so Atlantis automatically runs a plan when you push your commit or when you open your PR. But also can anybody just go in and run Atlantis apply on these? Or have we locked that? I'm asking because I believe we've locked that down. Like you have to like in some of our, in our internal working, so maybe not in this demo, but we have it in such a way that you can't run an Atlantis apply without the PR being approved, right? Correct. Yeah. Yeah. So our default workflow does require the PR to be mergeable, which allows us to use GITs, RBAC controls to control who can apply or when it can be applied. Technically after it's been approved, anybody could come in and type in minus apply and it would work, but it still has to be approved. And we're using GitHub's code owners functionality to do that. So you can see here, I don't know why it doesn't have the badge next to Stevie's name, but we have code owners assigned for this repository and one of them has to approve. So we're using built-in branch protections from GitHub to do that. The other thing we're getting that I didn't mention was this Atlantis UI. And so we're actually using a subpath off the backend of this ingress to give us access to the Atlantis UI. This is all behind an OAuth proxy tied to our internal authentication, but you can actually go view the run as it happens in this UI if you don't want to view it in the PR. And so I have applied it. I set Atlantis apply, it ran the apply and then it went ahead and merged the PR. So we've got automated Terraform running for all of our customers. And then since we're centrally templating out all of the Terraform files for most things, if I want to make a change across all the infrastructure, I just go make a change to that central templating engine and it actually creates PRs on all of these. And then I can assign them out to, they can be assigned out to individual teams or SREs that they can go in and approve them, review them and apply them as we do that. So that's the Terraform side of things. I don't think I see any new questions. Do you know anything on the Terraform side of stuff? No, no, I don't think so. It's just, so this is a very easy way to, if you need to, if you're using like one of the upstream modules for like EKS or something like that and you need to just upgrade your EKS version across a bunch of different clusters or in our case clients, then there's one upstream place that we just change a module value and then create PRs across all of those repos. And that makes it pretty simple to make those changes as opposed to checking out each different repo. Which is how we used to do it for the record. So, we're getting better. We're getting better, we're getting better. All right, so now let's go talk about the ROCD side of things and the get-up side of things. So in this same repository, if I go up a directory, so we go into the clusters directory, have some historical reasons for this nesting, don't worry about it. But if we get it up into the resources for this cluster, we are going to see a whole pile of YAML, which is not interesting, or everything's a whole pile of YAML. Underneath, right? It's just YAML all the way down. But here's where we start talking about add-ons. And so we have a whole bunch of different add-ons installed in this cluster. We've got KEDO, we've got CERT Manager, we've got EBS CSR driver, because we need that for an EKS cluster. We've got VPA running, we've got that OAuth2 proxy running. And so, used to what we would do, we have an internal tool, we have an open source tool called Reckoner. And Reckoner allows you to define a whole bunch of Helm charts in one place. So we've got a bunch of Helm releases, we specify their values file, we specify versions. And then the latest versions of Reckoner will actually spit out or Helm template all of the manifests into a directory path. And so if we look here in the manifests directory, we're going to see manifests, KEDA, and then all of the templated out manifests from a Helm template command in this directory. Now I know Argo CD applications can install Helm charts, but in a true GitOps fashion, I want to see in my Git pull request, the actual YAML files that are going to get applied, not just a change in version in a Helm chart, because we all know a change in version in a Helm chart can obfuscate a lot of changes in the underlying YAML because we don't necessarily want to go read all the Helm templates and see what those changes are. And so we prefer to template out all the YAML and do a directory. And so actually, like I said, we used to do this with Reckoner, so we could do a Reckoner template and that would spit out all the things in that course file, but we're actually moving into that tool that I mentioned, that's the internal templating engine. And so we are doing a Helm template and then creating pull requests with all of those files. So if I see Manifests, Keta, and I go look at the deployment for the Keta operator, how is Keta not installed this way? Maybe not Keta, hang on. One of these is managed with the new tool, TeraFish. So anyway, we'll see that file header. It doesn't really matter how the Manifests get there. The key is that we are pushing the Manifests into the directories and making those changes. And so we can do our same pull requests workflow. We can see the changes that are gonna happen. And then again, we have Argo CD running in each cluster. So this is the Argo CD UI for this particular sandbox cluster. And if I go to, let's say the external secrets operator here, I can see that Argo is syncing that. So it is a pull model for Argo CD. We're running it in each of the individual clusters, which means we only have to give Argo CD a limited GitHub token that has access to the infrastructure repository for that specific client and even that specific cluster if we wanted. And then it pulls down all of these Manifests. So if we go in to our infrastructure's code repository, maybe easier to see it here. Inventory sandbox, resources, Manifests, external secrets, and actually this one is managed by the tool. So we'll see all of those here. So Argo CD is pulling them from here. And then go ahead. Just wanna say real quick, there was a last question about whether we're using an on-prem cluster to implement Terraform instead of Terraform Cloud and just wanted to answer that before I scrolled off the screen, sort of essentially, like we're not doing on-prem, we're in the cloud, but yes, we are sort of hosting our own modules for various things that we have, like we said, a templating engine that's Terraform. And we're managing the modules that we use locally and we're also, sorry, and we're managing Terraform State in S3. So I think according with that, I think we had another question was like, is this a push model thing from Kafka? Yes, I saw that question, it is definitely a pull model. So we don't do... So this is sort of a philosophical question about credentials and get-ups and how you wanna run get-ups. I know a lot of folks do centralized ROCD implementations and I don't think there's anything wrong with that model, except in our case, because we're managing across multiple clients, we need to segregate our access as much as possible. And so we could do a centralized ROCD on our side that pushes out to all of our customers, but then, A, we wouldn't be able to give them access to the UI, well, we could, we'd have to do a lot of complex RBAC, it'd be interesting. But then we would be responsible for a centralized UI. And then that Argo instance would have access to all of the client clusters, which makes it a large target. If we keep all of the credentials for all the customers in vaults, that keeps them in probably the most secure place that we can. And then the Argo CD implementations don't need any access because they're running in the client cluster and they're pulling from our GitHub repository and they only need a limited GitHub token that has access to the code. And so I do think it actually is a little bit cleaner security model to do the pull model in our particular case. I'm not setting anything on that reasoning, Stevie. I'm so sorry, what did you say? Did I miss anything in that sort of like reasoning that we have? No, no, I think that is what we have discussed. It's, we have, like I said, a unique situation where we have a bunch of different clients. And so we want to make sure that we're not bleeding over, and that we're segregating the environment as much as possible to prevent any sort of issues. Cool, so that covers most of the infrastructure's code except for secrets. So we do have to be able to get secrets into the cluster. We don't want to keep them in our get repos because that's a terrible idea. So I did talk a little bit about the external secrets operator. If we want to take a look at how that works. So it's a fairly standard default install of the secrets operator, but we have, we have to give the client called to his access to a path in vault so that we have a place to keep secrets. And so if we try to remember the exact CRD name because it just escaped me, secret store. Secret store. So if we get secret stewards, I'll actually take the cluster of secret store in this particular case. The difference being, like the delineation between a bunch of different things like cluster role, roles, cluster binding, binding, it's a cluster level, secret store. Precisely, precisely. So if we take a look at the cluster, for a while we were using the Argo CD Vault plugin to do this and we're moving towards the external secret store because of some issues with the Argo CD Vault plugin and then also just separating concerns between Argo CD and our secrets management. So we're able to specify here our vault backend and the namespaces that are allowed to use this cluster secret store. So we limit this to essentially the namespaces that we manage in the cluster because we don't manage all of our customer's secrets, we just manage our secrets and the ones needed for infrastructure. And so that just pulls an app role and role ID from, well, the app role secret out of a secret. That's a lot of secrets and gets a little circular there. And that's the existing one that we're using for the Vault plugin. Like you just reuse the... Yep, just reuse the same secret. So in order to migrate from the Vault plugin to external secrets operator, I was able to just reuse the same secret in the Argo CD namespace, but this could be any Kubernetes secret that has the secret ID. And that this particular client, this app role vault app role ID only has access to a very specific path of secrets, which is important because we're trying to separate and we don't want one client being able to access another client's secrets, right? So we have a specific role for each client. So we create this cluster secret store which grants access to Vault and then we're able to use that in our manifest. So we can create a manifest called an external secret. And what that's going to do is it's going to reference a specific path in Vault and it's going to pull out different secrets and then it's going to use that to spit out a Kubernetes secret. It might actually be easiest to view the Sinargo CD now that I think about it. But essentially, you know, we say here's a Spottin's token and we're going to put that in this secret and we're going to put it in the key token. Pretty simple and straightforward. So if we go back to our applications and we look at this, the application itself is not super important. But if we look, we have our external secret that's applied by our RGRO CD and then it owns the actual Kubernetes secret that was generated by that. You can see it has the two values, account and token. It's got some annotations to say that it was reconciled by external secrets. And so now if I need to make changes to this token I can go into Vault, rotate it and the external secrets operator will update this secret with the new value. And so we're doing getups for everything except secret values. And that's a really cool feature of this because I ran into this with the, as Andy said we're still using the plugin in some places and we're working on migrating off. But like when you need to update because of the way the Vault plugin works it's like a path placeholder. And so when you need to update a value in the Vault secret store you actually have to then also do a hard, you either have to wait 24 hours or you have to like do a hard refresh of RGRO to get it to actually pick it up because the manifest itself hasn't changed. But when you have this, the way this is working I understand is that there's a controller that checks every so often for an update in that secret value. And so it'll update it for you. So that's a little more seamless for when you need to make an update. Yeah, yeah, definitely. And actually one of the biggest problems we had with the Vault plugin was that it would if it lost access to Vault for any reason it would sort of just stop templating that manifest and so it would actually delete the secret out of the cluster until it got access back to Vault. And so that was a huge problem. So to answer the last question that came and we're talking about migrating from the RGRO CD Vault plugin which is a RGRO CD templating plugin that accesses Vault to the external secrets operator. And so if we go look back at our diagram here we'll see we have two things in client clusters that are pulling from our infrastructure one of them is RGRO CD and the other is the external secrets operator which is able to populate secrets for things in this case like the spottings controller or whatever else it is we might be running the cluster. I think we have a couple more questions. How do you accomplish separating role ID and secret ID into separate paths in your app roles? That's a good question. And it's a lot of Vault policy and Vault configuration. So we actually generate a separate app role for each client cluster. And then there is a policy that grants that app role access to the individual and we actually have a separate KV store for every client. And so that cluster-based app role has access only to the subpath of that client via Vault policy. Did I say all that right? Yeah. Okay. And that's really indicative of how complex this whole system can be. So I feel like one of the things that we should probably talk about are some of the challenges that we have come across trying to set all of these different pieces up to work together because it's not, it has been a labor of love over a whole lot of time. Yeah, yeah, definitely. Vault configuration is probably one of the trickiest pieces of all of this, right? Running Terraform, automating Terraform, fairly straightforward, right? Argo CD, pulling manifests out of a directory, fairly straightforward, templating Helm charts, fairly straightforward. But when you start trying to work secrets and credentials into all of that, one of the challenges is actually, this isn't shown here, but all of our Terraform state is kept in our, our S3 bucket. Inside our infrastructure, right? So there's an S3 bucket that has all of our Terraform state and it lives inside this AWS account and each individual Atlantis needs access to just the state for the client that it manages. And because it's performing actions on a different AWS account, we have two different sets of AWS credentials that it needs. It needs one set to access the state and it needs another set to run the Terraform. And so managing that can get very complex and understanding where it's coming from, which is why we use Vault to broker all of the various credentials access because it allows us to do those fine-grained controls and do short-lived credentials all at the same time. So yeah, I think we had a question. Like, could you use Atlantis along with Crossplane? I believe that would be redundant. I'm not super familiar with Crossplane. I have evaluated it briefly, but Crossplane has the ability to run Terraform for you. So I don't know that I would run them both simultaneously. But again, I'm no expert in Crossplane. And so I will defer to more seasoned Crossplane experts to answer that question. Which are not here, at least not not between the three of us, I don't think. No. We did evaluate Crossplane. When we, I think early on we're learning about some of the Terraform Cloud changes. And I seem to recall that it also does a very similar thing. Let's see. We have another question here. How to share repo test Terraform code? I think that is an interesting question. So testing Terraform code. Not something that we've spent a lot of time doing. We spend a lot of time running things in lower environments, running them in sandbox clusters, testing our changes before we roll the amounts of production. But as far as like, you know, testing Terraform code in a manner that is non-disruptive. You know, that's a tricky problem. And that's something that I've spent a lot of time on mostly because we have the luxury of cloud environments where we can just spin up a non-production environment and go run Terraform against it and see how it behaves. And so we put a lot of focus on making sure our Terraform is consistent between clients and consistent between our non-prod and our prod environments. But as far as like actual like, you know, built-in testing, we haven't explored a ton of that. We are able to test some of our individual modules that we use, mostly the templating modules. We're able to run the templates and test that they function, which is useful. But as far as the actual Terraform that affects cloud infrastructure is, it's a tricky problem that we haven't spent a lot of time on. One commenter said TerraGrant has some frameworks that allow you to unit test and go lang. I have heard that. I imagine it's very cool. Not something that we've explored just yet. As another person said, it's an evolution, not a destination. I appreciate that comment. So much this, so much this. So I think, you know, we could tell a short history lesson here of, you know, the eight years of doing this or six years of doing this for us. You know, we started out with no consistency across clusters, multiple teams kind of doing whatever they wanted for each individual customer. And then we had this Python based templating tool that managed some of the infrastructure repositories, but it was sort of difficult to maintain and we didn't have constant maintainers. And now we're, you know, we're in this model, which will probably get yanked out and replaced by something else in five years after I'm gone or whatever. So it is definitely an evolution. We've gone through many iterations of this. So just like to share. There's some of the other challenges we had. The credentials is definitely the biggest one. And giving, so I feel like one of the biggest challenges of a framework like this is, you know, there's a lot of automation that just works once you get it all going. But sometimes things don't go correctly. And then there can be, you know, the more complex you make your automation, it can be a little more difficult to bounce out of it if you need to do something. And so if you need to do something outside of the automation, and that's one of the challenges that we also had to look at here, which is how do we allow RSREs, you know, at midnight, you know, if they need to make a change to something and there's no one around who's going to approve a PR that allows them to then run the Atlanta supply, like what then? And so we've had to also build in some sort of, some back doors to allow people to do some manual changes, which goes back to the credentials challenge. Right. Yeah, no, that's a really good point. You know, we've got this all this great automation, but like I need to go do a QCCL edit in the cluster because it's two AM and something's broken. How do I handle that, right? Argo CD is going to take that change away from me. So we have mechanisms that allow us RSREs to pause syncing on different apps in Argo CD. And then if they need to run Terraform manually, they need access to this Terraform state bucket in order to do that. And they also need the credentials for the client, but we don't necessarily want to encourage that behavior, right? We don't want to be just always falling back to manual Terraform. So we provided a mechanism where they can ask Vault for access to this S3 bucket, run the Terraform apply, but we've got a monitor watching the audit logs on Vault that says, hey, somebody assumed one of those, we call it break glass rolls. And it drops a message into a channel that says, hey, so-and-so, can you tell me why you did that? And it's just- It's a snitch. Yeah, we wrote a data log monitor that's a snitch, right? And so, but it's a nice way for the team to keep themselves accountable for not following the happy path and for us to also understand why, right? They can say, hey, you know, I had to break glass to do this. It's like, well, can we fix that going forward and build it into our automation rather than requiring it? So it does sound a little annoying to get pinged every time you want to run Terraform manually, but it definitely helps encourage that team culture of everything goes into Git, we make our changes in Git, and when we have to go around that, we allow it, but then we talk about why, which I think is a great model of doing that. So that's a great call out. So is there anything that, I mean, you know, I know all the blood, sweat, and tears that went into getting this going, anything that you would do differently now that you know? I have said, I mean, to be honest, like if I were to go back to the very beginning, I might remove the requirement to manage individual infrastructure repositories and go to more of a centralized management method just because I think that's, yeah, we did talk about that. And, you know, there's some benefits to having individualized infrastructure repositories allows us to control when things get rolled out, allows individual teams to be responsible for subsets of clients, which you could do in other ways. So I might consider changing that. I would start with the external secrets operator instead of trying the vault plugin. Yeah, 100%. Other than that, you know, I'm fairly happy with the stack as it is. It's been pretty good. How about you? What would you change? The only, you know, a friction point for me is actually, it's more in our, it's more in our templating engine. For templating. So nothing, nothing here, you know, I think like you said, the vault piece of it has been probably the most complex, the hardest to, for folks to wrap their heads around, understand and onboarding our solution with SREs, you know, is one of those you really have to repeat and get people into the habit of doing it. And so you can see the pain points that they run into because everything is locked down. And that does, that can make it a friction point. And that's the last thing you want to do because then people try and find ways to work around your solution. Yeah, yeah. Your first point, the templating piece, you know, there's definitely some work to be done on that front. I'm making that easier. It can be difficult, which is part of the reason why I would maybe consider going back to a centralized method, so that you don't have to have a templating engine. You can, you know, build all of that in code. And frankly, I'd probably build all of it in something like Pulumi, so I can write it all and go rather than Aeroform. But that was actually, this is an interesting conversation to have, right? Now, it depends, that decision depends a lot on your team, right? Do you go with an infrastructure as code tool that even non-developer SREs are commonly familiar with? Or do you go with a more flexible, in my opinion, easier to use framework like Pulumi at the expense of having to train your team on the language that you choose? And in our case, it made sense to go with Terraform because the team already had knowledge of Terraform. For us, it's actually easier, you know, to train people on our Terraform than it is to teach them to write Go because a lot of folks we hire aren't necessarily from a developer background. If you have a team that's all experts in Go or whatever language you want to choose, TypeScript, whatever, Pulumi is a great choice, huge fan. So it's definitely, you know, there are so many decision points in this entire process that have nothing to do with the actual technology. They have everything to do with people, process, and business decisions. Which are almost, in some cases, even more important than because there are multiple solutions out there in an open source world to solve the same problem. And it's really finding the solution that's going to work for your team that's gonna be easy to adopt, quick to adopt, and that's going to provide the least friction for people to use. And so, you know, we joke, I remember Pulumi was, you were really into using Pulumi because you code in Go, I know how to code in Go to some extent, but we did not have like a strong Go basis on the team. And so, Terraform, everybody knows, you know. Yeah, well. Yes, there's definitely problems with the Terraform state file, but we won't go into that here to address the comment here. There's no doubt about that. We definitely have to reconcile that. Someone once said to please explain Atlantis around Terraform and Argo CD connection. Ah, interesting question. I think the key is there is no connection. They are two very separate sets of tools. So for everything that lives outside of the cluster, your VPC, your subnets, your, you know, DNS zones, your IAM roles, all of that stuff that doesn't live in your Kubernetes cluster, we prefer to build that with Terraform and actually we build the cluster itself using Terraform. And so that's all managed by Atlantis, which is why it's a little bit, it's a little bit hard to see, but this arrow ends here at the cloud provider, basically. So that interacts with the cloud API and then everything that goes inside the Kubernetes cluster is managed with Argo CD. And so that is the, not connection, but distinction between the things that those manage. I like this other question here. Yeah, go ahead. The one of the interesting problems where you need to evolve that one. Yeah. So the question is, do we have interesting problems when we need to evolve client environments rather than greenfield deployments of them? Yeah. And that's a really interesting question because while we've been doing all of these tooling changes over the years, most of what we do is evolve client environments. We manage upgrades of all of these add-ons. We manage updates of the cluster. And so the majority of our work is modifying existing customers rather than building greenfield. We only build greenfield when we bring on a new client. And so a lot of the problems that we have encountered in the past and that we're overcoming going forward are about that exact question, which is how do we bring existing repos into new tooling patterns? Or how do we reconcile that sort of thing? Or how do we make large sweeping changes to the structure of our repositories? Which is, for example, why the GitOps manifests live in this highly nested directory because in this cluster's sandbox directory, there used to be a bunch of other files that were cops configuration. But we don't run cops clusters anymore. So we don't need that actual, in most of our clusters, this will be the only path in this repo. But it's not really worth it at the time to try and go back and change that for all the customers. So there's some interesting organizational changes that are difficult there as far as other problems. And moving to using like Argo CD, for example, and the GitOps approach, like we used to run strictly out of, Andy showed the Reckoner course file that we use. And that was a whole thing that was also built into our previous templating, our Python-based templating tool. And so we would use Reckoner to directly install add-ons into client clusters. And now not only are we using it strictly as like a templating engine, and then as to put out manifest for Argo to grab, but even the way that we fill out that course file itself, like, you know, we're using values YAML in a lot of places as opposed to doing inline YAML definitions. And so yeah, we are constantly, and that is definitely one of the big struggles is, you know, you can't just tell someone like, yeah, we're doing something different. So we're gonna have to like, knock everything down and start over because that doesn't have a benefit for them. So you have to find ways to accommodate and just pull what exists into your new pattern. And you have to determine it is worth it because it can be a lot of work to do that. Yeah, yeah. And keeping the system as flexible as possible to accommodate those changes, right? Across different clients, because we have clients that have different requirements and ask for different things. And so the design of the system is intentionally, you know, we dictate sort of how things get into the cluster, but we don't necessarily dictate how that gets into the cluster. So for example, like I showed with Terraform, we have the templated Terraform that's centrally managed, but I can add more Terraform on top of that for a particular client if I need to. Ideally that goes into the internal tooling and it's encapsulated, but client needs this tomorrow. I may not have time to go, you know, template it out in a generic way that works across everybody. We have the flexibility to add those things to the repo for each individual client, which is possibly one benefit of running the multi-repo setup that we have. So trade off. Yeah, I think we had a question, like would you share any thoughts on immutable infra via IAC? That's an interesting question. I feel like most of what we talked about is in that direction. So, you know, I'm a huge fan of a mutable infrastructure via infrastructures code, right? Ideally in this model, right? RSREs are never doing this or this. They're just making pull requests to the infrastructures code. Now that's an ideal world that is maybe, you know, not a hundred percent achievable in some environments, but it is largely achievable and we're huge fans of it, which is why we run ROCD, you know, ideally I would have Atlantis running on a schedule as well as running on repos, right? So that we would be essentially, you know, doing some sort of lightweight drift detection for Terraform code. And we're working towards that. So I think it is the end goal to summarize my answer to the question. And listener who thanked us, you are welcome. Thank you for asking the questions. We much appreciate the interaction. I think that makes these things much more interesting than I was just talking in a screen. Well, I think we're about out of time here. Well, we appreciate you having us on again. I know it's not our first time on the live stream. It's always a pleasure. Yeah, so yeah, I think, so any questions? If you have any questions, feel free to add in right now, but other than that, I think the session, like it was a very awesome session. So yeah, folks are saying like, yeah, great insights there. Thank you. So yeah, thank you so much, I guess, Stevie and Andy for your summation and it was very insightful, I guess. So yeah, thank you so much and see you again, right? Thank you. Bye. Okay. So yeah. So thanks everyone for joining the late sort of clownity live. We enjoyed the interaction and questions from you guys, obviously. Thanks for joining us today and we hope to see you again soon.