 Hey everybody. Last session of the day, so I appreciate you coming out. My name is Mark Borschtine. I'm the CTO of TREMO Security, and we're going to be talking about Kubernetes in public safety systems. What were some of the things that we had to do that were different from maybe your typical enterprise and our journey over the course of about a decade from running on virtual appliances all the way to containers. So like I said, my name is Mark Borschtine. I'm the CTO of TREMO Security. I've got 20 plus years of identity management experience. I've been working in Kubernetes world since 1.3 with OpenID Connect and RBAC coming out. So if you've ever read the documentation for OIDC on the Kubernetes website, I helped redo that a few years ago. I'm also the co-author of Kubernetes and Enterprise Guide, second edition, and contributor to multiple projects, Kubernetes, the dashboard, Keali. But you don't really want to hear about me. Let's talk about the case study here. So what is the National Capital Region IAMS program? IAMS stands for Identity and Access Management. It is a identity service for the Washington, D.C. area that provides identity and access for regional systems and applications across 22 jurisdictions. So in D.C. we have local governments, we have state governments in Maryland and Virginia, we have the D.C. government, and we also have the U.S. federal government. And the U.S. federal government, you have the civilian agencies as well as DOD. So there's a lot of different cross sections of identity that need to be managed. What's kind of interesting about our platform is that we do both authentication and authorization. So users can log in, authenticate using their jurisdiction credentials, but they can also request access for applications in a way that's independent of the jurisdiction. So it takes a little bit of the load off the jurisdiction IT systems. The applications that are integrated with IAMS kind of run a gambit from content management and collaboration systems, SharePoint sites, WordPress sites, Drupal sites, etc. All the way to emergency response. So for instance, the Fairfax County and Virginia emergency response system uses us for authentication. A lot of the emergency response agencies will use our system for authentication to get regional folks, like folks from universities, etc., into their systems during a crisis. So we run that entire gambit from the whole content management all the way to emergency management. And we've been in production since 2013, so we've got quite a nice long journey. And the entire platform is built on top of TREMO Security's Open Unison. So it's all open source. TREMO Security, my company, makes Open Unison. So yeah. So where IAMS begins. We go back to September 11th, 2001, when terrorists flew an airplane into the Pentagon. Units from Arlington County where the Pentagon is. Fairfax County, City of Alexandria, D.C., Maryland all came to respond to the attack. And out of that, from a public safety perspective, there were a lot of lessons learned. Things like, you know, the regions need to have interoperable systems, radios need to work together. And one of the things that came out of it was this attempt to build a network that would survive an outage on the commercial internet. So if the commercial internet went down, public safety officers were still able to communicate with each other. This network was built, and folks didn't want to put any applications on it because they didn't want to have yet another username and password. So that's where we came in. So when we're talking about public safety, the priorities are a little different than you might be used to in your typical enterprise. The first is availability. You might think security is the first one, but it isn't. Because a lot of the information, if you just have a radio, you can hear. So what's far more important is making sure the system is up and available. An example would be that if a jurisdiction's identity system goes down and they need to notify the other jurisdictions that there's a problem, well, if their identity system went down, how can we log them in so that they can notify these other applications that there's a problem? So we had to build in a way for them to be able to securely do that without their jurisdiction being online. For the stuff that isn't publicly available information, privacy becomes a big concern. It's probably the second highest concern only to availability. As an example, you get arrested. Your mug shot goes into a database. There are a lot of security controls around that to maintain your privacy. These controls are important. The ACLU comes up in almost every conversation when we start talking about how we're going to manage access to these things and making sure that everything is kept above board. So we have to be very mindful of that. Now, of course, security does play into this. You can't really have a good privacy system if you don't have a security system built that's built on top of. And so we use MFA. We use Open Unison itself to control access to the clusters. And then finally, performance. It's got to perform. It's got to be usable, but we're not talking hyperscale. There are about 10,000 users across the region, maybe a few hundred at any given time or logged in. So it's got to be available, but we're not exactly worried about serving millions of people. So where are some of the challenges to bring about? The first is we have to deal with a lot of legacy systems. If it ain't broke, don't fix it. It has a really special meaning in the public safety world because of the need to maintain availability. It's been a slow space to get into the constant movement of constantly upgrading, constantly keeping things patched. There are always exceptions to the rule. So as an example, a lot of the police applications we work with, the first thing they say is, we want to make sure only sworn officers have access. And then the second thing they say is, well actually crime analysts need to be able to access it. They're not sworn officers. So we have to be able to manage to that. There are silos of silos. You have every jurisdiction maintains a silo for the police department for fire. And then even inside police departments, you have silos because certain parts of the police shouldn't be talking to other parts of the police by law to maintain privacy. And then you run into the technology and regulations that feed into those challenges. So we've been talking a lot about what public safety is like. Let's get into the fun part, which is technology. So when we originally launched in 2013, we were running in virtual appliances. We thought it would be great to just have an ISO with the system on it, drop it into a VM, get up and running and have it work. And nobody was going to log in and put anything on it. And that lasted about 15 minutes until we were told, no, we need SNMP on there. We need this agent on there. We need that agent on there. So what was virtual appliance really just became another VM? It was a mix of Windows and Linux. So our product was running on Linux. The jurisdictions wanted to be able to take ownership of it once it was up and running. So most jurisdictions, local jurisdictions run on Windows, so they wanted things to be running on IIS, SQL Server, Active Directory. After a few years, we decided to migrate off of that platform. We were able to migrate to a different data center. That data center made it a little bit easier to run VMs. And so we moved still on virtual appliances, but everything was Linux. We got rid of IIS, we got rid of SQL Server and moved over to MySQL or actually MariaDB with Percona for high availability. We had set up a free IPA service to manage authentication authorization into our servers and for the most part, work fine. And then at that point, we was when Docker was starting to get a little popular and we said, all right, we're going to go ahead and try using Docker containers for all this. And we made every mistake you could possibly make with containers. We treated it like a VM. We didn't really know how we wanted to handle persistent storage. We didn't have a good way to handle networking or scheduling. We did everything manually. And we got it up and running. It worked, but we took one look at it and said, stiff wind is going to knock this thing over. This is no good. So we decided we were going to go Kubernetes, but we didn't want to run our own Kubernetes. So this was about 2017. And we were supposed to be moving to a new data center in a few months. And so we said, all right, we're just going to wait until we move. We're actually moving into a public cloud. And we said, we'd use that public clouds Kubernetes because we didn't want to manage it ourselves. Well, it took a little longer than we had hoped. And we ended up moving to the public cloud in 2019. We didn't have as long as an on-ramp as we hoped we would. So instead of doing Kubernetes, we first went to Ansible-managed VMs, which we'll talk about. And then finally, we did make it into Kubernetes. So the first major move was virtual appliances to Ansible. Our original product Unison was not open source. It had a GUI for configuring things. And so when we created open Unison, we got rid of the GUI, went to configuration as code. We liked that a lot more. It was a lot easier to manage. And we externalized the environment configuration. So now we were able to generate something that kind of looked like a container. It wasn't a container, but it was very easy to deploy. We were able to use Ansible Playbooks, and it worked. You know, we were able to take a VM and get a system up and running in five, six minutes, which was fine for us. The other major change here was we changed from a legacy monitoring solution that the region had been using. We, as a company, TREMO had taken over management. And so we wanted to do our own monitoring. We'll talk about that. And we did all of our patching via Ansible. So when it came to monitoring, we didn't have the capacity to just start deploying Prometheus and getting it up and running. We're a little bit budget constrained. So what we decided to do was we were going to run Prometheus and Grafana inside of TREMO's cloud, but do all the scraping against our in-data center or in the other cloud VMs. So what we decided to do was we had, on the left there, that's the TREMO cloud, we would have a Prometheus instance scrape open unison. Open unison would put a message on an Amazon SQS bus. You get picked up inside the cloud. Open unison would then scrape all the open unisons inside of the network, compile all of the responses, put it back on the bus, send it back up to the cloud, and then the open unison in the cloud, and then we give it back to Grafana. So that way we were able to monitor everything without having to worry about an ingress into the network. So now we're starting to get to the fun part. We're moving from VMs to Kubernetes. We had a few different challenges. The first challenge we had was a split network. Our applications are on a combination of that NCR net I had mentioned and the public internet. Some people can come in from the public internet, folks that are coming in if they're on the jurisdiction's networks are coming in from the NCR net. So we had to make sure that we had dual entrances and we narrowed down where things could go based on ingresses. We still had to deal with legacy security. All of the security rules across the regions are all still built on firewalls and IPs and natting. So that was a bit of an interesting conversation with security folks and they saw what we were doing and they were able to sign off on it. And then we had to get into CICD. This is where we really started to have some fun because what went from generating what was essentially a executable war file now became how do we generate a container and keep that container up to date. And then the last couple of things that we had that were challenging for us were logging and monitoring. We take all of our logs from Open Unison. We put them into a database. We started this before Splunk really became a thing and when Splunk did become a thing, nobody was going to give us budget to actually run Splunk. And because of the volume, it was actually easier just to do it in real time using message buses than trying to do something fancy with persistent volumes. So let's talk about our pipeline. So Trummo Security manages multiple base images in our Docker repo and we keep them up to date. We use gripe from anchor to scan those containers every single night and if canonical, all of our images are built on Ubuntu. So if canonical releases a patch for a CVE, we get triggered to go ahead and rebuild the container. That rebuild then triggers another rebuild to go ahead and build our NCR container. So we execute a webhook that triggers a patch into a GitHub repo that stores the application code that then triggers the cloud CICD pipeline to do a rebuild. That rebuild then patches the API server and the test environment to go ahead and load the new container. And because we have continuous monitoring, that tells us that everything is still working. Once we're happy that everything is still working, we go ahead and update our production API server with the new tags to run the new version. So we've got Kubernetes up and running. Everything is going great. And so the next question becomes what's next? Are we done yet? And going back to our original requirement, we need to have multiple instances. We all know public cloud data centers do go down. It doesn't happen often, but it does happen. And so we wanted to be able to fail over to a different data center, a different region entirely. And we didn't want to have to maintain two different clusters from nothing. So GitOps seems like the right approach. Well, before we could get to GitOps, working backwards here, we needed to figure out how we wanted to manage our secrets. Wasn't willing to put secrets in Git, and we'll talk about that in a second. So we needed to figure that one out. And, of course, we were on managed Kubernetes, so we had that one done. So let's talk about secrets management. You need to be able to externalize your secrets for GitOps. Secrets should never go in Git. I know there are projects out there, like sealed secret for encrypting secrets. The problem with that approach is a couple of fold. The first is if you... Git is specifically designed to be easily distributed. And to lose track of it. And so if I download my Git repo, I pull my Git repo, accidentally I push it up to GitLab or GitHub. I have now lost control of it. There's nothing I can do at that point other than try and go delete it. But once it's out there, there's no centralized control. You've now lost control of it. So that's issue one. Issue two, the big point... The biggest value of Secrets Vault isn't the fact that you're locking down the secret itself. I mean, it's good to lock that down. I'm talking wrong. But it's that you now have control over who's accessing it, and you have control over... or visibility in who's accessing it, and the ability to quickly rotate it. Which you can't do if you're using encrypted secrets in Git. If you're using encrypted secrets in Git, now you have a key that you have to maintain, and if that key gets lost and you need to rotate it, you have to go to find each secret that is encrypted by that key. So I understand why people do it, but for operations, we were not going to do it. Another point about using Vault, we wanted to use Vault as a centralized database for Secrets, but we were not worried about using Kubernetes Secrets. It's a common API. All the systems we use know how to use it. And so Kubernetes Secrets were fine, and the blog post I linked here came out early in 2022. It does a really great job of looking at the different potential threats to Secrets and Kubernetes and explaining why it's fine to put your Secrets, for the most part, into SED. Most of the value that you're going to get out of Vault beyond what syncing the Secrets into SED would get you, would be defeated by automation anyway. Especially now that Kubernetes has a way to provide an identity to workloads to talk to your Vaults without having to have a specific password or unlock things. So we decided to go with the CSI Secrets Store Driver. There are other projects that let you synchronize from a Vault into Kubernetes Secrets. We liked this one because it was part of CNCF. The kind of trick with it was that you still had to have the Secrets mounted to a pod, even though you were generating Secrets. So that was a little bit of fun to figure out. So the first thing we did was we wrote a script to convert our existing Secrets to a Vault. We automated it. So we wrote a script that went through each Secrets, generated the correct CSI object. The CSI object is kind of standard, but each implementation of the CSI Secret Driver, every vendor has its own implementation. So the Secrets Store Driver for Amazon has different things in the object than the Secrets Store Driver for Vault, for AKS, et cetera. So it's a little bit of hard-coded from that perspective. But once we did that, we were able to connect our cluster to Vault, generate the correct objects, and then we had to synchronize the objects, the Secrets into our cluster. So the way that we handled that was we used the pause container, or the pause image. It hasn't changed in, like, six years, I think. So we created a deployment with pause. It takes almost no resources, don't have to worry about patching it. That would mount the Secrets internally, and that would then get the CSI Secret Store Driver to synchronize the Secrets into our... the Secrets from the Vault into Secret Objects into STD. So that way we're getting our cake and eat it, too. So once we had our Secrets figured out and working, then we could start having fun with GitOps. And the first thing we learned very quickly is that eventual consistency is a lie. The reason I say that is if you've ever run a Helm chart and it failed kind of halfway through, Helm doesn't try again. You've got to make the decision to do that. If you have stateful systems, if you have monoliths, a lot of applications require specific sets of steps to happen in the right order and things to be ready before they can even get started. And so that was really our first big challenge, was figuring out how we were going to coordinate that. We have a combination of Helm charts and static manifests. My original thought was we were going to use Helm for everything and just have a different values file for each environment. We ended up deciding to go with environment-specific repos, though. Our Dev environment is much more complex in our production environment. So I talked about 22 jurisdictions across the region. In Dev, I have to replicate that so that we can test and make sure that things work before we push it into prod. So that infrastructure that's needed to provide those testing identities doesn't need to exist in production. We're not going to sink it into production. So we decided to go with the separate repos. And that made for a really easy migration from Dev to prod. And for a GitOps controller, we decided to go with Argo CD using the app of app pattern. So when we create Argo, we create a single application. That application points to a Git repo that has a bunch of application objects in it that gets synced in and then that powers the rest of the infrastructure. I really like Argo CD. I like having a GUI. It makes it easy to track what's going on visually and see what's going on and it simplifies the day-to-day operations. Some of the things that we had to... that were interesting to deal with was learning how to use waves properly in Argo CD. So making sure that I talked about sequences. Yeah, certain config maps were deployed before deployments were deployed or making sure that certain CRs were deployed before the relying deployments were there. So that was an interesting journey, seeing how all that fit together. So once we had our environment ready to go, the next thing we had to do was move all of our configuration out of our old cluster into our new cluster. And there are surprisingly few tools that'll let you do that. If you have a cluster and you use the kube control getAll plugin, that literally just gets everything from a namespace, you've got a ton of objects that are ephemeral in nature. Endpoints, pods, replica sets, these intermediary objects. So the first thing we wanted was something that would strip all that out. Then we also wanted to strip out all of the metadata information that was very cluster specific. UUIDs, resource versions, the annotation that stores all of the state management so you can do kube control apply, I'm blanking on it right now. So I could not find a tool that would do it, so I wrote one. There's the URL for it, you're welcome to use it. It is not a polished CLI tool, but you're perfectly welcome to use it. And what it does is it goes through the environment namespace, exports everything by object into folders that I can then easily clean up, go through, import into a Git repository. So some of the challenges we ran into with GitOps were CI CD. So when we talked about our CI CD pipeline previously, we were using the cloud vendor CI CD system to build, generate the container, and then patch the API server deployment to go ahead and roll out the new container. Well, in the GitOps world we don't want to do that, we want to patch a GitOps repo. So most of the CI CD platforms out there make it really easy to pull from a single repo, but if you're going to pull and write to a different repo, you basically have to do it manually as if you're sitting at your laptop. So that was a little interesting trying to figure out how to get that to work. One thing that's really good is you can use the kubectl patch against a local manifest to do the patching, so you don't have to worry about using sed or grep or awk or any of the other text-based tools to update it. You can do kubectl patch the same way, you just do it locally against the YAML file in your manifest. Monitoring was a little interesting. When we moved to Kubernetes we stopped using the tremolo-hosted monitoring solution and we moved instead to Prometheus Stack running inside the cluster. The issue we ran into though was that Prometheus Stack isn't really designed to be customized, and so we had specific dashboards and specific metrics that we wanted to track. And so we ended up writing a script that would download the source from Git, make our changes, and then run a lot of them, and then bring that into Git. And that worked pretty well. That actually wasn't as bad as I thought it was going to be. All right, I talked about leveraging waves in Argo, and then the last is once the repo is updated, making sure that Argo is synchronizing in the proper order, so we ended up writing some scripts to do that. I'm looking at doing something maybe with Argo workflows to make that a little bit more automated. Yeah, and that's pretty much it. I'm going to do a little shameless self-promotion here. You can find me on Twitter, either myself or tremolo-security. And we've got a 20% discount on Kubernetes and Enterprise Guide second edition. You get it off of Amazon, that's what that QR code is for. And you've got the code there. Any questions? All right, oh, sorry, one more. We got one. I'm sorry, I didn't hear that. We have three people on the team. The question was how big was the team? We have about three people on the team. Any other questions? All right, well, thanks for making it to the last session of the day.