 Hey everyone, good morning. Good afternoon wherever in the world you may be. I'm Michael Medellin. I am the director of engineering for a software engineering unit inside the Air Force called Kessel Run and joining me today is Gordon who I'll introduce himself. Hi folks, my name is Gordon Tillman. I'm a principal engineer with F9 and I've been working with Kessel Run for almost a year now. Awesome, thank you Gordon. Today we're going to spend about 20-25 minutes or so talking to everyone about how the Department of Defense is using Kubernetes and Flux to achieve our compliance and deployment consistency that we're looking for in our efforts to develop the capabilities that our users are looking from us on a day-to-day basis. Just coming into context, I want to make sure I can explain a bit more about what Kessel Run is and what we're doing. Our vision is to deliver combat capability that can sense and respond to conflict in any domain, any time, and anywhere. We are part of the mission just as much as the exercises and operations that go on with flying planes around the world or flying aircraft carriers around the world. We are building the capability necessary for the airmen and women who serve the United States and our coalition partners to win the next war and the software teams, both myself and Gordon, on the teams that are working on the platform and also our application teams who take dependency on our work are shipping that software every day to our users around the world. We want to take a time to just briefly a little bit on Kessel Run and that mission overall. If you dive into it, what really is Kessel Run? Kessel Run is an acquisition development environment. Traditionally, within the government, we have acquisition units that are going out and awarding contracts to software engineering teams and other companies to actually build capability, much like you would go out and build an airplane or a ship or some type of capability for the Department of Defense. We would be acquiring that and we are part of that acquisition's infrastructure for the Department of Defense, but internally, we are very much a software engineering organization. In the buzzwordy term, we're doing DevSecOps. We're trying to unify a way to ship software to our users around the world, do it securely, do it reliably, and do it continuously and achieving that loop. A bit more about Kessel Run's mission and what's our main focus? What are we actually trying? What's our mission? What are we actually trying to solve? We have to talk a bit more about the Air Operations Center and how we do command and control operations for the United States Air Force and our coalition forces around the world. Back when the Air Force first started, this was a very much manual process, working with maps, communicating over radios and telephones to direct resources and manpower to actually affect capability around the world based on the mission that we were being asked to serve. Naturally, over time, this became more software driven. In the 1990s, you saw the introduction of computers and digital technology to help coordinate resources. This is very much a normal enterprise resource planning problem. It's like, how do you get people and air assets to the location best necessary to suit the mission and actually execute the mission? As of today, this is more digitally driven. We've got these legacy systems that our users around the world, Airmen and Airwomen, are accessing these applications, conducting missions, conducting planning and execution, but they're using these systems that were developed so long ago by engineering teams that weren't what we would be familiar to us as a modern, quote unquote, DevSecOps engineering environment. Shipping software quickly, iteratively to our users on a day-to-day basis to prove out the capability and deliver what's most necessary and important at that point in time when we deliver it. I think it's worth defining the problem more so folks have an understanding of the problem space that we're working with. We've got the mission laid out. We're trying to solve the command and control problem for our users who are around the world. If we step back a bit, actually delivering that command and control capabilities is actually fairly difficult. We have a globally distributed user base and by distributed it's fairly evenly distributed. We have users and operations that happen around the world at all the different co-coms that we need to be able to support with all the operations and capability that we're shipping. We have multiple air gap networks that we have to support. So dealing with deploying our clusters and our applications and our services and configuration, controlling all that configuration, controlling all of that deployment uniformly across all of these networks that are air-gapped is a very difficult challenge. It sort of leads us to kind of some of the solutions that we're going to be talking about a bit later on in the presentation. We also have to support commercial and on-prem infrastructure. We're using commercial cloud. We're using our own globally distributed on-premise infrastructure that we need to be able to support, that we're putting Kubernetes clusters on, that we're putting applications and services on, and distributing that workload around the world to be able to serve the mission. We also need to be able to support on-demand operational and exercise and test environments. There's always exercises and tests being conducted with our software that we don't want to kind of battle test our production or operational environments. We actually want to use separate isolated environments to be able to test and execute exercises without interfering with normal operations in an actual real-world use case. Finally, where kind of the title was leading us for this presentation is that all of this also has a compliance and regulatory challenge as well that I don't want to follow up here a bit on as well. But really kind of these five major areas or issues really define the type of complexity and problem that Gordon's team and our teams in our platform engineering environment are having to solve alongside our application development teams to be able to deliver their capability effectively to the end user, to our coalition partners, to the United States Air Force to conduct missions and operations around the world. These sets of challenges inform a lot of what we've chosen and selected as part of the technology stack. Diving real quickly in terms of compliance and regulatory, one thing I'd be remiss to mention is that the major challenge that we face in this space and kind of our Kessel Run kind of got its original claim to fame, if you will, within the Department of Defense is this idea of that how can we improve the ability in which we field what we call as an ATO. Basically every system or information system inside of the Department of Defense legally needs this authority to operate to be able to take on operational missions, whether it's a managing employee payroll data to actually conducting air missions. All of those systems have to have an authority to operate that we've reduced the risk. We've proven that the model is secure and that we've done the controls necessary to mitigate as much risk as possible before we field this system in production. And this process is very, very tedious previously. And so it's led us kind of what we call the continuous authority to operate that I'll kind of briefly touch on here to set the stage for Gordon to talk a bit more about what we're doing with Flux. So traditionally the ATO was, you know, it's something that you achieved and got, you know, you achieve your ATO, you earned your ATO in a month to years timeframe, depending on the complexity of the system. It's very paperwork driven, very, very tedious, major architectural changes, or kind of out of the question because it would impact the ATO. Because that ATO would kind of last three years, then you re-visit the thing every, you know, at that three-year interval to either renew the ATO or change the ATO, depending on the circumstances. Obviously, this kind of pattern has left a lot of stagnant, stale systems within the different parts of the Department of Defense because not many teams are wanting to actually go about architectural changes and deliver new capability because of that tediousness involved with actually earning that authority to operate. So then, you know, kind of where we are now in this phase one of continuous ATO is that Kessler one kind of pushed the ball forward here and said, we're going to reduce that time to get an ATO down the days and months. We're going to leverage rapid assessment with application security teams and assessment teams, and we're going to implement guardrails and static scanning tools to help provide and identify vulnerabilities in our code bases before we actually ship them to production. And we're going to leverage best practice commercial products, products in our, in our founding things like Pivotal Cloud Foundry to help ship software quicker and more uniformly across our environments. And now kind of where we're going in this transition around Kubernetes and containers and declarative infrastructure is kind of what I think of as phase two as our continuous authority to operate. How can we, how can Kessler one take what it did originally and pushing the bounds of the continuous authority to operate forward and do the next best thing and the next great thing for application security DevSecOps and the DoD. And this is kind of what we're talking about today. How can we reduce it from, from days to months to ship code to production, but in hours or days, how can you show up on your, on your first day to Kessler one and submit a bug change to production in that same day. How can you, how can we use techniques like policy as code and configuration management tools like flux and other types of GitOps patterns to help drive the types of compliance and configuration management expectations that we have in the platform for customers in our authorizing officials on the platform. Can we leverage open source tool chains, products like Kubernetes, products like flux in our tool stack to help enable more transparency and more security within our infrastructure, because I'm fully on board with being able to leverage open battle tested systems that have had security and open penetration testing some vulnerability assessments. How can we leverage more of open source contributions in our system and actually, and then turn contribute back to the open source community as we kind of move forward down this path, because not only do we want to actually leverage what the community has done, but actually take what we're learning and operating in these complex environments and contribute back to tools like flux and Kubernetes to help improve the direction and capabilities of those systems to support not only our complex environments, but other others in the community who face similar challenges. So Gordon and I work on what's the application, the platform team. So we work on what we call the all domain common platform and we're trying to make it easy for teams to ship secure reliable resilient software and defining this a bit more. We're building a multi network, a multi region hybrid infrastructure platform that helps solve these problems for application team so that they can stay focused on the mission outcomes and we can solve the operational complexities for these teams that require supporting multiple networks, multiple regions and a hybrid infrastructure that we support today for application teams. This complexity is the reason why we've ended up down this route and using tools like patterns like GitOps and tools like flux and Kubernetes to power our infrastructure. These helped solve problems that are around this complexity and to tell you guys a bit more about that complexity and how we're using these tools to manage that complexity, and I'd like to pass it over to Gordon for a bit more on that thread. Okay, thank you, Michael. Next slide. Okay, so the goal here is to use GitLab as the single source of truth for everything that's being deployed. This includes networking, Kubernetes itself, our common or baseline services that we'll talk about, and applications that teams at Kessel run deploy. It also includes maintaining the appropriate access to the cluster itself as well as applications that are deployed in it. One common factor in all this is the use of flux. Next slide. Okay, so why flux? There's some measure of additional security which fits in nicely with our DevSecOps model. Nothing outside the cluster can update what is running in the cluster. Flux verifies and will reject any Git commits that are not signed properly. This prevents unauthorized code from being applied. Obviously, we have a full audit history of everything that is applied to a cluster via Git, and we eliminate the configuration drift that can happen when you let folks manually apply or tweak things in the cluster that are running in the cluster. Next slide. Okay, starting with infrastructure. We are using Cluster API to manage the deployment of workload clusters themselves. Now, Cluster API is a Kubernetes project that provides declarative APIs for managing the lifecycle of their worker clusters. We have management clusters that are responsible for the worker clusters deployed in their respective regions. Here's a snippet of a Rancher dashboard that's showing one of them and what the kind of resources that they're maintaining. Next slide. Now, before we move on, I would like to emphasize that Cluster API does more than just deploy worker clusters. It provides no downtime, security, patching, and Kubernetes upgrades. It can change instance types so it can scale workers both horizontally by scaling the number of nodes and vertically by upgrading them to use larger instance types. In addition, as a safety measure, fields in the cluster resources that could cause a cluster failure if they were changed accidentally are immutable, so it prevents that from happening. Next slide. Okay, so let's just take as an example what happens if we want to do, if we want to deploy a brand new Kubernetes cluster. Well, we kick off a GitLab pipeline and it first runs some Terraform that creates a networking itself. If, for example, we were deploying in something like AWS, this would include the VPC itself. And Cluster API can actually do this step, but we wanted an easy way to encapsulate all of the best practices from our cloud team based on the various security audits they had to go through. For example, it deploys a bastion instance that's running a hardened image. It deploys endpoints that allow one cluster to talk to another one, perhaps, or to talk to various internal or external services. And depending on where it's deployed, it may configure access to the environment via something like Zero Trust and lots of other things as well. It then reads the output from Terraform and generates the custom resources that are required for a new worker cluster. These are committed into Git and we have a flux instance running in that same management cluster that will apply these new custom resources and trigger the Cluster API controller to deploy them. And ending up in a new worker cluster. Next slide, please. Okay, so the workload cluster is up. We have a running workload cluster. Now, since we did not let Cluster API deploy the networking, we did that with Terraform, it cannot automatically associate the control plane nodes and worker nodes that it created with the appropriate load balancer targets. So we have a Kubernetes cron job running in the management cluster that watches for new nodes and does that for us. Then a post workload cluster deployment hook does the following. First of all, it populates some secrets involved that we need for the new cluster. Then it deploys some basic things that have to be in place before we do anything else. In particular, we deploy Bitnami sealed secrets. We deploy Flux. No surprise there. We deploy the Helm operator and we deploy a service mesh. Currently, this is Istio. Just a quick note. So among other vault secrets that are prepopulated are the private key and signing key that the Bitnami sealed secrets controller is initialized with. This lets our follow-up processes that we'll talk about here generate sealed secrets that can be committed to get safely for subsequent application by Flux and it's able to decrypt them successfully. Next slide. Okay, so let's talk about the baseline services I mentioned earlier. This part is actually kind of kicked off by human currently because different clusters have different requirements, but it's not very involved. We have a repository that manages all of what I call baseline services. To register a new cluster, we just add it to a manifest that specifies what we want to be deployed. Some examples of that may be a logging stack. This is a standard EFK stack based on OpenDistro. Maybe monitoring. This is based on Kube Prometheus stack. We can deploy various host-based threat detection agents, engine x-ingress, everybody needs an ingress, and potentially a lot of other things that kind of fall in this category of baseline services. That is things that aren't deployed by the teams themselves. When this is committed in GitLab, it kicks off a pipeline that results in the creation of some directories in Git that are monitored and applied by Flux. Next slide, please. Okay, so here's a tiny sample of what that monitor deployment directory would look like. I've deleted a lot of stuff just to make it fit, but I want to emphasize that not only are we able to configure and deploy whatever helm releases that are required, but we can also handle other customizations. For example, we can do RBAC that limits the scope of what a given team can access in the cluster. This doesn't, they're accessing with Kube Cuttle, Rancher Dashboard, or a service like Kibana or Grafana. We can pre-initialize namespaces for various team applications. We can customize the appropriate host names to use based on the region for things like GitLab, the various private Docker registries that we host, Nexus repositories for Helm charts and other artifacts. Basically, any resource that you want to have managed in clusters that are spread across multiple regions can be handled in a similar fashion. Next slide, please. This is just a tiny sample of a Flux V1 config. We're actually in the middle of upgrading to Flux V2 and the new Helm operator from the GitOps toolkit, but I showed Rancher Dashboard in an earlier slide. Well, one of the things it likes to do when it's monitoring clusters is it likes to add labels and annotations to namespaces. But if you let Flux control namespace resources in its normal fashion, it will just happily revert any changes made by Rancher. Here, we've altered the normal behavior of what we do with Flux. Normally, what it does is when it's doing its thing, it runs all the generator commands, concatenates their output, and applies it. But the first command you see here, for example, generates nothing to standard output. Instead, it runs a script from the Flux pod that will create all the required namespaces if they do not exist, but otherwise leaves them alone. So we're able to let it control the initial deployment of the namespaces and at the same time let Rancher do its thing. Next slide, please. All right. I mentioned team applications earlier. Well, we have a tool called RAD, which stands for Release and Deployment Dashboard. It is our internal self-service dashboard for teams to use. So behind the scenes, RAD will generate Kubernetes manifests that are committed to a specific Git repository for a given cluster. Flux is running in the cluster. It tracks these changes and will apply them. And it's been a great help with our application developers as they transition from a pivotal cloud foundry to native Kubernetes deployment. Now, as part of this, we include a platform manifest. Let's call it the declares various resources that may be required by an application like databases, caching services, whatever. We have a controller running in the cluster that will parse this process and will deploy those various required resources automatically. We'll talk about it more in a sec. RAD is an abstraction point in the truest sense in that it is able to evolve from a user iteration perspective by completely abstracting the deployment environment. And soon, it will also handle deployment to our air-gapped environments. It'll package up the appropriate artifacts. It'll securely transmit them to the appropriate environment. And we'll be good to go over there. Next slide, please. Okay. This is a small example of the platform manifest that I mentioned on the previous slide. Notice on the right-hand column in the baseline dependencies for this particular app that it's asking for an instance of MySQL. And it's declared there with some parameters. So when RAD deploys this manifest along with the other ones, the controller sees this, parses it, and will automatically deploy this and any other dependencies that are requested here. So it's very easy for teams to get up and running in Kubernetes with this. I think on the next screenshot, yes. Okay. So here we have a somewhat redacted screenshot that shows the development releases of an application we call KRID. And I think on the following screenshot, if I can see it. Ah, yes. This one, again, redacted, which showed the production deployments, including what versions are deployed where. We can have different versions of a given application deployed in different environments, staging, production, whatever. And RAD will also report the results of a deployment to an environment to tell you if it was successful or not. It will not only tell you that the pipeline succeeded, but also whether or not the pods came up, et cetera. And Michael, I'm going to throw it back to you for the summary, please. Awesome. Thank you very much, Gordon. Yeah. So I kind of want to pull it all together, have a limited amount of time here and pull together and get to Q&A with everyone. But to kind of sum up what we're doing in Kesseron and using these technologies, these GitOps patterns really do, using technologies like Flux, really do help solve pain points with fleet and configuration management for our particular problems. Kind of going back to some of the slides I showed earlier, we operate in a very complex environment and I'm sure that's not new to anyone here who's on the conference. But these particular challenges that we face when looking at the problem space and the solutions that were out there, the declarative paradigm that Kubernetes introduced with the declarative patterns and controller patterns that have sprung out of this project and influence technologies and patterns like GitOps and Flux, it really does help solve a lot of the nightmares that teams have had to solve previously in trying to manage this type of complexity across multiple different Aragot networks and worldwide, truly worldwide infrastructure. Also, again, coming back to from a compliance perspective, being able to take our end-to-end configuration from our cluster provisioning and our baseline services and configuration to our application teams themselves, that end-to-end version-controlled auditability, it really helps enable organizations like ours that operate in a very regulated and policy-driven environment to make the arguments that we're trying to make around moving quicker with more freedom for developers to ship applications quicker backed by technologies like this to help control and have auditing in place to understand who's introducing changes and monitor those changes as they go out through the door. And lastly, one of the things I'd be remiss to say is there's always, there's commonly a misperception about what government software engineering can look like. And in the US, we have a very big 4-tran problem that a lot of really smart people are having to go solve. But there are places like Castle Run and others in the Air Force and the DOD that are solving very interesting complex problems using technologies that are being hosted out of the CNCF and the Linux Foundation to help drive forward really, really interesting solutions to really, really complex problems. And with that, I really appreciate you guys letting us tell you a bit more about the problems that we're solving and why these types of technologies are very useful to us in solving those problems. And with that, we will take questions and we'll be around here for the next 10 to 15 minutes, I think. Thank you very much.