 the internal developer platform at the New York Times. From there, it will dive into feedback and how it impacts developer productivity before going through an OPA Conf test demo. We'll talk about how we've implemented Conf test at the New York Times and how this has allowed us to take a trust but verify approach to application deployment, including moving toward automatic merging of automatically generated files. Finally, we'll take a peek into what we're doing next with schema validation using KubeConform. So I wanna start with some background and definitions just to make sure we're all on the same page and the first of those is GitOps. GitOps is an operational framework that utilizes Git as a single source of truth. This is to manage infrastructure and application deployments and prevent configuration drift which occurs when the actual state of an application's infrastructure gradually diverges from its declarative state. The next term is ArgoCD, a declarative continuous delivery tool that utilizes GitOps principles to simplify application deployment to a Kubernetes cluster. The next term I wanna talk about is policy which refers to a set of rules that govern how a system operates in a specific scenario. An example of a policy might be prohibiting the use of the latest tags on container images or restricting a Kubernetes containers ability to run its route. And lastly, I wanna give you some background on OPA, the Open Policy Agent. This is an open source general purpose policy engine that unifies policy enforcement across the cloud native stack. Our talk today pertains to a utility from OPA which is ConfTest that allows you to write policy tests against structured configuration data. This in turn allows you to offer fast feedback by enabling policy enforcement locally or in the CI process without ever needing to connect to a Kubernetes cluster. Now I wanna tell you a little bit more about the internal developer platform at the New York Times, why we needed it, what our mindset was with respect to governance and how the platform has improved developer productivity. So as technology changes and evolves, application developers need to learn more tools to keep up. Application developers spend less time developing features as they also integrate and manage containerization, infrastructures code, testing, building, monitoring and more. And seeing as these technologies can be broken down even further, you can understand that this quickly piles up and even if it were reasonable to expect an application developer to learn all of these on their own, they're doing so would almost certainly impact their productivity and their ability to develop features. Before the IDP, the internal developer platform, teams managed their own cloud environments and infrastructure. There was little standardization with individual teams taking different approaches to infrastructure and cluster architecture. And while this gave developers broad freedom and how they designed and implemented their systems at scale, this led to infrastructure sprawl that was untenable to manage. With too much overhead, decreased efficiency and increased costs. So to tackle the issue of sprawling infrastructure, we adopted a shared centralized Kubernetes architecture. Developers no longer had to manage or maintain their own Kubernetes clusters and routine infrastructure tasks are handled by dedicated operations and platform teams that make up the New York Times delivery engineering mission. Distinct application developer teams, which are known as tenants, are given isolated spaces on these shared clusters. And this abstraction has improved developer productivity by allowing developers to focus on features and set of operations. So to use these shared clusters, teams were invited to onboard using the IDP. The IDP abstracts away much of the deployment and operations processes by handling certain repository creation, CRD creation and configuration related to running and routing applications. Those of us on the CICD team wanted deployment to be easy and relatively painless because we wanted feature developers to get to spend more time developing features and less time learning a new CI or CD tool. But while using a shared infrastructure has many benefits, there are also important considerations that we had to keep in mind. And the first of those was security. Ensuring that all teams adhered to the same security policies and regulations was critical to preventing potential vulnerabilities, especially when multiple teams are working on the same infrastructure, even if those teams are isolated from one another. It's also critical to consider how moving to a shared infrastructure and how over governance or over prescriptiveness in our policy writing could lead to decreased developer autonomy. Since devs had to rely on others to set up and manage their shared resources, they might have had to make changes to the way that they implemented or designed their systems. And adopting a shared infrastructure should be a balance and deployment should always be a collaboration. Feedback was another crucial consideration. On our live shared Kubernetes clusters, we're using OPA Gatekeeper to govern them. And OPA Gatekeeper is an admissions controller that intercepts API requests, persisting them according to your declarative policy. While this tool is important in a robust policy system, it provides late feedback to developers on whether their Kubernetes objects comply with policy. And if policies and requirements aren't clearly communicated to developers early and often, this slower feedback cycle can lead to increased frustration and increased time to problem resolution. So I wanna talk more about feedback, how it interacts with GitOps, how it impacts developer productivity, and how we can leverage OPA CompTest as a feedback tool in a centralized policy hub. And this starts with a pop quiz. So raise your hand when you think it is the answer to when is the ideal time for developers to begin getting feedback in a GitOps operational framework. A, during the PR process before emerging to a main branch. B, during automated testing in a CI process. C, after emerged in the main branch during the sync process. D, while developing locally. Thank you to the one person who's raised his hand so far. E, after any user or stakeholder testing. Or F, developers should get feedback as frequently as possible throughout the development life cycle. Thanks Marco. So it was a bit of a trick question and D is correct for when should it begin. But F, that it should be as frequently as possible. Feedback should be continuous in a GitOps operational framework. But it should start locally before you've even pushed any code. From a GitOps standpoint, the later you receive feedback, the worse off you are. If a developer is receiving their first compliance feedback at the time of deployment, their code is already merged into that single source of truth Git repository. And since this code can't be applied to the live cluster, the application state has drifted from the declarative state. So feedback needs to come much sooner than this. But we also have to be realistic. When you have an organization with tens or hundreds or thousands of developers, how do you communicate your policies to them? A common practice is to have policies passed down by security teams in internal documentation or ticketing systems or pinned Slack messages or emails or some other like carrier pigeons. I don't know. And the truth is, developers don't know where to find those. If they do know where to find them, they might be doubtful that they're up to date and reliable. If they're new, they might not know where to look. And while alerting tools are great, they are also late feedback. Informing security teams of non-compliance only after resource creation or application deployment. The later you receive feedback and the more policies you put in place that developers need to remember and implement, the more it is going to impact their productivity. And remember that prior to the internal developer platform at the New York Times, teams were deploying to and managing their own infrastructure. And so if we made the process much more difficult and complicated than it had to be, it was going to be harder to convince them to onboard and give the shared cluster a shot. So we should focus on making policies easy to access and start giving feedback early and after that feedback continuously. So how can we do this? We can use OPAConf test. We can write policies, package them into signed OPA bundles and make them available to an organization. We can set any policies that we want from high level to deeply detailed or technical. We can write warnings for upcoming policy changes. And we might not be able to force developers to leverage the tools for their local development, but we can integrate Conf test into CI and we can make sure that non-compliant configuration can't pass to the next phase of development. This feedback is fast and continuous and gives operations teams some peace of mind, especially when we're dealing with configuration files that are automated and automatically created. We can know that those automatically created files meet a minimum level of compliance. And with time and attention to detail, we can turn this into automatic merging, making these systems more scalable. No longer does every Argo app project spec need to be reviewed manually by a CI CD engineer. We write the policies we need and we let Conf test do the rest. So with that, I wanna move into a demo of actually writing policies for OPA Conf test. So I'll show you the anatomy of a Rego policy. I'll teach you a little bit of Rego logic and then we'll write some progressive policies for an example that I'll lay out. And we'll look at some Kubernetes specs. We're also gonna look at policies that govern a drone pipeline, which is a continuous integration tool that we use at New York times. And finally, I'll show you a policy that can be used to govern namespace resources in an Argo app project. And just as an aside, if you'd like access to the code I'm about to go over, there's a public GitHub repository of examples that are used in this presentation. You can scan this QR code. These slides are also already available and will be available after the talk. So you can look at them and there's this command at the bottom that if you have Conf test on your machine, you can pull down the policies now. And I also wanna talk a little bit about Rego, which in its documentation describes itself as easy to read, write and understand. Which is not true. There is a learning curve. I think once you get over the learning curve, it's fine and it is these things, but the logic can be a little unintuitive. But within Conf test, we can really write two kinds of tests, a warning or a failure. And within failures, we can write deny policies that return a string message or violation failures, which return structured data errors. So to write our policies, we're gonna use this example scenario of prohibiting the use of latest tags for container images in non-dev environments. Using the latest tag can lead to unexpected and potentially harmful changes because it can point to a newer version of the compiled software without any explicit action on the user's part. And we all know that every time you use a latest tag in production, a security engineer gets very sad and a hacker gets very happy. So we're gonna write a Rego policy that governs the Kubernetes deployment spec that you see here. The spec may contain multiple containers, which are defined in a list that's in the red box. And to write this policy, we first need to decide what kind of rule we need. So in this case, we want to deny the use of latest image tags, so we're going to write a deny rule. So to define our rule, we're gonna create a variable called image, which specifies the path to the container images in the Kubernetes deployment spec. This path is prefixed with the word input, which refers to the input data that is provided at runtime. Since there can be multiple containers defined in a single Kubernetes deployment spec, we need to iterate through the images of each container in the list. And we're going to use a special iterator character underscore instead of a variable like I. This is a best practice in Rego since the iterator variable has a local scope and is not referenced elsewhere. I will come back to this later. And next, we'll use Rego's built-in endswith function to check if the image ends with the tag latest and if it does, we'll return this message. So we can test our configuration against this policy with the command comftest test and the result, as you can see and as we expected, is a failure. And so the example that I originally laid out was that latest tags can't be used in production or staging environments. So how can we update our policies to reflect that these latest image tags can be used in a dev environment just not staging or production? Once again, we're gonna start by looking at two deployment specs. One is for a dev environment and one is for a staging environment and both have one container with one image that is tagged latest. We wanna write a deny rule for the staging deployment but for the dev environment, let's write a warning message that just conveys the information that, hey, this is okay now but not in higher environments. So let's go through how we would write these policies and I'm gonna show you each part one at a time and then I'll show you all of them together at the end. First, we'll define a rule that requires the use of these ENV labels and then we'll have another rule that enforces the label to be equal to one of the three values that is in this set. And you'll notice that in these examples, instead of just using the term deny or warn, I'm suffixing the policies to give them more descriptive names. Next, we're gonna write a very simple function is dev and all this does is compare the ENV label to the string dev. Next, we have this deny latest tags rule. So what this rule is really saying is if the image is tagged latest or ends with latest and the label indicates this is a non-dev environment, deny it. And finally, if the image is tagged latest and it is a dev environment, we'll return this warning message as a heads up to developers. Here's that entire policy together minus the set that enforces the ENV values because that was just too much code to fit on one side but it is in that GitHub repository and here are the outputs of those rules, a warning and a failure as we would expect. So what if you wanna allow a certain image to be tagged latest? In this case, you can use an exception to define the images that you would like to allow and you can see that in the upper right quadrant of this slide. This defines a set of allowed images and creates an exception to the rule latest tags if the container image is in that set. And these exceptions are reflected in the testing output. So all of our examples thus far have used the image path for deployments and replica sets in Kubernetes. If we wanted to capture pods as well, we could write a different deny rule to check for that specific input path and this gives me a chance to talk a little bit more about logic in Rego. So a policy is essentially a collection of Boolean conditions. In order for a policy to be true, all of the conditions defined within that policy have to be true. So in this case, the deny no ENV label will be true if metadata labels ENV does not exist. But I also wanna pay special attention to the message as well because if the kind and the metadata name aren't defined, the message will not evaluate to true and the deny policy won't be triggered. This has happened to me before. If we add other conditions into the deny policy, all of those conditions must be true for the deny to be triggered. So joining multiple expressions together in a single rule is a logical and between those conditions which might make you think, what about a logical or? To express a logical or in Rego, you define multiple rules with the same name. So in this example, the deny latest tags rule is defined twice, once for pods and once for deployments and replica sets. As in earlier examples, it's completely fine to just name your rules deny, but that could be viewed as deny this or this or this or this. If you have a desire to suffix your policy names for clarity, you can do that too. OPA also provides a framework that allows you to write unit tests for your policy because I bet everyone in this room loves writing tests for their tests. And this test written on the left is testing our deny latest tags rule. This with input as language overrides the input data with the testing input on the right, which is just a Kubernetes deployment email that I've translated into JSON. You can also use the OPA CLI to check your code coverage and it will give you a detailed output of which lines are covered and the total code coverage as well as code coverage of specific policy files. Overall writing unit tests helps ensure their correctness and reliability and enables more confident decision making in your system. So that's the basic anatomy of a rego policy and some basic information about how to write policies that govern Kubernetes objects, but Conf tests isn't limited to Kubernetes manifests. Conf test supports governance of any kind of structured configuration data. For example, you can write policies that govern Terraform configurations, Helm templates, Docker files, Argo CRDs, CI pipeline configuration, the list goes on. So we'll look at a few additional examples of just how you can implement it. Here's an example of some policies that can be written for a drone pipeline in YAML formatting. These policies follow our example scenario of latest tags. However, you will notice the syntax is a little bit different. These rules were written using future keywords to give you an example of different ways to draft rego policies. I also wanna bring attention to the use of the every keyword in this last block of code. This keyword is used and will return true if all items meet the condition. So this is useful if you need everything in a list to have a uniform conformity. And finally, I wanna look at an example of governing namespace resources for an Argo CRD app project. I wanna look at this namespace resource whitelist portion of the spec. We wanna be able to check that these lists of mappings are a pair and what I mean by that is instead of just checking the groups or instead of just checking the kind, I wanna say here are allowed groups and here are the allowed kinds within that group. So to do that, we're going to use a map and two rules. One that checks the correct groups and one that enforces the kinds belonging to each group. So first I've set up this allowed namespace resources map that tells us the group and what kinds are allowed to belong to that group. This deny namespace groups is pretty simple and just checks the keys in our map. This deny namespace kind rule then takes the groups, creates a smaller list of the allowed values and then compares the input kind to that list. And now quickly, you'll notice that here I have used some I instead of the underscore value and that's because an underscore value instantiates a separate iterator each time it's called. So if you use the underscore in this case, you're going to end up comparing each kind to each group independently, which isn't what we want. And if you don't believe me, Rego provides a handy print function, which is my favorite way of debugging. So I've added a print command to the exact rule we were just looking at to compare the group, the kind and that smaller allowed kind set. And this is the behavior that we would expect using some I, where we to use an underscore, this would be the behavior. So each time an underscore is specified, a new iterator is instantiated and under the hood, OPPA translates this character to a unique variable name that doesn't conflict with variables and rules and scope. Therefore, when you're calling it twice in this manner, it would be equivalent to instantiating two iterators, I and J. And you'll end up comparing every kind against every group, which will lead to unexpected behavior. If you don't like using the print function, you can also do a query trace, which is useful for debugging purposes sometimes. And these rules and these tools together can help you understand the underlying logic of your rego policies and rules. So now that we've gone over how to write Conf test policies, I wanna bring everything we've talked about together and tell you about how we've implemented Conf test and where we're going next. So my team started by writing specific high level rules that governed basic aspects of our Argo CD source repository, as well as our shared repositories for projects and applications. These policies were initially very bare bones with the hope of checking agreement and preventing silly mistakes and to help us streamline code reviews. While we own the app and project repositories that Argo syncs to the live clusters, there are multiple sources that might commit or merge to those repos. So we set up minimal and then more exhaustive levels of policy. These help ensure some simple checks on the configuration code being checked into the repositories before they are merged and synced to the live clusters. When onboarding new applications to our shared infrastructure, our internal platform handles the creation of certain Git repositories and configuration. Part of this includes Argo objects like projects and applications. Our IDP opens PRs against these repositories and with the policies that we have in place for those repositories, we feel comfortable allowing those PRs to automatically merge if they can pass our policy tests. This allows us to scale deployment. Since there are many more engineers generally at the New York Times who are deploying applications than there are engineers on the CI CD team. We've been able to avoid bottlenecks because we don't need an engineer to review every PR or review them as comprehensively. Policies can be shared by packaging them into an OPA bundle and using that bundle on an OPA server. These bundles can be signed. They can also be pulled via the Conf test CLI using a URL or a specific protocol like Git. And this has allowed us to govern multiple kinds of configuration data as they come through our CI tool. Policies can also be used to help new engineers onboard more quickly by giving them fast and continuous feedback while they're developing about our organizational wide policies. Kind of like guardrails at bowling. So that's how we've used OPA Conf test to automate configuration and permissions testing within the GitOps operational framework. And now I'm going to turn this talk over to Mike so he can tell you a little bit about our next steps of delving into schema validation with Kube conform. Thanks, Eve. I'm gonna wrap this up with something a little bit more dry, which is manifest schema validation. Validation is a critical aspect of ensuring that our Kubernetes clusters behave as expected and deliver the functionality that we intend. I have this diagram to sort of illustrate the separations of concern. Unlike governance, which focuses on ensuring your application conforms to specific policies and rules as a whole, validation is focused on ensuring that individual Kubernetes objects are defined correctly and configured just the same. When we talk about a manifest being valid, we mean that it conforms to an expected structure and syntax and can be safely deployed to your Kubernetes cluster. Without this, misconfigured manifests can lead to unintended behaviors. This is the key reason why validation is so important. For example, if we define a Kubernetes manifest with incorrect syntax or values, we can encounter unexpected behaviors like when we try to deploy it, like resource deletion or application that's not coming up. EML is really simple, but it is all too easy to make mistakes when writing manifests by hand. So I'm gonna give some specific manifest examples, but this still applies when you're using templating tools to generate manifests. So and to help mitigate these risks and of misconfigured manifests, we use a tool called kube conform. Kube conform is an open-source tool developed by Jan Hammond. Hope I'm saying that right, but thank you for this tool. It's designed specifically for validating manifests and can easily integrate into your RCI pipelines. It's just a one-liner that you can throw into all, that you can throw all your generated manifests at. It allows us to validate our manifest, it allows us to validate our manifests against the source of schemas that get matched up to their definitions and ensure that they meet an expected syntax and structure. And we do this again because shifting this left to developers gives faster feedback to the developers on misconfigured manifests earlier in the development process. This can reduce the risk of errors and minimize the time and effort required to troubleshoot and fix issues. I also wanna note, in addition to Kubernetes Primitives, we also have custom resource definitions which allow us to define our custom resource objects within Kubernetes. Unlike Kubernetes Primitives, which are included out of the box with Kubernetes, CRDs must be defined explicitly and are not a part of the Kubernetes core. And because they are not a part of the Kubernetes core, they require a little extra attention and detail for validating to ensure that they are defined correctly. And we'll get a little bit into this later. So now that we've explained why validation is important, let's look at some invalid manifests. For instance, in a recent or not so recent update, to Kubernetes released of 1.22 that involves a complete deprecation of the V1 beta one ingress API. And that's kind of very specific, but it meant that you had to update all your ingress manifests to the new version because the older version was no longer supported. So this happened to us. In fact, we had a lingering cluster in Sandbox. And when we upgraded the cluster, many of the services just didn't come up. This was fine because it was Sandbox, but it was, it did take our developers some time to understand what wasn't working. Adding this one liner to one of our projects immediately serves the issue. So following up on our earlier example of the deprecated ingress version, a developer may take a stab at just like updating the API version. And Kube conform can help identify these issues by flagging any errors in the manifest, allowing us to address them before deploying to our Kubernetes clusters. So here's an example of what that error message would look like using Kube conform to validate the manifest. When we can see that the path type doesn't exist in the newer version and we can make the necessary edits and continue iterating on this manifest until Kube conform reports zero errors. So here's another example of an incorrect manifest. If raise your hand if you can spot an error. Cool. So it's just a really small thing. It's like the sync policy is not under .spec. If you don't spot that, it's fine. This is really a job for CI or your IDE. Kube conform can easily validate custom resource definitions. So to validate an application CRD, such as the one from Argo CD, you're allowed to provide a catalog of CRDs that we stored in a repository remotely. Daytree, for instance, offers a public repo with a collection of popular CIDs from Kubernetes controllers like CERT Manager and AWS Carpenter. You can see that we add just the schema location flag followed by a URL to the raw file with some place folders that tell Kube conform how to build the URL for the custom resource definition. The New York Times also runs a number of in-house Kubernetes controllers that handle certain operations that are specific to our Kubernetes architecture. Using Kube conform, we can host our private CRDs and reference them in CI allowing developers to validate in-house CRDs. So once again, we find using Kube conform really easy to validate both Kubernetes primitives and CRDs, whether they're public or private. This problem is a really small but powerful. It's just a one-liner to add to your CI and you can ensure that all your manifests are correctly defined and structured, reducing the risk of misconfigurations and errors when deploying to your cluster. And with that, I'm gonna bring it back to you. Okay, so as tech evolves and as we find ourselves moving toward these centralized architectures and as we find ourselves moving towards using more automatically generated manifests or configs, it's more and more important to have robust governance and validation in place. Tools like Kube conform and OPAConftest and validation and governance more generally have allowed those of us in the delivery engineering mission at the New York Times to take more of a trust but verify approach to deployment. These tools also shift the feedback timeline, providing clear guidelines and guardrails for future developers early in the development process. Future devs get to implement more quickly and with fewer roadblocks and when they do deploy, they can feel confident that their code is compliant and valid. The use of these technologies creates a more robust collaborative ecosystem of software development and keeps the relationship between dev and ops cordial and transparent. So thank you for coming to our talk. If you're interested in other talks about the internal developer platform, these happened earlier in the week but you should be able to catch them on video in a few weeks. And if you'll be at CDCon and get ops con next month, my colleague and I will be giving a keynote there and we hope you'll attend. Thank you.